Hey you!

Postcards are cool! Go send some ->

postme.me

Fighting Twitter spam with bayes

Aug 31 2009 Tags: , ,

August Sky.
Image by pdeee454 via Flickr

For a few days now I’ve been researching the nature of spam on Twitter and finding the best way of combating it for a tool I’m making. It’s been an extremely interesting ride that has, surprise surprise, proven my initial assumptions made several months ago in the process of creating a different tool.

When I started looking into spambayes last week I realised a set of verified spam and ham was needed to train the filters so they could do their work. So I made myself a tool that would present everything from the public timeline (20 tweets) and set out to clicking

---
Need a freelance developer? Email me!

You should follow me on twitter
 Subscribe to RSS

6 responses so far

  • http://brainmachine.mozganostroj.com sparkica

    Very (and I mean VERY) interesting experiment :) Twitter seems to break ‘old laws’ of text classification. Having some problems myself :S

  • http://swizec.com Swizec

    It’s giving _you_ problems? Wow, that means I must’ve chosen exactly the right field to do semantics :D

    It’s strange though, today I was told from about 50 sources that Disney bought Marvel. Are all of them spam? Is the first cool and the rest are spam? What? I think we need a whole new definition of what constitutes spam before we can even begin to make a technical solution.

    And then bayes (or whatever) is trained and voila, tomorrow nobody will be talking about disney anymore.

  • http://blarneyfellow.wordpress.com Simon

    1) Simplify the task by classifying users instead of individual tweets. That way you get more context.

    2) Include features such as :has-link-and-a-trending-topic, :has-multiple-trending-topics.

    3) based on known spammers build a domain blacklist.

    4) For :has-link-and-a-trending-topic check how semantically related are the content of the link and tweet.
    Use this as a feature for classification

    5) Are there any regularities in spam accounts (ratio & time distribution of camouflage/spam).

    Disney problem is not about spam, but echo chamber. You can’t solve that with binary classification.

  • http://blarneyfellow.wordpress.com Simon

    If you are willing to endure some false positives, there are also things such as

    6) Classifying names as (non)auto-generated and using this as a feature.

    7) Using follower/following as a feature.

    8) Using %replies as a feature.

    9) Use bio and web either as another data point and/or (pre-classified) feature.

    3) :non-spammer-replied-back and :non-spammer-sent-msg (seems like a strong white-list feature).

  • http://blarneyfellow.wordpress.com Simon

    apparently I have the attention span of a bear. Anyhow;

    11) sign up your honey-pot to various get n-followers/day schemes.

    12) use other services to bootstrap your probabilites & datasets. e.g. http://www.twitblock.org/

    13) further reading http://news.ycombinator.com/item?id=703623
    http://www.stoptwitterspam.com/blog/

  • http://www.onlineordering.com Ordering

    No spam tool will catch wise spammer on twitter, and even if you succed chances that other users will repeat your efforts are almost zero.

    That is why I stopped using twitter. Of my 2000 follows every day at least 5 turned out to spammers.

« Everyone should work out!... Snow Leopards in mah rooms! »