count. I tested this service by searching “@spam”
on Twitter. Surprisingly the query results show that
this report service is also abused by both hoaxes and
spam. Only a few tweets report real malicious ac-
counts. Some Twitter applications also allow users to
flag possible spam. However, all these ad hoc meth-
ods require users to identify spam manually and de-
pend on their own experience.
Twitter also puts effort into cleaning up suspicious
accounts and filtering out malicious tweets. Mean-
while, legitimate Twitter users complain that their ac-
counts are mistakenly suspended by Twitter’s anti-
spam action. Twitter recently admitted to accidentally
suspending accounts as a result of a spam clean-up ef-
fort (Twitter, 2009a).
In this paper, the suspicious behaviors of spam
accounts on Twitter is studied. The goal is to apply
machine learing methods to automatically distinguish
spam accounts from normal ones. The major contri-
butions of this paper are as follows:
1. To the best of our knowledge, this is the first effort
to automatically detect spam on Twitter;
2. A directed graph model is proposed to explore the
unique “follower” and “friend” relationships on
Twitter;
3. Based on Twitter’s spam policy, novel graph-
based features and content-based features are pro-
posed to facilitate the spam detection;
4. A series of classification methods are compared
and applied to distinguish suspicious behaviors
from normal ones;
5. A Web crawler is developed relying on the API
methods provided by Twitter to extract public
available data on Twitter website. A data set of
around 25K users, 500K tweets, and 49M follow-
er/friend relationships are collected;
6. Finally, a prototype system is established to eval-
uate the detection method. Experiments are con-
ducted to analyze the data set and evaluate the per-
formance of the system. The result shows that the
spam detection system has a 89% precision.
The rest of the paper is organized as follows. In
Section 2 the related work is discussed. A directed
social graph model is proposed in Section 3. The
unique friend and follower relationships are also de-
fined in this part. In Section 4, novel graph-based and
content-based features are proposed based on Twit-
ter’s spam policy. Section 5 introduces the method in
which I collect the data set. Bayesian classification
methods are adopted in Section 6 to detect spam ac-
counts in Twitter. Experiments are conducted in Sec-
tion 7 to analyze the labeled data. Traditional classi-
fication methods are compared to evaluate the perfor-
mance of the detection ssystem. The conclusion is in
Section 8.
2 RELATED WORK
Spam detection has been studied for a long time. The
exsiting work mainly focuses on email spam detection
and Web spam detection. In (Sahami et al., 1998),
the authors are the first to apply a Bayesian approach
to filter spam emails. Experiment results show that
the classifier achievesa better performanceby consid-
ering domain-specific features in addition to the raw
text of E-mail messages. Currently spam email filter-
ing is a fairly mature technique. Bayesian spam email
filters are widely implemented both on modern email
clients and servers.
Web is massive and changes more rapidly com-
pared with email system. It is a significant challenge
to detect Web spam. (Gy¨ongyi et al., 2004) first for-
malized the Web spam detection problem and pro-
posed a comprehensive solution to detect Web spam.
The TrustRank algorithm is proposed to compute the
trust scores of a Web graph. Based on computed
scores where good pages are given higher scores,
spam pages can be filtered in the search engine re-
sults. In (Gyongyi et al., 2006), the authors based
on the link structure of the Web proposed a measure-
ment Spam Mass to identify link spamming. A di-
rected graph model of the Web is proposed in (Zhou
et al., 2007). The authors apply classification algo-
rithms for directed graphs to detect real-world link
spam. In (Castillo et al., 2007), both link-based fea-
tures and content-based features are proposed. A ba-
sic decision tree classifier is implemented to detect
spam. Semi-supervised learning algorithms are pro-
posed to boost the performance of a classifier which
only needs small amount of labeled samples in (Geng
et al., 2009).
For spam detection in other applications, the au-
thors in (Yu-Sung et al., 2009) present an approach
for detection of spam calls over IP telephony called
SPIT in VoIP system. Based on the popular semi-
supervised learning methods, an improved algorithm
called MPCK-Means is proposed. In (Benevenuto
et al., 2009), the authors study the video spammers
and promoters on YouTube. A supervised classifi-
cation algorithm is proposed to detect spammers and
promoters. In (Wang, 2010), a machine learning ap-
proach is proposed to study spam bots detection in
online social networking sites using Twitter as an ex-
ample. In (Krishnamurthy et al., 2008), the authors
collected three datasets of the Twitter network. The
DON'T FOLLOW ME - Spam Detection in Twitter
143