5 CONCLUSIONS AND FUTURE
WORK
In this paper we proposed a method based on vec-
tor space models to classify “bad” messages and pre-
vent their publication on a social network account.
We showed that the traditional tf-idf model performs
poorly due to the small amount of messages in each
account but it can improved with different expansion
techniques. We also tested different time decay pa-
rameters and showed that our model determined em-
pirically from a Twitter dataset performs best. Over-
all, it is feasible to train a classifier for this task and
reduce the amount of messages to evaluated for a rec-
ommendation process.
In future works, we plan to extend our method
testing it with different feature selection algorithms.
Furthemore, feature reduction could be performed
with co-occurence cluster analysis where attained
clusters would represent latent topics. These would
result into low-dimensional vector space with more
dense feature vectors that could help improve the clas-
sification further. We also plan to test how this classi-
fication could affect ranking algotihms aimed at rec-
ommending messages on a social network account.
REFERENCES
Chen, K., Chen, T., Zheng, G., Jin, O., Yao, E., and Yu, Y.
(2012). Collaborative personalized tweet recommen-
dation. In Proceedings of the 35th international ACM
SIGIR conference on Research and development in in-
formation retrieval, SIGIR ’12, page 661670, New
York, NY, USA. ACM.
Chen, M., Jin, X., and Shen, D. (2011). Short text classifica-
tion improved by learning multi-granularity topics. In
Proceedings of the Twenty-Second international joint
conference on Artificial Intelligence - Volume Volume
Three, IJCAI’11, page 17761781. AAAI Press.
Combarro, E., Montanes, E., Diaz, I., Ranilla, J., and
Mones, R. (2005). Introducing a family of linear
measures for feature selection in text categorization.
Knowledge and Data Engineering, IEEE Transactions
on, 17(9):1223–1232.
Dagan, I., Lee, L., and Pereira, F. C. N. (1999). Similarity-
based models of word cooccurrence probabilities.
Mach. Learn., 34(1-3):43–69.
D
´
ıaz, I., Ranilla, J., Monta
˜
nes, E., Fern
´
andez, J., and Com-
barro, E. (2004). Improving performance of text cat-
egorization by combining filtering and support vector
machines. Journal of the American society for infor-
mation science and technology, 55(7):579–592.
Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. (2012).
Large-scale learning of word relatedness with con-
straints. In Proceedings of the 18th ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’12, page 14061414, New York,
NY, USA. ACM.
Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What
is twitter, a social network or a news media? In
Proceedings of the 19th international conference on
World wide web, pages 591–600, Raleigh, North Car-
olina, USA. ACM.
Lage, R., Durao, F., and Dolog, P. (2012). Towards effective
group recommendations for microblogging users. In
Proceedings of the 27th Annual ACM Symposium on
Applied Computing, SAC ’12, pages 923–928, New
York, NY, USA. ACM.
Lan, M., Tan, C.-L., Low, H.-B., and Sung, S.-Y. (2005).
A comprehensive comparative study on term weight-
ing schemes for text categorization with support vec-
tor machines. In Special interest tracks and posters of
the 14th international conference on World Wide Web,
WWW ’05, page 10321033, New York, NY, USA.
ACM.
Lewis, D. D. (1998). Naive (Bayes) at forty: The indepen-
dence assumption in information retrieval. In Ndel-
lec, C. and Rouveirol, C., editors, Machine Learning:
ECML-98, number 1398 in Lecture Notes in Com-
puter Science, pages 4–15. Springer Berlin Heidel-
berg.
Lin, J. and Mishne, G. (2012). A study of ”Churn” in tweets
and real-time search queries. In Sixth International
AAAI Conference on Weblogs and Social Media.
Mladenic, D. and Grobelnik, M. (1999). Feature selec-
tion for unbalanced class distribution and naive bayes.
In Machine Learning-International Workshop Then
Conference-, pages 258–267. Morgan Kaufmann Pub-
lishers, Inc.
Petrovic, S., Osborne, M., and Lavrenko, V. (2011). RT
to win! predicting message propagation in twitter. In
Fifth International AAAI Conference on Weblogs and
Social Media.
Robertson, S. E., Walker, S., Beaulieu, M., and Willett, P.
(1998). Okapi at TREC-7: automatic ad hoc, filtering,
VLC and interactive track. In TREC, pages 199–210.
Sahami, M. and Heilman, T. D. (2006). A web-based ker-
nel function for measuring the similarity of short text
snippets. In Proceedings of the 15th international con-
ference on World Wide Web, WWW ’06, page 377386,
New York, NY, USA. ACM.
Sun, A. (2012). Short text classification using very few
words. In Proceedings of the 35th international ACM
SIGIR conference on Research and development in in-
formation retrieval, SIGIR ’12, page 11451146, New
York, NY, USA. ACM.
Turney, P. D. and Pantel, P. (2010). From frequency
to meaning: Vector space models of semantics.
arXiv:1003.1141. Journal of Artificial Intelligence
Research, (2010), 37, 141-188.
Yang, Y. and Pedersen, J. (1997). A comparative study on
feature selection in text categorization. In Machine
Learning-International Workshop Then Conference-,
pages 412–420. Morgan Kaufmann Publishers, Inc.
Yih, W.-t., Goodman, J., and Carvalho, V. R. (2006). Find-
ing advertising keywords on web pages. In Proceed-
ings of the 15th international conference on World
ClassifyingShortMessagesonSocialNetworksusingVectorSpaceModels
421