Classifying Short Messages on Social Networks using Vector Space Models
Ricardo Lage, Peter Dolog, Martin Leginus
2013
Abstract
In this paper we propose a method to classify irrelevant messages and filter them out before they are published on a social network. Previous works tended to focus on the consumer of information, whereas the publisher of a message has the challenge of addressing all of his or her followers or subscribers at once. In our method, a supervised learning task, we propose vector space models to train a classifier with labeled messages from a user account. We test the precision and accuracy of the classifier on over 13,000 Twitter accounts. Results show the feasibility of our approach on most types of active accounts on this social network.
References
- Chen, K., Chen, T., Zheng, G., Jin, O., Yao, E., and Yu, Y. (2012). Collaborative personalized tweet recommendation. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7812, page 661670, New York, NY, USA. ACM.
- Chen, M., Jin, X., and Shen, D. (2011). Short text classification improved by learning multi-granularity topics. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three, IJCAI'11, page 17761781. AAAI Press.
- Combarro, E., Montanes, E., Diaz, I., Ranilla, J., and Mones, R. (2005). Introducing a family of linear measures for feature selection in text categorization. Knowledge and Data Engineering, IEEE Transactions on, 17(9):1223-1232.
- Dagan, I., Lee, L., and Pereira, F. C. N. (1999). Similaritybased models of word cooccurrence probabilities. Mach. Learn., 34(1-3):43-69.
- Díaz, I., Ranilla, J., Monta n˜es, E., Fernández, J., and Combarro, E. (2004). Improving performance of text categorization by combining filtering and support vector machines. Journal of the American society for information science and technology, 55(7):579-592.
- Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. (2012). Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 7812, page 14061414, New York, NY, USA. ACM.
- Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591-600, Raleigh, North Carolina, USA. ACM.
- Lage, R., Durao, F., and Dolog, P. (2012). Towards effective group recommendations for microblogging users. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC 7812, pages 923-928, New York, NY, USA. ACM.
- Lan, M., Tan, C.-L., Low, H.-B., and Sung, S.-Y. (2005). A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Special interest tracks and posters of the 14th international conference on World Wide Web, WWW 7805, page 10321033, New York, NY, USA. ACM.
- Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Ndellec, C. and Rouveirol, C., editors, Machine Learning: ECML-98, number 1398 in Lecture Notes in Computer Science, pages 4-15. Springer Berlin Heidelberg.
- Lin, J. and Mishne, G. (2012). A study of ”Churn” in tweets and real-time search queries. In Sixth International AAAI Conference on Weblogs and Social Media.
- Mladenic, D. and Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive bayes. In Machine Learning-International Workshop Then Conference-, pages 258-267. Morgan Kaufmann Publishers, Inc.
- Petrovic, S., Osborne, M., and Lavrenko, V. (2011). RT to win! predicting message propagation in twitter. In Fifth International AAAI Conference on Weblogs and Social Media.
- Robertson, S. E., Walker, S., Beaulieu, M., and Willett, P. (1998). Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In TREC, pages 199-210.
- Sahami, M. and Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, WWW 7806, page 377386, New York, NY, USA. ACM.
- Sun, A. (2012). Short text classification using very few words. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR 7812, page 11451146, New York, NY, USA. ACM.
- Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. arXiv:1003.1141. Journal of Artificial Intelligence Research, (2010), 37, 141-188.
- Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Machine Learning-International Workshop Then Conference-, pages 412-420. Morgan Kaufmann Publishers, Inc.
- Yih, W.-T. and Meek, C. (2007). Improving similarity measures for short segments of text. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, AAAI'07, page 14891494. AAAI Press.
Paper Citation
in Harvard Style
Lage R., Dolog P. and Leginus M. (2013). Classifying Short Messages on Social Networks using Vector Space Models . In Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-54-9, pages 413-422. DOI: 10.5220/0004357304130422
in Bibtex Style
@conference{webist13,
author={Ricardo Lage and Peter Dolog and Martin Leginus},
title={Classifying Short Messages on Social Networks using Vector Space Models},
booktitle={Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2013},
pages={413-422},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004357304130422},
isbn={978-989-8565-54-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Classifying Short Messages on Social Networks using Vector Space Models
SN - 978-989-8565-54-9
AU - Lage R.
AU - Dolog P.
AU - Leginus M.
PY - 2013
SP - 413
EP - 422
DO - 10.5220/0004357304130422