posed approach leaves room for improvement but in
these experimental cases it has produced good results
and shown that it can be used for text categorization
in real world applications.
ACKNOWLEDGEMENTS
The authors wish to thank Taloustutkimus Oy for sup-
porting the work, and Prof. Hannu Toivonen and the
anonymous reviewers for their valuable comments.
REFERENCES
Benevenuto, F., Mango, G., Rodrigues, T., and Almeida,
V. (2010). Detecting spammers on twitter. In CEAS
2010. Seventh annual Collaboration, Electronic messag-
ing, Anti-Abuse and Spam Conference, Redmond, Wash-
ington, USA, July 13 - 14, 2010.
Cai, L. and Hofmann, T. (2003). Text categorization by
boosting automatically extracted concepts. In SIGIR
2003: Proceedings of the 26th Annual International
ACM SIGIR Conference on Research and Development
in Information Retrieval, Toronto, Canada, July 28 - Au-
gust 1, 2003, pages 182–189. ACM.
Clark, K. and Gale, W. (1995). Inverse document frequency
(idf): A measure of deviation from poisson. In Third
Workshop on Very Large Corpora, Massachusetts Insti-
tute of Technology Cambridge, Massachusetts, USA, 30
June 1995, pages 121–130.
Forman, G. (2003). An extensive empirical study of fea-
ture selection metrics for text classification. J Machine
Learning Res, 3:1289–1305.
Forman, G. (2008). Bns feature scaling: an improved rep-
resentation over tf-idf for svm text classification. In Pro-
ceedings of the 17th ACM Conference on Information
and Knowledge Management, CIKM 2008, Napa Valley,
California, USA, October 26-30, 2008, pages 263–270.
ACM.
Irani, D., Webb, S., Pu, C., and Li, K. (2010). Study of
trendstuffing on twitter through text classification. In
CEAS 2010, Seventh annual Collaboration, Electronic
messaging, Anti-Abuse and Spam Conference, Redmond,
Washington, USA, July 13 - 14, 2010.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features. In
Machine Learning: ECML-98, 10th European Confer-
ence on Machine Learning, Chemnitz, Germany, April
21-23, 1998, volume 1398 of Lecture Notes in Computer
Science, pages 137–142. Springer.
Joachims, T. (1999). Advances in Kernel Methods - Sup-
port Vector Learning, chapter Making large-Scale SVM
Learning Practical, pages 41–56. MIT Press.
Kibriya, A. M., Frank, E., Pfahringer, B., and Holmes, G.
(2004). Multinomial naive bayes for text categorization
revisited. In AI 2004: Advances in Artificial Intelli-
gence, 17th Australian Joint Conference on Artificial In-
telligence, Cairns, Australia, December 4-6, 2004, vol-
ume 3339 of Lecture Notes in Computer Science, pages
488–499. Springer.
Krishnakumar, A. (2006). Text categorization build-
ing a knn classifier for the reuters-21578 collec-
tion. http://citeseerx.ist.psu.edu/viewdoc/-summary?
doi=10.1.1.135.9946.
Mladenic, D. and Grobelnik, M. (1999). Feature selec-
tion for unbalanced class distribution and naive bayes.
In Proceedings of the Sixteenth International Conference
on Machine Learning (ICML 1999), Bled, Slovenia, June
27 - 30, 1999, pages 258–267. Morgan Kaufmann.
Pak, A. and Paroubek, P. (2010). Twitter as a corpus for
sentiment analysis and opinion mining. In Proceedings
of the International Conference on Language Resources
and Evaluation, LREC 2010, Valletta, Malta, 17-23 May
2010. European Language Resources Association.
Phan, X. H., Nguyen, M. L., and Horiguchi, S. (2008).
Learning to classify short and sparse text & web with
hidden topics from large-scale data collections. In Pro-
ceedings of the 17th International Conference on World
Wide Web, WWW 2008, Beijing, China, April 21-25,
2008, pages 91–100. ACM.
Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. (2003).
Tackling the poor assumptions of naive bayes text clas-
sifiers. In Machine Learning, Proceedings of the Twen-
tieth International Conference (ICML 2003), August 21-
24, 2003, Washington, DC, USA, pages 616–623. AAAI
Press.
Rennie, J. D. M. and Jaakkola, T. (2005). Using term infor-
mativeness for named entity detection. In SIGIR 2005:
Proceedings of the 28th Annual International ACM SI-
GIR Conference on Research and Development in Infor-
mation Retrieval, Salvador, Brazil, August 15-19, 2005,
pages 353–360. ACM.
Ritter, A., Cherry, C., and Dolan, B. (2010). Unsupervised
modeling of twitter conversations. In Human Language
Technologies: Conference of the North American Chap-
ter of the Association of Computational Linguistics, Pro-
ceedings, Los Angeles, California, USA, June 2-4, 2010,
pages 172–180. The Association for Computational Lin-
guistics.
Salton, G. and Buckley, C. (1988). Term-weighting ap-
proaches in automatic text retrieval. Inf Process Manage,
24(5):513–523.
Timonen, M., Silvonen, P., and Kasari, M. (2011). Classifi-
cation of short documents to categorize consumer opin-
ions. In Advanced Data Mining and Applications - 7th
International Conference, ADMA 2011, Beijing, China,
December 17-19, 2011. Online Proceedings.
Yang, Y. (1999). An evaluation of statistical approaches to
text categorization. Inf. Retr., 1(1-2):69–90.
Yang, Y. and Liu, X. (1999). A re-examination of text
categorization methods. In SIGIR ’99: Proceedings of
the 22nd Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval,
Berkeley, CA, USA, August 15-19, 1999, pages 42–49.
ACM.
Yang, Y. and Pedersen, J. (1997). Feature selection in sta-
tistical learning of text categorization. In Proceedings
of the Fourteenth International Conference on Machine
Learning (ICML 1997), Nashville, Tennessee, USA, July
8-12, 1997, pages 412–420.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
16