Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification

Shuhua Liu, Thomas Forss

Abstract

This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows that unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.

References

  1. Blei, D, Ng, A., and Jordan, M. I. 2003. Latent dirichlet allocation. Advances in neural information processing systems. 601-608.
  2. Blei, D. 2012. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012
  3. Calado, P., Cristo, M., Goncalves, M. A., de Moura, E. S., Ribeiro-Neto, B., and Ziviani, N. 2006. Link-based similarity measures for the classification of web documents. Journal of the American Society for Information Science and Technology (57:2), 208-221.
  4. Chakrabarti, S., B. Dom and P. Indyk. 1998. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.
  5. Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel web page filtering system by combining texts and images. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 732- 735. IEEE Computer Society.
  6. Cohen, W. 2002. Improving a page classifier with anchor extraction and link analysis. In S. Becker, S. Thrun, and K. Obermayer (Eds.), Advances in Neural Information Processing Systems (Volume 15, Cambridge, MA: MIT Press) 1481-1488.
  7. Dumais, S. T., and Chen, H. 2000. Hierarchical classification of web content. Proceedings of SIGIR'00, 256-263.
  8. Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., and Kandel, A. 2005. Contentbased detection of terrorists browsing the web using an advanced terror detection system (ATDS), Intelligence and Security Informatics (Lecture Notes in Computer Science Volume 3495), 244-255.
  9. Fürnkranz J, Exploiting structural information for text classification on the WWW, Advances in Intelligent Data Analysis, 487-497, 1999
  10. Fürnkranz J., T. Mitchell and E. Riloff, A Case Study in Using Linguistic Phrases for Text Categorization on the WWW, Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization.
  11. Fürnkranz J, A study using n-gram features for text categorization, Austrian Research Institute for Artifical Intelligence 3 (1998), 1-10
  12. Fürnkranz J, T Mitchell, E Riloff, A case study in using linguistic phrases for text categorization on the WWW, Proceedings from the AAAI/ICML Workshop on Learning for Text Categorization, 5-12, 1999
  13. Gabrilovich, E., and Markovich, S. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07), Hyderabad, India.
  14. Hammami, M., Chahir, Y., and Chen, L. 2003. WebGuard: web based adult content detection and filtering system. Proceedings of the IEEE/WIC Inter. Conf. on Web Intelligence (Oct. 2003), 574 - 578.
  15. Kludas, J. 2007. Multimedia retrieval and classification for web content, Proc. of the 1st BCS IRSG conference on Future Directions in Information Access, British Computer Society Swinton, UK ©2007
  16. Last, M., Shapira, B., Elovici, Y., Zaafrany, O., and Kandel, A. 2003. Content-Based Methodology for Anomaly Detection on the Web. Advances in Web Intelligence, Lecture Notes in Computer Science (Vol. 2663, 2003), 113-123.
  17. Liu, B. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers 2012
  18. Liu S. and T. Forss, “Web Content Classification based on Topic and Sentiment Analysis of Text”, accepted by KDIR 2014, Rome, Italy, October 2014
  19. Lu, Y., M. Castellanos, U. Dayal, C. Zhai. 2011a. "Automatic Construction of a Context-Aware Sentiment Lexicon: An Optimization Approach", Proceedings of the 20th international conference on World wide web (WWW'2011) Pages: 347-356
  20. Lu, Y., Q. Mei, C. Zhai. 2011b. "Investigating Task Performance of Probabilistic Topic Models - An Empirical Study of PLSA and LDA", Information Retrieval, April 2011, Volume 14, Issue 2, pp 178-203
  21. Pang, B., and Lee, L. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1-135, July 2008
  22. Pennebaker, J., Mehl, M., & Niederhoffer, K. 2003. Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1), 547-577.
  23. Qi, X., and Davidson, B. 2007. Web Page Classification: Features and Algorithms. Technical Report LU-CSE07-010, Dept. of Computer Science and Engineering, Lehigh University, Bethlehem, PA, 18015
  24. Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J., Celebi, A., Dimitrov, S., and Zhang, Z. 2004a. MEAD-a platform for multidocument multilingual text summarization. Proeedings of the 4th LREC Conference (Lisbon, Portugal, May 2004)
  25. Radev, D., Jing, H., Stys, M., and Tam, D. 2004b. Centroid-based summarization of multiple documents. Information Process. and Management (40) 919-938.
  26. Riloff E, J Fürnkranz, T Mitchell, A Case Study in Using Linguistic Phrases for Text Categorization on the WWW, AAAI/ICML Workshop on Learning for Text Categorization, 2001
  27. Salton, G., and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523.
  28. Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma: Web-page classification through summarization. SIGIR 2004: 242-249
  29. Shen, D., Qiang Yang, Zheng Chen: Noise reduction through summarization for Web-page classification. Info. Process. and Manage. 43(6): 1735-1747 (2007)
  30. Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M. 1966. The general inquirer: a computer approach to content analysis. The MIT Press, Cambridge, Massachusetts, 1966. 651
  31. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A. 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Sci. and Technology, 61(12), 2544-2558.
  32. Thelwall, M., Buckley, K., and Paltoglou, G. 2012. Sentiment strength detection for the social Web. Journal of the American Society for Information Science and Technology, 63(1), 163-173.
  33. Zhang, S., Xiaoming Jin, Dou Shen, Bin Cao, Xuetao Ding, Xiaochen Zhang: Short text classification by detecting information path. CIKM 2013: 727-732.
Download


Paper Citation


in Harvard Style

Liu S. and Forss T. (2014). Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014) ISBN 978-989-758-048-2, pages 530-537. DOI: 10.5220/0005170305300537


in Bibtex Style

@conference{sstm14,
author={Shuhua Liu and Thomas Forss},
title={Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)},
year={2014},
pages={530-537},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005170305300537},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2014)
TI - Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification
SN - 978-989-758-048-2
AU - Liu S.
AU - Forss T.
PY - 2014
SP - 530
EP - 537
DO - 10.5220/0005170305300537