A large number of experiments were carried out.
Our results reveal that unigram based models,
although a much simpler approach, show their
unique value and effectiveness in web content
classification. Raw data input, stemming, IDF
database, all play important roles in determining
topic similarity, just like the choice of representation
model as uni-gram or higher order n-grams.
Higher order n-gram based approach, especially
5-gram based models in our study, when combined
with sentiment features, bring significant
improvement in precision levels for the Violence
and two Racism related web categories. However,
the effect of high n-grams on topic similarity based
models seems to be really minor. We need to look
into this further to understand if the improvements
made in classification models justify the large
amount of computation needed in processing n-
gram.
The main contributions of our paper are: (1)
Investigation of a new approach for web content
classification to serve online safety applications; (2)
Contrary to most studiesn which only consider
presence or frequency count of n-grams in their
applications, we make use of tf-idf weighted n-
grams in building the content classification models.
(3) A large amount of feature extraction and model
developing experiments contributes to a better
understanding of text summarization, sentiment
analysis methods, and learning models; (4)
Analytical results that directly benefit the
development of cyber safety solutions.
In our future work we will explore the
incorporation of probabilistic topic models (Blei et
al, 2003; Blei, 2012; Lu et al, 2011b), revisit topic-
aware sentiment lexicons (Lu et al, 2011a), and fine-
tuning the models with different learning methods.
We believe there is still much room for
improvements and some of these methods will
hopefully help to enhance the classification
performance to a new level.
ACKNOWLEDGEMENTS
This research is supported by the Tekes funded
DIGILE D2I research program, Arcada Research
Foundation, and our industry partner.
REFERENCES
Blei, D, Ng, A., and Jordan, M. I. 2003. Latent dirichlet
allocation. Advances in neural information processing
systems. 601-608.
Blei, D. 2012. Probabilistic topic models. Communi-
cations of the ACM, 55(4):77–84, 2012
Calado, P., Cristo, M., Goncalves, M. A., de Moura, E. S.,
Ribeiro-Neto, B., and Ziviani, N. 2006. Link-based
similarity measures for the classification of web
documents. Journal of the American Society for
Information Science and Technology (57:2), 208-221.
Chakrabarti, S., B. Dom and P. Indyk. 1998. Enhanced
hypertext categorization using hyperlinks.
Proceedings of ACM SIGMOD 1998.
Chen, Z., Wu, O., Zhu, M., and Hu, W. 2006. A novel web
page filtering system by combining texts and images.
Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence, 732–
735. IEEE Computer Society.
Cohen, W. 2002. Improving a page classifier with anchor
extraction and link analysis. In S. Becker, S. Thrun,
and K. Obermayer (Eds.), Advances in Neural
Information Processing Systems (Volume 15,
Cambridge, MA: MIT Press) 1481–1488.
Dumais, S. T., and Chen, H. 2000. Hierarchical
classification of web content. Proceedings of
SIGIR'00, 256-263.
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman,
M., Schneider, M., and Kandel, A. 2005. Content-
based detection of terrorists browsing the web using
an advanced terror detection system (ATDS),
Intelligence and Security Informatics (Lecture Notes
in Computer Science Volume 3495), 244-255.
Fürnkranz J, Exploiting structural information for text
classification on the WWW, Advances in Intelligent
Data Analysis, 487-497, 1999
Fürnkranz J., T. Mitchell and E. Riloff, A Case Study in
Using Linguistic Phrases for Text Categorization on
the WWW, Working Notes of the 1998 AAAI/ICML
Workshop on Learning for Text Categorization.
Fürnkranz J, A study using n-gram features for text
categorization, Austrian Research Institute for
Artifical Intelligence 3 (1998), 1-10
Fürnkranz J, T Mitchell, E Riloff, A case study in using
linguistic phrases for text categorization on the
WWW, Proceedings from the AAAI/ICML Workshop
on Learning for Text Categorization, 5-12, 1999
Gabrilovich, E., and Markovich, S. 2007. Computing
Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis. In Proceedings of the 20th
International Joint Conference on Artificial
Intelligence (IJCAI’07), Hyderabad, India.
Hammami, M., Chahir, Y., and Chen, L. 2003.
WebGuard: web based adult content detection and
filtering system. Proceedings of the IEEE/WIC Inter.
Conf. on Web Intelligence (Oct. 2003), 574 – 578.
Kludas, J. 2007. Multimedia retrieval and classification for
web content, Proc. of the 1st BCS IRSG conference on
Future Directions in Information Access, British
Computer Society Swinton, UK ©2007
Last, M., Shapira, B., Elovici, Y., Zaafrany, O., and
Kandel, A. 2003. Content-Based Methodology for
Anomaly Detection on the Web. Advances in Web
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
536