Large Scale Web-Content Classification

Luca Deri, Maurizio Martinelli, Daniele Sartiano, Loredana Sideri

Abstract

Web classification is used in many security devices for preventing users to access selected web sites that are not allowed by the current security policy, as well for improving web search and for implementing contextual advertising. There are many commercial web classification services available on the market and a few publicly available web directory services. Unfortunately they mostly focus on English-speaking web sites, making them unsuitable for other languages in terms of classification reliability and coverage. This paper covers the design and implementation of a web-based classification tool for TLDs (Top Level Domain). Each domain is classified by analysing the main domain web site, and classifying it in categories according to its content. The tool has been successfully validated by classifying all the registered .it Internet domains, whose results are presented in this paper.

References

  1. BrightCloud Inc., 2014. BrightCloud Web Classification Service, http://www.brightcloud.com/pdf/BCSS-WCSDS-us-021814-F.pdf
  2. SimilarWeb Inc, 2014. Our Data & Methodology, http://www.similarweb.com/downloads/our-datamethodology.pdf.
  3. AOL Inc, 2015. Open Directory Project (ODP), http://dmoz.org.
  4. Blocksi SAS, 2014. Blocksi Manager for Cloud Filtering, http://www.blocksi.net.
  5. zvelo Inc., 2014. Website Classification, https://zvelo.com/website-classification/.
  6. Sun, A., Lim, E., 2002. Web classification using support vector machine, Proc. of the 4th international workshop on Web information and data management (WIDM 7802).
  7. Dumais, S., Chen, H., 2000. Hierarchical classification of Web content, Proc. of the 23rd ACM SIGIR conference on Research and development in information retrieval (SIGIR 7800).
  8. Zhang Zhang, Y., Zincir-Heywood, N., and Milios, E., 2003. Summarizing web sites automatically, Proc. of AI'03.
  9. Soumen, C., Van den Berg, M., and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery, Computer Networks 31.11 (1999): 1623-1640.
  10. Chandra, C., et al., 1997. Web search using automatic classification, Proc. of the Sixth International Conference on the World Wide Web.
  11. Jung-Jin, L., et al., 2009. Novel web page classification techniques in contextual advertising, Proc. of the eleventh international workshop on Web information and data management. ACM.
  12. Xiaoguang, Q., and Davison, B. D., 2009. Web page classification: Features and algorithms, ACM Computing Surveys (CSUR) 41.2 (2009):12.
  13. Dou, S., et al., 2004. Web-page classification through summarization, Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
  14. Hwanjo, Y., Han, J., and Chen-Chuan Chang, K., 2002. PEBL: positive example based learning for web page classification using SVM, Proc. of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM.
  15. Ji-bin, Z., et al., 2010. A Web Site Classification Approach Based On Its Topological Structure, Int. J. of Asian Lang. Proc. 20.2 (2010):75-86.
  16. Soumen, C., Dom, B., and Indyk, P., 1998. Enhanced hypertext categorization using hyperlinks, ACM SIGMOD Record. Vol. 27. No. 2. ACM.
  17. Attardi, G., Gulli, A., and Sebastiani, F., 1999. Automatic Web page categorization by link and context analysis, Proc. of THAI. Vol. 99. No. 99.
  18. Vapnik, V., 1998. Probabilistic learning theory. Adaptive and learning systems for signal processing, communications, and control, John Wiley & Sons.
  19. Joachims, T., 1998. Text categorization with support vector machines: Learning with many relevant features. Springer Berlin Heidelberg.
  20. Sun, A., Ee-Peng, L., and Wee-Keong, N., 2002. Web classification using support vector machine, Proc. of the 4th international workshop on Web information and data management, ACM.
  21. Weimin, X., et al., 2006. Web Page Classification Based on SVM, Proc. of WCICA 2006, IEEE.
  22. Fernandez, V. F., et al., 2006. Naive Bayes Web Page Classification with HTML Mark-Up Enrichment, Proc. of ICCGI 7806.
  23. Vinu, D., Gylson, T., Ford, E., 2011. Naive Bayes Approach for Website Classification, In Communications in Computer and Information Science, Vol 147.
  24. Attardi, G., Dei Rossi, S., and Simi, M., 2010. The tanl pipeline, Proc. of Workshop on Web Services and Processing Pipelines in HLT, co-located LREC.
  25. Rajaraman A., and Ullman, J., 2011. Mining of Massive Datasets, Cambridge University Press.
  26. Mikolov, T., et al., 2013. Efficient estimation of word representations in vector space, Cornell University.
  27. Min-Yen, K., and Oanh Nguyen Thi, H., 2005. Fast webpage classification using URL features, Proc. of the 14th ACM international conference on Information and knowledge management, ACM.
  28. Powers, D. M., 2011. Evaluation: From precision, recall and F-factor to ROC, informedness, markedness & correlation (Tech. Rep.)., Journal of Machine Learning Technologies 2 (1): 37-63.
  29. Goutte, C., and Gaussier, E., 2005. A Probabilistic Interpretation of Precision, Recall and F-score, with Implication for Evaluation, Proc. of ECIR 7805, 2005.
  30. Marill, J. L. , Boyko, A., Ashenfelder, M., and Graham, L., 2004. Tools and techniques for harvesting the world wide web. In Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries (JCDL 7804).
Download


Paper Citation


in Harvard Style

Deri L., Martinelli M., Sartiano D. and Sideri L. (2015). Large Scale Web-Content Classification . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015) ISBN 978-989-758-158-8, pages 545-554. DOI: 10.5220/0005635605450554


in Bibtex Style

@conference{sstm15,
author={Luca Deri and Maurizio Martinelli and Daniele Sartiano and Loredana Sideri},
title={Large Scale Web-Content Classification},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)},
year={2015},
pages={545-554},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005635605450554},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: SSTM, (IC3K 2015)
TI - Large Scale Web-Content Classification
SN - 978-989-758-158-8
AU - Deri L.
AU - Martinelli M.
AU - Sartiano D.
AU - Sideri L.
PY - 2015
SP - 545
EP - 554
DO - 10.5220/0005635605450554