Automatic Web Page Classification Using Visual Content

António Videira, Nuno Goncalves

Abstract

There is a constantly increasing requirement for automatic classification techniques with greater classification accuracy. To automatically classify and process web pages, the current systems use the text content of those pages. However, little work has been done on using the visual content of a web page. On this account, our work is focused on performing web page classification using only their visual content. First a descriptor is constructed, by extracting different features from each page. The features used are the simple color and edge histograms, Gabor and Tamura features. Then two methods of feature selection, one based on the Chi-Square criterion, the other on the Principal Components Analysis are applied to that descriptor, to select the top discriminative attributes. Another approach involves using the Bag of Words (BoW) model to treat the SIFT local features extracted from each image as words, allowing to construct a dictionary. Then we classify web pages based on their aesthetic value, their recency and type of content. The machine learning methods used in this work are the Naive Bayes, Support Vector Machine, Decision Tree and AdaBoost. Different tests are performed to evaluate the performance of each classifier. Finally, we thus prove that the visual appearance of a web page has rich content not explored by current web crawlers based only on text content.

References

  1. Andrade, L. (2009). The worlds ugliest websites!!! retrieved october 2009: http://www.nikibrown.com/designoblog/2009/03/03/ theworlds-ugliest-websites/.
  2. Asirvatham, A. P. and Ravi, K. K. (2001). Web page classification based on document structure. In IEEE National Convention.
  3. Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools.
  4. Chen, R. C. and Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Syst. Appl., 31(2):427- 435.
  5. Crazyleafdesign.com (2013). Most beautiful and inspirational website designs.
  6. de Boer, V., van Someren, M., and Lupascu, T. (2010). Classifying web pages with visual features. In WEBIST (2010), pages 245-252.
  7. Deselaers, T. (2003). Features for image retrieval (thesis). Master's thesis, RWTH Aachen University, Aachen, Germany.
  8. Flanders, V. (2012). Worst websites of the year 2012 - 2005: http://www.webpagesthatsuck.com/worstwebsites-of-the-year.html.
  9. Kovacevic1, M., Diligenti, M., Gori, M., and Milutinovic1, V. (2004). Visual adjacency multigraphs, a novel approach for a web page classification. Workshop on Statistical Approaches to Web Mining (SAWM), pages 38-49.
  10. Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, TAI 7895.
  11. Liu, J. (2013). Image retrieval based on bag-of-words model. arXiv preprint arXiv:1304.5168.
  12. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91-110.
  13. Selamat, A. and Omatu, S. (2004). Web page feature selection and classification using neural networks. Inf. Sci. Inf. Comput. Sci., pages 69-88.
  14. Shuey, M. (2013). 10-worst-websites-for-2013: http://www.globalwebfx.com/10-worst-websitesfor-2013/.
  15. Song, F., Guo, Z., and Mei, D. (2010). Feature selection using principal component analysis. In System Science, Engineering Design and Manufacturing Informatization (ICSEM), 2010 International Conference on, volume 1, pages 27-30.
  16. Tamura, H., Mori, S., and Yamawaki, T. (1978). Textural features corresponding to visual perception. IEEE Transaction on Systems, Man, and Cybernetics, 8:460-472.
  17. waxy.org (2010). Den.net and the top 100 websites of 1999: http://waxy.org/2010/02/dennet and the top 100 websites of 1999/.
  18. Zhang, D., Wong, A., Indrawan, M., and Lu, G. (2000). Content-based image retrieval using gabor texture features. In IEEE Pacific-Rim Conference on Multimedia, University of Sydney, Australia.
Download


Paper Citation


in Harvard Style

Videira A. and Goncalves N. (2014). Automatic Web Page Classification Using Visual Content . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 193-204. DOI: 10.5220/0004856201930204


in Bibtex Style

@conference{webist14,
author={António Videira and Nuno Goncalves},
title={Automatic Web Page Classification Using Visual Content},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={193-204},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004856201930204},
isbn={978-989-758-024-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - Automatic Web Page Classification Using Visual Content
SN - 978-989-758-024-6
AU - Videira A.
AU - Goncalves N.
PY - 2014
SP - 193
EP - 204
DO - 10.5220/0004856201930204