TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS

Vladimir Bartik, Radek Burget

Abstract

The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

References

  1. Lin, S. H., Ho, J. M., 2002. Discovering Informative Content Blocks from Web Documents. In SIGKDD 2002, 8th Conference on Knowledge Discovery and Data Mining, ACM.
  2. Chen, Y., Ma, W. Y., Zhang, H. J., 2003. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. In WWW 2003, Twelfth International World Wide Web Conference, ACM.
  3. Cai, D., Yu, S., Wen, J. R., Ma, W. Y., 2003. VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Research.
  4. Xiang, P., Yang, X., Shi, Y., 2006. Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In International Conference on Web Intelligence, ACM.
  5. Salton, G., Buckley, C., 1999. Term Weighting Approaches in Automatic Text Retrieval. In Information Processing and Management, Vol. 24, Elsevier.
  6. Kwon, O. W., Lee, J. H., 2003. Text Categorization Based on K-nearest Neighbor Approach for Web Site Classification. In Information Processing and Management, Vol. 39, Elsevier.
  7. Schenker, A., Last, M., Burke, H., Kandel, A., 2004. Classification of Web Documents Using Graph Matching. In International Journal of Pattern Recognition and Artificial Intelligence, Vol. 18, World Scientific.
  8. Burget, R., 2007. Layout Based Information Extraction from HTML Documents. In ICDAR 2007, Ninth International Conference on Document Analysis and Recognition, IEEE.
  9. Holmes, G., Donkin, A., Witten, I.H., 1994. WEKA: A Machine Learning Workbench. In Second Australia and New Zealand Conference on Intelligent Information Systems.
  10. Bartik, V., 2010: Text-Based Web Page Classification with Use of Visual Information. In OSINT-WM 2010, International Symposium on Open Source Intelligence & Web Mining, IEEE (accepted).
Download


Paper Citation


in Harvard Style

Bartik V. and Burget R. (2010). TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 458-462. DOI: 10.5220/0003096204580462


in Bibtex Style

@conference{kdir10,
author={Vladimir Bartik and Radek Burget},
title={TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={458-462},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003096204580462},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS
SN - 978-989-8425-28-7
AU - Bartik V.
AU - Burget R.
PY - 2010
SP - 458
EP - 462
DO - 10.5220/0003096204580462