EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh

2011

Abstract

Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.

References

  1. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In APWeb, volume 2642 of Lecture Notes in Computer Science, pages 406-417. Springer.
  2. Debnath, S., Mitra, P., and Giles, C. L. (2005). Identifying content blocks from web documents. In Lecture Notes in Computer Science, pages 285-293, NY, USA. Springer.
  3. Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In Proceedings of the Second DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland.
  4. Gibson, D., Punera, K., and Tomkins, A. (2005). The volume and evolution of web page templates. In WWW 7805: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web., pages 830-839, New York, NY, USA. ACM Press.
  5. Gottron, T. (2007). Evaluating content extraction on html documents. In Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123-132, University of Wales, UK.
  6. Gottron, T. (2008). Content code blurring: A new approach to content extraction. In 19th International Workshop on Database and Expert Systems Applications, pages 29-33, Turin, Italy.
  7. Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). Dom-based content extraction of html documents. In Proceedings of the 12th international conference on World Wide Web, pages 207-214, New York, USA. ACM.
  8. Hirschberg, D. S. (1977). Algorithms for the longest common subsequence problem. J. ACM, 24(4), pages 664- 675.
  9. Kaiser, G., Gupta, S., and Stolfo, S. (2005). Extracting context to improve accuracy for html content extraction. In Special Interest Tracks and Posters of the 14th International conference on World Wide Web, pages 1114-1115.
  10. Lin, S.-H. and Ho, J.-M. (2002). Discovering informative content blocks from web documents. In KDD 7802: Proceeding of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 588-593, New York, NY, USA. ACM.
  11. Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Separating xhtml content from navigation clutter using dom-structure block analysis. In Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pages 145-147, New York, USA. ACM.
  12. Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2011a). A fast and accurate approach for main content extraction based on character encoding. In TIR'11: Proccedings of the 8th Workshop on Text-based Information Retrieval., Toulouse, France.
  13. Mohammadzadeh, H., Schweiggert, F., and Nakhaeizadeh, G. (2011b). Using utf-8 to extract main content of right to left languages web pages. In ICSOFT 2011: Proceedings of the 6th International Conference on Software and Data Technologies., pages 243- 249, Seville, Spain.
  14. Moreno, J. A., Deschacht, K., and Moens, M.-F. (2009). Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50-55, Netherland.
  15. Pinto, D., Branstein, M., Coleman, R., Croft, W. B., and King, M. (2002). Quasm: a system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46-55, New York, USA. ACM.
  16. Rahman, A. F. R., Alam, H., and Hartono, R. (2001). Content extraction from html documents. In WDA 2001: Proceedings of the First International Workshop on Web Document Analysis., pages 7-10.
  17. Weninger, T., Hsu, W. H., and Han, J. (2010). CETR - content extraction via tag ratios. In Proceeding of International World Wide Web Conference., Raleigh, USA.
Download


Paper Citation


in Harvard Style

Mohammadzadeh H., Gottron T., Schweiggert F. and Nakhaeizadeh G. (2011). EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 462-467. DOI: 10.5220/0003665304700475


in Bibtex Style

@conference{kdir11,
author={Hadi Mohammadzadeh and Thomas Gottron and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={462-467},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003665304700475},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD
SN - 978-989-8425-79-9
AU - Mohammadzadeh H.
AU - Gottron T.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2011
SP - 462
EP - 467
DO - 10.5220/0003665304700475