THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION

Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh

2012

Abstract

In this paper, we introduce AdDANAg, a language-independent approach to extract the main content of web documents. The approach combines best-of-breed techniques from recent content extraction approaches to yield better extraction results. This combination of techniques brings together two pre-processing steps, e.g. to normalize the document presentation and reduce the impact of certain syntactical structures, and four phases for the actual content extraction. We show that AdDANAg demonstrates a high performance in terms of effectiveness and efficiency and outperforms previous approaches.

References

  1. Debnath, S., Mitra, P., and Lee Giles, C. (2005). Identifying content blocks from web documents. In Foundations of Intelligent Systems, Lecture Notes in Computer Science, pages 285-293.
  2. Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries.
  3. Gottron, T. (2008). Content code blurring: A new approach to content extraction. In DEXA 7808: 19th International Workshop on Database and Expert Systems Applications, pages 29 - 33. IEEE Computer Society.
  4. Gottron, T. (2009). An evolutionary approach to automatically optimise web content extraction. In IIS'09: Proceedings of the 17th International Conference Intelligent Information Systems, pages 331-343.
  5. Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). DOM-based content extraction of HTML documents. In WWW 7803: Proceedings of the 12th International Conference on World Wide Web, pages 207-214, New York, NY, USA. ACM Press.
  6. Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Separating XHTML content from navigation clutter using DOM-structure block analysis. In HYPERTEXT 7805: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, pages 145-147, New York, NY, USA. ACM Press.
  7. Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2011a). Extracting the main content of web documents based on a naive smoothing method. In KDIR'11: International Conference on Knowledge Discovery and Information Retrieval, pages 470 - 475. SciTePress.
  8. Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2011b). A fast and accurate approach for main content extraction based on character encoding. In DEXA 7811: 22th International Workshop on Database and Expert Systems Applications, pages 167 - 171. IEEE Computer Society.
  9. Mohammadzadeh, H., Schweiggert, F., and Nakhaeizadeh, G. (2011c). Using utf-8 to extract main content of right to left language web pages. In ICSOFT 2011 - Proceedings of the 6th International Conference on Software and Data Technologies, Volume 1, Seville, Spain, 18-21 July, 2011, pages 243-249. SciTePress.
  10. Moreno, J., Deschacht, K., and Moens, M. (2009). Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50-55.
  11. Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King, M., Li, W., and Wei, X. (2002). QuASM: a system for question answering using semi-structured data. In JCDL 7802: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46-55, New York, NY, USA. ACM Press.
  12. Weninger, T. and Hsu, W. H. (2008). Text extraction from the web via text-tag-ratio. In TIR 7808: Proceedings of the 5th International Workshop on Text Information Retrieval, pages 23 - 28. IEEE Computer Society.
Download


Paper Citation


in Harvard Style

Mohammadzadeh H., Gottron T., Schweiggert F. and Nakhaeizadeh G. (2012). THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 677-682. DOI: 10.5220/0003931906770682


in Bibtex Style

@conference{webist12,
author={Hadi Mohammadzadeh and Thomas Gottron and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={677-682},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003931906770682},
isbn={978-989-8565-08-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION
SN - 978-989-8565-08-2
AU - Mohammadzadeh H.
AU - Gottron T.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2012
SP - 677
EP - 682
DO - 10.5220/0003931906770682