THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert, Gholamreza Nakhaeizadeh
2012
Abstract
In this paper, we introduce AdDANAg, a language-independent approach to extract the main content of web documents. The approach combines best-of-breed techniques from recent content extraction approaches to yield better extraction results. This combination of techniques brings together two pre-processing steps, e.g. to normalize the document presentation and reduce the impact of certain syntactical structures, and four phases for the actual content extraction. We show that AdDANAg demonstrates a high performance in terms of effectiveness and efficiency and outperforms previous approaches.
References
- Debnath, S., Mitra, P., and Lee Giles, C. (2005). Identifying content blocks from web documents. In Foundations of Intelligent Systems, Lecture Notes in Computer Science, pages 285-293.
- Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries.
- Gottron, T. (2008). Content code blurring: A new approach to content extraction. In DEXA 7808: 19th International Workshop on Database and Expert Systems Applications, pages 29 - 33. IEEE Computer Society.
- Gottron, T. (2009). An evolutionary approach to automatically optimise web content extraction. In IIS'09: Proceedings of the 17th International Conference Intelligent Information Systems, pages 331-343.
- Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). DOM-based content extraction of HTML documents. In WWW 7803: Proceedings of the 12th International Conference on World Wide Web, pages 207-214, New York, NY, USA. ACM Press.
- Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Separating XHTML content from navigation clutter using DOM-structure block analysis. In HYPERTEXT 7805: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, pages 145-147, New York, NY, USA. ACM Press.
- Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2011a). Extracting the main content of web documents based on a naive smoothing method. In KDIR'11: International Conference on Knowledge Discovery and Information Retrieval, pages 470 - 475. SciTePress.
- Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2011b). A fast and accurate approach for main content extraction based on character encoding. In DEXA 7811: 22th International Workshop on Database and Expert Systems Applications, pages 167 - 171. IEEE Computer Society.
- Mohammadzadeh, H., Schweiggert, F., and Nakhaeizadeh, G. (2011c). Using utf-8 to extract main content of right to left language web pages. In ICSOFT 2011 - Proceedings of the 6th International Conference on Software and Data Technologies, Volume 1, Seville, Spain, 18-21 July, 2011, pages 243-249. SciTePress.
- Moreno, J., Deschacht, K., and Moens, M. (2009). Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50-55.
- Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King, M., Li, W., and Wei, X. (2002). QuASM: a system for question answering using semi-structured data. In JCDL 7802: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46-55, New York, NY, USA. ACM Press.
- Weninger, T. and Hsu, W. H. (2008). Text extraction from the web via text-tag-ratio. In TIR 7808: Proceedings of the 5th International Workshop on Text Information Retrieval, pages 23 - 28. IEEE Computer Society.
Paper Citation
in Harvard Style
Mohammadzadeh H., Gottron T., Schweiggert F. and Nakhaeizadeh G. (2012). THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 677-682. DOI: 10.5220/0003931906770682
in Bibtex Style
@conference{webist12,
author={Hadi Mohammadzadeh and Thomas Gottron and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={677-682},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003931906770682},
isbn={978-989-8565-08-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION
SN - 978-989-8565-08-2
AU - Mohammadzadeh H.
AU - Gottron T.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2012
SP - 677
EP - 682
DO - 10.5220/0003931906770682