A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents

Moheb Ghorbani; Hadi Mohammadzadeh; Abdolreza Nazemi

doi:10.5220/0004947503350339

A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents

Moheb Ghorbani, Hadi Mohammadzadeh, Abdolreza Nazemi

2014

Abstract

Most HTML web documents on the World Wide Web contain a lot of hyperlinks in the body of main content area and additional areas. As extraction of the main content of such hyperlink rich web documents is rather complicated, three simple and language-independent pre-processing main content extraction methods are addressed in this paper to deal with the hyperlinks for identifying the main content accurately. To evaluate and compare the presented methods, each of these three methods is combined with a prominent main content extraction method, called DANAg. The obtained results show that one of the methods delivers a higher performance in term of effectiveness in comparison with the other two suggested methods.

References

Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries.
Gottron, T. (2007). Evaluating content extraction on HTML documents. In ITA 7807: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123 - 132, Wrexham, Wales, UK.
Gottron, T. (2008). Content code blurring: A new approach to content extraction. In DEXA'08: 19th International Workshop on Database and Expert Systems Applications, pages 29 - 33, Turin, Italy. IEEE Computer Society.
Kohlschütter, C., Fankhauser, P., and Nejdl, W. (2010). Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining, WSDM 7810, pages 441-450, New York, NY, USA. ACM.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2012). The impact of source code normalization on main content extraction. In WEBIST'12: 8th International Conference on Web Information Systems and Technologies, pages 677 - 682, Porto, Portugal. SciTePress.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., and Nakhaeizadeh, G. (2013). Extracting the main content of web documents based on character encoding and a naive smoothing method. In Software and Data Technologies, CCIS Series, Springer, pages 217 - 236. Springer-Verlag Berlin Heidelberg.
Moreno, J., Deschacht, K., and Moens, M. (2009). Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50 - 55.
Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King, M., Li, W., and Wei, X. (2002). QuASM: a system for question answering using semi-structured data. In JCDL 7802: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46 - 55, New York, NY, USA. ACM Press.
Weninger, T., Hsu, W. H., and Han, J. (2010). CETR: content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web, pages 971 - 980. ACM Press.

Download

Paper Citation

in Harvard Style

Ghorbani M., Mohammadzadeh H. and Nazemi A. (2014). A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 335-339. DOI: 10.5220/0004947503350339

in Bibtex Style

@conference{webist14,
author={Moheb Ghorbani and Hadi Mohammadzadeh and Abdolreza Nazemi},
title={A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={335-339},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004947503350339},
isbn={978-989-758-024-6},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents
SN - 978-989-758-024-6
AU - Ghorbani M.
AU - Mohammadzadeh H.
AU - Nazemi A.
PY - 2014
SP - 335
EP - 339
DO - 10.5220/0004947503350339