A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages
Houriye Esfahanian, Abdolreza Nazemi, Andreas Geyer-Schulz
2023
Abstract
With the daily increase of published information on the Web, extracting the web page’s main content has become an important issue. Since 2010, in addition to the English Language, the contents with the right to left languages such as Arabic or Persian are also increasing. In this paper, we compared the three famous main content extraction algorithms published in the last decade, Boilerpipe, DANAg, and Web-AM, to find the best algorithm considering evaluation measures and performance. The ArticleExtractor algorithm of the Boilerpipe approach was scored as the most accurate algorithm, with the highest average score of F1 measure of 0.951. On the contrary, the DANAg algorithm was selected with the best performance, being able to process more than 21 megabytes per second. Considering the accuracy and the effectiveness of the main content extraction projects, one of the two Boilerpipe or DANAg algorithms can be used.
DownloadPaper Citation
in Harvard Style
Esfahanian H., Nazemi A. and Geyer-Schulz A. (2023). A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR; ISBN 978-989-758-671-2, SciTePress, pages 222-229. DOI: 10.5220/0012162000003598
in Bibtex Style
@conference{kdir23,
author={Houriye Esfahanian and Abdolreza Nazemi and Andreas Geyer-Schulz},
title={A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages},
booktitle={Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR},
year={2023},
pages={222-229},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012162000003598},
isbn={978-989-758-671-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR
TI - A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages
SN - 978-989-758-671-2
AU - Esfahanian H.
AU - Nazemi A.
AU - Geyer-Schulz A.
PY - 2023
SP - 222
EP - 229
DO - 10.5220/0012162000003598
PB - SciTePress