Authors:
Houriye Esfahanian
1
;
Abdolreza Nazemi
2
and
Andreas Geyer-Schulz
2
Affiliations:
1
Non-Governmental Non-Profit College, Refah, Tehran, Iran
;
2
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Keyword(s):
Main Content Extraction, Evaluation Methods, Boilerplate Detection, Right to Left Languages.
Abstract:
With the daily increase of published information on the Web, extracting the web page’s main content has become an important issue. Since 2010, in addition to the English Language, the contents with the right to left languages such as Arabic or Persian are also increasing. In this paper, we compared the three famous main content extraction algorithms published in the last decade, Boilerpipe, DANAg, and Web-AM, to find the best algorithm considering evaluation measures and performance. The ArticleExtractor algorithm of the Boilerpipe approach was scored as the most accurate algorithm, with the highest average score of F1 measure of 0.951. On the contrary, the DANAg algorithm was selected with the best performance, being able to process more than 21 megabytes per second. Considering the accuracy and the effectiveness of the main content extraction projects, one of the two Boilerpipe or DANAg algorithms can be used.