Authors:
Hadi Mohammadzadeh
1
;
Franz Schweiggert
1
and
Gholamreza Nakhaeizadeh
2
Affiliations:
1
University of Ulm, Germany
;
2
University of Karlsruhe, Germany
Keyword(s):
Main content extraction, Information retrieval, UTF-8, HTML documents, Right to left languages.
Related
Ontology
Subjects/Areas/Topics:
Business Analytics
;
Data and Information Retrieval
;
Data Engineering
Abstract:
In this paper, we propose a new and simple approach to extract the main content of Right to Left language web pages. Independence to DOM tree and HTML tags is one of the most important features of the proposed algorithm. In practice, HTML tags have been written in English and we know that the English character set is located in the interval [0,127]. In most languages which are written from Right-to-Left (R2L) such as the Arabic language, however, a definite interval of the Unicode character set is used that is certainly not in this interval. In the first phase of our approach, we apply this distinction to separate the R2L characters from the English ones. Then for each HTML file, we determine the density of the R2L characters and the density of Non-R2L characters. That part of the HTML file with high density of the R2L characters and low density of the Non-R2L characters contains the main content of the web page with high accuracy. The proposed algorithm has been tested, evaluated an
d compared with the last main content extraction approach on 2166 selected web pages.
(More)