loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Hadi Mohammadzadeh 1 ; Franz Schweiggert 1 and Gholamreza Nakhaeizadeh 2

Affiliations: 1 University of Ulm, Germany ; 2 University of Karlsruhe, Germany

Keyword(s): Main content extraction, Information retrieval, UTF-8, HTML documents, Right to left languages.

Related Ontology Subjects/Areas/Topics: Business Analytics ; Data and Information Retrieval ; Data Engineering

Abstract: In this paper, we propose a new and simple approach to extract the main content of Right to Left language web pages. Independence to DOM tree and HTML tags is one of the most important features of the proposed algorithm. In practice, HTML tags have been written in English and we know that the English character set is located in the interval [0,127]. In most languages which are written from Right-to-Left (R2L) such as the Arabic language, however, a definite interval of the Unicode character set is used that is certainly not in this interval. In the first phase of our approach, we apply this distinction to separate the R2L characters from the English ones. Then for each HTML file, we determine the density of the R2L characters and the density of Non-R2L characters. That part of the HTML file with high density of the R2L characters and low density of the Non-R2L characters contains the main content of the web page with high accuracy. The proposed algorithm has been tested, evaluated an d compared with the last main content extraction approach on 2166 selected web pages. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 18.219.22.169

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Mohammadzadeh, H.; Schweiggert, F. and Nakhaeizadeh, G. (2011). USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES. In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT; ISBN 978-989-8425-76-8; ISSN 2184-2833, SciTePress, pages 243-249. DOI: 10.5220/0003508502430249

@conference{icsoft11,
author={Hadi Mohammadzadeh. and Franz Schweiggert. and Gholamreza Nakhaeizadeh.},
title={USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES},
booktitle={Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT},
year={2011},
pages={243-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003508502430249},
isbn={978-989-8425-76-8},
issn={2184-2833},
}

TY - CONF

JO - Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT
TI - USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES
SN - 978-989-8425-76-8
IS - 2184-2833
AU - Mohammadzadeh, H.
AU - Schweiggert, F.
AU - Nakhaeizadeh, G.
PY - 2011
SP - 243
EP - 249
DO - 10.5220/0003508502430249
PB - SciTePress