Authors:
Hadi Mohammadzadeh
1
;
Thomas Gottron
2
;
Franz Schweiggert
1
and
Gholamreza Nakhaeizadeh
3
Affiliations:
1
University of Ulm, Germany
;
2
Universität Koblenz-Landau, Germany
;
3
University of Karlsruhe, Germany
Keyword(s):
Main content extraction, Information extraction, Web mining, HTML web pages.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.