loading
Documents

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Hadi Mohammadzadeh 1 ; Thomas Gottron 2 ; Franz Schweiggert 1 and Gholamreza Nakhaeizadeh 3

Affiliations: 1 University of Ulm, Germany ; 2 Universität Koblenz-Landau, Germany ; 3 University of Karlsruhe, Germany

ISBN: 978-989-8565-08-2

ISSN: 2184-3252

Keyword(s): Main Content Extraction, Web Mining, HTML Web Pages, Normalization.

Abstract: In this paper, we introduce AdDANAg, a language-independent approach to extract the main content of web documents. The approach combines best-of-breed techniques from recent content extraction approaches to yield better extraction results. This combination of techniques brings together two pre-processing steps, e.g. to normalize the document presentation and reduce the impact of certain syntactical structures, and four phases for the actual content extraction. We show that AdDANAg demonstrates a high performance in terms of effectiveness and efficiency and outperforms previous approaches.

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.235.107.209

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Mohammadzadeh, H.; Gottron, T.; Schweiggert, F. and Nakhaeizadeh, G. (2012). THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION.In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, ISSN 2184-3252, pages 677-682. DOI: 10.5220/0003931906770682

@conference{webist12,
author={Hadi Mohammadzadeh. and Thomas Gottron. and Franz Schweiggert. and Gholamreza Nakhaeizadeh.},
title={THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={677-682},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003931906770682},
isbn={978-989-8565-08-2},
}

TY - CONF

JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - THE IMPACT OF SOURCE CODE NORMALIZATION ON MAIN CONTENT EXTRACTION
SN - 978-989-8565-08-2
AU - Mohammadzadeh, H.
AU - Gottron, T.
AU - Schweiggert, F.
AU - Nakhaeizadeh, G.
PY - 2012
SP - 677
EP - 682
DO - 10.5220/0003931906770682

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.