A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents

Moheb Ghorbani; Hadi Mohammadzadeh; Abdolreza Nazemi

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents

Topics: Data Mining; Text Mining; Web Information Filtering and Retrieval

In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, 335-339, 2014 , Barcelona, Spain

Authors: Moheb Ghorbani ¹ ; Hadi Mohammadzadeh ² and Abdolreza Nazemi ³

Affiliations: ¹ University of Tehran, Iran, Islamic Republic of ; ² University of Ulm, Germany ; ³ Karlsruhe Institute of Technology (KIT), Germany

Keyword(s): Main Content Extraction, Pre-processing Algorithms, Hyperlink Rich Web Documents.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Data Mining ; Databases and Information Systems Integration ; Enterprise Information Systems ; Sensor Networks ; Signal Processing ; Soft Computing

Abstract: Most HTML web documents on the World Wide Web contain a lot of hyperlinks in the body of main content area and additional areas. As extraction of the main content of such hyperlink rich web documents is rather complicated, three simple and language-independent pre-processing main content extraction methods are addressed in this paper to deal with the hyperlinks for identifying the main content accurately. To evaluate and compare the presented methods, each of these three methods is combined with a prominent main content extraction method, called DANAg. The obtained results show that one of the methods delivers a higher performance in term of effectiveness in comparison with the other two suggested methods.

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.17.6.50

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Ghorbani, M., Mohammadzadeh, H. and Nazemi, A. (2014). A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents. In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST; ISBN 978-989-758-024-6; ISSN 2184-3252, SciTePress, pages 335-339. DOI: 10.5220/0004947503350339

@conference{webist14,
author={Moheb Ghorbani and Hadi Mohammadzadeh and Abdolreza Nazemi},
title={A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST},
year={2014},
pages={335-339},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004947503350339},
isbn={978-989-758-024-6},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST
TI - A Comparison of Three Pre-processing Methods for Improving Main Content Extraction from Hyperlink Rich Web Documents
SN - 978-989-758-024-6
IS - 2184-3252
AU - Ghorbani, M.
AU - Mohammadzadeh, H.
AU - Nazemi, A.
PY - 2014
SP - 335
EP - 339
DO - 10.5220/0004947503350339
PB - SciTePress