Content-based Title Extraction from Web Page

Najlah Gali; Pasi Fränti

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Content-based Title Extraction from Web Page

Topics: Data Web Mining; Web Information Filtering and Retrieval

In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, 204-210, 2016 , Rome, Italy

Authors: Najlah Gali and Pasi Fränti

Affiliation: University of Eastern Finland, Finland

Keyword(s): Title Extraction, Information Extraction, Web Data Extraction, Web Mining.

Abstract: Web pages are usually designed in a presentation oriented fashion, having therefore a large amount of non-informative data such as navigation banners, advertisement and functional text. For a particular user, only informative data such as title, main content, and representative images are considered useful. Existing methods for title extraction rely on the structural and visual features of the web page. In this paper, we propose a simpler, but more effective method by analysing the content of the title and meta tags in respect to the main body of the page. We segment the title and meta tags using a set of predefined delimiters and score the segments using three criteria: placement in tag, popularity within all header tags in the page, and the position in the link of the web page. The method is fully automated, template independent, and not limited to any certain type of web pages. Experimental results show that the method significantly improves the accuracy (average similarity to the ground truth title) from 62 % to 84 %. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.108

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Gali, N. and Fränti, P. (2016). Content-based Title Extraction from Web Page. In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST; ISBN 978-989-758-186-1; ISSN 2184-3252, SciTePress, pages 204-210. DOI: 10.5220/0005794102040210

@conference{webist16,
author={Najlah Gali and Pasi Fränti},
title={Content-based Title Extraction from Web Page},
booktitle={Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST},
year={2016},
pages={204-210},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005794102040210},
isbn={978-989-758-186-1},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST
TI - Content-based Title Extraction from Web Page
SN - 978-989-758-186-1
IS - 2184-3252
AU - Gali, N.
AU - Fränti, P.
PY - 2016
SP - 204
EP - 210
DO - 10.5220/0005794102040210
PB - SciTePress