Authors:
Najlah Gali
and
Pasi Fränti
Affiliation:
University of Eastern Finland, Finland
Keyword(s):
Title Extraction, Information Extraction, Web Data Extraction, Web Mining.
Abstract:
Web pages are usually designed in a presentation oriented fashion, having therefore a large amount of non-informative data such as navigation banners, advertisement and functional text. For a particular user, only informative data such as title, main content, and representative images are considered useful. Existing methods for title extraction rely on the structural and visual features of the web page. In this paper, we propose a simpler, but more effective method by analysing the content of the title and meta tags in respect to the main body of the page. We segment the title and meta tags using a set of predefined delimiters and score the segments using three criteria: placement in tag, popularity within all header tags in the page, and the position in the link of the web page. The method is fully automated, template independent, and not limited to any certain type of web pages. Experimental results show that the method significantly improves the accuracy (average similarity to the
ground truth title) from 62 % to 84 %.
(More)