embedded in the article block and has a caption and
that non-article images have no caption. It considers
only images with captions as candidate images and
may therefore misses potentially useful web images
that do not have captions.
The image extraction method in (Adam et al.,
2010) focuses on web pages that are written in article-
style (title and body). The method locates the border
of the article and selects an image from this region
based on its size and aspect ratio. It provides image
annotation by identifying the captions assigned to
them. This method considers only images that
accompany the article, which is a section of the web
page and may falsely select advertisement images if
they have acceptable size and aspect ratio. Our task is
wider because we consider all images in the web page
and we select an image that represents the entire web
page.
Google+
share preview snippet (Google+, 2014)
summarizes a post made to Google+. It includes a
link, a page title, a brief description of the page, and
a thumbnail image. The image is selected based on its
size and aspect ratio. The image height must be at
least 120 pixels, and if the width of the image is less
than 100 pixels, then the aspect ratio must be ≤ 3.
Although the explicit framework for the snippet has
not been published in any scientific forum, the
method is described in technical document, and it is
used in real application.
The method in (Tsymbalenko and Munson, 2001)
focuses on finding relevant images to specific query
without downloading or analyzing images. It
examines only the text that surrounds the image tag
in the source code of the web page and then decides
whether the image is relevant or not. However, many
web pages do not have text surrounding their useful
images which lead to exclude them from being
candidate images.
A functional categorization of images is studied in
(Hu and Bagga, 2003). The images are classified into
categories based on their usage in the web page by
defining eight categories: story, preview, commercial,
host, heading, icons and logos, formatting,
miscellaneous. These are further grouped in two
super-classes: one for useful images (story, preview
and host) and one for the images that are not
associated with the content (the other categories). We
also use image categorization, but we use it directly
in the method for helping to choose the best image.
The method in (Gupta et al., 2003) navigates
Document Object Model (DOM) tree that is created
by parsing the Hypertext Mark Up Language
(HTML) code recursively and uses it to extract
relevant information, including images. It filters out
irrelevant data such as advertisement images by
examining the values of the src and href attributes to
determine the servers which the links refer to. If an
address matches against a list of common
advertisement servers, the node of the DOM tree that
contains the link is deleted. We also use the DOM tree
of a web page, but we use more image attributes and
we define more categories.
The method in (Parmar and Gadge, 2011)
removes advertisement images by using a rule-based
classifier. Seven rules are defined to decide whether
the image is an advertisement or not: domain name
difference, dimension, well-known advertisement
provider, advertisement related keywords,
advertisement by scripting, dynamic advertisement,
flash plug-in removal. This method eliminates most
advertisement images in the web page.
Despite of several researchers have been working
in related areas, none of the existing methods is
directly applicable to our problem as such. To our
knowledge, the only existing methods are the
commercial ones implemented in Google+ and
Facebook but, according to our experiments, neither
of them is working perfectly.
In this paper, we propose a method that parses the
source code of a web page, detects all the images and
selects one that best represents the content. Instead of
analyzing the content of the images or examining the
text surrounding them, we rely on the functional
purpose of the images within the web page and on the
features such as the size, the aspect ratio, the format
of the image and the attributes of HTML tags.
Similarly to (Hu and Bagga, 2003) we classify the
images into categories. We define the following
categories based on image functionality:
representative, logos, banners, advertisements, and
formatting including icons. We rank the categories in
this order, based on how important they are with
respect to the content of the web page. The images in
each category are ranked based on their features.
The main contribution of our method is that it
does not rely on the surrounding text, on certain
template or web page categories. Instead, it is targeted
to work with all types of web pages. It is therefore
general and not limited by the writing style or the
layout of the web page. Besides the selection of the
threshold values, the method does not require any
training data. It is designed to work in real time,
without the need to store the results in a database or
to query a set of pre downloaded web pages. Since we
consider prior classification of images, our method is
useful in several applications such as automatic
identifying adverts, saving bandwidth by web
crawlers by downloading carefully only most relevant
WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies
412