2 RELATED WORK
The text content that is directly located on the page is
the most used feature. A WPC method presented by
Selamat and Omatu (Selamat and Omatu, 2004) used
a neural network with inputs based on the Principal
Component Analysis and class profile-based features.
By selecting the most regular words in each class and
weighted them, and with several methods of classifi-
cation, they were able to demonstrate an acceptable
accuracy. Chen and Hsieh (Chen and Hsieh, 2006)
proposed a WPC method using a SVM based on a
weighted voting scheme. This method uses Latent
semantic analysis to find relations between keywords
and documents, and text features extracted from the
web page content.Those two features are then sent to
the SVM model for training and testing respectively.
Then, based on the SVM output, a voting scheme is
used to determine the category of the web page.
There are few studies of WPC using the visual
content, because traditionally only text information
is used, achieving reasonable accuracy. It has been,
however, noticed (de Boer et al., 2010) that the visual
content can help in disambiguating the classification
based only on this text content. Additionally, another
factor in favor of using the visual content is the fact
that subjective variables as design recency and aes-
thetic value cannot be studied using text content con-
tained in the html code. These variables are increasing
in importance due to web marketing strategies.
A WPC approach based on the visual information
was implemented by Asirvatham et al. (Asirvatham
and Ravi, 2001), where a number of visual features,
as well as text features, were used. They proposed
a method for automatic categorization of web pages
into a few broad categories based on the structure of
the web documents and the images presented on it.
Another approach was proposed by Kovacevic et al.
(Kovacevic1 et al., 2004), where a page is represented
as a hierarchical structure - Visual Adjacency Multi-
graph, in which, nodes represent simple HTML ob-
jects, texts and images, while directed edges reflect
spatial relations on the browser screen.
As mentioned previously, Boer et al. (de Boer
et al., 2010) has successfully classified web pages us-
ing only visual features. They classified pages in two
binary variables: aesthetic value and design recency,
achieving good accuracy. The authors also applied the
same classification algorithm and methods to a multi-
class categorization of the website topic and although
the results obtained are reasonable, it was concluded
that this classification is more difficult to perform.
3 CLASSIFICATION PROCESS
This section presents the work methodology used to
fulfill the proposed objectives. Namely, how the pro-
cess of classification of new web pages is done. In
Fig. 1 it is possible to see the necessary steps to pre-
dict the class of new web pages. The algorithms were
developed in C/C++ using the OpenCV library (Brad-
ski, 2000), that runs under Windows, Linux and Mac
OS X.
The next subsections present an explanation of the
methods used to extract features from the images, and
the construction of the respective feature descriptors.
It is explained in detail the techniques used to perform
feature selection.
3.1 Feature Extraction
The concept of feature in computer vision and image
processing refers to a piece of information which is
relevant and distinctive. For each web page, differ-
ent feature descriptors (feature vector) are computed.
This section describes how a descriptor of low level
features which contains 166 attributes that character-
ize the page is obtained and how the SIFT descriptor
using Bag of Words model is built.
3.1.1 Low Level Descriptor
Visual descriptors are descriptions of visual features
of the content of an image. These descriptors describe
elementary characteristics such as shape, color, tex-
ture, motion, among others. To built this descriptor
the following features were extracted from each im-
age: color histogram, edge histogram, tamura features
and gabor features.
Color Histogram. It is a representation of the dis-
tribution of colors in an image. It can be built in
any color space, but the ones used in this work is the
HSV color space. It was selected because it reflects
human vision quite accurately and because it mainly
uses only one of its components (Hue) to describe the
main properties of color in the image. The Hue his-
togram is constructed by discretization of the colors
in the image into 32 bins. Each bin will represent an
intensity spectrum. This means that a histogram pro-
vides a compact summarization of the distribution of
data in an image.
Edge Histogram. An edge histogram will repre-
sent the frequency and directionality of the brightness
changes in the image. The Edge Histogram Descrip-
tor (EHD) describes the edge distribution in an image.
It is a descriptor that expresses only the local edge
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
194