MACHINE LEARNING AND LINK ANALYSIS
FOR WEB CONTENT MINING
Moreno Carullo and Elisabetta Binaghi
Department of Computer Science and Communication, University of Insubria, 21100 Varese, Italy
Keywords:
Web content mining, Hyperlinks, Machine learning, Radial basis function networks.
Abstract:
In this work we define a hybrid Web Content Mining strategy aimed to recognize within Web pages the main
entity, intended as the short text that refers directly to the main topic of a given page. The salient aspect of the
strategy is the use of a novel supervised Machine Learning model able to represent in an unified framework
the integrated use of visual pages layout features, textual features and hyperlink description. The proposed
approach has been evaluated with promising results.
1 INTRODUCTION
The web content is huge and dynamic, mainly con-
stituted by unstructured, partially-structured text data
or by low-volumed structured data having different
non-standard scheme. Web documents are character-
ized by both their heterogeneous contents represented
with different media and format, and the organization
of these contents within graphical page layouts more
and more consistent with usability guidelines. Server
logs information, link structure and description con-
stitute additional relevant information. All these as-
pects make the extraction of useful knowledge a chal-
lenging research problem.
Since the year 2000 the application of learning
strategies to WM tasks was intensively studied show-
ing advantages in terms of both effectiveness and
portability over conventional and earlier strategies
based on knowledge engineering approach (Kosala
and Blockeel, 2000). In supervised learning training
examples consist of input/output pairs and the goal
of the learning algorithm is to predict the output val-
ues of never seen input values. In unsupervised learn-
ing training examples are constituted only by input
patterns; the learning algorithm is able to general-
ize from input patterns to discover similarities among
data (Michalski et al., 1983; Mitchell, 1997).
According to (Kosala and Blockeel, 2000) three
main areas of research can be distinguished: Web
Content Mining (WCM) - the application of data min-
ing techniquesto Web documents, Web Usage Mining
- the analysis of interactions between the user and the
Web and Web Structure Mining and/or Link Analysis
in which the structure of hyperlinks is used to solve a
problem in a graph-structured domain.
These WM tasks can be combined in an unique
application, reinforcing the mining process by the al-
lied analysis of different Web characteristics. Recent
works propose in particular the integration of Web
Structure and Text Mining tasks. The hyper link in-
formation, and in particular the anchor text - the text
appearing in the predecessor page and pointing to the
target, has been used in Information Retrieval tasks
and classification tasks (Spertus, 1997; F¨urnkranz,
2002). Page classification in particular can benefit
from Link Analysis since it permits to determine the
topic of each page in a more robust way, considering
in the classification process the predicted category of
neighbors (Chakrabarti et al., 1998; Oh et al., 2000;
Joachims et al., 2001).
Proceeding from these considerations, we in-
tend to define and experimentally investigate a hy-
brid WCM strategy aimed to recognize within Web
pages the main entity, intended as the short text that
refers directly to the main topic of a given page.
We consider this application of great importance for
the building and the maintenance of intelligent web
agents oriented to advanced web-based user services.
The salient aspect of the strategy is the use of a novel
supervised ML model able to represent in an unified
framework the integrated use of visual layout fea-
tures, textual features and hyperlink descriptions.
156
Carullo M. and Binaghi E..
MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING.
DOI: 10.5220/0003065401560161
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 156-161
ISBN: 978-989-8425-28-7
Copyright
c
2010 SCITEPRESS (Science and Technology Publications, Lda.)
2 WEB MAIN ENTITY
RECOGNITION
Every and each page in the web has a definite topic
that can be explicitly declared in the page itself with a
short text we name main entity. Consider for instance
online blogs, the main entity of each post’s page is the
title of the post. An online newspaper’spage will have
the new’s title as the main entity, and an e-commerce
website the product’s name.
Each page p can be modeled as a collection of text
entities B: where the main entity e B is the one best
describing the topic or title of the page. In addition we
know the set of incoming links to p and in particular
for each link we know the anchor text, that is the text
in the predecessor page pointing to p.
In the main entity recognition problem (MERP)
the goal is to consider a page p and the set of Incom-
ing Link Anchor Texts (ILA) A = {a
1
,...,a
n
} and to
discover the text block ˆe recognized as the most plau-
sible main entity in the H B set of candidate main
entity blocks.
Suppose we have the possibility to collect a su-
pervised truth for each web page, that is a set D of
tuples (p, e,A) where p is a web page in a given do-
main, e is the main entity and A is the set of ILA for
p. The supervised truth can be collected by providing
domain experts an environment where they can select
the main entity e B for each p.
MERP can be seen as a supervised classification
task where the object to classify is (b,A) B × A and
the candidate labels are “main entity” and “non main
entity”. The main issue lies in the fact that A is a set
of elements of variable cardinality while b is a sin-
gle element. In order to allow a ML model to dis-
cover the relationship between (b,A) and the label,
a proper low-level representation of features is sup-
posed to take place.
3 THE PROPOSED APPROACH
The approach we propose has three main parts. The
first is the definition of how the web page is repre-
sented, that is the structure and characteristics of a
web page. The second is a set of representative fea-
tures that work on the defined web page structure.
The third is the definition of a suitable ML model for
MERP.
We model each web page p as a semi-structured
source of information where B = {b
1
,...,b
m
}, B
B is the set of text blocks in p P, where B is the
domain of all textblocks and P is the domain of pages.
With the term “text block” we refer to a string of text
that appears in p as a the inner text of a DOM
1
node.
The set B is obtained through the extract blocks : P
B function, further explained in Sec. 3.4.
3.1 Feature Extraction
In the main entity recognition problem the object to
classify is a text block b B given its page p and
the A set for p. We therefore partition the features
in two subsets: textblock features and anchortext fea-
tures. The elements in the former are functions f F
in the form f : B × P R while in the latter we have
functions g G in the form g : A × B R.
Textblock features describe low-level characteris-
tics of elements b B found in the page and their re-
lation with the page itself, such as the title, page di-
mensions etc:
f
intitle
: expresses the percentage of matching be-
tween the terms of the textblock b and the terms
in the title t of the page, following the web us-
ability guideline stating that the main object of
a web page should be named in the title. For
example if the web page p title is t=“The quick
brown fox jumps over the lazy dog” and b =
“The quick brown fox” then f
intitle
(b, p) = 16/35,
where 16 is the total length of the common terms
and 35 is the length of t.
f
fsize
: expresses the font size of the textblock b
normalized w.r.t. a maximum value defined a pri-
ori. The idea behind this feature is that the main
entity is often presented with a big font-size to be
easily identified by the user.
f
fbold
: expresses whether or not the considered
textblock b is displayed with a bold font. The idea
behind this feature is closely related to the f
fsize
feature and is driven by the fact that the main en-
tity is often presented with a bold font.
Anchortext features take into account anchors and
textblocks to allow the definition of low-level rela-
tionships between the two. Here we define a simple,
low level feature of this kind:
g
dice
: expresses the similarity between the
textblock b and the anchortext a. The employed
similarity measure is the Dice coefficient (Frakes
and Baeza-Yates, 1992), computed as:
g
dice
(a,b) =
2C
|a| + |b|
where C is the number of common terms between
a and b, |a| and |b| are the number of terms of a
and b, respectively.
1
http://www.w3.org/TR/DOM-Level-2-HTML/
MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING
157
In the g
dice
and f
intitle
features a tokenizer is re-
quired to obtain a set of terms from a text block or
an anchor text or from the title of the page. The tok-
enizer has to be selected depending on the domain of
the problem.
Given a set F F and a set G G and a (p, b,A)
tuple we define the Θ R
n
, n = |F| textblock feature
space and the Γ anchortext feature space. An element
γ from the (b, A) pair is in the form γ = {~x
1
,...,~x
m
}
where each instance~x
i
is computed by the set of fea-
tures defined from a (a,b) pair a A. The cardi-
nality m of a γ element is not fixed a priori since the
number |A| of anchortexts pointing to a given page
may vary.
3.2 A Multi-source Machine Learning
Algorithm
In this section we describe the ML model used to
solve the MERP. In particular, we adopted the Radial
Basis Function Network (RBFN) model introduced
by (Moody and Darken, 1989) for its proven training
speed and robustness on classification and regression
tasks. These capabilities are especially suitable for
the inherent complexity in the WCM context.
The MERP defines the textblock b as the object to
be classified, with the additional context of its page p
and the anchortext set A. As introduced in the previ-
ous section, we define two kinds of features that give
rise to two separate feature spaces Θ and Γ. Since
Θ R
n
with n = |F|, the euclidean L
2
distance can
be used in this space. The elements in the Γ space
however are in the form γ = {~x
1
,...,~x
|γ|
} where |γ| is
not fixed. A suitable distance for such variable-length
object is the Earth Mover’s Distance (EMD) (Rubner
et al., 2000), that has been developed for variable-
length distributions. The γ object can be seen as a
distribution modeling the relation of a textblock w.r.t
its anchors.
We describe the learning object (p,b,A) with
((θ,γ),y) where θ is obtained from (b, p) with fea-
tures f F and γ is obtained from (A,b) with features
g G and y , = {ω
1
,ω
2
} is the supervised la-
bel that defines whether (y = ω
1
) or not (y = ω
2
) a
texblock b is a main entity.
RBFNs are a general purpose ML solution and its
application in a specific problem domain implies the
definition of a proper distance metric to learn from
the feature space. In (Zhang and Zhou, 2006) for in-
stance a RBFN has been adapted to Multi-Instance
Learning problems by using the Hausdorff distance,
and in (Carullo et al., 2009) a content-based image
soft-categorization algorithm has been defined by in-
tegrating the EMD (Rubner et al., 2000) in a RBFN.
It is non-trivial to define a true distance metric
over Θ and Γ since they are heterogeneous and de-
fined over two different spaces, with different dis-
tance metrics. Our idea is to circumvent the distance
metric definition problem by letting the ML model
to learn adaptively from the data how the two dif-
ferent feature spaces should be combined together.
We thus define a novel ML model we name MS-
RBFN (Multi Source Radial Basis Function Network)
as a non-linear function h : Θ × Γ that maps
the problem space to the categories space as a re-
sult of the learning phase on the training set TrS =
{((θ
1
,γ
1
),y
1
),...,((θ
N
,γ
N
),y
N
)}.
The network is structured as follows:
1. a hybrid first level of:
(a) M
1
units φ
i
: Θ R to map the textblock fea-
ture space to the distance space w.r.t centroids
µ
θ
i
.
(b) M
2
units ψ
j
: Γ R to map the anchortext fea-
ture space to the distance space w.r.t centroids
µ
γ
j
.
with:
φ
i
(θ) = exp(−kθ µ
θ
i
k/σ
θ
i
)
ψ
j
(γ) = exp(emd(γ,µ
γ
j
)/σ
γ
j
)
where µ
θ
i
is the the i-th centroid in the Θ space,
µ
γ
j
is the j-th centroid in Γ space and σ
θ
i
and σ
γ
j
are the spreads of the basis functions for the Θ
and Γ space, respectively. The k · k distance in φ
i
is the euclidean distance and emd(·,·) is the EMD
distance.
2. a second level of linear weights:
~w
k
= {w
k,1
,...,w
k,|Γ|
} , k = 1, . . . , M
1
+ M
2
that connects each first level unit with each output
unit.
3. the two levels are then linearly combined to build
the model function f:
o
c
(γ,θ) =
M
1
i=1
φ
i
(γ) · w
i,c
+
M
2
j=1
ψ
j
(θ) · w
( j+M
1
),c
(1)
f(γ,θ) = argmax
c=1,...,||
o
c
(γ,θ) (2)
The training scheme is two-phased as in the orig-
inal RBFN (Moody and Darken, 1989): one is unsu-
pervised and selects µ
θ
i
, i = 1,...,M
1
and µ
γ
j
, j =
1,...,M
2
while the other solves a linear problem to
find values for ~w
k
, k = 1,...,M
1
+ M
2
.
1. the first phase finds suitable centroids µ
θ
i
, i =
1,... , M
1
by running the K-Means clustering al-
gorithm with K = M
1
. Then the p-means heuris-
tic (Moody and Darken, 1989) is applied to
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
158
compute the processing unit spreads σ
θ
i
, i =
1,...,M
1
. Similarly the µ
γ
j
, j = 1,...,M
2
cen-
troids are selected with an EMD-based K-Means
algorithm with K = M
2
and the σ
γ
j
, j = 1,...,M
2
values are set with the p-means heuristic.
2. the second phase is supervised and computes
~w
k
, k = 1, . . . ,M
1
+ M
2
by minimizing the differ-
ence between predicted output and truth by Least
Mean Squares:
(a) Φ is a N × (M
1
+ M
2
) matrix where
Φ
n,i
= φ
i
(θ
n
), i = 1,...,M
1
and
Φ
n,(M
1
+ j)
= ψ
j
(γ
n
), j = 1,...,M
2
with
n = 1,...,N.
(b) W is a (M
1
+ M
2
) × || matrix where W
i, j
=
w
i, j
.
(c) T is a N × || matrix where T
i
=
ˆ
y
i
.
the minimization problem to solve is ΦW = T and
thus W = Φ
T, where Φ
is the pseudoinverse.
The model has therefore three user parameters:
1. the M
1
value for first level local processing units
θ
i
2. the M
2
value for first level local processing units
γ
j
3. the value π of the p-means heuristic, used to de-
termine the spread of first level processing units.
3.3 Training and Generalization
Let us now consider the training phase of the algo-
rithm and the step-by-step phases for the recognition
of the main entity.
Given the supervised dataset D, the textblock fea-
tures F, the anchortext features G and the MS-RBFN
parameters M
1
,M
2
and π, the first step is to train the
algorithm with data extracted from D.
1. Split the set D in D
TrS
and D
TeS
with some rule,
e.g. 2/3 in D
TrS
and D
TeS
= D\ D
TrS
.
2. Initialize the training set TrS = {}.
3. (p, e,A) D:
(a) Apply extract blocks to obtain B.
(b) b B
i. Compute
ˆ
θ with features defined in F.
ii. Compute
ˆ
γ with features defined in G.
iii. Set the label y to ω
1
(main entity) if b = e and
to ω
2
(not main entity) otherwise.
iv. Add the ((
ˆ
θ,
ˆ
γ),y) element to TrS.
4. Train MS-RBFN with TrS.
The model can then be used to solve the MERP us-
ing the generalization phase. Be (p,A) an input pair,
and H the output set of recognized main entities:
1. Initialize the output set H = {}
2. Apply extract blocks to obtain B.
3. b B:
(a) Compute θ with features defined in F.
(b) Compute γ with features defined in G.
(c) Recognize the label ˆy = h((θ,γ))
(d) If ˆy = w
1
, add b to H
The performance can then be evaluated using the
generalization phase over all elements in D
TeS
.
3.4 Procedural Details
In this section further details are given to understand
how the proposed approach works. In particular the
text block extraction process extract blocks is de-
scribed and details on how the page is seen by the
algorithm are further explained.
3.4.1 Page Rendering
Our idea is to keep in pair with current and future
standard and technologies by approaching web con-
tent mining with the help of a web rendering engine.
The rendered page text is properly annotated with
structures that suggest the formatting of the text (font
weight, font size) and the presence of images or other
media objects. From this rendering view one can de-
rive the position of each object (text, images) in the
web page, and by analyzing each resource additional
metadata can be obtained: object size, width, URL,
file name, usage count in the page, etc.
The rendering engine program is based on XUL-
Runner
2
and permits to obtain visual layout infor-
mation for elements b B. This procedure is config-
ured to mimic a 1024x768 screen, as one of the most
widely used setups on the web.
3.4.2 Text Block Extraction Process
The blocks b
i
B set should as much as possible
be consistent with the semantics of the page. A
na¨ıve approach is to consider leaf DOM nodes only,
however due to the complex and heterogeneous real-
world DOM structures it is necessary to consider
also some non-leaf nodes since the resulting b
i
el-
ements should be as much as possible aligned with
semantics. In (Cai et al., 2003) Deng et al. show
how a good DOM segmentation algorithm can con-
tribute to obtain a semantically meaningful aggrega-
tion of a page’s contents. Inspired from that work
2
https://developer.mozilla.org/en/XULRunner
MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING
159
Table 1: Sample documents from the e-commerce dataset.
Expected Main Entity (e) Anchor Set (A)
“IKEA STOCKHOLM” “IKEA STOCKHOLM rahi”
“IKEA STOCKHOLM rahi 199”
“Sweet Time SY.6231M/26 Chrono Man” “Put in your basket”
“Sweet Time SY.6231M/26 Chrono Man”
“Marioneta enanito 35 cm” “MARIONETA AN
˜
AO 35 CM”
Table 2: Experimental results on the e-commerce dataset. The MS-RBFN model was trained with M
1
= M
2
= 50 and π = 5.
Feature set (G) Feature set (F) P R F
1
- intitle, fsize, fbold 0.78 0.66 0.72
dice intitle 0.87 0.81 0.83
dice intitle, fsize, fbold 0.88 0.84 0.86
we consider all leaf DOM nodes, plus nodes result-
ing from the merge of sub-nodes with name equal
to: “b”,“i”,“u”,“span”,“em”. Considering for exam-
ple the DOM subtree:
<div><b>This <em>is</em> the
<span class="entity">ENTITY</span>
<u>of</u> interest</b></div>
the resulting blocks are:“This”,“is”,“the”,“ENTITY”,
“of”,“interest”,“This is the ENTITY of interest”. The
latter block stems from the merge of the sub-nodes
with the “em”, “span” and “u” names.
4 EXPERIMENTS
In this section we report the experimental evaluation
of the effectiveness of the proposed approach as an
automated main entity recognition tool. In particular
the experiments address the two main questions:
1. to quantify whether or not the combined use of an-
chortext and textblock features is able to improve
the recognition process.
2. to isolate the contribution of non-conventional vi-
sual features.
Since to the best of our knowledge no suitable
datasets are available for this problem, we have built
and published an initial dataset for the e-commerce
domain. The dataset was collected from 51 European
e-commerce websites and is composed of 822 web
pages, with a total textblock count of 172937 and a
total anchor count of 1676. It can be obtained from
our website
3
. Each web page has a mean of 2 in-
3
http://www.dicom.uninsubria.it/ moreno.carullo/thesis/
la/
coming links and one supervised main-entity. The set
A was built considering a complete crawl of the web-
site and thus downward, upward and crosswise hyper-
links were considered (Spertus, 1997). Both text links
and images were considered to build the A set, where
the anchor text and the alternative text of the image
were used respectively. In Table 1 we report some
data from the (p,e,A) tuples.
In our experiments we used a letter-digit tokenizer
for the features, defined as a simple automata split-
ting contiguous sequences of alphabetic or numerical
characters. For example the string “this is the 1st” is
splitted as “this”,“is”,“the”,“1”,“st”. Further details
on how the tokenizer deals with the feature extracton
process, see section 3.1.
4.1 Evaluation Metrics
Standard ML evaluation metrics such as error ma-
trix and derived measures (Congalton, 1991) can be
adopted to evaluate classification results. However,
such metrics are not able to directly evaluate how well
the system is able to recognize the main entity in a
given web page. This can be assessed considering
the well-known Information Retrieval metrics Preci-
sion (P), Recall (R) and F-Measure (F
β
) (Frakes and
Baeza-Yates, 1992) targeting the recognition of cor-
rect main entities. The metrics were evaluated with a
macro-average approach, and the F-Measure F
β
was
used with equal weight for P and R (β = 1).
4.2 Results
The results obtained by the overall set of experiments
are reported in table 2. In the best configuration using
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
160
the complete set of features we obtained a satisfactory
result with a value of F
1
= 0.86. The precision value
of P = 0.88 determines the ability to embrace the pro-
posed solution in real-world scenarios. The addition
of anchortext features is able to improve both P and R
passing from an overall F
1
of 0.72 to 0.86. This latter
observation deserves particular attention since justi-
fies the definition of the proposed novel ML model.
Moreover experimental results show that the use
of the f
fsize
and f
fbold
visual features enable a further
improvement of the recognition performance, in par-
ticular the Recall value that goes from 0.81 to 0.84.
5 CONCLUSIONS AND FUTURE
WORKS
In this paper we have shown that ML techniques can
be used in conjunction with LA to achieve automatic
recognition of the main entity from pages with a given
topic and well-known web-usability-driven structure.
Experimental results show encouraging results on the
proposed dataset and highlight the advantage of com-
bining the two sources of information, text blocks
with their visual formatting styles and incoming an-
chor texts. Future works include the experimentation
on other website domains and the extension of the
current set general purpose features with additional
domain specific features.
REFERENCES
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Ex-
tracting content structure for web pages based on vi-
sual representation. In Web Technologies and Applica-
tions: 5th Asia-Pacific Web Conference, APWeb 2003,
Xian, China, April 23-25, 2003. Proceedings, page
596.
Carullo, M., Binaghi, E., and Gallo, I. (2009). Soft cate-
gorization and annotation of images with radial basis
function networks. In VISSAPP, International Con-
ference on Computer Vision Theory and Applications,
volume 2, pages 309–314.
Chakrabarti, S., Dom, B., and Indyk, P. (1998). En-
hanced hypertext categorization using hyperlinks. In
SIGMOD ’98: Proceedings of the 1998 ACM SIG-
MOD international conference on Management of
data, pages 307–318, New York, NY, USA. ACM.
Congalton, R. (1991). A review of assessing the accuracy of
classifications of remotely sensed data. Remote sens-
ing of environment, 37(1):35–46.
Frakes, W. B. and Baeza-Yates, R. A., editors (1992). In-
formation Retrieval: Data Structures & Algorithms.
Prentice-Hall.
F¨urnkranz, J. (2002). Web structure mining - exploiting the
graph structure of the world-wide web.
¨
OGAI Journal,
21(2):17–26.
Joachims, T., De, T. J., Cristianini, N., and Uk, N. R. A.
(2001). Composite kernels for hypertext categorisa-
tion. In In Proceedings of the International Confer-
ence on Machine Learning (ICML, pages 250–257.
Morgan Kaufmann Publishers.
Kosala, R. and Blockeel, H. (2000). Web mining research:
a survey. SIGKDD Explor. Newsl., 2(1):1–15.
Michalski, R. S., Carbonell, J. G., and Mitchell, T. M.
(1983). Machine Learning, An Artificial Intelligence
Approach. McGraw-Hill.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill,
New York.
Moody, J. E. and Darken, C. (1989). Fast learning in net-
works of locally-tuned processing units. Neural Com-
putation, 1:281–294.
Oh, H.-J., Myaeng, S. H., and Lee, M.-H. (2000). A prac-
tical hypertext catergorization method using links and
incrementally available class information. In SIGIR
’00: Proceedings of the 23rd annual international
ACM SIGIR conference on Research and development
in information retrieval, pages 264–271, New York,
NY, USA. ACM.
Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth
mover’s distance as a metric for image retrieval. Int.
J. Comput. Vision, 40(2):99–121.
Spertus, E. (1997). Parasite: mining structural informa-
tion on the web. Comput. Netw. ISDN Syst., 29(8-
13):1205–1215.
Zhang, M.-L. and Zhou, Z.-H. (2006). Adapting rbf neural
networks to multi-instance learning. Neural Process.
Lett., 23(1):1–26.
MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING
161