MACHINE LEARNING AND LINK ANALYSIS

FOR WEB CONTENT MINING

Moreno Carullo and Elisabetta Binaghi

Department of Computer Science and Communication, University of Insubria, 21100 Varese, Italy

Keywords:

Web content mining, Hyperlinks, Machine learning, Radial basis function networks.

Abstract:

In this work we deﬁne a hybrid Web Content Mining strategy aimed to recognize within Web pages the main

entity, intended as the short text that refers directly to the main topic of a given page. The salient aspect of the

strategy is the use of a novel supervised Machine Learning model able to represent in an uniﬁed framework

the integrated use of visual pages layout features, textual features and hyperlink description. The proposed

approach has been evaluated with promising results.

1 INTRODUCTION

The web content is huge and dynamic, mainly con-

stituted by unstructured, partially-structured text data

or by low-volumed structured data having different

non-standard scheme. Web documents are character-

ized by both their heterogeneous contents represented

with different media and format, and the organization

of these contents within graphical page layouts more

and more consistent with usability guidelines. Server

logs information, link structure and description con-

stitute additional relevant information. All these as-

pects make the extraction of useful knowledge a chal-

lenging research problem.

Since the year 2000 the application of learning

strategies to WM tasks was intensively studied show-

ing advantages in terms of both effectiveness and

portability over conventional and earlier strategies

based on knowledge engineering approach (Kosala

and Blockeel, 2000). In supervised learning training

examples consist of input/output pairs and the goal

of the learning algorithm is to predict the output val-

ues of never seen input values. In unsupervised learn-

ing training examples are constituted only by input

patterns; the learning algorithm is able to general-

ize from input patterns to discover similarities among

data (Michalski et al., 1983; Mitchell, 1997).

According to (Kosala and Blockeel, 2000) three

main areas of research can be distinguished: Web

Content Mining (WCM) - the application of data min-

ing techniquesto Web documents, Web Usage Mining

- the analysis of interactions between the user and the

Web and Web Structure Mining and/or Link Analysis

in which the structure of hyperlinks is used to solve a

problem in a graph-structured domain.

These WM tasks can be combined in an unique

application, reinforcing the mining process by the al-

lied analysis of different Web characteristics. Recent

works propose in particular the integration of Web

Structure and Text Mining tasks. The hyper link in-

formation, and in particular the anchor text - the text

appearing in the predecessor page and pointing to the

target, has been used in Information Retrieval tasks

and classiﬁcation tasks (Spertus, 1997; F¨urnkranz,

2002). Page classiﬁcation in particular can beneﬁt

from Link Analysis since it permits to determine the

topic of each page in a more robust way, considering

in the classiﬁcation process the predicted category of

neighbors (Chakrabarti et al., 1998; Oh et al., 2000;

Joachims et al., 2001).

Proceeding from these considerations, we in-

tend to deﬁne and experimentally investigate a hy-

brid WCM strategy aimed to recognize within Web

pages the main entity, intended as the short text that

refers directly to the main topic of a given page.

We consider this application of great importance for

the building and the maintenance of intelligent web

agents oriented to advanced web-based user services.

The salient aspect of the strategy is the use of a novel

supervised ML model able to represent in an uniﬁed

framework the integrated use of visual layout fea-

tures, textual features and hyperlink descriptions.

156

Carullo M. and Binaghi E..

MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING.

DOI: 10.5220/0003065401560161

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 156-161

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

2 WEB MAIN ENTITY

RECOGNITION

Every and each page in the web has a deﬁnite topic

that can be explicitly declared in the page itself with a

short text we name main entity. Consider for instance

online blogs, the main entity of each post’s page is the

title of the post. An online newspaper’spage will have

the new’s title as the main entity, and an e-commerce

website the product’s name.

Each page p can be modeled as a collection of text

entities B: where the main entity e ∈ B is the one best

describing the topic or title of the page. In addition we

know the set of incoming links to p and in particular

for each link we know the anchor text, that is the text

in the predecessor page pointing to p.

In the main entity recognition problem (MERP)

the goal is to consider a page p and the set of Incom-

ing Link Anchor Texts (ILA) A = {a

,...,a

} and to

discover the text block ˆe recognized as the most plau-

sible main entity in the H ⊂ B set of candidate main

entity blocks.

Suppose we have the possibility to collect a su-

pervised truth for each web page, that is a set D of

tuples (p, e,A) where p is a web page in a given do-

main, e is the main entity and A is the set of ILA for

p. The supervised truth can be collected by providing

domain experts an environment where they can select

the main entity e ∈ B for each p.

MERP can be seen as a supervised classiﬁcation

task where the object to classify is (b,A) ∈ B × A and

the candidate labels are “main entity” and “non main

entity”. The main issue lies in the fact that A is a set

of elements of variable cardinality while b is a sin-

gle element. In order to allow a ML model to dis-

cover the relationship between (b,A) and the label,

a proper low-level representation of features is sup-

posed to take place.

3 THE PROPOSED APPROACH

The approach we propose has three main parts. The

ﬁrst is the deﬁnition of how the web page is repre-

sented, that is the structure and characteristics of a

web page. The second is a set of representative fea-

tures that work on the deﬁned web page structure.

The third is the deﬁnition of a suitable ML model for

MERP.

We model each web page p as a semi-structured

source of information where B = {b

,...,b

}, B ⊂

B is the set of text blocks in p ∈ P, where B is the

domain of all textblocks and P is the domain of pages.

With the term “text block” we refer to a string of text

that appears in p as a the inner text of a DOM

node.

The set B is obtained through the extract blocks : P →

B function, further explained in Sec. 3.4.

3.1 Feature Extraction

In the main entity recognition problem the object to

classify is a text block b ∈ B given its page p and

the A set for p. We therefore partition the features

in two subsets: textblock features and anchortext fea-

tures. The elements in the former are functions f ∈ F

in the form f : B × P → R while in the latter we have

functions g ∈ G in the form g : A × B → R.

Textblock features describe low-level characteris-

tics of elements b ∈ B found in the page and their re-

lation with the page itself, such as the title, page di-

mensions etc:

• f

intitle

: expresses the percentage of matching be-

tween the terms of the textblock b and the terms

in the title t of the page, following the web us-

ability guideline stating that the main object of

a web page should be named in the title. For

example if the web page p title is t=“The quick

brown fox jumps over the lazy dog” and b =

“The quick brown fox” then f

intitle

(b, p) = 16/35,

where 16 is the total length of the common terms

and 35 is the length of t.

• f

fsize

: expresses the font size of the textblock b

normalized w.r.t. a maximum value deﬁned a pri-

ori. The idea behind this feature is that the main

entity is often presented with a big font-size to be

easily identiﬁed by the user.

• f

fbold

: expresses whether or not the considered

textblock b is displayed with a bold font. The idea

behind this feature is closely related to the f

fsize

feature and is driven by the fact that the main en-

tity is often presented with a bold font.

Anchortext features take into account anchors and

textblocks to allow the deﬁnition of low-level rela-

tionships between the two. Here we deﬁne a simple,

low level feature of this kind:

• g

dice

: expresses the similarity between the

textblock b and the anchortext a. The employed

similarity measure is the Dice coefﬁcient (Frakes

and Baeza-Yates, 1992), computed as:

dice

(a,b) =

|a| + |b|

where C is the number of common terms between

a and b, |a| and |b| are the number of terms of a

and b, respectively.

http://www.w3.org/TR/DOM-Level-2-HTML/

MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING

157

In the g

dice

and f

intitle

features a tokenizer is re-

quired to obtain a set of terms from a text block or

an anchor text or from the title of the page. The tok-

enizer has to be selected depending on the domain of

the problem.

Given a set F ⊂ F and a set G ⊂ G and a (p, b,A)

tuple we deﬁne the Θ ⊂ R

, n = |F| textblock feature

space and the Γ anchortext feature space. An element

γ from the (b, A) pair is in the form γ = {~x

,...,~x

}

where each instance~x

is computed by the set of fea-

tures deﬁned from a (a,b) pair ∀a ∈ A. The cardi-

nality m of a γ element is not ﬁxed a priori since the

number |A| of anchortexts pointing to a given page

may vary.

3.2 A Multi-source Machine Learning

Algorithm

In this section we describe the ML model used to

solve the MERP. In particular, we adopted the Radial

Basis Function Network (RBFN) model introduced

by (Moody and Darken, 1989) for its proven training

speed and robustness on classiﬁcation and regression

tasks. These capabilities are especially suitable for

the inherent complexity in the WCM context.

The MERP deﬁnes the textblock b as the object to

be classiﬁed, with the additional context of its page p

and the anchortext set A. As introduced in the previ-

ous section, we deﬁne two kinds of features that give

rise to two separate feature spaces Θ and Γ. Since

Θ ⊂ R

with n = |F|, the euclidean L

distance can

be used in this space. The elements in the Γ space

however are in the form γ = {~x

,...,~x

|γ|

} where |γ| is

not ﬁxed. A suitable distance for such variable-length

object is the Earth Mover’s Distance (EMD) (Rubner

et al., 2000), that has been developed for variable-

length distributions. The γ object can be seen as a

distribution modeling the relation of a textblock w.r.t

its anchors.

We describe the learning object (p,b,A) with

((θ,γ),y) where θ is obtained from (b, p) with fea-

tures f ∈ F and γ is obtained from (A,b) with features

g ∈ G and y ∈ Ω, Ω = {ω

,ω

} is the supervised la-

bel that deﬁnes whether (y = ω

) or not (y = ω

) a

texblock b is a main entity.

RBFNs are a general purpose ML solution and its

application in a speciﬁc problem domain implies the

deﬁnition of a proper distance metric to learn from

the feature space. In (Zhang and Zhou, 2006) for in-

stance a RBFN has been adapted to Multi-Instance

Learning problems by using the Hausdorff distance,

and in (Carullo et al., 2009) a content-based image

soft-categorization algorithm has been deﬁned by in-

tegrating the EMD (Rubner et al., 2000) in a RBFN.

It is non-trivial to deﬁne a true distance metric

over Θ and Γ since they are heterogeneous and de-

ﬁned over two different spaces, with different dis-

tance metrics. Our idea is to circumvent the distance

metric deﬁnition problem by letting the ML model

to learn adaptively from the data how the two dif-

ferent feature spaces should be combined together.

We thus deﬁne a novel ML model we name MS-

RBFN (Multi Source Radial Basis Function Network)

as a non-linear function h : Θ × Γ → Ω that maps

the problem space to the categories space as a re-

sult of the learning phase on the training set TrS =

{((θ

,γ

),y

),...,((θ

,γ

),y

)}.

The network is structured as follows:

1. a hybrid ﬁrst level of:

(a) M

units φ

: Θ → R to map the textblock fea-

ture space to the distance space w.r.t centroids

(b) M

units ψ

: Γ → R to map the anchortext fea-

ture space to the distance space w.r.t centroids

with:

(θ) = exp(−kθ− µ

k/σ

)

(γ) = exp(−emd(γ,µ

)/σ

)

where µ

is the the i-th centroid in the Θ space,

is the j-th centroid in Γ space and σ

and σ

are the spreads of the basis functions for the Θ

and Γ space, respectively. The k · k distance in φ

is the euclidean distance and emd(·,·) is the EMD

distance.

2. a second level of linear weights:

= {w

k,1

,...,w

k,|Γ|

} , k = 1, . . . , M

+ M

that connects each ﬁrst level unit with each output

unit.

3. the two levels are then linearly combined to build

the model function f:

(γ,θ) =

∑

i=1

(γ) · w

i,c

∑

j=1

(θ) · w

( j+M

),c

(1)

f(γ,θ) = argmax

c=1,...,|Ω|

(γ,θ) (2)

The training scheme is two-phased as in the orig-

inal RBFN (Moody and Darken, 1989): one is unsu-

pervised and selects µ

, i = 1,...,M

and µ

, j =

1,...,M

while the other solves a linear problem to

ﬁnd values for ~w

, k = 1,...,M

+ M

1. the ﬁrst phase ﬁnds suitable centroids µ

, i =

1,... , M

by running the K-Means clustering al-

gorithm with K = M

. Then the p-means heuris-

tic (Moody and Darken, 1989) is applied to

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

158

compute the processing unit spreads σ

, i =

1,...,M

. Similarly the µ

, j = 1,...,M

cen-

troids are selected with an EMD-based K-Means

algorithm with K = M

and the σ

, j = 1,...,M

values are set with the p-means heuristic.

2. the second phase is supervised and computes

, k = 1, . . . ,M

+ M

by minimizing the differ-

ence between predicted output and truth by Least

Mean Squares:

(a) Φ is a N × (M

+ M

) matrix where

n,i

= φ

(θ

), i = 1,...,M

and

n,(M

+ j)

= ψ

(γ

), j = 1,...,M

with

n = 1,...,N.

(b) W is a (M

+ M

) × |Ω| matrix where W

i, j

the minimization problem to solve is ΦW = T and

thus W = Φ

†

T, where Φ

†

is the pseudoinverse.

The model has therefore three user parameters:

1. the M

value for ﬁrst level local processing units

2. the M

value for ﬁrst level local processing units

3. the value π of the p-means heuristic, used to de-

termine the spread of ﬁrst level processing units.

3.3 Training and Generalization

Let us now consider the training phase of the algo-

rithm and the step-by-step phases for the recognition

of the main entity.

Given the supervised dataset D, the textblock fea-

tures F, the anchortext features G and the MS-RBFN

parameters M

and π, the ﬁrst step is to train the

algorithm with data extracted from D.

1. Split the set D in D

TrS

and D

TeS

with some rule,

e.g. 2/3 in D

TrS

and D

TeS

= D\ D

TrS

2. Initialize the training set TrS = {}.

3. ∀(p, e,A) ∈ D:

(a) Apply extract blocks to obtain B.

(b) ∀b ∈ B

i. Compute

θ with features deﬁned in F.

ii. Compute

γ with features deﬁned in G.

iii. Set the label y to ω

(main entity) if b = e and

to ω

(not main entity) otherwise.

iv. Add the ((

θ,

γ),y) element to TrS.

4. Train MS-RBFN with TrS.

The model can then be used to solve the MERP us-

ing the generalization phase. Be (p,A) an input pair,

and H the output set of recognized main entities:

1. Initialize the output set H = {}

2. Apply extract blocks to obtain B.

3. ∀b ∈ B:

(a) Compute θ with features deﬁned in F.

(b) Compute γ with features deﬁned in G.

(d) If ˆy = w

, add b to H

The performance can then be evaluated using the

generalization phase over all elements in D

TeS

3.4 Procedural Details

In this section further details are given to understand

how the proposed approach works. In particular the

text block extraction process extract blocks is de-

scribed and details on how the page is seen by the

algorithm are further explained.

3.4.1 Page Rendering

Our idea is to keep in pair with current and future

standard and technologies by approaching web con-

tent mining with the help of a web rendering engine.

The rendered page text is properly annotated with

structures that suggest the formatting of the text (font

weight, font size) and the presence of images or other

media objects. From this rendering view one can de-

rive the position of each object (text, images) in the

web page, and by analyzing each resource additional

metadata can be obtained: object size, width, URL,

ﬁle name, usage count in the page, etc.

The rendering engine program is based on XUL-

Runner

and permits to obtain visual layout infor-

mation for elements b ∈ B. This procedure is conﬁg-

ured to mimic a 1024x768 screen, as one of the most

widely used setups on the web.

3.4.2 Text Block Extraction Process

The blocks b

∈ B set should as much as possible

be consistent with the semantics of the page. A

na¨ıve approach is to consider leaf DOM nodes only,

however due to the complex and heterogeneous real-

world DOM structures it is necessary to consider

also some non-leaf nodes since the resulting b

el-

ements should be as much as possible aligned with

semantics. In (Cai et al., 2003) Deng et al. show

how a good DOM segmentation algorithm can con-

tribute to obtain a semantically meaningful aggrega-

tion of a page’s contents. Inspired from that work

https://developer.mozilla.org/en/XULRunner

MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING

159

Table 1: Sample documents from the e-commerce dataset.

Expected Main Entity (e) Anchor Set (A)

“IKEA STOCKHOLM” “IKEA STOCKHOLM rahi”

“IKEA STOCKHOLM rahi 199”

“Sweet Time SY.6231M/26 Chrono Man” “Put in your basket”

“Sweet Time SY.6231M/26 Chrono Man”

“Marioneta enanito 35 cm” “MARIONETA AN

AO 35 CM”

Table 2: Experimental results on the e-commerce dataset. The MS-RBFN model was trained with M

= M

= 50 and π = 5.

Feature set (G) Feature set (F) P R F

- intitle, fsize, fbold 0.78 0.66 0.72

dice intitle 0.87 0.81 0.83

dice intitle, fsize, fbold 0.88 0.84 0.86

we consider all leaf DOM nodes, plus nodes result-

ing from the merge of sub-nodes with name equal

to: “b”,“i”,“u”,“span”,“em”. Considering for exam-

ple the DOM subtree:

<div><b>This <em>is</em> the

<span class="entity">ENTITY</span>

<u>of</u> interest</b></div>

the resulting blocks are:“This”,“is”,“the”,“ENTITY”,

“of”,“interest”,“This is the ENTITY of interest”. The

latter block stems from the merge of the sub-nodes

with the “em”, “span” and “u” names.

4 EXPERIMENTS

In this section we report the experimental evaluation

of the effectiveness of the proposed approach as an

automated main entity recognition tool. In particular

the experiments address the two main questions:

1. to quantify whether or not the combined use of an-

chortext and textblock features is able to improve

the recognition process.

2. to isolate the contribution of non-conventional vi-

sual features.

Since to the best of our knowledge no suitable

datasets are available for this problem, we have built

and published an initial dataset for the e-commerce

domain. The dataset was collected from 51 European

e-commerce websites and is composed of 822 web

pages, with a total textblock count of 172937 and a

total anchor count of 1676. It can be obtained from

our website

. Each web page has a mean of 2 in-

http://www.dicom.uninsubria.it/ moreno.carullo/thesis/

la/

coming links and one supervised main-entity. The set

A was built considering a complete crawl of the web-

site and thus downward, upward and crosswise hyper-

links were considered (Spertus, 1997). Both text links

and images were considered to build the A set, where

the anchor text and the alternative text of the image

were used respectively. In Table 1 we report some

data from the (p,e,A) tuples.

In our experiments we used a letter-digit tokenizer

for the features, deﬁned as a simple automata split-

ting contiguous sequences of alphabetic or numerical

characters. For example the string “this is the 1st” is

splitted as “this”,“is”,“the”,“1”,“st”. Further details

on how the tokenizer deals with the feature extracton

process, see section 3.1.

4.1 Evaluation Metrics

Standard ML evaluation metrics such as error ma-

trix and derived measures (Congalton, 1991) can be

adopted to evaluate classiﬁcation results. However,

such metrics are not able to directly evaluate how well

the system is able to recognize the main entity in a

given web page. This can be assessed considering

the well-known Information Retrieval metrics Preci-

sion (P), Recall (R) and F-Measure (F

) (Frakes and

Baeza-Yates, 1992) targeting the recognition of cor-

rect main entities. The metrics were evaluated with a

macro-average approach, and the F-Measure F

was

used with equal weight for P and R (β = 1).

4.2 Results

The results obtained by the overall set of experiments

are reported in table 2. In the best conﬁguration using

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

160

the complete set of features we obtained a satisfactory

result with a value of F

= 0.86. The precision value

of P = 0.88 determines the ability to embrace the pro-

posed solution in real-world scenarios. The addition

of anchortext features is able to improve both P and R

passing from an overall F

of 0.72 to 0.86. This latter

observation deserves particular attention since justi-

ﬁes the deﬁnition of the proposed novel ML model.

Moreover experimental results show that the use

of the f

fsize

and f

fbold

visual features enable a further

improvement of the recognition performance, in par-

ticular the Recall value that goes from 0.81 to 0.84.

5 CONCLUSIONS AND FUTURE

WORKS

In this paper we have shown that ML techniques can

be used in conjunction with LA to achieve automatic

recognition of the main entity from pages with a given

topic and well-known web-usability-driven structure.

Experimental results show encouraging results on the

proposed dataset and highlight the advantage of com-

bining the two sources of information, text blocks

with their visual formatting styles and incoming an-

chor texts. Future works include the experimentation

on other website domains and the extension of the

current set general purpose features with additional

domain speciﬁc features.

REFERENCES

Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Ex-

tracting content structure for web pages based on vi-

sual representation. In Web Technologies and Applica-

tions: 5th Asia-Paciﬁc Web Conference, APWeb 2003,

Xian, China, April 23-25, 2003. Proceedings, page

596.

Carullo, M., Binaghi, E., and Gallo, I. (2009). Soft cate-

gorization and annotation of images with radial basis

function networks. In VISSAPP, International Con-

ference on Computer Vision Theory and Applications,

volume 2, pages 309–314.

Chakrabarti, S., Dom, B., and Indyk, P. (1998). En-

hanced hypertext categorization using hyperlinks. In

SIGMOD ’98: Proceedings of the 1998 ACM SIG-

MOD international conference on Management of

data, pages 307–318, New York, NY, USA. ACM.

Congalton, R. (1991). A review of assessing the accuracy of

classiﬁcations of remotely sensed data. Remote sens-

ing of environment, 37(1):35–46.

Frakes, W. B. and Baeza-Yates, R. A., editors (1992). In-

formation Retrieval: Data Structures & Algorithms.

Prentice-Hall.

F¨urnkranz, J. (2002). Web structure mining - exploiting the

graph structure of the world-wide web.

OGAI Journal,

21(2):17–26.

Joachims, T., De, T. J., Cristianini, N., and Uk, N. R. A.

(2001). Composite kernels for hypertext categorisa-

tion. In In Proceedings of the International Confer-

ence on Machine Learning (ICML, pages 250–257.

Morgan Kaufmann Publishers.

Kosala, R. and Blockeel, H. (2000). Web mining research:

a survey. SIGKDD Explor. Newsl., 2(1):1–15.

Michalski, R. S., Carbonell, J. G., and Mitchell, T. M.

(1983). Machine Learning, An Artiﬁcial Intelligence

Approach. McGraw-Hill.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill,

New York.

Moody, J. E. and Darken, C. (1989). Fast learning in net-

works of locally-tuned processing units. Neural Com-

putation, 1:281–294.

Oh, H.-J., Myaeng, S. H., and Lee, M.-H. (2000). A prac-

tical hypertext catergorization method using links and

incrementally available class information. In SIGIR

’00: Proceedings of the 23rd annual international

ACM SIGIR conference on Research and development

in information retrieval, pages 264–271, New York,

NY, USA. ACM.

Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth

mover’s distance as a metric for image retrieval. Int.

J. Comput. Vision, 40(2):99–121.

Spertus, E. (1997). Parasite: mining structural informa-

tion on the web. Comput. Netw. ISDN Syst., 29(8-

13):1205–1215.

Zhang, M.-L. and Zhou, Z.-H. (2006). Adapting rbf neural

networks to multi-instance learning. Neural Process.

Lett., 23(1):1–26.

MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING

161