A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB

TEMPLATE DETECTION

Filippo Geraci

and Marco Maggini

Istituto di Informatica e Telematica, CNR, Pisa, Italy

Dipartimento di Ingegneria dell’Informazione, Universit

a di Siena, Siena, Italy

ﬁlippo.geraci@iit.cnr.it, maggini@ing.unisi.it

Keywords:

Web information retrieval, Web template detection.

Abstract:

Nowadays most of Web pages are automatically assembled by content management systems or editing tools

that apply a ﬁxed template to give a uniform structure to all the documents beloging to the same site. The tem-

plate usually contains side information that provides better graphics, navigation bars and menus, banners and

advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic

processing of Web documents. In this paper, we present a novel template removing technique that exploits

a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a

quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags

among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experi-

mental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real

template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template

removing technique is shown on a Web page clustering task.

1 INTRODUCTION

Automatic processing of Web contents is nowadays

an important tool to provide added value services to

users. For instance, search engines have become an

indispensable and powerful gateway to access infor-

mation on the Web, information extraction techniques

are the basis to design systems for data integration and

collection from unstructured Web documents, senti-

ment analysis and topic monitoring from Web posts

are attracting an increasing interest from companies.

Most of these applications need to focus only on the

most informative part of a Web document and the

quality of their results may be heavily hindered by the

large amount of side information that is usually em-

bedded in Web documents. In fact, most of Web pages

are usually dynamically constructed by content man-

agement systems or edited with design tools that use

a ﬁxed template to format them. Basically a template

provides a common structure of all the pages that is

reﬂected both in the graphical layout and in the skele-

ton of HTML tags that are used to format the page.

Many different techniques for template detection

and removal have been proposed in the literature. All

the reported results show that automatic processing

techniques, like indexing, Web page classiﬁcation and

clustering, may beneﬁt from template removal. Some

early works approached the task by a layout analy-

sis based on a graphical rendering of the page but

their practical use is limited by the high computa-

tional costs (Kovacevic et al., 2002). The most re-

cent algorithms try to exploit the structure of HTML

tags by searching for common parts of the pages or

blocks (Bar-Yossef and Rajagopalan, 2002; Debnath

et al., 2005; Song et al., 2004) or by an analysis of

the document tree as deﬁned by HTML tags (Yi et al.,

2003; Chakrabarti et al., 2007). A related application

that exploits similar techniques is wrapper induction

(see e.g. (Wong and Lam, 2007)), that aims at the au-

tomatic generation of information extraction modules

by exploiting the common structure of the Web pages

containing the data we are interested to collect.

Desirable features of an automatic template ex-

traction algorithms are low computational costs since

they may employed in large scale systems, easy

adaptability to any template structure using a mini-

mum number of sample pages, the use of a completely

unsupervised (or a very limited supervised) approach

to avoid the need of human effort that would limit its

applicability, and high precision in the detection of the

template structure. In this paper we present a template

extraction algorithm that meets most of these require-

Geraci F. and Maggini M. (2011).

A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 121-128

DOI: 10.5220/0003712801210128

 SciTePress

ments. The proposed technique exploits the Needle-

man and Wunsch alignment algorithm originally de-

vised for DNA sequences (Needlemana and Wunscha,

1970). This algorithm is exploited to ﬁnd the most

likely alignment among two sequences of HTML tags

extracted from two different pages from the same site.

Since the common sequence structure that is extracted

from a pair of pages may be strongly dependent on

the speciﬁc selected pages, a hierarchical consensus

schema is adopted: given a certain number of differ-

ent pairs of pages, the extracted common sequences,

that constitute the current set of hypotheses for the

target template, are recursively compared to further

distillate the common parts. The comparisons form

a binary consensus tree and the output of the algo-

rithm is the template sequence available in the tree

root. The experimental results show that the proposed

method is able to yield a good precision and recall

in the detection of the HTML tags belonging to the

template given only a very limited number of sample

pages from the site (in the considered setting only 16

pages are able to provide satisfactory performances).

The algorithm is also efﬁcient and the running time

for template extraction is quite low. Finally, the eval-

uation shows that the proposed extraction technique

can provide signiﬁcant beneﬁts in a Web mining task,

namely Web page clustering, conﬁrming similar re-

sults as those reported in the literature.

The paper is organized as follows. The next sec-

tion describes the template extraction algorithm in de-

tails. Then section 3 reports both the evaluation of the

accuracy in the prediction of the real template and an

analysis of the impact of the template removal in a

Web mining application (Web page clustering). Fi-

nally in section 4 the conclusions are drawn and the

future developments are sketched.

2 TEMPLATE DETECTION

ALGORITHM

The proposed template extraction algorithm is

based on the assumption that a Web template is made

up of a set of overrepresented contiguous subse-

quences of HTML tags shared by the pages from a

given web site. According to this assumption, some

local parts of the template can be missing in some par-

ticular web page.

The ﬁrst processing step consists in splitting each

page into a sequence of tokens. In the following, a

token corresponds to a single HTML tag or to the

text snippet contained between two consecutive tags.

The tokens are then normalized to remove the po-

tential differences due to the human editing of the

pages or irrelevant features (i.e. extra white spaces

are removed, capital letters are lowered). Moreover

tags are normalized by sorting their internal attributes.

The normalized tokens form an alphabet of sym-

bols over which the page sequences are generated.

Hence, the problem of ﬁnding the common parts in

two Web pages can be cast as the computation of the

global alignment of the two corresponding sequences

of normalized tokens. The alignment can be obtained

by exploiting a modiﬁed version of the DNA global

alignment algorithm due to Needleman and Wunsch

(Needlemana and Wunscha, 1970), which is based on

a dynamic programming technique.

However, a single alignment of only two Web

pages is likely to produce a poor approximation of

the template since they can share the same tokens just

by chance or a portion of the real template could be

missing in one of the two pages. Hence, in order to

compute a more reliable template, the alignment pro-

cedure is repeated exploiting a set t pages using a re-

cursive procedure. At each step the k input sequences

are paired and aligned to yield k/2 template proﬁle

candidates, starting from the initial t sequences rep-

resenting the available samples. The template candi-

date originating from a given pair of sequences is ob-

tained by pruning those tokens that have receive low

evidence in the alignment steps performed so far. The

procedure iterates until it remains only a single pro-

ﬁle that is returned as ﬁnal Web template (i.e. log

(t)

steps are required).

2.1 The Alignment Algorithm

Two strings s

and s

are aligned by inserting white

spaces (gaps) into them, such that the probability of

ﬁnding the same symbol in the same position of the

two modiﬁed strings is maximized, and that only one

string can contain a gap in a certain position. An

alignment corresponds to the sequence of operations

required to align the two strings. Consider the case

in which we are comparing the i-th symbol of string

and the j-th symbol of string s

(later referred to

as s

[i] and s

[ j]). There are three options: skip both

symbols and move to the next one for both the strings;

insert a gap in s

[i] or in s

[ j]. We assign a score (or

a penalty) to each operation so as to compute a global

score for the whole alignment. In particular we re-

ward the case in which s

[i] = s

[ j] giving score 1,

we ignore the case in which we skip s

[i] and s

[ j]

because they are different, and we penalize the inser-

tion of a gap giving score -1. The goal is to ﬁnd an

alignment with highest score among all the possible

alignments.

In the considered application, the strings cor-

<H1>

</H1>

<BR>

<HR>

Text 1

<BR>

</H1>

Title 1

<H1>

Text 2

Title 2

−10 −2 −3 −4 −5 −6

−7

−6

−5

−4

−3

−2

−1

(a) Matrix initialization

<H1>

</H1>

<BR>

<HR>

Text 1

<BR>

</H1>

Title 1

<H1>

Text 2

Title 2

−10 −2 −3 −4 −5 −6

−6

−5

−4

−3

−2

−1

−7

1 0 −1 −2 −3 −4

0 01 −1 −2 −3

−1 0 2 01 −1

−2 −1 1 23 1

−3 −2 0 2 3 2

−4 −3 −1 321

310−2−5 −4

(b) Filled matrix

Figure 1: Similarity matrix for two small HTML snippets.

respond to sequences of tokens extracted from the

HTML pages, as described before. Let U =

,. ..,u

} and V = {v

,. . .,v

} be the sequences of

tokens representing two Web pages. The alignment

algorithm builds a (n + 1) × (m + 1) matrix S (called

similarity matrix) such that each entry contains the

score of the best alignment of a preﬁx of U and V .

We initialize the ﬁrst row and column of S as follows:

S[0][i] = S[i][0] = −i. Figure 1(a) shows the matrix S

initialized for a small portion of two HTML pages.

Once initializated, we ﬁll each line of the similar-

ity matrix S from left to right. Consider the case in

which we are ﬁlling the cell in position i, j. If we in-

sert a gap in position i of U, the partial score of the

alignment becomes S[ j][i − 1] − 1. Symmetrically, if

we insert a gap in position j of the sequence V , the

partial alignment score becomes S[ j − 1][i] − 1. If we

align v

and u

there are two cases: the two tokens

match, thus score becomes S[ j − 1][i − 1] + 1, the to-

kens do not match, thus the score does not change.

Since we are interested in the best possible alignment,

the algorithm chooses the alternative that maximizes

the score. Note that the value of S[i][ j] is always de-

rived from one of its neighbours. In a separate matrix

S we keep trace of the path from a cell of S to its pre-

decessor. Figure 1(b) shows the ﬁnal similarity matrix

and a visual representation of the corresponding

The ﬁnal phase of the algorithm navigates

S from

the bottom right position to the top left so as to ex-

tract the aligned tokens. Given the element

S[ j][i], if

its predecessor is

S[ j][i − 1] the algorithm outputs v

symmetrically, if the predecessor is

S[ j − 1][i], the al-

gorithm outputs u

. When the predecessor of

S[ j][i] is

S[ j −1][i− 1] we have a match/missmatch. In the ﬁrst

case we output indifferently u

or v

, otherwise we

output both tokens. Note that this sequence of tokens

has to be inverted to match the original order.

2.2 The Template Distillation Algorithm

Given a set of t input pages represented by their nor-

malized token sequences s

, i = 1, ...,t, we can ran-

domly pair these pages to obtain t/2 pairs (s

Each token in these sequences is assigned a score

value that accounts the frequency of the token at each

given position in the sample pages. Hence, at the ﬁrst

step the tokens are all assigned the initial value of 1/t,

since they are observed in just one case. The output of

the alignment algorithm applied on each of these pairs

is an aligned sequence s

,k = 1,.. .,t/2 containing the

union of the two original sequences. The conﬁdence

score of each token in the output sequence is com-

puted as the sum of the scores available in the input

sequences for the matching tokens, whereas if a token

does not match with a token in the other sequence its

original score is left unchanged.

The procedure is repeated at each step r using

as input the sequences s

r−1

,k = 1, ... ,t/2

r−1

, com-

puted after the alignment and a token pruning phase

in the previous step. The pruning phase aims at re-

moving the less frequent tokens from the sequences

and, in particular, all the tokens having a score below

the value 2

r−2

/t are pruned (given the initial values

of the scores, for r = 1,2 no tokens are actually re-

moved). Without loss of generality, we can assume

that t = 2

such that the algorithm terminates after

n = log

(t) steps when the ﬁnal sequence correspond-

ing to the output template is computed. The distilla-

Table 1: The Web sites collected in the benchmark dataset, with the related average number of tokens in a typical page and in

the corresponding template.

Web site # tokens # template tokens

PI: punto-informatico.it 876 686

Ansa: ansa.it 2245 1794

Wikipedia: en.wikipedia.org 2058 467

allMusic: allmusic.com 1559 282

zdNet: zdnet.com 2882 2820

tion procedure can be described as a binary tree whose

leaves correspond to the input pages whereas the in-

ternal nodes are the result of the recursive distillation

process. Nodes in the higher levels represent more

reliable template hypotheses since they collect the ev-

idence extracted by combining the alignments orig-

inating from a larger set of pages. Finally the ﬁnal

result is the represented by the sequence available at

the root of the tree. Hence, it is easy to verify that the

procedure requires to perform a total of 2

−1 = t − 1

alignments.

The sequence available at the root node is ﬁnally

pruned by removing those tokens whose score is be-

low 1/2, that is that the evidence of their presence is

supported by less than half of the pages used in the

process (this is the value used in the experiments but

a different threshold could be used).

3 EXPERIMENTAL RESULTS

The computational cost and the impact of the tem-

plate extraction algorithm in a typical Web mining

task were evaluated in a set of experiments. In par-

ticular, the ﬁrst evaluation was designed to assess the

ability of the algorithm to predict exactly which to-

kens belong to the template for a given Web site,

whereas a second benchmark has been performed to

investigate the contribution of the algorithm on the ﬁ-

nal outcome of an information retrieval task (namely

Web page clustering).

The experiments were performed on a Mac

equipped with a 3.33 Ghz Intel core 2 Duo, 8Gb of

RAM and the Mac OS X v. 10.6.7 operating system.

The algorithm was implemented in Python. Even if

we could speed up the algorithm by exploiting the in-

trinsic parallelism of the CPU by performing many

alignments of pages in parallel, we decided to imple-

ment the algorithm as a sequential program to avoid

biases in the running time evaluation due to the par-

ticular CPU architecture. However we would like to

remark that, using an architecture with t/2 processors,

it is possible to speed up the algorithm from O(t) to

O(log

(t)).

3.1 Evaluation of Template Prediction

The main goal of the ﬁrst set of experiments was to

assess the ability of the algorithm to predict which to-

kens belong to the Web site template. To the best of

our knowledge there is not a public available dataset

of Web templates to be used for the comparison of

template extraction algorithms. To perform an ac-

curate evaluation the dataset should contain a set of

pages extracted from Web sites that exploit different

template structures and for each of them the actual

template, that is the ground truth, should be available.

Hence, we decided to select ﬁve Web sites with dif-

ferent size and characteristics in their templates (i.e.

the ratio between the number of template tokens and

the overall number of tokens in a typical page). For

each of them, the “true” template was manually ex-

tracted. The sites we selected are reported in table 1,

that shows for each of them the average number of

tokens per page and the number of tokens in the asso-

ciated template.

The task of extracting the template from a Web

page (or a set of pages) is made complex because of

two reasons:

• to discriminate whether a certain section of

HTML is part or not of the template can result in

an arbitrary choice;

• the template can contain some tags which require

to be manually modiﬁed.

The ﬁrst situation can be found in Web sites that are

divided into sections, where each section shares with

the others most of the template except some details.

In these cases, we could deﬁne a more general tem-

plate for the whole site or more speciﬁc templates for

each section. Another interesting case is that of on–

line newspapers in which often the home page has a

slightly different structure than the other pages. An

example of the second case is a template featuring

a context menu such that each page contains a link

pointing to itself or to the category to which it be-

longs. Another case is when the pages contain ran-

dom generated banners/ads for which the template al-

locates some space, but the target link changes every

time.

Table 2: Precision, recall and number of tags in the extracted template for different values of the parameter t.

PI: punto-informatico.it Ansa: ansa.it

# Pages 2 4 8 16 32 2 4 8 16 32

Precision 0.720 0.804 0.842 0.855 0.811 0.681 0.816 0.891 0.856 0.859

Recall 0.940 0.940 0.930 0.868 0.739 0.973 0.957 0.903 0.908 0.892

# tags 915 808 757 697 625 2651 2159 1822 1909 1869

Wikipedia: en.wikipedia.org allMusic: allmusic.com

# Pages 2 4 8 16 32 2 4 8 16 32

Precision 0.155 0.226 0.569 0.906 0.911 0.119 0.196 0.395 0.702 0.725

Recall 0.946 0.905 0.899 0.931 0.907 0.950 0.946 0.939 0.936 0.936

# tags 3012 1864 738 480 466 2242 1362 671 377 365

zdNet: zdnet.com

# Pages 2 4 8 16 32

Precision 0.674 0.874 0.890 0.934 0.959

Recall 0.941 0.917 0.915 0.910 0.777

# tags 4051 2961 2908 2748 2285

Table 2 reports the precision and recall of the com-

puted template against the manually extracted ground

truth, for different values of the parameter t. Even

if there is not a theoretical limitation for the value of

parameter t we narrowed the tests only to powers of

It is not surprising that the recall tends to have

a monotonically decreasing trend with respect to the

value of the parameter t. In fact, when using more

pages it is likely to evidence slight differences in the

use of the template tags that prevent the algorithm

to select them. In contrast the precision tends to in-

crease. This is due to the fact that the higher is the

number of pages exploited in the computation of the

consensus, the lower is the probability for a token not

belonging to the template to be shared enough to be

kept in the ﬁnal hypothesis. On the other hand, neg-

ative effects in the precision due to the local discrep-

ancy of some pages are also mitigated by the higher

number of pages.

With the purpose of better investigating the ef-

fects of the choice of t in the ﬁnal template prediction

quality, we report in ﬁgure 2(a) the f-measure for the

experiments described above. F-measure is the har-

monic mean of precision and recall, and it is com-

monly used in information retrieval to evaluate the

trade-off between these two measures. The plot shows

how a overestimated value for t tends to slightly re-

duce the overall F-measure (as highlighted from PI

and zdnet). On the other hand, a underestimated

value of t can result in a poor F-measure (as high-

lighted from wikipedia and allmusic) due to a low

precision.

We observed that precision is strongly affected by

the ratio between the size of the template and the num-

ber of tokens in the considered web pages. In partic-

ular, for a given assignment of t, the higher the ratio,

the higher the precision. We observed also that, for

increasing values of t, the number of tokens returned

from the algorithm decreases (see table 2). This latter

observation suggests that precision is not inﬂuenced

by the number of retrieved tokens effectively belong-

ing to the template, but from the number of false pos-

itives.

According to Figure 2(a) we can conclude that set-

ting t = 16 seems to be a good choice in most situa-

tions, at least when an estimation of the web template

size can not be automatically accomplished.

Figure 2(b) reports the running time of the algo-

rithm using the hardware and software conﬁguration

described earlier in this section. Not surprisingly the

algorithm running time depends linearly on t. Note

that for the assignment of t = 16 the algorithm is al-

ways able to extract the template in few minutes.

We expected also to ﬁnd a strong correlation be-

tween the running time and the characteristics of the

Web sites (average number of tokens, template size),

but our experiments conﬁrm only partially this corre-

lation. Consider the cases of allMusic and Ansa, the

former has a lower number of tokens per page and the

template size is smaller, however the running time is

comparable to the latter. The explanation of this fact

is that the running time is affected by the number of

tokens in each alignment and not by either the page

size or the template size. This feature depends from

intrinsic characteristics of the Web pages, thus it is

unpredictable in advance.

3.2 Web Page Clustering

The main application ﬁelds of Web template extrac-

tion algorithms are: information retrieval and Web

mining. Experimental results in the literature have

shown how these algorithms can improve the perfor-

0.2

0.4

0.6

0.8

2 4 8 16 32

F-measure

# pages

ansa

wikipedia

allMusic

zdnet

(a) F-measure

120

180

240

300

360

420

480

540

600

660

720

780

2 4 8 16 32

Running time (secs)

# pages

ansa

wikipedia

allMusic

zdnet

(b) Running time

Figure 2: Template extraction algorithm performances with respect to the parameter t.

mance in many applications like clustering, classiﬁ-

cation and indexing (Yi et al., 2003; Bar-Yossef and

Rajagopalan, 2002). In this context, the exact identiﬁ-

cation of the Web template itself is not the main goal,

but it is crucial to remove from the document corpus

those repeated parts which are not discriminative for

the speciﬁc task, even if they are not properly part of

the Web template.

In this section we report experiments aimed at

evaluating the effects of the template extraction algo-

rithm in a clustering task. To this purpose we set up

two datasets to which we applied the clustering proce-

dure before and after the removal of the template. We

also compare our algorithm with the naive baseline al-

gorithm which removes the real (manually extracted)

template.

In table 3 we report some details about the two

document corpora used. Both these corpora were ob-

tained downloading a certain number of articles from

an Italian on-line magazine. We only downloaded ar-

ticles and not other kinds of pages (i.e. the home

page). This choice has two advantages: ﬁrst it reduces

the potential effects due to the use of slightly differ-

ent templates in different categories of pages; second,

the subdivision of the articles into categories can be

easily derived from the manual classiﬁcation made by

the web site owners.

Table 3: Datasets used in the Web page clustering evalua-

tion.

Dataset PI Ansa

Web site punto-informatico.it ansa.it

# articles 9364 3470

# Categories 7 13

As clustering procedure we used an implemen-

tation of the Furthest-Point-First (FPF) algorithm

(Gonzalez, 1985), which has already been used for

related Web mining tasks (Geraci et al., 2008). As

quality measures we used four standard functions

commonly used in IR to evaluate clustering qual-

ity: F-measure, purity, normalized mutual informa-

tion (NMI), and normalized complementary entropy

(NCE).

Let GT = {GT

,. . .,GT

} be the partition of the

articles as extracted by the original classiﬁcation in

the Web sites, and let C = {c

,. . .,c

} be the outcome

of the clustering procedure. Given a cluster c

and a

class GT

, precision and recall can be deﬁned as

i, j

|GT

∩ c

, r

i, j

|GT

∩ c

|GT

Let F(GT

) be the F-measure for a given cluster

and a class GT

deﬁned as the harmonic mean of

precision and recall. We can extend the concept of F-

measure to the entire clustering by using the following

formula:

F =

∑

|GT

max

(F(GT

)),

where n is the sum of the cardinality of all the

classes. The value of F is in the range [0,1] and a

higher value indicates better quality.

Purity is devised to evaluate how well each clus-

ter ﬁts in a class of the ground truth. Each cluster is

assigned to the class which share the highest number

of articles with it. The higher number of shared doc-

uments the higher cluster quality according to purity.

The overall purity for a clustering is the sum of the

contributions provided by each cluster. In details pu-

rity is deﬁned as:

A =

∑

max

|GT

∩ c

| .

The value of A is in the range [0, 1] and a higher

value indicates better quality.

The normalized mutual information (see

e.g. (Strehl, 2002, page 110)), comes from in-

formation theory and is deﬁned as follows:

NMI(C, GT ) =

log|C||GT |

∑

c∈C

∑

∈GT

P(c,c

) · log

P(c,c

)

P(c) · P(c

)

where P(c) represents the probability that a ran-

domly selected article o

belongs to c, and P(c,c

)

represents the probability that a randomly selected ar-

ticle o

belongs to both c and c

. The normalization,

achieved by the

log|C||GT |

factor, is necessary in or-

der to account for the fact that the cardinalities of C

and GT are in general different (Cover and Thomas,

1991). Higher values of NMI mean better clustering

quality. NMI is designed for hard clustering.

The normalized complementary entropy (Strehl,

2002, page 108) is a modiﬁed version of entropy

which ranges in the interval [0,1]. The entropy of a

cluster c

∈ C is

|GT |

∑

k=1

−r

k, j

logr

k, j

where r

k, j

is the recall of cluster c

for class GT

To normalize the entropy we divide E

for log|GT |.

To evaluate the entropy for the whole clustering we

can now sum the contribution of each cluster. Then

we compute the complement of the entropy to make

it coherent with the above described measures so that

higher values mean better quality. The ﬁnal formula

for the normalized complementary entropy is the fol-

lowing:

NCE(C,GT ) = 1 −

log|GT |

|C|

∑

j∈1

Table 4: Clustering performance when using the original

pages, the proposed template removal algorithm with t =

8,16, and the removal of the manually extracted template.

PI: punto-informatico.it

Original Our Baseline

t = 8 t = 16

F-measure 0.234 0.260 0.276 0.274

Purity 0.249 0.261 0.272 0.279

NMI 0.040 0.044 0.059 0.058 6

NCE 0.064 0.068 0.084 0.083

Ansa: ansa.it

Original Our Baseline

t = 8 t = 16

F-measure 0.387 0.445 0.449 0.422

Purity 0.397 0.466 0.471 0.452

NMI 0.346 0.425 0.384 0.346

NCE 0.383 0.388 0.421 0.383

Table 4 reports an evaluation of the clustering

quality for the datasets PI and Ansa in different cases:

when the template is not removed (column Original

in the tables), when the template is identiﬁed and re-

moved by the template extraction algorithm (with dif-

ferent values of the parameter t), and when we remove

the manually extracted template. For each measure,

we highlighted the case reporting the highest perfor-

mance.

A ﬁrst important observation is that, the experi-

ments conﬁrm how template removal is an effective

tool to improve the quality of Web information re-

trieval tasks. Another important observation is that

these experiments conﬁrm how setting t = 16 in the

extraction algorithm is, in general, a good choice in-

dependently of the document corpus characteristics.

Hence, even if it works quite well in practice,

the removal of the manually extracted template is not

the best possible choice for improving the clustering

quality. In fact, for this task the perfect identiﬁcation

of the template is not the main issue. The experiments

show that the automatic template extraction algorithm

consistently provides a better result in the clustering

task when compared to the removal of the manually

extracted template.

Figure 3 shows the beneﬁts of Web template re-

moval for clustering in terms of the reduction of the

corpus size, and of the clustering running time. In par-

ticular, ﬁgure 3 (a) reports the clustering running time

for PI and Ansa. The experiments conﬁrm that Web

template removal has consistently positive effects on

the clustering time. The extraction algorithm makes

clustering faster than that obtained by the manual Web

template removal. In the case of PI The proposed al-

gorithm can contribute to save up to 50% of the clus-

tering time (compared to the clustering of the original

dataset). In the case of Ansa the amount of saved time

depends on the choice of t. Figure 3 (a) shows that,

setting t = 16, the saved time is still about 40%.

Figure 3 (b) shows that the extraction algorithm

reduces the amount of used space to store the docu-

ment corpus of a factor 3. Note that for the PI dataset

the algorithm reduces the corpus size more than the

removal of the manually extracted template. This is

due to the fact that, in this case, the extraction algo-

rithm has identiﬁed snippets of HTML, which are re-

peated in the most of pages, even if they are not part of

the Web template (i.e. tags commonly used for article

formatting).

4 CONCLUSIONS

The paper describes a template extraction technique

that exploits a sequence alignment algorithm to de-

tect the HTML structure that is shared by the pages

5000"

10000"

15000"

20000"

25000"

Original"

Our"t"="8"

Our"t"="16"

Baseline"

Clustering+running+,me+in+Seconds+

PI"

Ansa"

(a) Clustering running time

296$

103$

106$

125$

212$

74$

66$

64$

50$

100$

150$

200$

250$

300$

350$

Original$

Our$t$=$8$

Our$t$=$16$

Baseline$

Dataset&occupa+on&in&Mb&

PI$

Ansa$

(b) Dataset size in Mb

Figure 3: Time and space requirements of the clustering algorithm for the datasets PI and Ansa before and after template

removal.

from a target site. The algorithm is able to yield good

retrieval performance given just a small number of

pages from the site. It is also shown that the removal

of the template extracted by the algorithm can signif-

icantly improve the performance of a Web page clus-

tering application. Future work will concern the im-

provement of the alignment step by studying different

solutions for assigning the similarity matrix on the ba-

sis of different features of the compared elements.

ACKNOWLEDGEMENTS

This work was partially supported by the Italian Reg-

istry of the ccTLD “.it” Registro.it.

REFERENCES

Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detec-

tion via data mining and its applications. In Proceed-

ings of the 11th international conference on World

Wide Web, WWW ’02, pages 580–591, New York,

NY, USA. ACM.

Chakrabarti, D., Kumar, R., and Punera, K. (2007). Page-

level template detection via isotonic smoothing. In

Proceedings of the 16th international conference on

World Wide Web, WWW ’07, pages 61–70, New York,

NY, USA. ACM.

Cover, T. M. and Thomas, J. A. (1991). Elements of infor-

mation theory. John Wiley & Sons, New York, US.

Debnath, S., Mitra, P., and Giles, C. L. (2005). Automatic

extraction of informative blocks from webpages. In

Proceedings of the 2005 ACM symposium on Applied

computing, SAC ’05, pages 1722–1726, New York,

NY, USA. ACM.

Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F.

(2008). Cluster generation and labelling for web snip-

pets: A fast, accurate hierarchical solution. Internet

Matematics, 3(4):413–443.

Gonzalez, T. F. (1985). Clustering to minimize the maxi-

mum intercluster distance. Theoretical Computer Sci-

ence, 38(2/3):293–306.

Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic,

V. (2002). Recognition of common areas in a web

page using visual information: a possible application

in a page classiﬁcation. In Proceedings of the IEEE

International Conference on Data Mining, ICDM ’02.

Needlemana, S. B. and Wunscha, C. D. (1970). A gen-

eral method applicable to the search for similarities in

the amino acid sequence of two proteins. Journal of

Molecular Biology, 48(3):443–453.

Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004).

Learning block importance models for web pages. In

Proceedings of the 13th international conference on

World Wide Web, WWW ’04, pages 203–211, New

York, NY, USA. ACM.

Strehl, A. (2002). Relationship-based Clustering and Clus-

ter Ensembles for High-dimensional Data Mining.

PhD thesis, University of Texas, Austin, US.

Wong, T.-L. and Lam, W. (2007). Adapting web infor-

mation extraction knowledge via mining site-invariant

and site-dependent features. ACM Trans. Internet

Technol., 7.

Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy infor-

mation in web pages for data mining. In Proceedings

of the ninth ACM SIGKDD international conference

on Knowledge discovery and data mining, pages 296–

305.