ments. The proposed technique exploits the Needle-
man and Wunsch alignment algorithm originally de-
vised for DNA sequences (Needlemana and Wunscha,
1970). This algorithm is exploited to find the most
likely alignment among two sequences of HTML tags
extracted from two different pages from the same site.
Since the common sequence structure that is extracted
from a pair of pages may be strongly dependent on
the specific selected pages, a hierarchical consensus
schema is adopted: given a certain number of differ-
ent pairs of pages, the extracted common sequences,
that constitute the current set of hypotheses for the
target template, are recursively compared to further
distillate the common parts. The comparisons form
a binary consensus tree and the output of the algo-
rithm is the template sequence available in the tree
root. The experimental results show that the proposed
method is able to yield a good precision and recall
in the detection of the HTML tags belonging to the
template given only a very limited number of sample
pages from the site (in the considered setting only 16
pages are able to provide satisfactory performances).
The algorithm is also efficient and the running time
for template extraction is quite low. Finally, the eval-
uation shows that the proposed extraction technique
can provide significant benefits in a Web mining task,
namely Web page clustering, confirming similar re-
sults as those reported in the literature.
The paper is organized as follows. The next sec-
tion describes the template extraction algorithm in de-
tails. Then section 3 reports both the evaluation of the
accuracy in the prediction of the real template and an
analysis of the impact of the template removal in a
Web mining application (Web page clustering). Fi-
nally in section 4 the conclusions are drawn and the
future developments are sketched.
2 TEMPLATE DETECTION
ALGORITHM
The proposed template extraction algorithm is
based on the assumption that a Web template is made
up of a set of overrepresented contiguous subse-
quences of HTML tags shared by the pages from a
given web site. According to this assumption, some
local parts of the template can be missing in some par-
ticular web page.
The first processing step consists in splitting each
page into a sequence of tokens. In the following, a
token corresponds to a single HTML tag or to the
text snippet contained between two consecutive tags.
The tokens are then normalized to remove the po-
tential differences due to the human editing of the
pages or irrelevant features (i.e. extra white spaces
are removed, capital letters are lowered). Moreover
tags are normalized by sorting their internal attributes.
The normalized tokens form an alphabet of sym-
bols over which the page sequences are generated.
Hence, the problem of finding the common parts in
two Web pages can be cast as the computation of the
global alignment of the two corresponding sequences
of normalized tokens. The alignment can be obtained
by exploiting a modified version of the DNA global
alignment algorithm due to Needleman and Wunsch
(Needlemana and Wunscha, 1970), which is based on
a dynamic programming technique.
However, a single alignment of only two Web
pages is likely to produce a poor approximation of
the template since they can share the same tokens just
by chance or a portion of the real template could be
missing in one of the two pages. Hence, in order to
compute a more reliable template, the alignment pro-
cedure is repeated exploiting a set t pages using a re-
cursive procedure. At each step the k input sequences
are paired and aligned to yield k/2 template profile
candidates, starting from the initial t sequences rep-
resenting the available samples. The template candi-
date originating from a given pair of sequences is ob-
tained by pruning those tokens that have receive low
evidence in the alignment steps performed so far. The
procedure iterates until it remains only a single pro-
file that is returned as final Web template (i.e. log
2
(t)
steps are required).
2.1 The Alignment Algorithm
Two strings s
1
and s
2
are aligned by inserting white
spaces (gaps) into them, such that the probability of
finding the same symbol in the same position of the
two modified strings is maximized, and that only one
string can contain a gap in a certain position. An
alignment corresponds to the sequence of operations
required to align the two strings. Consider the case
in which we are comparing the i-th symbol of string
s
1
and the j-th symbol of string s
2
(later referred to
as s
1
[i] and s
2
[ j]). There are three options: skip both
symbols and move to the next one for both the strings;
insert a gap in s
1
[i] or in s
2
[ j]. We assign a score (or
a penalty) to each operation so as to compute a global
score for the whole alignment. In particular we re-
ward the case in which s
1
[i] = s
2
[ j] giving score 1,
we ignore the case in which we skip s
1
[i] and s
2
[ j]
because they are different, and we penalize the inser-
tion of a gap giving score -1. The goal is to find an
alignment with highest score among all the possible
alignments.
In the considered application, the strings cor-