The key idea in this step is to choose the record
division that maximizes the similarity between
records. The similarity measure used is based on
serializing the subtrees as strings and then applying
string edit-distance techniques (Levenstein, 1966).
The main problem to apply this idea is that the
number of candidate divisions is 2
n-1
, where n is the
number of subtrees. In most real Web pages, this
number is too high to compute the similarities for all
the possible divisions. Therefore, we first need to
generate a set of candidate divisions.
The method for choosing the candidate divisions
starts by clustering all the subtrees in the data region
according to their similarity. Then, we assign an
identifier to each cluster we have generated and we
build a sequence by listing the subtrees in the data
region in order, representing each subtree with the
identifier of the cluster it belongs to (see Figure 3).
The string will tend to be formed by a repetitive
sequence of cluster identifiers, with each repetition
corresponding to a data record. Since there may be
optional data fields in the extracted records, the
sequence for one record may be slightly different
from the sequence corresponding to another record.
Nevertheless, we will assume that all records either
start or end with a subtree belonging to the same
cluster (i.e. all the data records always either start or
end in a similar way). This is built on the heuristic
that in most Web sites, records are visually delimited
in an unambiguous manner to improve clarity.
This process reduces the number of candidate
divisions to 1+2k, where k is the number of clusters.
Figure 3 shows the subtrees of Figure 2 that have
been chosen for each record r
0
-r
3
of our example by
applying this process and choosing the subtree with
the highest similarity between records.
TABLE
TR TR TR TR TR TR TRTR TR TR
t
0
t
1
t
2
t
3
t
4
t
5
t
6
t
7
t
8
t
9
c
2
c
0
c
1
c
2
c
1
c
2
c
0
c
1
c
2
c
1
c
1
TR
t
10
r
r
r
1
r
0
Figure 3: Record division for the page in Figure 1.
The third step is extracting the attributes of the
data records. This point involves two stages: (1)
transforming each record into a string and (2)
applying string alignment techniques to each record
in order to identify its attributes. Variations of the
“center star” algorithm (Gonnet et al., 1992) are
used to solve this problem. The key idea is that the
aligned variable values in several records represent
the different values of the same field in different
records. Figure 4 shows an excerpt of the alignment
between the strings that represent the records of our
example. In this excerpt, it can be seen how the last
data fields of each record are aligned (fields that
correspond to the attributes COMPENSATION,
EXTRACOMPENSATION and JOB_TYPE).
r
0
… TR TD B TEXT TEXT B TEXT TEXT TR TD B TEXT TEXT SPAN TEXT
r
1
… TR TD B TEXT TEXT
r
2
… TR TD B TEXT TEXT B TEXT TEXT TR TD B TEXT TEXT
r
3
… TR TD B TEXT TEXT TR TD B TEXT TEXT
COMPENSATION EXTRACOMPENSATION JOBTYPE
“Compensation:” “Extra compensations:” “Job Type:”
“(only on vacancies)”
Figure 4: Alignment between records of Figure 1.
4 IMPROVEMENTS TO THE
EXISTING SOLUTION
The method described in the previous section has
several limitations. For instance, it is unable to
identify the sublists marked in our example of
Figure 1. In this section, we will describe the
improvements we have incorporated to deal with
these limitations.
The main idea behind our algorithm is very
simple: each record from the list identified by the
above method can be considered as a reduced
HTML page by simply adding a root node to their
subtrees. Then we can simply execute the method
again on these “reduced pages” to find sublists
inside them. For instance, applying our algorithm on
the first record of Figure 1 (r0) will correctly
identify the 3-record sublist inside it. In the cases in
which sublists are found, we can keep applying the
method recursively for each record. Unfortunately,
this simple method has several important drawbacks:
1. Sublists with very few records will not be
identified. Our method for finding lists is based
on finding records with high inter-similarity,
but, for instance, in record r1 of Figure 1, the
sublist contains only one record, so the method
cannot work directly.
2. As the “pages” used to search for lists become
more and more reduced, lists tend to be shorter
and their elements tend to have fewer fields. In
this situation, the list identification process will
have less information, so we need additional
heuristics to maintain the accuracy.
Section 4.1 discusses these issues. The problem
of removing unwanted content (ads, pagination
controls, etc.) contained inside the data region is
discussed in section 4.2.
The results of the algorithm are stored in a
multilevel map due to the hierarchical nature of the
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
248