DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS

Emilio Ferrara

Department of Mathematics, University of Messina, Messina,Italy

Robert Baumgartner

Lixto Software GmbH, Vienna, Austria

Keywords:

Semantic Web, Information Extraction, Data Mining.

Abstract:

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to

be adopted in order to extract relevant data in an efﬁcient and reliable way. Both academia and enterprises

developed several approaches of Web data extraction, for example using techniques of artiﬁcial intelligence or

machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision

of information extracted from Web pages, and, at the same time, have to prove robustness in order not to

compromise quality and reliability of data themselves.

In this paper we focus on some experimental aspects related to the robustness of the data extraction process

and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for

ﬁnding similarities between two different version of a Web page, in order to handle modiﬁcations, avoiding

the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate

performances, advantages and draw-backs of our novel system of automatic wrapper adaptation.

1 INTRODUCTION

The World Wide Web today contains an exterminated

amount of information, mostly unstructured, under

the form of Web pages, but also documents of various

nature. During last years big efforts have been con-

ducted to develop techniques of information extrac-

tion on top of the Web. Approaches adopted spread in

several ﬁelds of Mathematics and Computer Science,

including, for example, logic-programming and ma-

chine learning. Several projects, initially developed

in academic settings, evolved in commercial products,

and it is possible to identify different methodologies

to face the problem of Web data extraction. A widely

adopted approach is to deﬁne Web wrappers, proce-

dures relying on analyzing the structure of HTML

Web pages (i.e. DOM tree) to extract required in-

formation. Wrappers can be deﬁned in several ways,

e.g. most advanced tools let users to design them in

a visual way, for example selecting elements of inter-

est in a Web page and deﬁning rules for their extrac-

tion and validation, semi-automatically; regardless of

their generation process, wrappers intrinsically refer

to the HTML structure of the Web page at the time of

their creation. Thus, introducing not negligible prob-

lems of robustness, wrappers could fail in their tasks

of data extraction if the underlying structure of the

Web page changes, also slightly. Moreover, it could

happens that the process of extraction does not fail but

extracted data are corrupted.

All these aspects clarify the following scenarios:

during their deﬁnition, wrappers should be as much

elastic as possible, in order to intrinsically handle mi-

nor modiﬁcations on the structure of Web pages (this

kind of small local changes are much more frequent

than heavy modiﬁcations); although elastic wrappers

could efﬁciently react to minor changes, maintenance

is required for the whole wrapper life-cycle. Wrapper

maintenance is expensive because it requires highly

qualiﬁed personnel, specialized in deﬁning wrappers,

to spend their time in rebuilding or ﬁxing wrappers

whenever they stop working properly. For improving

this aspect, several commercial tools include notiﬁ-

cation features, reporting warnings or errors during

wrappers execution. Moreover, to increase their reli-

ability, data extracted by wrappers could be subject to

validation processes, and also data cleaning is a fun-

damental step; some tools provide caching services

to store the last working copy of Web pages involved

in data extraction processes. Sometimes, it is even

211

Ferrara E. and Baumgartner R..

DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS.

DOI: 10.5220/0003131802110217

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 211-217

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

more convenient to rewrite ex novo a wrapper, instead

of trying to ﬁnd causes of malfunctioning and ﬁxing

them, because debugging wrapper executions can be

not trivial. The unpredictability of what changes will

occur in a speciﬁc Web page and, consequently, the

impossibility to establish when a wrappers will stop

working properly, requires a smart approach to wrap-

per maintenance.

Our purpose in this paper is to describe the realiza-

tion and to investigate performances of an automatic

process of wrapper adaptation to structural modiﬁca-

tions of Web pages. We designed and implemented a

system relying on the possibility of storing, during the

wrapper deﬁnition step, a snapshot of the DOM tree

of the original Web page, namely a tree-gram. If, dur-

ing the wrapper execution, problems occur, this sam-

ple is compared to the new DOM structure, ﬁnding

similarities on trees and sub-trees, to automatically try

adaptating the wrapper with a custom degree of accu-

racy. Brieﬂy, the paper is structured as follows: Sec-

tion 2 focuses on related work, covering the literature

about wrapper generation and adaptation. In Section

3 we explain some concepts related to the tree similar-

ity algorithm implemented, to prove the correctness

of our approach. Section 4 shows details about our

implementation of the automatic wrapper adaptation.

Most important results, obtained by our experimen-

tation, are reported in Section 5. Finally, Section 6

concludes providing some remarks for future work.

2 RELATED WORK

The concept of analyzing similarities between trees,

widely adopted in this work, was introduced by Tai

(Tai, 1979); he deﬁned the notion of distance between

two trees as the measure of the dissimilarity between

them. The problem of transforming trees into other

similar trees, namely tree edit distance, can be solved

applying elementary transformations to nodes, step-

by-step. The minimum cost for this operation rep-

resents the tree edit distance between the two trees.

This technique shows high computational require-

ments and complex implementations (Bille, 2005),

and do not represents the optimal solution to our prob-

lem of ﬁnding similarities between two trees. The

simple tree matching technique (Selkow, 1977) rep-

resents a turning point: it is a light-weight recursive

top-down algorithm which evaluates position of nodes

to measure the degree of isomorphism between two

trees, analyzing and comparing their sub-trees. Sev-

eral improvements to this technique have been sug-

gested: Ferrara and Baumgartner (Ferrara and Baum-

gartner, 2010), extending the concept of weights in-

troduced by Yang (Yang, 1991), developed a variant

of this algorithm with the capability of discovering

clusters of similar sub-trees. An interesting evaluation

of the simple tree matching and its weighed version,

brought by Kim et al. (Kim et al., 2008), was per-

formed exploiting these two algorithms for extract-

ing information from HTML Web pages; we found

their achievements very useful to develop automati-

cally adaptable wrappers.

Web data extraction and adaptation rely especially

on algorithms working with DOM trees. Related

work, in particular regarding Web wrappers and their

maintenance, is intricate: Laender et al. (Laender

et al., 2002) presented a taxonomy of wrapper gen-

eration methodologies, while Ferrara et al. (Ferrara

et al., 2010) discussed a comprehensive survey about

techniques and ﬁelds of application of Web data ex-

traction and adaptation. Some novel wrapper adap-

tation techniques have been introduced during last

years: a valid hybrid approach, mixing logic-based

and grammar rules, has been presented by Chidlovskii

(Chidlovskii, 2001). Also machine-learning tech-

niques have been investigated, e.g. Lerman et al.

(Lerman et al., 2003) exploited their know-how in

this ﬁeld to develop a system for wrapper veriﬁcation

and re-induction. Meng et al. (Meng et al., 2003)

developed the SG-WRAM (Schema-Guided WRAp-

per Maintenance), for wrapper maintenance, starting

from the observation that, changes in Web pages, even

substantial, always preserve syntactic features (i.e.

syntactic characteristics of data items like data pat-

terns, string lengths, etc.), hyperlinks and annotations

(e.g. descriptive information representing the seman-

tic meaning of a piece of information in its context).

This system has been implemented in their Web data

extraction platform: wrappers are deﬁned providing

both HTML Web pages and their XML schemes, de-

scribing a mappings between them. When the sys-

tem executes the wrapper, data are extracted under

the XML format reﬂecting the previously speciﬁed

XML Schema; the wrapper maintainer veriﬁes any

issue and, eventually, provides protocols for the au-

tomatic adaptation of the problematic wrapper. The

XML Schema is a DTD (Document Type Deﬁnition)

while the HTML Web page is represented by its DOM

tree. The framework described by Wong and Lam

(Wong and Lam, 2005) performs the adaptation of

wrappers previously learned, applying them to Web

pages never seen before; they assert that this platform

is also able to discover, eventually, new attributes in

the Web page, using a probabilistic approach, exploit-

ing the extraction knowledge acquired through previ-

ous wrapping tasks. Also Raposo et al. (Raposo et al.,

2005) suggested the possibility of exploiting previ-

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

212

ously acquired information, e.g. results of queries, to

ensure a reliable degree of accuracy during the wrap-

per adaptation process. Concluding, Kowalkiewicz et

al. (Kowalkiewicz et al., 2006) investigate the possi-

bility of increasing the robustness of wrappers based

on the identiﬁcation of HTML elements, inside Web

pages, through their XPath, adopting relative XPath,

instead of absolute ones.

3 MATCHING UP HTML TREES

Our idea of automatic adaptation of wrappers can be

explained as follows: ﬁrst of all, outlining how to ex-

tract information from Web pages (i.e. in our case,

how a Web wrapper works); then, describing how it is

possible to recover information previously extracted

from a different Web page (i.e. how to compare struc-

tural information between the two versions of the Web

page, ﬁnding similarities); ﬁnally, deﬁning how to au-

tomatize this process (i.e. how to build reliable, robust

automatically adaptable wrappers).

Our solution has been implemented in a commer-

cial product

; Baumgartner et al. (Baumgartner et al.,

2009) described details about its design. This plat-

form provides tools to design Web wrappers in a vi-

sual way, selecting elements to be extracted from Web

pages. During the wrapper execution, selected ele-

ments, identiﬁed through their XPath(s) in the DOM

tree of the Web page, are automatically extracted. Al-

though the wrapper design process lets users to deﬁne

several restricting or generalizing conditions to build

wrappers as much elastic as possible, wrappers are

strictly interconnected with the structure of the Web

page on top of they are built. Usually, also slight

modiﬁcations to this structure could alter the wrapper

execution or corrupt extracted data.

In this section we discuss some theoretical foun-

dations on which our solution relies; in details, we

show an efﬁcient algorithm to ﬁnd similar elements

within different Web pages.

3.1 Methodology

A simple measure of similarity between two trees,

once deﬁned their comparable elements, can be es-

tablished applying the simple tree matching algorithm

(Selkow, 1977), introduced in Section 2. We de-

ﬁne comparable elements among HTML Web pages,

nodes, representing HTML elements (or, otherwise,

free text) identiﬁed by tags, belonging to the DOM

tree of these pages. Similarly, we intend for compa-

rable attributes all the attributes, both generic (e.g.

Lixto Suite, www.lixto.com

class, id, etc.) and type-speciﬁc (e.g. href for an-

chor texts, src for images, etc.), shown by HTML el-

ements; it is possible to exploit these properties to in-

troduce additional comparisons to reﬁne the similarity

measure. Several implementations of the simple tree

matching have been proposed; our solution exploits

an improved version, namely clustered tree matching

(Ferrara et al., 2010), designed to match up HTML

trees, identifying clusters of sub-trees with similar

structures, satisfying a custom degree of accuracy.

3.2 Tree Matching Algorithms

Previous studies proved the effectiveness of the sim-

ple tree matching algorithm applied to Web data ex-

traction tasks (Kim et al., 2008; Zhai and Liu, 2005);

it measures the similarity degree between two HTML

trees, producing the maximum matching through dy-

namic programming, ensuring an acceptable compro-

mise between precision and recall.

As improvement to this algorithm, this is a pos-

sible implementation of clustered tree matching: let

d(n) to be the degree of a node n (i.e. the number of

ﬁrst-level children); let T(i) to be the i-th sub-tree of

the tree rooted at node T; let t(n) to be the number of

total siblings of a node n including itself.

Algorithm 1: ClusteredTreeMatching(T

, T

1: if T

has the same label of T

then

2: m ← d(T

)

3: n ← d(T

)

4: for i = 0 to m do

5: M[i][0] ← 0;

6: for j = 0 to n do

7: M[0][ j] ← 0;

8: for all i such that 1 ≤ i ≤ m do

9: for all j such that 1 ≤ j ≤ n do

10: M[i][ j] ← Max(M[i][ j − 1], M[i − 1][ j],

M[i − 1][ j − 1] +W [i][ j]) where W[i][ j] =

ClusteredTreeMatching(T

(i − 1), T

( j −

1))

11: if m > 0 AND n > 0 then

12: return M[m][n] * 1 / Max(t(T

), t(T

))

13: else

14: return M[m][n] + 1 / Max(t(T

), t(T

))

15: else

16: return 0

The main difference between the simple and the clus-

tered tree matching is the way of assigning values to

matching elements. The ﬁrst, adopts a ﬁxed match-

ing value of 1; the latter, instead, computes some

additional information, retrieved in the sub-trees of

DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS

213

(A) a

(N6+N7)

·N8

(N9+N10)

N10

·N11

N11

(N12+N13+N14)

N12

N13

N14

(B) a

N15

N16

(N18+N19)

N18

N19

N17

(N20+N21)

N20

·N22

N22

N21

Figure 1: Two labeled trees, A and B, which show similarities in their structures.

matched nodes.

Omitting detail, provided in (Ferrara et al., 2010),

the clustered tree matching algorithm assigns a

weighted value equal to 1, divided by the greater num-

ber of siblings, computed between the two compared

nodes (also considering themselves).

Figure 1 shows two similar simple rooted, la-

beled trees, and the way of assignment of weights that

would be calculated by applying the clustered tree

matching between them.

3.2.1 Motivations

Several motivations lead us to bring these improve-

ments. For example, considering common charac-

teristics shown by Web pages, provides some useful

tips: usually, rich sub-levels (i.e. sub-levels with sev-

eral nodes) represent list items, table rows, menu, etc.,

more frequently affected by modiﬁcations than other

elements of Web pages; moreover, analyzing which

kind of modiﬁcations usually affect Web pages sug-

gests to assign less importance to slight changes hap-

pening in deep sub-levels of the DOM tree, this be-

cause these are commonly related to missing/added

details to elements, etc.

On the one hand, simple tree matching ignores

these important aspects, on the other clustered tree

matching exploits information like position and num-

ber of mismatches to produce a more accurate result.

3.2.2 Advantages and limitations

The main advantage of our clustered tree matching is

its capability to calculate an absolute measure of sim-

ilarity, while simple tree matching produces the map-

ping value between two trees. Moreover, the more the

structure of considered trees is complex and similar,

the more the measure of similarity established by this

algorithm will be accurate. It ﬁts particularly well to

matching up HTML Web pages, this because they of-

ten own rich and variegated structures.

One important limitation of algorithms based on

the tree matching is that they can not match permu-

tations of nodes. Intuitively, this happens because of

the dynamic programming technique used to face the

problem with computational efﬁciency; both the al-

gorithms execute recursive calls, scanning sub-trees

in a sequential manner, so as to reduce the number of

required iterations (e.g. in Figure 1, permutation of

nodes [c,b] in A with [b,c] in B is not computed). It is

possible to modify the algorithm introducing the anal-

ysis of permutations of sub-trees, but this would heav-

ily affect performances. Despite this intrinsic limit,

this technique appears to ﬁt very well to our purpose

of measuring HTML trees similarity.

It is important to remark that, applying simple tree

matching to compare simple and quite different trees

will produce a more accurate result. Despite that, be-

cause of the most of modiﬁcations in Web pages are

usually slight changes, clustered tree matching is far

and away the best algorithm to be adopted in building

automatically adaptable wrappers. Moreover, this al-

gorithm makes it possible to establish a custom level

of accuracy of the process of matching, deﬁning a

similarity threshold required to match two trees.

4 ADAPTABLE WEB WRAPPERS

Based on the adaptation algorithms described above,

a proof-of-concept extension to the Lixto Visual De-

veloper (VD) has been implemented. Wrappers are

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

214

Figure 2: State diagram of Web wrappers design and adaptation in the Lixto Visual Developer.

automatically adapted based on given conﬁguration

settings, integrity constraints, and triggers.

Usually, wrapper generation in VD is a hierar-

chical top-down process, e.g. ﬁrst, a “hotel record”

is characterized, and inside the hotel record, entities

such as “rating” and “room types”. Such entities are

referred to as patterns. To deﬁne a pattern, the wrap-

per designer visually selects an example and together

with system suggestions generalizes the rule conﬁgu-

ration until the desired instances are matched.

In this extension, to support the automatic adapta-

tion process during runtime, the wrapper designer fur-

ther speciﬁes what it means that extraction failed. In

general, this means wrong or missing data, and with

integrity constraints one can give indications how cor-

rect results look like. Typical integrity constraints are:

• Occurrence restrictions: e.g. minimum and/or

maximum number of allowed occurrence of a pat-

tern instance, minimum and/or maximum number

of child pattern instances;

• Data types: e.g. all results of a “price” pattern

need to be of data type integer.

Integrity constraints can be speciﬁed with each pat-

tern individually or be based on a data model (in our

case, a given XML Schema). In case integrity con-

straints are violated during runtime, the adaptation

process for this particular pattern is started.

During wrapper creation, the application designer

provides a number of conﬁguration settings to this

process. This includes:

• Threshold values;

• Priorities/order of adaptation algorithms used;

• Flags of the chosen algorithm (e.g. using HTML

element name as node label, using id/class at-

tributes as node labels, etc.);

• Triggers for bottom-up, top-down and process

ﬂow adaptation bubbling;

• Whether stored tree-grams and XPath statements

are updated based on adaptation results to be ad-

ditionally used as inputs in future adaptation pro-

cedures (reﬂecting and addressing regular slight

changes of a Web page over time).

Used algorithms for adaptations rely on two inputs

(stored example tree-gram(s), DOM tree of current

page) and provide as output sub-trees that are suf-

ﬁciently similar to the original (example) ones, and

in consequence a generated XPath statement that

matches the nodes (Fig. 2 summarizes the process

from design time and execution time perspective).

Algorithms under consideration include the clus-

tered tree matching discussed above, as well as tree-

based variants of the Bigram (Collins, 1996) and Jaro-

Winkler similarity (Winkler, 1999) (which are of ad-

vantage when one assumes that permutations in the

tree nodes are likely over time). Moreover, for ex-

traction of leaf nodes which exhibit no inherent tree

structure, we rely on string similarity metrics. Fi-

nally, triggers in adaptation settings can be used to

force adaptation of further fragments of the wrapper:

• Top-down: forcing adaptation of all/some de-

scendant patterns (e.g. adapt the “price” pattern

DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS

215

as well to identify prices within a record if the

“record” pattern was adapted).

• Bottom-up: forcing adaptation of a parent pat-

tern in case adaptation of a particular pattern was

not successful. Experimental evaluation pointed

out that in such cases it is often the problem that

the parent pattern already provides wrong or miss-

ing results (even if matched by the integrity con-

straints) and has to be adapted ﬁrst.

• Process ﬂow: it might happen that particular pat-

terns are no longer detected because the wrapper

evaluates on the wrong page. Hence, there is the

need to use variations in the deep web navigation

processes. A simple approach explored at this

time is to use a switch window or back step ac-

tion to check if the previous window or another

tab/pop-up provides the required information.

5 EXPERIMENTAL RESULTS

The best way of measuring reliability of automatically

adaptable wrappers is to test their behavior in real

world use-cases. Several common areas of applica-

tion of Web wrappers have been identiﬁed: social net-

works and bookmarking, retail market and compar-

ison shopping, Web search and information distribu-

tion, and, ﬁnally, Web communities. For each of these

ﬁelds, we designed a test using a representative Web-

site, studying a total of 7 use-cases, deﬁning wrap-

pers applied to 70 Web pages. Websites like Face-

book, Google News, Ebay, etc. are usually subjected

to countless, although often invisible, structural mod-

iﬁcations; thus, altering the correct behavior of Web

wrappers. Table 1 summarizes results: each wrap-

per automatically tries to adapt itself using both the

algorithms described in Section 3. Column referred

as thresh. means the threshold value of similarity re-

quired to match two elements. Columns tp, fp and fn

represent true and false positive, and false negative,

measures usually adopted to evaluate precision and

recall of these kind of tasks.

Performances obtained using the simple and the clus-

tered tree matching are, respectively, good and ex-

cellent; clustered tree matching deﬁnitely is a viable

solution to automatically adapt wrappers with a high

degree of reliability (F-Measure greater than 98%).

This system provides also the possibility of improv-

ing these results including additional checks on com-

parable attributes (e.g. id, name or class). The role of

the required accuracy degree is fundamental; experi-

mental results help to highlight the following consid-

erations: very high values of threshold could result

Table 1: Evaluation of the reliability of automatically adapt-

able wrappers applied to real world scenarios.

Simple T. M. Clustered T. M.

Precision/Recall Precision/Recall

Scenario thresh. tp fp fn tp fp fn

Delicious 40% 100 4 - 100 - -

Ebay 85% 200 12 - 196 - 4

Facebook 65% 240 72 - 240 12 -

Google news 90% 604 - 52 644 - 12

Google.com 80% 100 - 60 136 - 24

Kelkoo 40% 60 4 - 58 - 2

Techcrunch 85% 52 - 28 80 - -

Total - 1356 92 140 1454 12 42

Recall - 90.64% 97.19%

Precision - 93.65% 99.18%

F-Measure - 92.13% 98.18%

in false negatives (e.g. Google news and Google.com

scenarios), while low values could result in false pos-

itives (e.g. the Facebook scenario). Our solution

exploiting the clustered tree matching algorithm, de-

signed by us, helps to reduce wrapper maintenance

tasks, keeping in mind that, in cases of deep structural

changes, it could be required to manually intervene to

ﬁx a speciﬁc wrapper, since it is impossible to auto-

matically face all the possible malfunctionings.

6 CONCLUSIONS

In this paper we described several novel ideas, inves-

tigating the possibility of designing smart Web wrap-

pers which automatically react to structural modiﬁ-

cations of underlying Web pages and adapting them-

selves to avoid malfunctionings or corrupting ex-

tracted data. After explaining the core algorithms on

which this system relies, we shown how to implement

this feature in Web wrappers. Finally, we analyzed

performances of this system through a rigorous test-

ing of the behavior of automatically adaptable wrap-

pers in real world use-cases.

This work opens new scenarios on wrapper adap-

tation techniques and is liable to several improve-

ments: ﬁrst of all, avoiding some limitations of the

matching algorithms, for example the inability of han-

dling permutations on nodes previously explained,

with computationally efﬁcient solutions could be im-

portant to improve the robustness of wrappers. One

limitation of adopted tree matching algorithms is also

that they do not work very well if new levels of nodes

are added or node levels are removed. We already

investigated the possibility of adopting different tree

similarity algorithms, working better in such cases.

We could try to “generalize” other similarity metrics

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

216

on strings, such as the n-gram distance and the Jaro-

Winkler distance. Implementing these two metrics do

not require dynamic programming and might be com-

putationally efﬁcient; in particular, variants of the Bi-

gram distance on trees might work well with permu-

tations of groups of nodes and the Jaro-Winkler dis-

tance could better reﬂect missing or added node lev-

els. Another idea is investigating the possibility of

improving matching criteria including additional in-

formation to be compared during the tree match up

process (e.g. full path information, all attributes, etc.);

then, exploiting logic-based rules (e.g. regular expres-

sions, string edit distance, and so on) to analyze tex-

tual properties.

Finally, the tree-grammar, already exploited to

store a light-weight signature of the structure of ele-

ments identiﬁed by the wrapper, could be extended for

classifying topologies of templates frequently shown

by Web pages, in order to deﬁne standard protocols

of automatic adaptation in these particular contexts.

Adaptation in the deep web navigation is a different

topic than adaptation on a particular page, but also

extremely important for wrapper adaptation. Future

work will comprise to investigate focused spidering

techniques: instead of explicit modeling of a work

ﬂow on a Web page (form ﬁll-out, button clicks, etc.)

we develop a tree-grammar based approach that de-

cides for a given Web page which template it matches

best and executes the data extraction rules deﬁned for

this template. Navigation steps are carried out im-

plicitly by following all links and DOM events that

have been deﬁned as interesting, crawling a site in a

focused way to ﬁnd the relevant information.

Concluding, the system of designing automati-

cally adaptable wrappers described in this paper has

been proved to be robust and reliable. The clustered

tree matching algorithm is very extensible and it could

be adopted for different tasks, also not strictly related

to Web wrappers (e.g. operations that require match-

ing up trees could exploit this algorithm).

REFERENCES

Baumgartner, R., Gottlob, G., and Herzog, M. (2009). Scal-

able web data extraction for online market intelli-

gence. Proc. VLDB Endow., 2(2):1512–1523.

Bille, P. (2005). A survey on tree edit distance and re-

lated problems. Theoretical Computer Science, 337(1-

3):217–239.

Chidlovskii, B. (2001). Automatic repairing of web wrap-

pers. In Proc. of the 3rd international workshop on

Web information and data management, pages 24–30.

Collins, M. J. (1996). A new statistical parser based on

bigram lexical dependencies. In Proc. of the 34th An-

nual Meeting on Association for Computational Lin-

guistics, pages 184–191, Morristown, NJ, USA.

Ferrara, E. and Baumgartner, R. (2010). Automatic Wrap-

per Adaptation by Tree Edit Distance Matching (to ap-

pear). Smart Innovation, Systems and Technologies.

Springer-Verlag.

Ferrara, E., Fiumara, G., and Baumgartner, R. (2010). Web

Data Extraction, Applications and Techniques: A Sur-

vey. Technical report.

Kim, Y., Park, J., Kim, T., and Choi, J. (2008). Web infor-

mation extraction by HTML tree edit distance match-

ing. In Convergence Information Technology, 2007.

International Conference on, pages 2455–2460.

Kowalkiewicz, M., Kaczmarek, T., and Abramowicz, W.

(2006). MyPortal: robust extraction and aggregation

of web content. In Proc. of the 32nd international con-

ference on Very large data bases, pages 1219–1222.

Laender, A., Ribeiro-Neto, B., Silva, A. D., and JS (2002).

A brief survey of web data extraction tools. ACM Sig-

mod, 31(2):84–93.

Lerman, K., Minton, S., and Knoblock, C. (2003). Wrapper

maintenance: A machine learning approach. Journal

of Artiﬁcial Intelligence Research, 18(2003):149–181.

Meng, X., Hu, D., and Li, C. (2003). Schema-guided wrap-

per maintenance for web-data extraction. In Proc. of

the 5th ACM international workshop on Web informa-

tion and data management, pages 1–8, NY, USA.

Raposo, J., Pan, A.,

Alvarez, M., and Vi

na, A. (2005). Au-

tomatic wrapper maintenance for semi-structured web

sources using results from previous queries. SAC ’05:

Proc. of the 2005 ACM symposium on Applied com-

puting, pages 654–659.

Selkow, S. (1977). The tree-to-tree editing problem. Infor-

mation Processing Letters, 6(6):184 – 186.

Tai, K. (1979). The tree-to-tree correction problem. Journal

of the ACM (JACM), 26(3):433.

Winkler, W. E. (1999). The state of record linkage and cur-

rent research problems. Technical report, Statistical

Research Division, U.S. Census Bureau.

Wong, T. and Lam, W. (2005). A probabilistic approach

for adapting information extraction wrappers and dis-

covering new attributes. In ICDM’04. Proc. of the

fourth IEEE International Conference on Data Min-

ing, pages 257–264.

Yang, W. (1991). Identifying syntactic differences between

two programs. Software - Practice and Experience,

21(7):739–755.

Zhai, Y. and Liu, B. (2005). Web data extraction based on

partial tree alignment. In WWW ’05: Proc. of the 14th

International Conference on World Wide Web, pages

76–85, New York, NY, USA.

DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS

217