on strings, such as the n-gram distance and the Jaro-
Winkler distance. Implementing these two metrics do
not require dynamic programming and might be com-
putationally efficient; in particular, variants of the Bi-
gram distance on trees might work well with permu-
tations of groups of nodes and the Jaro-Winkler dis-
tance could better reflect missing or added node lev-
els. Another idea is investigating the possibility of
improving matching criteria including additional in-
formation to be compared during the tree match up
process (e.g. full path information, all attributes, etc.);
then, exploiting logic-based rules (e.g. regular expres-
sions, string edit distance, and so on) to analyze tex-
tual properties.
Finally, the tree-grammar, already exploited to
store a light-weight signature of the structure of ele-
ments identified by the wrapper, could be extended for
classifying topologies of templates frequently shown
by Web pages, in order to define standard protocols
of automatic adaptation in these particular contexts.
Adaptation in the deep web navigation is a different
topic than adaptation on a particular page, but also
extremely important for wrapper adaptation. Future
work will comprise to investigate focused spidering
techniques: instead of explicit modeling of a work
flow on a Web page (form fill-out, button clicks, etc.)
we develop a tree-grammar based approach that de-
cides for a given Web page which template it matches
best and executes the data extraction rules defined for
this template. Navigation steps are carried out im-
plicitly by following all links and DOM events that
have been defined as interesting, crawling a site in a
focused way to find the relevant information.
Concluding, the system of designing automati-
cally adaptable wrappers described in this paper has
been proved to be robust and reliable. The clustered
tree matching algorithm is very extensible and it could
be adopted for different tasks, also not strictly related
to Web wrappers (e.g. operations that require match-
ing up trees could exploit this algorithm).
REFERENCES
Baumgartner, R., Gottlob, G., and Herzog, M. (2009). Scal-
able web data extraction for online market intelli-
gence. Proc. VLDB Endow., 2(2):1512–1523.
Bille, P. (2005). A survey on tree edit distance and re-
lated problems. Theoretical Computer Science, 337(1-
3):217–239.
Chidlovskii, B. (2001). Automatic repairing of web wrap-
pers. In Proc. of the 3rd international workshop on
Web information and data management, pages 24–30.
Collins, M. J. (1996). A new statistical parser based on
bigram lexical dependencies. In Proc. of the 34th An-
nual Meeting on Association for Computational Lin-
guistics, pages 184–191, Morristown, NJ, USA.
Ferrara, E. and Baumgartner, R. (2010). Automatic Wrap-
per Adaptation by Tree Edit Distance Matching (to ap-
pear). Smart Innovation, Systems and Technologies.
Springer-Verlag.
Ferrara, E., Fiumara, G., and Baumgartner, R. (2010). Web
Data Extraction, Applications and Techniques: A Sur-
vey. Technical report.
Kim, Y., Park, J., Kim, T., and Choi, J. (2008). Web infor-
mation extraction by HTML tree edit distance match-
ing. In Convergence Information Technology, 2007.
International Conference on, pages 2455–2460.
Kowalkiewicz, M., Kaczmarek, T., and Abramowicz, W.
(2006). MyPortal: robust extraction and aggregation
of web content. In Proc. of the 32nd international con-
ference on Very large data bases, pages 1219–1222.
Laender, A., Ribeiro-Neto, B., Silva, A. D., and JS (2002).
A brief survey of web data extraction tools. ACM Sig-
mod, 31(2):84–93.
Lerman, K., Minton, S., and Knoblock, C. (2003). Wrapper
maintenance: A machine learning approach. Journal
of Artificial Intelligence Research, 18(2003):149–181.
Meng, X., Hu, D., and Li, C. (2003). Schema-guided wrap-
per maintenance for web-data extraction. In Proc. of
the 5th ACM international workshop on Web informa-
tion and data management, pages 1–8, NY, USA.
Raposo, J., Pan, A.,
´
Alvarez, M., and Vi
˜
na, A. (2005). Au-
tomatic wrapper maintenance for semi-structured web
sources using results from previous queries. SAC ’05:
Proc. of the 2005 ACM symposium on Applied com-
puting, pages 654–659.
Selkow, S. (1977). The tree-to-tree editing problem. Infor-
mation Processing Letters, 6(6):184 – 186.
Tai, K. (1979). The tree-to-tree correction problem. Journal
of the ACM (JACM), 26(3):433.
Winkler, W. E. (1999). The state of record linkage and cur-
rent research problems. Technical report, Statistical
Research Division, U.S. Census Bureau.
Wong, T. and Lam, W. (2005). A probabilistic approach
for adapting information extraction wrappers and dis-
covering new attributes. In ICDM’04. Proc. of the
fourth IEEE International Conference on Data Min-
ing, pages 257–264.
Yang, W. (1991). Identifying syntactic differences between
two programs. Software - Practice and Experience,
21(7):739–755.
Zhai, Y. and Liu, B. (2005). Web data extraction based on
partial tree alignment. In WWW ’05: Proc. of the 14th
International Conference on World Wide Web, pages
76–85, New York, NY, USA.
DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS
217