Authors:
Julián Grigera
1
;
2
;
3
;
Juan Cruz Gardey
2
;
3
;
Alejandra Garrido
2
;
3
and
Gustavo Rossi
2
;
3
Affiliations:
1
CICPBA, Argentina
;
2
CONICET, Argentina
;
3
LIFIA, Facultad de Informática, Universidad Nacional de La Plata, La Plata, CP 1900, Argentina
Keyword(s):
Information Extraction, Web Adaptation, Refactoring for Usability.
Abstract:
Most documents in the WWW are generated from templates that represent user interface (UI) elements, and later filled with contents. In the field of information extraction, many approaches emerged to analyze the documents’ structure, obtain similar features amongst them, and generate wrappers that are used to extract the raw contents from such documents. Therefore, most techniques documented in the literature are optimized to compare full documents, but there are other fields of applicability that require analyzing structural similarity on smaller UI components, like web augmentation or transcoding. In this paper we present two flexible algorithms to measure similarity between DOM Elements by using a mixed approach that considers both elements’ location and inner structure. The proposed algorithms were used in the context of two projects: an approach for automatic usability refactoring, and a web accessibility helper. We also present a wrapper induction technique based on such algorit
hms. Additionally, we present a precision & recall evaluation of our algorithms as compared with other known approaches, applied to DOM elements of different sizes, but smaller than full scaled documents. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.
(More)