of interest are located in the page. This phase has
two steps: the object-rich subtree extraction and the
object separator extraction. Finally, objects of inter-
est are extracted thanks to the result of the previous
phase.
The main concern of this paper is the identifica-
tion of object separators. In this goal, five heuristics
are compared with their combinaitions. As a conse-
quence, the combinaition of the five heuristics shows
the best results on the cached pages from 50 different
web sites.
2.2 More
(Gaeremynck et al., 2003) focus on discovering the
models behind web forms. The main challenge they
address is to discover the relationship between strings
and widgets. These relations are mutually exclusive
as they assume that each entity (string or interactor)
plays a unique role, for example a string can’t at the
same time a caption and a hint for a interactor.
The starting point of the study is a collection of
facts extracted from the web page: description of the
entities (“S
1
is a string” or “I
2
is an iteractor”), rela-
tionships between those entities (“S
1
is a caption for
I
2
” or “S
1
is a hint for I
2
”), .... Facts are then ma-
nipulated through a forward chain rule system. Three
types of rules are defined : deduction rules to produce
new facts from selected facts, exclusion rules to deter-
mine if two facts are mutually exclusive and the scor-
ing rules to order facts depending on the properties of
the interactors involved.
The model recovery algorithm can be summed up
as follows. As long as there remains unprocessed
facts, facts are created thanks to the deduction rules.
Resulting facts that, when combined, haveless chance
to create future conflicts (or exclusions) are selected.
The second selection amongst them is based on the
scoring rules to get the larger set of compatible facts.
Only the finally selected facts expand the set. The re-
mainder of the created facts that do not pass the two
selections are discarded and the loop continues.
The result is a list of relations between strings and
interactors which then can be used to split a form.
2.3 Web RevEnge
Web RevEnge (Paganelli and Patern`o, 2003) was de-
velopped to automatically extract the task models
from a web application, i.e. multiple web pages.
In order to do so, they begin to compute each page.
The DOM of the page is parsed to find links, interac-
tion objects (such as <input> tags), their groupings
(forms, radio button groups), and finally frames. As
the task models are represented in ConcurTasksTrees
(Patern`o, 2000), task model representations of each
page are graphs with a root element and link nodes to
other pages.
To build the task model of the whole web appli-
cation, the process uses the home page as its starting
point. All links being represented in the task model,
one replaces the internal links (in the same site) with
the task models of the targeted pages.
2.4 WARE and WANDA
Even though presented in the same publication (Lucca
and Penta, 2005), WARE and WANDA are web appli-
cation reverse-engineeringtools that were developped
independently. The former adresses the static analysis
of web applications. The latter intervenes upstream
by extracting information from the php files.
WARE implements a two-step process. Revelant
information is retrieved from the static code (mainly
HTML) by extractors. Then abstractors take the pre-
vious result as input and abstract them. The final out-
put is a UML representation of the web application.
WANDA does the same work but on dynamic
data instead of static data. Dynamic information is
collected during web application executions and be-
comes the support of the extraction that creates UML
diagrams.
Bringing them together permits us to identify
groups of equivalent dynamically built pages if there
are enough execution runs.
2.5 ReversiXML and TransformiXML
ReversiXML and TransformiXML (Bouillon et al.,
2005) are respectively a tool to reverse-engineer web
pages and a tool to transform abstract representations
from one context of use to another.
For this purpose, Bouillon and al. takes Cameleon
framework (Calvary et al., 2003) as reference for
the development process. In order to express any
abstraction level of the UI, they rely on UsiXML
(http://www.usixml.org).
About the reverse-engineering part, the derivation
from code source to any abstraction level is done
thanks to derivation rules, functions interpreted at
design- and run-time. The output of this first stage
is an UsiXML file that represents the graph of the UI
in the selected abstraction level.
The transformation takes place at any level of ab-
straction. As UsiXML has an underlying graph struc-
ture, the model transformation system is equivalent to
a graph transformation system based on the theory of
graph grammars.
OVERVIEW OF WEB CONTENT ADAPTATION
385