icated web service which takes time, effort, and re-
sources to develop. Furthermore, apart from the cost
issue, this can be an unsurmountable obstacle in cases
when legacy web applications have to be integrated
into a novel system, and modifications to the exist-
ing application cannot be made. This is, in fact, a
very common situation in system integration projects,
when the developers of the original application have
often left the organization since the time the applica-
tion was deployed.
Consequently, for the purposes of system integra-
tion of legacy IT systems, it is very desirable to find
ways to extract information that is intended to be used
by humans and encoded in HTML format, and present
it to remote machines in a computer-interpretable for-
mat such as XML. This is a problem that has been
under investigation in the field of information extrac-
tion. We review the existing approaches and their lim-
itations in the next section, and propose two novel al-
gorithms for semi-supervised information extraction
from web pages with lists of variable length in Sec-
tion 3. In Section 4, we describe one possible im-
plementation of the proposed methods, and its role in
a system integration solution and method. Section 5
proposes directions for expanding these solutions to a
wider class of web pages, and concludes the paper.
2 WEB INFORMATION
EXTRACTION
The field of information extraction is an area of in-
formation technology that is concerned with extract-
ing useful information from natural language text that
is intended to be read and interpreted by humans.
Such text can be produced either by other humans
(e.g., a classified ad), or generated by machines, pos-
sibly using the content of a structured database (e.g., a
product description page on an e-commerce web site)
(Laender et al., 2002). Although humans do not nec-
essarily make a significant distinction between these
two cases, as far as their ability to inteprete the text
is concerned, the difference between them has enor-
mous importance as regards the success of interpre-
tation of such text by machines. Understanding free-
form natural language that has been generated by hu-
mans is a very complicated problem whose complete
solution is not in the foreseeable future. In contrast, if
text has been generated by a machine, using a boiler-
plate template for page layout and presentation (such
as an XSLT file), and a database for actual content,
the rate of success of automated information extrac-
tion methods can be very high. This type of text is
often called semi-structured data, and due to its high
practical significance, is the focus of this paper.
The usual method for extracting information from
web pages that are output by legacy applications is by
means of programs called wrappers (van den Heuvel
and Thiran, 2003). The simplest approach is to write
such wrappers manually, for example using a general-
purpose programming language such as Java or Perl.
Since this can be difficult, tedious, and error-prone,
various methods have been proposed for automating
the development of wrappers. Although most of these
methods focus on creating extraction rules that are ap-
plied to web pages by an extraction tool, they differ
significantly according to how they apply the induced
rules on web pages, and according to how they actu-
ally induce these rules.
Regarding the first difference, some methods ap-
ply the rules directly to the stream of tokens in the
web page. In such cases, the rules can be encoded as
regular expressions, context-free grammars, or using
more advanced specialized languages. One advantage
of these methods is that they can easily filter out irrel-
evant text, e.g. interstitial ads in web pages. However,
finding the rules that would extract all the needed in-
formation and only the needed information is not a
trivial problem.
Other methods use the fact that a web page en-
coded in HTML is not just a stream of tokens, but
has a tree-like structure. This structure is in fact rec-
ognized by web browsers when they transform the
HTML code into a Document Object Model (DOM)
tree prior to rendering it on screen. When a needed
data item can be found in one of the leaves of the
DOM tree, an extraction rule for its retrieval can be
encoded by means of a standard XPath expression
that specifies the path that has to be traversed in the
DOM tree to reach the respective leaf. Applying such
rules to new web pages in a deployed system is very
straightforward: an embedded web browser is used to
retrieve the web page and create its DOM tree, after
which the XPath expression is applied to retrieve the
data. Fully automated tools such as W4F (Sahuguet
and Azavant, 1999) and XWRAP (Liu et al., 2000)
operate on the DOM tree. (Although, they use differ-
ent languages for representing the extraction rules.)
Regarding the second difference among wrapper
construction tools — how extraction rules are induced
— there are several principal approaches. Supervised
methods require explicit instruction on where the data
fields are in a web page, in the form of examples. One
practical way for a human user to do this is to point
to the data items on a rendered web page, for example
by highlighting them by means of a computer mouse,
and after that the corresponding extraction rules are
automatically generated. As noted, when the extrac-
ICEIS 2009 - International Conference on Enterprise Information Systems
262