start problem. As soon as there is enough information
in the local IE service, documents should be handled
locally. The second problem concerns the combina-
tion of results from multiple IE services. An aggre-
gated result set has to be created. Third, we need to
test the overall scalability of the solution to prove that
it is able to handle millions of documents.
Our approach offers strategies to solve these three
challenges. We tested these strategies in our own IE
system and provide evaluation results on large sets of
business documents.
The remainder of this paper is organized as fol-
lows. In Section 2 we give a general definition of the
problem of extracting relevant information from busi-
ness documents. Section 3 focusses on related works
in the area of local and centralized IE and presents
open research questions regarding the combination of
these topics. In Section 4 we present an overview
of our approach called Modelspace. Section 5 goes
more in detail and gives a technical view on strategies
and algorithms we developed to solve the introduced
problems. We show evaluation results in Section 6
and conclude the paper in Section 7.
2 PROBLEM DEFINITION
The information extraction task to be carried out for
document archiving can be described as follows:
Given any scanned, photographed or printed doc-
ument, an OCR process is called to transfer the doc-
ument in a semi-structured representation, which in-
cludes text zones. Each word from these text zones is
recognized as a pair (text, boundingbox). The bound-
ing box contains the relative coordinates of the up-
per left and lower right corner of the rectangular area
where the word is printed on the document page. The
information extraction system takes a document in
this representation and tries to retrieve values for typ-
ical archiving index fields like document type, sender,
receiver, date or total amount.
The IE process within a self-learning system is
mostly done following the two steps of classification
and functional role labeling (Saund, 2011). Classifi-
cation (or clustering in case the class of training doc-
uments is not known) tries to identify a group of sim-
ilar training documents that share the same properties
as the document to be extracted. Properties defining
a group may be the type, sender, period of creation
or document template. Functional role labeling uses
these similar training documents to create a knowl-
edge base of where to find the field values. For each of
the fields the IE system finally returns a list of candi-
date tuples of the form (value, score) sorted by score
descending.
Whenever the results are not sufficient for the user,
he may correct them offering the right result pairs
(text, boundingbox) for each of the recognized fields.
This feedback set is added to the knowledge base of
the IE system and improves future results.
The IE system should not depend on an initial set
of training documents nor should it use any language-
or document-dependent rules to improve extraction
results. The approach should be purely self-learning
to be able to adapt to user needs without intervention
of a system administrator.
3 RELATED WORK
Solutions for extracting relevant information out of
documents are used in many domains. These ap-
proaches can be differentiated based on the struc-
turedness of the documents. While extraction meth-
ods for unstructured documents are built on the docu-
ment’s text (Nadeau and Sekine, 2007), approaches
for semi-structured documents, like business docu-
ments (Marinai, 2008) or web pages (Chang et al.,
2006), focus mainly on the document’s layout and
structure.
Within this paper, we are going to deal with an-
other important aspect of extraction systems, namely,
the locality of the extraction knowledge. Local and
centralized information extraction approaches have
been adopted widely by existing scientific and com-
mercial solutions. Depending on the location, extrac-
tion knowledge is stored and information extraction
is carried out, a system either performs a local or a
centralized processing.
Local information extraction systems are used for
processing business documents. Especially large and
medium-sized organizations rely on such kind of sys-
tems as they perform very well under large training
sets of business documents and avoid sharing of doc-
uments with external services. Examples for such sys-
tems are smartFIX (Klein et al., 2004) and Open Text
Capture Center (Opentext, 2012).
In centralized systems the task of information ex-
traction is shifted to a centralized component (in the
cloud), whose knowledge base is shared among and
used by all participants. Examples are commercial
services in the area of Named Entity Recognition
(e.g., AlchemyAPI (AlchemyAPI, 2013)) or Business
Document Processing (e.g., Gini (Gini, 2013)). Sci-
entific works like Edelweiss (Roussel et al., 2001)
and KIM (Popov et al., 2004) present scalable servers
for information extraction that provide well defined
APIs to make use of the centralized knowledge base
ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems
322