the system has to be (1) constant for all documents in-
dependent from the number k of similar documents in
the training and (2) near to the extraction effectivity
the fully trained system reaches.
As presented in Section 5, we define our similar-
ity function sim based on training documents with the
same template and our evaluation metric p on top of
common metrics Precision, Recall and F
1
score. To
compare the learning behavior of different systems,
we implemented a measure, called Few-Exemplar Ex-
traction Performance (FEEP), whose calculation is
presented in Equation 2. By calculating the average
over the relative performances according to the sys-
tem’s maximum performance p
max
for a number of
bins with k ≤ t, we get an indicator how good a sys-
tem works with few examples in its starting period
according to its maximum performance. Due to an
often very uneven number of instances for each k, we
use the average performance p
avg
of the system in-
stead of the maximum performance p
max
.
FEEP
t
=
1
t
t
∑
k=1
p
k
p
max
(2)
4 TEMPLATE-BASED
APPROACH
To find an approach that fits the users’ requirements
and performs a few-exemplar extraction, we analyzed
content and layout of business documents in detail.
Most documents are based on document templates.
While many related works define a template as a
schema explicitly describing the document and its rel-
evant information, from our point of view a template
consists of a theoretical function, which transforms
index data, fixed textual elements, and layout compo-
nents to a graphical representation of the document, a
so-called template instance.
The key idea of our information extraction system
is to reverse this transformation to identify the index
data used to create the representation. While we do
not have any information about the function itself, we
try to identify documents with the same template and
benefit from commonalities between them. By group-
ing documents according to their layout and generat-
ing extraction rules out of at least a minimal num-
ber of similar training examples on-the-fly, we want
to reach the proposed enhancement of the extraction
effectivity and speed-up in the starting period. Ideally,
template instances are similar enough to contain suffi-
cient extraction knowledge out of only one single in-
stance in the training set. Details on our approach are
shown in Figure 2. It is a part of the Intellix process
(Schuster et al., 2013), which focusses on the extrac-
tion of information out of business documents with a
high overall effectivity. Due to the focus of this work
to the ability of few-exemplar extraction, we reduce
the description of our algorithms to a minimum. Fur-
ther details can be found in the referred paper.
The input of our extraction system are XML files
describing the content and layout of business docu-
ments. Starting with a document image taken by one
of various source devices, i.e. scanner, printer, smart-
phone, or computer, the document is preprocessed
and transformed by a commercial OCR to a hierarchi-
cal representation. This XML file describes the struc-
ture of the document starting from page level down
to character level. For each element additional infor-
mation like position, bounding box, font details, and
formatting styles are detected. While this information
is delivered by an external OCR, we do not focus on
any optimizations.
4.1 Template Document Detection
Similar to common solutions, the first step of our ap-
proach is a classification. The template document de-
tection searches the model of available training ex-
amples for similar documents called template docu-
ments. We try to identify training documents based on
the same template as the extraction document. We do
not have any formal definition what a template looks
like. Hence we analyze textual and structural charac-
teristics to find similarities which lead to the decision
that two documents are based on the same template.
Due to a high dependency of following algorithms on
the results of the template document detection, we fo-
cus on reaching a very high Precision with values of
99% and higher. Technically, we use a two-step ap-
proach to find template documents.
In a first step, we use the search engine Lucene
with a tf-idf-based ranking as a fast heuristic. Due to
its independence from the size of training and its abil-
ity to immediately learn new documents, it is most
suitable to the SOHO use case. As features we com-
bine the document’s words with its positions. For this
purpose we overlay each document with a grid of the
same size and add the coordinates of the cell the upper
left corner of the bounding box of the word matches.
Validation runs have shown a perfect grid size of 6
by 3. A word “Invoice” in grid cell with coordinates
x=2 and y=4 will result in the feature “Invoice 0204”.
Querying Lucene returns a ranked list of k training
documents that match the input document.
To identify relevant documents within the ranked
list, we rely on a common distance metric. In this sec-
ond step we calculate a normalized and comparative
Few-exemplarInformationExtractionforBusinessDocuments
295