Robust Template Identification of Scanned Documents
Xiaofan Feng
1
, Abdou Youssef
1
and Sithu D. Sudarsan
2
1
Department of Computer Science, The George Washington University, Washington DC, U.S.A.
2
Office of Science and Engineering Labs, Center for Devices and Radiological Health, US Food and Drug Administration,
Silver Spring, MD, U.S.A.
Keywords:
Scanned Document Identification, Maximum A-Posterior Estimation, Information Retrieval.
Abstract:
Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly
focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image
processing techniques for document identification rely on clean data and large differences among templates.
Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are
too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template
of scanned documents. The proposed algorithm works on imperfect OCR output and document collections
containing very similar templates. Through experiment and analysis, this novel probabilistic approach is
shown to achieve high accuracy on different data sets.
1 INTRODUCTION
Although electronic documents have become preva-
lent, governments and enterprises still possess a large
volume of paper documents. A major task is to dig-
itize, label, and extract information from these paper
documents. Many documents used in enterprises and
governments are typically derived from templates, es-
pecially forms completed by users, e.g., tax forms,
medical forms, job application forms, etc. Given a
set of templates and a scanned paper document, an
open problem is to quickly and accurately identify
which template this scanned document was originally
derived from (Esser et al., 2011). To solve this prob-
lem, a number of systems based on labeled informa-
tion have been proposed and developed (Cunningham
et al., 2002; T. S. Jayram et al., 2006) The labeled in-
formation is usually manually generated, making doc-
ument identification a time-consuming and expensive
process. Studies have been performed to use image
features to match a scanned document to its template
(Hu et al., 2000). Some of these studies still require
labeled information (Esser et al., 2011), while others
require consistent high-quality data in order to func-
tion properly.
Identifying documents in a repository of scanned
documents via manual labeling is somewhat ineffi-
cient and expensive. Most automatic image process-
ing techniques require clean data making these tech-
niques ineffective in the presence of noise. Another
drawback of those techniques is that they cannot cor-
rectly correlate a document to a template when the
templates are too similar to each other, which is a se-
rious problem because many form templates in gov-
ernments and enterprises are near identical. As such,
a robust system to automatically identify the origi-
nating template of a scanned document is necessary.
Once the template has been correctly identified, ex-
isting techniques can be utilized to retrieve specific
information.
Now, suppose that a scanned document has been
successfully matched to its template. To extract infor-
mation from the scanned document, one of the first
steps commonly applied is Optical Character Recog-
nition (OCR). Even though state-of-the-art OCR tech-
niques are highly accurate in recognizing printed
words and characters in clean simple-formatted docu-
ments, these techniques still are error-prone when the
input is noisy or the document-format is complicated.
The accuracy of an OCR engine becomes less reli-
able for documents that have become distorted under
scanning, aging, or folding. Furthermore, OCR out-
put tends to be inaccurate when the document con-
tains multiple columns and text of different font types
and font sizes.
This paper introduces an efficient method to iden-
tify a document to its originating template by utiliz-
ing the results generated by OCR. First, the text con-
tained in the templates is retrieved. Then a scanned
document image is OCRed. This result is compared
103
Feng X., Youssef A. and Sudarsan S..
Robust Template Identification of Scanned Documents.
DOI: 10.5220/0004144601030110
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 103-110
ISBN: 978-989-8565-29-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
to all the templates’ text using a probabilistic model,
and the most likely template is selected. The pro-
posed approach works well on noisy documents and
does not require manual intervention or labeling. It
also works well on both document templates which
are significantly different from each other and on doc-
ument templates which are nearly identical to each
other. The approach can be easily incorporated into
other OCR techniques to reduce document processing
time.
2 RELATED WORKS
Document template identification has been studied
extensively. In this section we summarize the related
work. A survey of the literature quickly reveals that
existing approaches fall into two categories: Informa-
tion Retrieval approaches, and Image Feature based
approaches.
2.1 Information Retrieval Approaches
To identify the template of a scanned document with
information retrieval techniques, the textual informa-
tion of the document is treated as a query against
a database of template documents (Salton, 1986).
The template document which has the highest rank
is taken as the identification result.
Cosine similarity is a common similarity measure-
ment used in document query or identification. (Press
et al., 2007; Tan et al., 2005; Salton et al., 1975).
Given two documents represented as term-frequency
vectors v
1
and v
2
, their cosine similarity is defined as:
similarity{v
1
,v
2
} = cos(θ) =
v
1
v
2
kv
1
kkv
2
k
(1)
These term-frequency vectors can be applied with lo-
cal and global functions to improve the results in dif-
ferent scenarios.
Singular Value Decomposition (SVD) is another
well developed technique that is widely used in doc-
ument query or identification. There are variations of
this technique, namely latent semantic indexing (LSI)
(Deerwester, ), probabilistic latent semantic indexing
(pLSI) (Hofmann, 1999), and latent dirichlet alloca-
tion (LDA) (Blei et al., 2003). SVD based techniques
sometimes achieve better results than just using terms
for retrieval. Taking a lower rank of SVD can help fil-
ter out the noise in the document-term matrix. OCR
may recognize some terms incorrectly. These terms
tend to appear together within a set of same terms
for the documents generated from the same template.
SVD can be explored as the fill-in data that do not fre-
quently occur can be filtered out as noise with SVD.
2.2 Image Feature Approaches
Image-feature based approaches have also been ex-
tensively studied for document template identifica-
tion. To identify the template of a certain document,
different features of the image have been selected, (Lu
and Tan, 2004; Hu et al., 2000; Zheng et al., 2005;
Jinhui Liu, 2000; Shin et al., 2001). Based on differ-
ent features in use, different similarity measurements
have been proposed. In (Jinhui Liu, 2000) and (Zheng
et al., 2005), the image is classified with geometric
line information. These methods require image data
to be clean and noise-free, and do not work on forms
containing free-size cells. Free-size cells are those
cells whose sizes will adjust with the filled-in data.
Image based approaches cannot work on forms gener-
ated with the same form structure but having different
contents. The ZoneSeg method used by (Esser et al.,
2011) is chosen as representative of image feature ap-
proach to compare against our approach. In (Esser
et al., 2011), this method has been claimed to have
over 98% accuracy.
3 PROBLEM FORMULATION
In this section, we formulate the problem of scanned
document identification, describe the intuitive ap-
proaches, and identify shortcomings in those ap-
proaches.
3.1 Problem Definition
A modified bag-of-words representation is used for
the document. Each document is treated as a set of
strings. This modified bag-of-words representation
maintains the frequency of each string occurrence and
does not disregard grammar. For example, unlike the
traditional bag-of-words representation, this modified
representation does not use stemming and does not
disregard punctuation.
Let N
T
denote the size of the document-templates
set. Each template is denoted as T
i
, with i [1,N
T
].
Each template T
i
can be represented as a set T
i
:
T
i
= {s
i
1
,s
i
2
,...,s
i
m
}
where s
i
1
,s
i
2
,...,s
i
m
are the strings that appear in the
template T
i
. If a string appears multiple times in
T
i
, it is represented once in T
i
, and its frequency is
recorded. The frequency (or count) of appearance for
the strings in template T
i
is denoted by
C
i
= {c
i
1
,c
i
2
,...,c
i
m
}
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
104
Similarly, an unlabeled query document E, i.e., the
document needing to be identified, is represented as a
set E:
E = {e
1
,e
2
,...,e
n
}
where the e
j
are the strings that appear in E, and the
frequencies of these strings in E are
C
E
= {c
e
1
,c
e
2
,...,c
e
n
}
The goal is to find the template T
k
that the
input document E was derived from, where T
k
{T
1
,T
2
,...,T
N
T
}.
Lemma 3.1. Suppose the OCR result of E is error-
free. In this case, T
k
E.
We have the case that T
k
E when the input doc-
ument E was filled-out by a user or annotated by a fax
machine (e.g., adding the timestamp and the fax num-
ber). Since E was derived from the template T
k
, all
the strings in T
k
will also be in E given that the OCR
result is error-free. We can also have the case that
E = T
k
in the event the input document E was just T
k
scanned in, without any additional strings added by a
user or a fax machine.
In this particular situation, common information
retrieval techniques (e.g., cosine similarity) can pro-
duce decent results. However, many scanned and
faxed documents cannot be OCRed error-free.
Observation 3.1. Suppose E is derived from T
k
. It
may be the case that E T
j6=k
6=
/
0. Let E = A B,
A T
k
, B T
k
=
/
0. One or more of the following
cases may hold true:
1. A T
j6=k
6=
/
0
2. B T
j6=k
6=
/
0
3. (A B) T
j6=k
6=
/
0
The first case stems from the fact that the strings
which appear in input E may also appear in templates
other than T
k
(due to the overlap of strings between
templates). The second case stems from strings ap-
pearing in E and not appearing in T
k
but appearing in
other templates in the set, as a result of filled-in data.
The third case can occur when both the first and sec-
ond cases occur simultaneously.
The challenges of this problem are due to the im-
perfect output from OCR and the user fill-in data. Our
goal is to be able to find the correct template T
k
for E
not only when E does not include all strings of T
k
, but
also when E contains information appearing on other
templates T
j6=k
, j [1,N
t
], given that templates them-
selves can be very similar to each other.
Note that a document represented with the tradi-
tional bag-of-words model is not desired in this situ-
ation. Stemming removes grammatical information,
which is useful in our setting. For example, a term
appearing on a template in singular form should be
treated differently from a term appearing in plural
form. Comparing documents represented as a se-
quence of strings or as a sequence of words involves
searching a space that is exponentially larger than our
representation. Furthermore, for the current prob-
lem context, representation considering the order of
the words (e.g., n-grams) is unsuitable for compar-
ing documents since the OCR output will have errors,
and word-order is not guaranteed in complicated doc-
ument formats.
3.2 Intuitive Approaches
Given the problem formulation, there are a few intu-
itive solutions, which we discuss in this section.
The very direct and intuitive solution for this prob-
lem is as follows:
Given N
T
different templates, if every template T
i
has a subset of strings unique to T
i
, i.e., occurring in T
i
and in no other template, then these subsets of strings
can be used as signatures. We denote the signature
of template T
i
with S
i
. Since s S
i
, s T
i
and s /
S
j
, j 6= i, if any of the strings of the signature can
be found in the OCR result of E, we can identify the
template. This approach is problematic, however.
Depending on the similarities among the tem-
plates in the set, the number |S
i
| of uniquely appear-
ing strings differs across templates. For some tem-
plate sets, we will be able to find many strings that
uniquely identify each template. But for other tem-
plate sets, we might have degenerate cases with no
strings to uniquely identify some of the templates.
In addition, as the number of templates grows in the
set, the size of the signatures tend to become smaller.
More importantly, the success of this approach relies
on perfect OCR, with all strings from a query doc-
ument E being extracted correctly. If all the strings
in the signature S
i
have been incorrectly recognized,
it would not be possible to find the originating tem-
plate. Furthermore, realistically speaking, there is no
control over the information that will be filled-in in
the query document E. Thus, we could have the case
where E contains strings that are a part of signatures
of multiple templates.
Another intuitive solution would be to use cosine-
similarity. If E was derived from the template T
k
,
its vector should be close to T
k
s vector. However,
we have two caveats here: First, with OCR errors
and user filled-in data, the vector representation of E
might be dramatically different from T
k
, resulting in a
large distance between E and T
k
. This may result in a
false match when Es vector is closer to another tem-
plate’s vector. Second, if two templates T
k
and T
j
are
RobustTemplateIdentificationofScannedDocuments
105
very similar (i.e., the vector space distance between
T
k
and T
j
is very small), we will have the case where
we are not able to determine whether E was derived
from T
k
or T
j
.
4 A PROBABILISTIC METHOD
FOR DOCUMENT
IDENTIFICATION
In this section, we will discuss how to solve the doc-
ument identification using a probabilistic approach in
detail.
The basic idea of our probabilistic approach is to
explore the difference in the feature space of similar
templates. To identify the template of an unlabeled
document E, both the similarity and the dissimilarity
between E and a template T
i
should be considered.
The similarity serves as a filter to exclude the tem-
plates which E cannot belong to, while the dissimilar-
ity is used to enlarge the differences between similar
templates.
4.1 The Probabilistic Method
For a query document E, we first use OCR to extract
all the strings and build the set C
E
. If E was derived
from the template T
k
, there may be instances where
E T
k
6=
/
0 and E T
j
6=
/
0, for some j and k where
j 6= k. These instances result from two situations (or
a combination of these situations, as defined in Ob-
servation 3.1). First, the templates may share com-
mon strings among themselves, that is, T
k
T
j
6=
/
0,
k 6= j. Second, the input may have been filled with
information that was correctly or incorrectly recog-
nized as a string occurring in another template T
j
,
(E \ T
k
) T
j
6=
/
0. For a string s appearing in input E
and template T
k
, it is possible that E was derived from
T
k
. It is also possible that E was derived from another
template T
j
, but the filled-in data in E contained s. To
handle this uncertainty, we will use maximum a pos-
teriori (MAP) estimation to estimate the most likely
template from which E was derived.
From the set of templates {T
1
,T
2
,...,T
N
T
}, a
union of the set of strings can be obtained, as U
T
=
S
N
T
i=1
T
i
. Any string that is in E but not in U
T
is ig-
nored. Identifying which template T
i
, i [1,N
T
], the
input E was generated from is equivalent to deciding
which template that the E
0
= E U
T
is most likely to
follow.
For any string e
i
E
0
, e
i
U
T
, the number of
times e
i
appears in E
0
is c
e
i
. Suppose e
i
appears in
template T
i
, with its count of occurrence in T
i
being c
i
j
.
The case where c
i
j
c
e
i
may be a result of OCR errors
with e
i
being incorrectly recognized as other string(s).
The case could also be that some of the e
i
in E have
been incorrectly identified, as well as some fill-in data
containing e
i
have been correctly recognized, but the
total number of occurrence is still less than c
i
j
. In the
case of c
i
j
< c
e
i
, it is possible that all the strings e
i
from
template T
i
have been correctly recognized by OCR,
and fill-in data containing exact c
e
i
c
i
j
of e
i
, it is also
possible that not all c
i
j
occurrences of e
i
are correctly
recognized, but the fill-in data containing more than
c
e
i
c
i
j
occurrences of e
i
are recognized.
To handle these uncertain situations, two probabil-
ities are defined. In general, we can state that the OCR
success in recognizing a string e
i
follows a Bernoulli
distribution with probability p. For the template T
i
with e
i
appearing c
i
j
times, while e
i
appears in E for
c
e
i
c
i
j
, the probability follows a Binomial distribu-
tion (c
i
j
, p). A string e
i
which appears in U
T
follows
a Bernoulli distribution with a probability of q that it
actually comes from the fill-in data.
Given the input E as an observation and a set of
templates T
1
,T
2
,...,T
N
T
, the template that maximizes
the posterior probability will be selected as the tem-
plate that E was derived from. The posterior proba-
bility is computed as:
ˆ
i
MAP
(C
E
) = argmax
i
p(T
i
|C
E
)
= arg max
i
p(C
E
|T
i
)p(T
i
)
p(C
E
)
arg max
i
p(C
E
|T
i
)p(T
i
)
(2)
where p(T
i
) is the prior probability distribution of the
template T
i
. Based on the particular application, if
knowledge regarding which template is more frequent
is known, this probability can be assigned. In the case
that there is no such prior knowledge, p(T
i
) can be
assigned with the Uniform distribution, with p(T
i
) =
1/N
T
. This results in the MAP estimator being the
Maximum likelihood estimator (MLE).
Now, we discuss how to compute p(C
E
|T
i
). The
recognition result of each string in the template T
i
from which E has been derived can be taken as a
random variable following a Binomial distribution of
(c
i
j
, p
i
). The recognition results of all strings in tem-
plate T
i
are assumed to be independent identically dis-
tributed. The probability of having any string from
U
T
\ T
i
, which is the probability of filled-in data is the
same as strings occurring in other templates is q. Then
the probability p(C
E
|T
i
) can be precisely computed
as the combination of these two distributions, which
is too complicated. Under the assumptions p >> q,
p(c
e
i
|T
i
) can be simplified as: if c
e
i
c
i
j
, all strings
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
106
are considered as being from the template; if c
e
i
c
i
j
,
c
i
j
strings are assumed to be from the template, while
c
e
i
c
i
j
are strings accidentally falling in U
T
.
p(c
e
i
|T
i
) =
c
i
j
c
e
i
p
i
c
e
i
(1 p
i
)
c
i
j
c
e
i
if c
e
i
c
i
j
p
i
c
i
j
q
c
e
i
c
i
j
if c
e
i
c
i
j
(3)
Hence, given a template T
i
, the conditional prob-
ability of having an observed data as E with its fre-
quency vector as C
E
can be computed approximately
as:
p(C
E
|T
i
) =
n
l=1
p(c
e
l
|T
i
) (4)
With Equation 3 and 4, the MAP estimator of tem-
plate T
i
can be computed. We use the logarithm of the
probability as
ˆ
i
MAP
(E) arg max
i
log p(C
E
|T
i
)p(T
i
) (5)
This can be computed faster and avoid the underflow
problems.
4.2 Considerations
The probabilistic model allows for situations when an
important string is not properly recognized by OCR.
At the same time, this model leverages the length of
the words. The p
i
used in our experiment is obtained
by first OCRing the known templates and compar-
ing with the ground-truth on each template and their
OCRing output to obtain an estimation of p
i
. It can be
argued that this p
i
does not reflect any particular er-
ror rate on different input E. However, it is useful to
express the relative error rates among the templates.
5 EXPERIMENTAL
PERFORMANCE EVALUATION
In this section, the proposed algorithm is evaluated on
two data sets. Some properties of the data sets are dis-
cussed first, followed by a brief introduction to other
algorithms we compare against. Then the experiment
results are shown, followed by detailed evaluation and
discussion.
5.1 Data Sets
Two data sets are used in the experiments, the first
data set is the Special database II from the National
Institute of Standards and Technology (NIST). This
data set contains 12 different sets of tax forms of
IRS 1040 Package X for the year 1988 totaling 5,590
IRS tax documents. This data set will be referred to
as the NIST-IRS data set. These documents include
Forms 1040, 2106, 2441, 4562, and 6251 together
with Schedule A, B, C, D, E, F, and SE. These doc-
uments have 20 different form faces and are all bi-
nary images of consistent quality. These forms have a
small pair-wise similarity as shown in Figure 1(a) and
1(b), in terms of both the image feature and the actual
string contents they contain.
The other data set contains 4,576 event report
forms from the US Food and Drug Administration
(FDA). This document set consists of 12 different
form faces, which are the actual event reports. These
forms are also binary images scanned in different res-
olutions, from 199dpi to 300dpi. Documents follow-
ing the same templates were produced at different
times, with various fill-in data. Most of the documents
are noisy with various skewing angles and annotation.
Some of the templates are very similar in term of im-
age layout and string contents, making these difficult
to differentiate. Figure 1(c) and 1(d) illustrates a pair
of similar templates.
5.2 Comparison Methods
The detailed implementation and set up for the com-
parison methods, Cosine Similarity, SVD, and Zone-
Seg are explained as follows.
5.2.1 Cosine Similarity
Given two documents templates T
i
and T
j
, all terms
(i.e., strings) that occur on the whole collection are
first collected to obtain the term-frequency vectors C
i
and C
j
, and their cosine similarity is computed with
Equation 1.
As the term frequency cannot be negative, sim-
ilarity between T
i
,T
j
is bounded in the interval
[0,1]. When T
i
and T
j
have no common terms,
similarity{T
i
,T
j
} = 0, and if T
i
,T
j
are exactly the
same, then similarity{T
i
,T
j
} = 1. To use cosine-
similarity to identify the template of a document, each
document template is represented as a term-frequency
vector C
i
(i = 1,2,...,N
T
). The dimension of these
term frequency documents is the total number of dis-
tinct terms that appear on the whole set of templates.
For each unlabeled test document E, the correspond-
ing term frequency vector C
E
is computed. The sim-
ilarities between the test document and the templates
similarity{E, T
i
} are computed. The template T
i
with
the largest cosine-similarity to the unlabeled test doc-
ument is taken as the matching template.
RobustTemplateIdentificationofScannedDocuments
107
(a) A NIST template (b) Another NIST template (c) An FDA template (d) Another FDA template
Figure 1: Sample NIST and FDA templates. Notice (a) and (b) are significantly different while (c) and (d) are quite similiar.
5.2.2 Singular Value Decomposition
The Singular Value Decomposition (SVD) approach
is implemented as follows. All terms on templates
T
1
,T
2
,...,T
N
T
are used to generate the term document
matrix. We have experimented with several common
weighting functions. The two weighting functions
which give the best accuracy were used in the ex-
periment. The log weighting function is used as lo-
cal weighting function. It is defined as: for term i
in template T
j
, the weighting l
i,T
j
= log(c
j
i
+ 1), with
c
j
i
being the count of occurrences of term i in tem-
plate T
j
. The global weighting used is the Entropy
weighting function, which is reported to work well
with LSI. A term is global weighting is computed as
g
i
= 1 +
j
p
i,T
j
log p
i,T
j
logN
T
, where p
i,T
j
=
c
j
i
c
UT
i
, and c
UT
i
is
the total count of occurrences of term i in the whole
collection UT . All ranks of singular values have been
tested, and the half rank and full rank results are re-
ported.
5.2.3 Image feature similarity / ZoneSeg
Each template T
i
and the input document E are treated
as an image, and grids are superimposed on the im-
ages with size m × n. For each small patch of the im-
age, if the black pixels on the patch exceed over a
threshold K
threshold
= 5%, the patch is represented as
1, and 0 otherwise. Then, according to the same se-
quence, the image can be represented as a string con-
sisting of 0s and 1s. The similarity between a template
T
i
and an input document E is defined with the Lev-
enshtein distance between the two strings. The more
visually similar the two images are, the smaller the
Levenshtein distance will be. The Levenshtein dis-
tance between the test document E and each template
T
i
is computed, and the template that minimizes the
distance is taken as the identified template. Different
image patch sizes are also experimented with, and the
size m = 36 pixels, n = 24 pixels, which renders the
best results, is used in the experiment.
5.3 Experiment Design
The experiment was conducted on a set of Intel Core
Duo machines, using Python 2.7, and MatLab 2009b.
For OCR, we used the open-source Tesseract 3.01
project.
5.4 Results and Analysis
In this section, we compare the performance of our
approach with cosine similarity, SVD, and ZoneSeg.
The performance is reported in terms of accuracy.
We evaluate the accuracy of a document identifica-
tion algorithm based on two metrics: total accuracy
and average accuracy on each template. The total ac-
curacy ACC
t
of an algorithm is defined as the ratio
of number of documents which are correctly identi-
fied and the total number of documents. The aver-
age accuracy on each template ACC
a
is defined as the
average accuracy on documents from each template.
While the ACC
t
accuracy demonstrate the overall per-
formance of each method on the data sets respectively,
the template-wise average accuracy ACC
a
shows the
algorithm robustness towards different templates. The
corresponding results on NIST-IRS and FDA data set
are shown in Fig. 2 and 3 respectively.
From Figures 2 and 3, we make the following ob-
servations.
1. All algorithms achieve good results on the NIST-
IRS data set.
2. For the NIST-IRS data set, all algorithms achieve
100% ACC
t
accuracy and ACC
a
accuracy, as ex-
pected, except for ZoneSeg. However, the result
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
108
Figure 2: Performance on the NIST-IRS data set.
Figure 3: Performance on the FDA data set.
of ZoneSeg algorithm is still consistent with the
reported results in (Esser et al., 2011).
3. For the FDA data set, our algorithm achieves the
best ACC
t
accuracy and ACC
a
accuracy.
4. SVD methods performs better on full rank than on
half rank on both data sets.
5. For the NIST-IRS data set, ACC
a
result is better
than ACC
t
with all algorithms. For FDA data set,
ACC
t
result is better than ACC
a
consistently for all
algorithms.
The first observation is expected: the accuracy
of document identification algorithms depend on the
data sets. When the difference of different templates
in the data set is significant, it is easier to identify the
template. Hence, the accuracy will be high. Since the
NIST-IRS data set’s templates are significantly differ-
ent both in image level and contents, high accuracy
for most algorithms was expected.
The second observation indicates that the dif-
ferences between different templates are significant.
Both the algorithms based on OCR output and the al-
gorithms based on image level features achieve good
results, all being over (98%). However, the algo-
rithms based on OCR output achieve better accuracy
than algorithms based on image level features because
the difference in string level features is more signifi-
cant than the difference in image level features. The
NIST-IRS data set also has less fill-in data in com-
mon, which makes the test images easier to be identi-
fied using image level features. However, the noise on
the image, including the scanning skewing and trans-
lation, still affects the accuracy.
The third observation indicates that the proposed
probability-based algorithm works very well even on
the templates that are very similar. Because this ap-
proach considers both the reward on the evidence of
an unlabeled test document being any of the tem-
plates, it also penalizes the dissimilarity with prob-
ability. Unlike other algorithms that mainly measure
similarity between a query and a template, our model
measures both the similarity and the dissimilarity be-
tween a query document and a template. The accu-
racy of cosine similarity and SVD methods is not as
good as our algorithm because they do not have a
penalty mechanism, and cannot consider the possibil-
ity of a match on a string occurring when the fill-in
data coincides with the term used in another template.
The accuracy of the ZoneSeg is lower on the FDA set,
indicating that the performance of ZoneSeg is sensi-
tive to the quality of the document and the similarity
between templates.
The fourth observation indicates that full rank
SVD works better than smaller ranks. As in most in-
formation retrieval task, usually only few ranks are
used to approximate the data in feature space, which
is accurate enough. However, a fewer rank does not
approximate the full rank that well for this task.
The last observation indicates an important result.
Since ACC
a
is the average of the accuracy for each
template, in the case that an algorithm performs con-
sistently on all templates, the two metrics should give
the same result. However, when the number of sam-
ples for each template differs, the accuracy of the
same algorithm differs across different template de-
tection result; so, the two metrics have different val-
ues. For the NIST-IRS data set, the accuracy ACC
a
is consistently higher than the accuracy measurement
ACC
t
. This is because the accuracy of several tem-
plates, whose sample sizes are smaller, all algorithms
still give a high accuracy, most of which is higher than
the templates with larger sample size. However, for
the FDA data set, there are a few templates with small
sample sizes, and the accuracy for these templates are
lower. The templates that are harder to identify have
a smaller number of test documents. Note that our
probabilistic approach has the smallest decrease in ac-
curacy from ACC
t
to ACC
a
among all methods, which
indicates that our method is more consistent across
different templates.
We also studied the 23 failed cases out of the 4,576
tests in the FDA data set and found that the failed
RobustTemplateIdentificationofScannedDocuments
109
cases generally fall into two situations: 1. The im-
ages are severely malformed (e.g., more than half of
the document was not scanned), and only the iden-
tical portions of the images are kept and OCRed; 2.
For a pair of templates, which share almost the same
structure, having 10 or fewer different words, the test
was unable to determine the template, given that the
scores computed by Equation 5 for the pair were too
close. Simply choosing the maximum score results
in incorrect classification. To handle these cases, a
threshold on the likelihood should be set based on ap-
plication and similarity among the templates. Any test
file whose likelihood is below this threshold should be
verified manually.
With respect to speed, the experiments show that
our method’s speed is comparable to cosine similarity
and SVD, and ZoneSeg has a varied speed depending
on the size of the patch. In the experiments, the image
patch size that gives the best accuracy is significantly
slow, since a general test image will be converted into
a string having a length greater than 5,000. This re-
sults in the Levenshtein distance computation to take
a long time to execute.
6 CONCLUSIONS AND FUTURE
WORK
A probabilistic method of identifying document tem-
plates for noisy scanned document has been studied.
This method works well with low accuracy OCR re-
sults produced from noisy documents. Through ex-
periment and analysis, the proposed method is shown
to perform consistently across different template sets,
and works well even when document templates are
very similar. We recognize that certain documents
will contain, in addition to text, non-text content for
which our technique does not apply. It is the intent
that the technique in this paper will be incorporated
into larger application systems that handle both text
recognition (our technique) and non-text recognition
(image feature-based techniques).
Our future research will focus on incorporating
other image features-based techniques with our meth-
ods and identifying fill-in data automatically based on
the proposed method.
REFERENCES
Blei, D. M., Ng, A., and Jordan, M. (2003). Latent dirichlet
allocation. J. Mach. Learn. Res., 3:993–1022.
Cunningham, H., Maynard, D., Bontcheva, K., and Tabla,
V. (2002). Gate: A framework and graphical devel-
opment environment for robust nlp tools and appli-
cations. In Proceedings of the 40th Annual Meet-
ing of the Association for Computational Linguistics,
Philadelphia, PA, USA.
Deerwester, S. Improving information retrieval with latent
semantic indexing. In Proceedings of the 51st ASIS
Annual Meeting, ASIS ’88.
Esser, D., Schuster, D., Muthmann, K., Berger, M., and
Schill, A. (2011). Automatic indexing of scanned
documents - a layout-based approach. In Document
Recognition and Retrieval XVIII.
Hofmann, T. (1999). Probabilistic latent semantic indexing.
In Proceedings of the 22nd Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’99, pages 50–57.
Hu, J., Kashi, R., and Wilfong, G. (2000). Comparison and
classification of documents based on layout similarity.
Inf. Retr., 2:227–243.
Jinhui Liu, A. K. J. (2000). Image-based form document
retrieval. Pattern Recognition, 33:503–513.
Lu, Y. and Tan, C. L. (2004). Information retrieval in docu-
ment image databases. IEEE Transactions on Knowl-
edge and Data Engineering, 16:1398–1410.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-
nery, B. P. (2007). Numerical Recipes 3rd Edition:
The Art of Scientific Computing. Cambridge Univer-
sity Press.
Salton, G. (1986). Another look at automatic text-retrieval
systems. Commun. ACM, 29:648–656.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector
space model for automatic indexing. In Communica-
tions of the ACM, volume 18.
Shin, C., Doermann, D., and Rosenfeld, A. (2001). Classi-
fication of document pages using structure-based fea-
tures. International Journal on Document Analysis
and Recognition, 3:232–247.
Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduc-
tion to Data Mining. Addison-Wesley.
T. S. Jayram, Krishnamurthy, R., Raghavan, S.,
Vaithyanathan, S., and Zhu, H. (2006). Avatar
information extraction system. In IEEE Data
Engineering Bulletin 29.
Zheng, Y., Li, H., and Doermann, D. (2005). A parallel-line
detection algorithm based on HMM decoding. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 27:777–792.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
110