Robust Template Identiﬁcation of Scanned Documents

Xiaofan Feng

, Abdou Youssef

and Sithu D. Sudarsan

Department of Computer Science, The George Washington University, Washington DC, U.S.A.

Ofﬁce of Science and Engineering Labs, Center for Devices and Radiological Health, US Food and Drug Administration,

Silver Spring, MD, U.S.A.

Keywords:

Scanned Document Identiﬁcation, Maximum A-Posterior Estimation, Information Retrieval.

Abstract:

Identiﬁcation of low-quality scanned documents is not trivial in real-world settings. Existing research mainly

focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image

processing techniques for document identiﬁcation rely on clean data and large differences among templates.

Both these approaches fail to maintain accuracy in the context of noisy data or when document templates are

too similar to each other. In this paper, a probabilistic approach is proposed to identify the document template

of scanned documents. The proposed algorithm works on imperfect OCR output and document collections

containing very similar templates. Through experiment and analysis, this novel probabilistic approach is

shown to achieve high accuracy on different data sets.

1 INTRODUCTION

Although electronic documents have become preva-

lent, governments and enterprises still possess a large

volume of paper documents. A major task is to dig-

itize, label, and extract information from these paper

documents. Many documents used in enterprises and

governments are typically derived from templates, es-

pecially forms completed by users, e.g., tax forms,

medical forms, job application forms, etc. Given a

set of templates and a scanned paper document, an

open problem is to quickly and accurately identify

which template this scanned document was originally

derived from (Esser et al., 2011). To solve this prob-

lem, a number of systems based on labeled informa-

tion have been proposed and developed (Cunningham

et al., 2002; T. S. Jayram et al., 2006) The labeled in-

formation is usually manually generated, making doc-

ument identiﬁcation a time-consuming and expensive

process. Studies have been performed to use image

features to match a scanned document to its template

(Hu et al., 2000). Some of these studies still require

labeled information (Esser et al., 2011), while others

require consistent high-quality data in order to func-

tion properly.

Identifying documents in a repository of scanned

documents via manual labeling is somewhat inefﬁ-

cient and expensive. Most automatic image process-

ing techniques require clean data making these tech-

niques ineffective in the presence of noise. Another

drawback of those techniques is that they cannot cor-

rectly correlate a document to a template when the

templates are too similar to each other, which is a se-

rious problem because many form templates in gov-

ernments and enterprises are near identical. As such,

a robust system to automatically identify the origi-

nating template of a scanned document is necessary.

Once the template has been correctly identiﬁed, ex-

isting techniques can be utilized to retrieve speciﬁc

information.

Now, suppose that a scanned document has been

successfully matched to its template. To extract infor-

mation from the scanned document, one of the ﬁrst

steps commonly applied is Optical Character Recog-

nition (OCR). Even though state-of-the-art OCR tech-

niques are highly accurate in recognizing printed

words and characters in clean simple-formatted docu-

ments, these techniques still are error-prone when the

input is noisy or the document-format is complicated.

The accuracy of an OCR engine becomes less reli-

able for documents that have become distorted under

scanning, aging, or folding. Furthermore, OCR out-

put tends to be inaccurate when the document con-

tains multiple columns and text of different font types

and font sizes.

This paper introduces an efﬁcient method to iden-

tify a document to its originating template by utiliz-

ing the results generated by OCR. First, the text con-

tained in the templates is retrieved. Then a scanned

document image is OCRed. This result is compared

103

Feng X., Youssef A. and Sudarsan S..

Robust Template Identiﬁcation of Scanned Documents.

DOI: 10.5220/0004144601030110

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 103-110

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

to all the templates’ text using a probabilistic model,

and the most likely template is selected. The pro-

posed approach works well on noisy documents and

does not require manual intervention or labeling. It

also works well on both document templates which

are signiﬁcantly different from each other and on doc-

ument templates which are nearly identical to each

other. The approach can be easily incorporated into

other OCR techniques to reduce document processing

time.

2 RELATED WORKS

Document template identiﬁcation has been studied

extensively. In this section we summarize the related

work. A survey of the literature quickly reveals that

existing approaches fall into two categories: Informa-

tion Retrieval approaches, and Image Feature based

approaches.

2.1 Information Retrieval Approaches

To identify the template of a scanned document with

information retrieval techniques, the textual informa-

tion of the document is treated as a query against

a database of template documents (Salton, 1986).

The template document which has the highest rank

is taken as the identiﬁcation result.

Cosine similarity is a common similarity measure-

ment used in document query or identiﬁcation. (Press

et al., 2007; Tan et al., 2005; Salton et al., 1975).

Given two documents represented as term-frequency

vectors v

and v

, their cosine similarity is deﬁned as:

similarity{v

} = cos(θ) =

kkv

(1)

These term-frequency vectors can be applied with lo-

cal and global functions to improve the results in dif-

ferent scenarios.

Singular Value Decomposition (SVD) is another

well developed technique that is widely used in doc-

ument query or identiﬁcation. There are variations of

this technique, namely latent semantic indexing (LSI)

(Deerwester, ), probabilistic latent semantic indexing

(pLSI) (Hofmann, 1999), and latent dirichlet alloca-

tion (LDA) (Blei et al., 2003). SVD based techniques

sometimes achieve better results than just using terms

for retrieval. Taking a lower rank of SVD can help ﬁl-

ter out the noise in the document-term matrix. OCR

may recognize some terms incorrectly. These terms

tend to appear together within a set of same terms

for the documents generated from the same template.

SVD can be explored as the ﬁll-in data that do not fre-

quently occur can be ﬁltered out as noise with SVD.

2.2 Image Feature Approaches

Image-feature based approaches have also been ex-

tensively studied for document template identiﬁca-

tion. To identify the template of a certain document,

different features of the image have been selected, (Lu

and Tan, 2004; Hu et al., 2000; Zheng et al., 2005;

Jinhui Liu, 2000; Shin et al., 2001). Based on differ-

ent features in use, different similarity measurements

have been proposed. In (Jinhui Liu, 2000) and (Zheng

et al., 2005), the image is classiﬁed with geometric

line information. These methods require image data

to be clean and noise-free, and do not work on forms

containing free-size cells. Free-size cells are those

cells whose sizes will adjust with the ﬁlled-in data.

Image based approaches cannot work on forms gener-

ated with the same form structure but having different

contents. The ZoneSeg method used by (Esser et al.,

2011) is chosen as representative of image feature ap-

proach to compare against our approach. In (Esser

et al., 2011), this method has been claimed to have

over 98% accuracy.

3 PROBLEM FORMULATION

In this section, we formulate the problem of scanned

document identiﬁcation, describe the intuitive ap-

proaches, and identify shortcomings in those ap-

proaches.

3.1 Problem Deﬁnition

A modiﬁed bag-of-words representation is used for

the document. Each document is treated as a set of

strings. This modiﬁed bag-of-words representation

maintains the frequency of each string occurrence and

does not disregard grammar. For example, unlike the

traditional bag-of-words representation, this modiﬁed

representation does not use stemming and does not

disregard punctuation.

Let N

denote the size of the document-templates

set. Each template is denoted as T

, with i ∈ [1,N

Each template T

can be represented as a set T

= {s

,...,s

}

where s

,...,s

are the strings that appear in the

template T

. If a string appears multiple times in

, it is represented once in T

, and its frequency is

recorded. The frequency (or count) of appearance for

the strings in template T

is denoted by

= {c

,...,c

}

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

104

Similarly, an unlabeled query document E, i.e., the

document needing to be identiﬁed, is represented as a

set E:

E = {e

,...,e

}

where the e

are the strings that appear in E, and the

frequencies of these strings in E are

= {c

,...,c

}

The goal is to ﬁnd the template T

that the

input document E was derived from, where T

∈

,...,T

Lemma 3.1. Suppose the OCR result of E is error-

free. In this case, T

⊆ E.

We have the case that T

⊂ E when the input doc-

ument E was ﬁlled-out by a user or annotated by a fax

machine (e.g., adding the timestamp and the fax num-

ber). Since E was derived from the template T

, all

the strings in T

will also be in E given that the OCR

result is error-free. We can also have the case that

E = T

in the event the input document E was just T

scanned in, without any additional strings added by a

user or a fax machine.

In this particular situation, common information

retrieval techniques (e.g., cosine similarity) can pro-

duce decent results. However, many scanned and

faxed documents cannot be OCRed error-free.

Observation 3.1. Suppose E is derived from T

. It

may be the case that E ∩ T

j6=k

0. Let E = A ∪ B,

A ⊆ T

, B ∩ T

0. One or more of the following

cases may hold true:

1. A ∩T

j6=k

2. B ∩T

j6=k

3. (A ∪B) ∩ T

j6=k

The ﬁrst case stems from the fact that the strings

which appear in input E may also appear in templates

other than T

(due to the overlap of strings between

templates). The second case stems from strings ap-

pearing in E and not appearing in T

but appearing in

other templates in the set, as a result of ﬁlled-in data.

The third case can occur when both the ﬁrst and sec-

ond cases occur simultaneously.

The challenges of this problem are due to the im-

perfect output from OCR and the user ﬁll-in data. Our

goal is to be able to ﬁnd the correct template T

for E

not only when E does not include all strings of T

, but

also when E contains information appearing on other

templates T

j6=k

, j ∈ [1,N

], given that templates them-

selves can be very similar to each other.

Note that a document represented with the tradi-

tional bag-of-words model is not desired in this situ-

ation. Stemming removes grammatical information,

which is useful in our setting. For example, a term

appearing on a template in singular form should be

treated differently from a term appearing in plural

form. Comparing documents represented as a se-

quence of strings or as a sequence of words involves

searching a space that is exponentially larger than our

representation. Furthermore, for the current prob-

lem context, representation considering the order of

the words (e.g., n-grams) is unsuitable for compar-

ing documents since the OCR output will have errors,

and word-order is not guaranteed in complicated doc-

ument formats.

3.2 Intuitive Approaches

Given the problem formulation, there are a few intu-

itive solutions, which we discuss in this section.

The very direct and intuitive solution for this prob-

lem is as follows:

Given N

different templates, if every template T

has a subset of strings unique to T

, i.e., occurring in T

and in no other template, then these subsets of strings

can be used as signatures. We denote the signature

of template T

with S

. Since ∀s ∈ S

, s ∈ T

and s /∈

, ∀ j 6= i, if any of the strings of the signature can

be found in the OCR result of E, we can identify the

template. This approach is problematic, however.

Depending on the similarities among the tem-

plates in the set, the number |S

| of uniquely appear-

ing strings differs across templates. For some tem-

plate sets, we will be able to ﬁnd many strings that

uniquely identify each template. But for other tem-

plate sets, we might have degenerate cases with no

strings to uniquely identify some of the templates.

In addition, as the number of templates grows in the

set, the size of the signatures tend to become smaller.

More importantly, the success of this approach relies

on perfect OCR, with all strings from a query doc-

ument E being extracted correctly. If all the strings

in the signature S

have been incorrectly recognized,

it would not be possible to ﬁnd the originating tem-

plate. Furthermore, realistically speaking, there is no

control over the information that will be ﬁlled-in in

the query document E. Thus, we could have the case

where E contains strings that are a part of signatures

of multiple templates.

Another intuitive solution would be to use cosine-

similarity. If E was derived from the template T

its vector should be close to T

’s vector. However,

we have two caveats here: First, with OCR errors

and user ﬁlled-in data, the vector representation of E

might be dramatically different from T

, resulting in a

large distance between E and T

. This may result in a

false match when E’s vector is closer to another tem-

plate’s vector. Second, if two templates T

and T

are

RobustTemplateIdentificationofScannedDocuments

105

very similar (i.e., the vector space distance between

and T

is very small), we will have the case where

we are not able to determine whether E was derived

from T

or T

4 A PROBABILISTIC METHOD

FOR DOCUMENT

IDENTIFICATION

In this section, we will discuss how to solve the doc-

ument identiﬁcation using a probabilistic approach in

detail.

The basic idea of our probabilistic approach is to

explore the difference in the feature space of similar

templates. To identify the template of an unlabeled

document E, both the similarity and the dissimilarity

between E and a template T

should be considered.

The similarity serves as a ﬁlter to exclude the tem-

plates which E cannot belong to, while the dissimilar-

ity is used to enlarge the differences between similar

templates.

4.1 The Probabilistic Method

For a query document E, we ﬁrst use OCR to extract

all the strings and build the set C

. If E was derived

from the template T

, there may be instances where

E ∩ T

0 and E ∩ T

0, for some j and k where

j 6= k. These instances result from two situations (or

a combination of these situations, as deﬁned in Ob-

servation 3.1). First, the templates may share com-

mon strings among themselves, that is, T

∩ T

k 6= j. Second, the input may have been ﬁlled with

information that was correctly or incorrectly recog-

nized as a string occurring in another template T

(E \ T

) ∩ T

0. For a string s appearing in input E

and template T

, it is possible that E was derived from

. It is also possible that E was derived from another

template T

, but the ﬁlled-in data in E contained s. To

handle this uncertainty, we will use maximum a pos-

teriori (MAP) estimation to estimate the most likely

template from which E was derived.

From the set of templates {T

,...,T

}, a

union of the set of strings can be obtained, as U

i=1

. Any string that is in E but not in U

is ig-

nored. Identifying which template T

, i ∈ [1,N

], the

input E was generated from is equivalent to deciding

which template that the E

= E ∩U

is most likely to

follow.

For any string e

∈ E

, e

∈ U

, the number of

times e

appears in E

is c

. Suppose e

appears in

template T

, with its count of occurrence in T

being c

The case where c

≥ c

may be a result of OCR errors

with e

being incorrectly recognized as other string(s).

The case could also be that some of the e

in E have

been incorrectly identiﬁed, as well as some ﬁll-in data

containing e

have been correctly recognized, but the

total number of occurrence is still less than c

. In the

case of c

< c

, it is possible that all the strings e

from

template T

have been correctly recognized by OCR,

and ﬁll-in data containing exact c

− c

of e

, it is also

possible that not all c

occurrences of e

are correctly

recognized, but the ﬁll-in data containing more than

− c

occurrences of e

are recognized.

To handle these uncertain situations, two probabil-

ities are deﬁned. In general, we can state that the OCR

success in recognizing a string e

follows a Bernoulli

distribution with probability p. For the template T

with e

appearing c

times, while e

appears in E for

≤ c

, the probability follows a Binomial distribu-

tion (c

, p). A string e

which appears in U

follows

a Bernoulli distribution with a probability of q that it

actually comes from the ﬁll-in data.

Given the input E as an observation and a set of

templates T

,...,T

, the template that maximizes

the posterior probability will be selected as the tem-

plate that E was derived from. The posterior proba-

bility is computed as:

MAP

) = argmax

p(T

)

= arg max

p(C

)p(T

)

p(C

)

∝ arg max

p(C

)p(T

)

(2)

where p(T

) is the prior probability distribution of the

template T

. Based on the particular application, if

knowledge regarding which template is more frequent

is known, this probability can be assigned. In the case

that there is no such prior knowledge, p(T

) can be

assigned with the Uniform distribution, with p(T

) =

1/N

. This results in the MAP estimator being the

Maximum likelihood estimator (MLE).

Now, we discuss how to compute p(C

). The

recognition result of each string in the template T

from which E has been derived can be taken as a

random variable following a Binomial distribution of

, p

). The recognition results of all strings in tem-

plate T

are assumed to be independent identically dis-

tributed. The probability of having any string from

\ T

, which is the probability of ﬁlled-in data is the

same as strings occurring in other templates is q. Then

the probability p(C

) can be precisely computed

as the combination of these two distributions, which

is too complicated. Under the assumptions p >> q,

p(c

) can be simpliﬁed as: if c

≤ c

, all strings

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

106

are considered as being from the template; if c

≥ c

strings are assumed to be from the template, while

− c

are strings accidentally falling in U

p(c

) =











(1 − p

)

−c

if c

≤ c

−c

if c

≥ c

(3)

Hence, given a template T

, the conditional prob-

ability of having an observed data as E with its fre-

quency vector as C

can be computed approximately

as:

p(C

) =

∏

l=1

p(c

) (4)

With Equation 3 and 4, the MAP estimator of tem-

plate T

can be computed. We use the logarithm of the

probability as

MAP

(E) ∝ arg max

log p(C

)p(T

) (5)

This can be computed faster and avoid the underﬂow

problems.

4.2 Considerations

The probabilistic model allows for situations when an

important string is not properly recognized by OCR.

At the same time, this model leverages the length of

the words. The p

used in our experiment is obtained

by ﬁrst OCRing the known templates and compar-

ing with the ground-truth on each template and their

OCRing output to obtain an estimation of p

. It can be

argued that this p

does not reﬂect any particular er-

ror rate on different input E. However, it is useful to

express the relative error rates among the templates.

5 EXPERIMENTAL

PERFORMANCE EVALUATION

In this section, the proposed algorithm is evaluated on

two data sets. Some properties of the data sets are dis-

cussed ﬁrst, followed by a brief introduction to other

algorithms we compare against. Then the experiment

results are shown, followed by detailed evaluation and

discussion.

5.1 Data Sets

Two data sets are used in the experiments, the ﬁrst

data set is the Special database II from the National

Institute of Standards and Technology (NIST). This

data set contains 12 different sets of tax forms of

IRS 1040 Package X for the year 1988 totaling 5,590

IRS tax documents. This data set will be referred to

as the NIST-IRS data set. These documents include

Forms 1040, 2106, 2441, 4562, and 6251 together

with Schedule A, B, C, D, E, F, and SE. These doc-

uments have 20 different form faces and are all bi-

nary images of consistent quality. These forms have a

small pair-wise similarity as shown in Figure 1(a) and

1(b), in terms of both the image feature and the actual

string contents they contain.

The other data set contains 4,576 event report

forms from the US Food and Drug Administration

(FDA). This document set consists of 12 different

form faces, which are the actual event reports. These

forms are also binary images scanned in different res-

olutions, from 199dpi to 300dpi. Documents follow-

ing the same templates were produced at different

times, with various ﬁll-in data. Most of the documents

are noisy with various skewing angles and annotation.

Some of the templates are very similar in term of im-

age layout and string contents, making these difﬁcult

to differentiate. Figure 1(c) and 1(d) illustrates a pair

of similar templates.

5.2 Comparison Methods

The detailed implementation and set up for the com-

parison methods, Cosine Similarity, SVD, and Zone-

Seg are explained as follows.

5.2.1 Cosine Similarity

Given two documents templates T

and T

, all terms

(i.e., strings) that occur on the whole collection are

ﬁrst collected to obtain the term-frequency vectors C

and C

, and their cosine similarity is computed with

Equation 1.

As the term frequency cannot be negative, sim-

ilarity between T

is bounded in the interval

[0,1]. When T

and T

have no common terms,

similarity{T

} = 0, and if T

are exactly the

same, then similarity{T

} = 1. To use cosine-

similarity to identify the template of a document, each

document template is represented as a term-frequency

vector C

(i = 1,2,...,N

). The dimension of these

term frequency documents is the total number of dis-

tinct terms that appear on the whole set of templates.

For each unlabeled test document E, the correspond-

ing term frequency vector C

is computed. The sim-

ilarities between the test document and the templates

similarity{E, T

} are computed. The template T

with

the largest cosine-similarity to the unlabeled test doc-

ument is taken as the matching template.

RobustTemplateIdentificationofScannedDocuments

107

(a) A NIST template (b) Another NIST template (c) An FDA template (d) Another FDA template

Figure 1: Sample NIST and FDA templates. Notice (a) and (b) are signiﬁcantly different while (c) and (d) are quite similiar.

5.2.2 Singular Value Decomposition

The Singular Value Decomposition (SVD) approach

is implemented as follows. All terms on templates

,...,T

are used to generate the term document

matrix. We have experimented with several common

weighting functions. The two weighting functions

which give the best accuracy were used in the ex-

periment. The log weighting function is used as lo-

cal weighting function. It is deﬁned as: for term i

in template T

, the weighting l

i,T

= log(c

+ 1), with

being the count of occurrences of term i in tem-

plate T

. The global weighting used is the Entropy

weighting function, which is reported to work well

with LSI. A term i’s global weighting is computed as

= 1 +

∑

i,T

log p

i,T

logN

, where p

i,T

, and c

the total count of occurrences of term i in the whole

collection UT . All ranks of singular values have been

tested, and the half rank and full rank results are re-

ported.

5.2.3 Image feature similarity / ZoneSeg

Each template T

and the input document E are treated

as an image, and grids are superimposed on the im-

ages with size m × n. For each small patch of the im-

age, if the black pixels on the patch exceed over a

threshold K

threshold

= 5%, the patch is represented as

1, and 0 otherwise. Then, according to the same se-

quence, the image can be represented as a string con-

sisting of 0s and 1s. The similarity between a template

and an input document E is deﬁned with the Lev-

enshtein distance between the two strings. The more

visually similar the two images are, the smaller the

Levenshtein distance will be. The Levenshtein dis-

tance between the test document E and each template

is computed, and the template that minimizes the

distance is taken as the identiﬁed template. Different

image patch sizes are also experimented with, and the

size m = 36 pixels, n = 24 pixels, which renders the

best results, is used in the experiment.

5.3 Experiment Design

The experiment was conducted on a set of Intel Core

Duo machines, using Python 2.7, and MatLab 2009b.

For OCR, we used the open-source Tesseract 3.01

project.

5.4 Results and Analysis

In this section, we compare the performance of our

approach with cosine similarity, SVD, and ZoneSeg.

The performance is reported in terms of accuracy.

We evaluate the accuracy of a document identiﬁca-

tion algorithm based on two metrics: total accuracy

and average accuracy on each template. The total ac-

curacy ACC

of an algorithm is deﬁned as the ratio

of number of documents which are correctly identi-

ﬁed and the total number of documents. The aver-

age accuracy on each template ACC

is deﬁned as the

average accuracy on documents from each template.

While the ACC

accuracy demonstrate the overall per-

formance of each method on the data sets respectively,

the template-wise average accuracy ACC

shows the

algorithm robustness towards different templates. The

corresponding results on NIST-IRS and FDA data set

are shown in Fig. 2 and 3 respectively.

From Figures 2 and 3, we make the following ob-

servations.

1. All algorithms achieve good results on the NIST-

IRS data set.

2. For the NIST-IRS data set, all algorithms achieve

100% ACC

accuracy and ACC

accuracy, as ex-

pected, except for ZoneSeg. However, the result

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

108

Figure 2: Performance on the NIST-IRS data set.

Figure 3: Performance on the FDA data set.

of ZoneSeg algorithm is still consistent with the

reported results in (Esser et al., 2011).

3. For the FDA data set, our algorithm achieves the

best ACC

accuracy and ACC

accuracy.

4. SVD methods performs better on full rank than on

half rank on both data sets.

5. For the NIST-IRS data set, ACC

result is better

than ACC

with all algorithms. For FDA data set,

ACC

result is better than ACC

consistently for all

algorithms.

The ﬁrst observation is expected: the accuracy

of document identiﬁcation algorithms depend on the

data sets. When the difference of different templates

in the data set is signiﬁcant, it is easier to identify the

template. Hence, the accuracy will be high. Since the

NIST-IRS data set’s templates are signiﬁcantly differ-

ent both in image level and contents, high accuracy

for most algorithms was expected.

The second observation indicates that the dif-

ferences between different templates are signiﬁcant.

Both the algorithms based on OCR output and the al-

gorithms based on image level features achieve good

results, all being over (98%). However, the algo-

rithms based on OCR output achieve better accuracy

than algorithms based on image level features because

the difference in string level features is more signiﬁ-

cant than the difference in image level features. The

NIST-IRS data set also has less ﬁll-in data in com-

mon, which makes the test images easier to be identi-

ﬁed using image level features. However, the noise on

the image, including the scanning skewing and trans-

lation, still affects the accuracy.

The third observation indicates that the proposed

probability-based algorithm works very well even on

the templates that are very similar. Because this ap-

proach considers both the reward on the evidence of

an unlabeled test document being any of the tem-

plates, it also penalizes the dissimilarity with prob-

ability. Unlike other algorithms that mainly measure

similarity between a query and a template, our model

measures both the similarity and the dissimilarity be-

tween a query document and a template. The accu-

racy of cosine similarity and SVD methods is not as

good as our algorithm because they do not have a

penalty mechanism, and cannot consider the possibil-

ity of a match on a string occurring when the ﬁll-in

data coincides with the term used in another template.

The accuracy of the ZoneSeg is lower on the FDA set,

indicating that the performance of ZoneSeg is sensi-

tive to the quality of the document and the similarity

between templates.

The fourth observation indicates that full rank

SVD works better than smaller ranks. As in most in-

formation retrieval task, usually only few ranks are

used to approximate the data in feature space, which

is accurate enough. However, a fewer rank does not

approximate the full rank that well for this task.

The last observation indicates an important result.

Since ACC

is the average of the accuracy for each

template, in the case that an algorithm performs con-

sistently on all templates, the two metrics should give

the same result. However, when the number of sam-

ples for each template differs, the accuracy of the

same algorithm differs across different template de-

tection result; so, the two metrics have different val-

ues. For the NIST-IRS data set, the accuracy ACC

is consistently higher than the accuracy measurement

ACC

. This is because the accuracy of several tem-

plates, whose sample sizes are smaller, all algorithms

still give a high accuracy, most of which is higher than

the templates with larger sample size. However, for

the FDA data set, there are a few templates with small

sample sizes, and the accuracy for these templates are

lower. The templates that are harder to identify have

a smaller number of test documents. Note that our

probabilistic approach has the smallest decrease in ac-

curacy from ACC

to ACC

among all methods, which

indicates that our method is more consistent across

different templates.

We also studied the 23 failed cases out of the 4,576

tests in the FDA data set and found that the failed

RobustTemplateIdentificationofScannedDocuments

109

cases generally fall into two situations: 1. The im-

ages are severely malformed (e.g., more than half of

the document was not scanned), and only the iden-

tical portions of the images are kept and OCRed; 2.

For a pair of templates, which share almost the same

structure, having 10 or fewer different words, the test

was unable to determine the template, given that the

scores computed by Equation 5 for the pair were too

close. Simply choosing the maximum score results

in incorrect classiﬁcation. To handle these cases, a

threshold on the likelihood should be set based on ap-

plication and similarity among the templates. Any test

ﬁle whose likelihood is below this threshold should be

veriﬁed manually.

With respect to speed, the experiments show that

our method’s speed is comparable to cosine similarity

and SVD, and ZoneSeg has a varied speed depending

on the size of the patch. In the experiments, the image

patch size that gives the best accuracy is signiﬁcantly

slow, since a general test image will be converted into

a string having a length greater than 5,000. This re-

sults in the Levenshtein distance computation to take

a long time to execute.

6 CONCLUSIONS AND FUTURE

WORK

A probabilistic method of identifying document tem-

plates for noisy scanned document has been studied.

This method works well with low accuracy OCR re-

sults produced from noisy documents. Through ex-

periment and analysis, the proposed method is shown

to perform consistently across different template sets,

and works well even when document templates are

very similar. We recognize that certain documents

will contain, in addition to text, non-text content for

which our technique does not apply. It is the intent

that the technique in this paper will be incorporated

into larger application systems that handle both text

recognition (our technique) and non-text recognition

(image feature-based techniques).

Our future research will focus on incorporating

other image features-based techniques with our meth-

ods and identifying ﬁll-in data automatically based on

the proposed method.

REFERENCES

Blei, D. M., Ng, A., and Jordan, M. (2003). Latent dirichlet

allocation. J. Mach. Learn. Res., 3:993–1022.

Cunningham, H., Maynard, D., Bontcheva, K., and Tabla,

V. (2002). Gate: A framework and graphical devel-

opment environment for robust nlp tools and appli-

cations. In Proceedings of the 40th Annual Meet-

ing of the Association for Computational Linguistics,

Philadelphia, PA, USA.

Deerwester, S. Improving information retrieval with latent

semantic indexing. In Proceedings of the 51st ASIS

Annual Meeting, ASIS ’88.

Esser, D., Schuster, D., Muthmann, K., Berger, M., and

Schill, A. (2011). Automatic indexing of scanned

documents - a layout-based approach. In Document

Recognition and Retrieval XVIII.

Hofmann, T. (1999). Probabilistic latent semantic indexing.

In Proceedings of the 22nd Annual International ACM

SIGIR Conference on Research and Development in

Information Retrieval, SIGIR ’99, pages 50–57.

Hu, J., Kashi, R., and Wilfong, G. (2000). Comparison and

classiﬁcation of documents based on layout similarity.

Inf. Retr., 2:227–243.

Jinhui Liu, A. K. J. (2000). Image-based form document

retrieval. Pattern Recognition, 33:503–513.

Lu, Y. and Tan, C. L. (2004). Information retrieval in docu-

ment image databases. IEEE Transactions on Knowl-

edge and Data Engineering, 16:1398–1410.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-

nery, B. P. (2007). Numerical Recipes 3rd Edition:

The Art of Scientiﬁc Computing. Cambridge Univer-

sity Press.

Salton, G. (1986). Another look at automatic text-retrieval

systems. Commun. ACM, 29:648–656.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector

space model for automatic indexing. In Communica-

tions of the ACM, volume 18.

Shin, C., Doermann, D., and Rosenfeld, A. (2001). Classi-

ﬁcation of document pages using structure-based fea-

tures. International Journal on Document Analysis

and Recognition, 3:232–247.

Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduc-

tion to Data Mining. Addison-Wesley.

T. S. Jayram, Krishnamurthy, R., Raghavan, S.,

Vaithyanathan, S., and Zhu, H. (2006). Avatar

information extraction system. In IEEE Data

Engineering Bulletin 29.

Zheng, Y., Li, H., and Doermann, D. (2005). A parallel-line

detection algorithm based on HMM decoding. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 27:777–792.

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

110