cases generally fall into two situations: 1. The im-
ages are severely malformed (e.g., more than half of
the document was not scanned), and only the iden-
tical portions of the images are kept and OCRed; 2.
For a pair of templates, which share almost the same
structure, having 10 or fewer different words, the test
was unable to determine the template, given that the
scores computed by Equation 5 for the pair were too
close. Simply choosing the maximum score results
in incorrect classification. To handle these cases, a
threshold on the likelihood should be set based on ap-
plication and similarity among the templates. Any test
file whose likelihood is below this threshold should be
verified manually.
With respect to speed, the experiments show that
our method’s speed is comparable to cosine similarity
and SVD, and ZoneSeg has a varied speed depending
on the size of the patch. In the experiments, the image
patch size that gives the best accuracy is significantly
slow, since a general test image will be converted into
a string having a length greater than 5,000. This re-
sults in the Levenshtein distance computation to take
a long time to execute.
6 CONCLUSIONS AND FUTURE
WORK
A probabilistic method of identifying document tem-
plates for noisy scanned document has been studied.
This method works well with low accuracy OCR re-
sults produced from noisy documents. Through ex-
periment and analysis, the proposed method is shown
to perform consistently across different template sets,
and works well even when document templates are
very similar. We recognize that certain documents
will contain, in addition to text, non-text content for
which our technique does not apply. It is the intent
that the technique in this paper will be incorporated
into larger application systems that handle both text
recognition (our technique) and non-text recognition
(image feature-based techniques).
Our future research will focus on incorporating
other image features-based techniques with our meth-
ods and identifying fill-in data automatically based on
the proposed method.
REFERENCES
Blei, D. M., Ng, A., and Jordan, M. (2003). Latent dirichlet
allocation. J. Mach. Learn. Res., 3:993–1022.
Cunningham, H., Maynard, D., Bontcheva, K., and Tabla,
V. (2002). Gate: A framework and graphical devel-
opment environment for robust nlp tools and appli-
cations. In Proceedings of the 40th Annual Meet-
ing of the Association for Computational Linguistics,
Philadelphia, PA, USA.
Deerwester, S. Improving information retrieval with latent
semantic indexing. In Proceedings of the 51st ASIS
Annual Meeting, ASIS ’88.
Esser, D., Schuster, D., Muthmann, K., Berger, M., and
Schill, A. (2011). Automatic indexing of scanned
documents - a layout-based approach. In Document
Recognition and Retrieval XVIII.
Hofmann, T. (1999). Probabilistic latent semantic indexing.
In Proceedings of the 22nd Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’99, pages 50–57.
Hu, J., Kashi, R., and Wilfong, G. (2000). Comparison and
classification of documents based on layout similarity.
Inf. Retr., 2:227–243.
Jinhui Liu, A. K. J. (2000). Image-based form document
retrieval. Pattern Recognition, 33:503–513.
Lu, Y. and Tan, C. L. (2004). Information retrieval in docu-
ment image databases. IEEE Transactions on Knowl-
edge and Data Engineering, 16:1398–1410.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-
nery, B. P. (2007). Numerical Recipes 3rd Edition:
The Art of Scientific Computing. Cambridge Univer-
sity Press.
Salton, G. (1986). Another look at automatic text-retrieval
systems. Commun. ACM, 29:648–656.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector
space model for automatic indexing. In Communica-
tions of the ACM, volume 18.
Shin, C., Doermann, D., and Rosenfeld, A. (2001). Classi-
fication of document pages using structure-based fea-
tures. International Journal on Document Analysis
and Recognition, 3:232–247.
Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduc-
tion to Data Mining. Addison-Wesley.
T. S. Jayram, Krishnamurthy, R., Raghavan, S.,
Vaithyanathan, S., and Zhu, H. (2006). Avatar
information extraction system. In IEEE Data
Engineering Bulletin 29.
Zheng, Y., Li, H., and Doermann, D. (2005). A parallel-line
detection algorithm based on HMM decoding. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 27:777–792.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
110