FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS

Costin-Gabriel Chiru, Andrei Hanganu, Traian Rebedea, Stefan Trausan-Matu

Abstract

In this paper we present a text recovery method based on a probabilistic post-recognition processing of the output of an Optical Character Recognition system. The proposed method is trying to fill in the gaps of missing text resulted from the recognition process of degraded documents. For this task, a corpus of up to 5-grams provided by Google is used. Several heuristics for using this corpus for the fulfilment of this task are described after presenting the general problem and alternative solutions. These heuristics have been validated using a set of experiments that are also discussed together with the results that have been obtained.

References

  1. Baird, H. S., 2003. Digital libraries and document image analysis. In International Conference on Document Analysis and Recognition, pages 2-14.
  2. Brants, T., Franz, A., 2006. Web 1T 5-gram Version 1, Linguistic Data Consortium, Philadelphia.
  3. Breithaupt, M., 2001. Improving OCR and ICR accuracy through expert voting. Technical report, Oce Document Technologies. (www.csisoft.com/ applications/OCE%20Intellidact%20Whitepaper.pdf) Hong, T., Hull, J. J., 1995. Algorithms for Postprocessing OCR Results with Visual Inter-Word Constraints. In Procs. International Conference on Image Processing, Volume 3, Issue, pages 312 - 315.
  4. Khoubyari, S., Hull, J. J., 1995. Font and Function Word Identification in Document Recognition. In Computer Vision, Graphics, and Image Processing: Image Understanding.
  5. Kukich, K., 1992. Techniques for Automatically Correcting Words in Text. In ACM Computing Surveys, Vol. 24, No. 4, pages 377-439.
  6. Meknavin, S., Kijsirikul, B., Chotimonkol, A. Nuttee, C., 1998. Combining Trigram and Winnow in Thai OCR Error Correction. In Proceedings of COLING, pages 836-842.
  7. Nagy, G., Nartker, T. A., Rice, S. V., 1999. Optical character recognition: An illustrated guide to the frontier. In Procs. Document Recognition and Retrieval VII, SPIE, Volume 3967, pages 58-69, Kluwer Academic Publishers.
  8. Tong, X., Evans, D., 1996. A Statistical Approach to Automatic OCR Error Correction in Context. In WVLC-96, pages 88-100.
Download


Paper Citation


in Harvard Style

Chiru C., Hanganu A., Rebedea T. and Trausan-Matu S. (2010). FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS . In Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT, ISBN 978-989-8425-23-2, pages 438-443. DOI: 10.5220/0002932204380443


in Bibtex Style

@conference{icsoft10,
author={Costin-Gabriel Chiru and Andrei Hanganu and Traian Rebedea and Stefan Trausan-Matu},
title={FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS},
booktitle={Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,},
year={2010},
pages={438-443},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002932204380443},
isbn={978-989-8425-23-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Software and Data Technologies - Volume 2: ICSOFT,
TI - FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS
SN - 978-989-8425-23-2
AU - Chiru C.
AU - Hanganu A.
AU - Rebedea T.
AU - Trausan-Matu S.
PY - 2010
SP - 438
EP - 443
DO - 10.5220/0002932204380443