High Level Shape Representation in Printed Gujarati Character

Mukesh M. Goswami, Suman K. Mitra

2017

Abstract

This paper presents extraction and identification of the high-level stroke (HLS) from printed Gujarati characters. The HLS feature describes a character as a sequence of predefined high-level strokes. Such a high-level shape representation enables approximate shape similarity computation between characters and can easily be extended to word-level. The shape similarity based character and word matching have extensive application in word-spotting based document image retrieval and character classification. Therefore, the proposed features were tested on printed Gujarati character database consisting of 12000 samples from 42 different symbol classes. The classification is performed using k-nearest neighbor with shape similarity measure. Also, a shape similarity based printed Gujarati word matching experiment is reported on a small word image database and the initial result are encouraging.

References

  1. Antani, S. and Agnihotri, L. (1999). Gujarati character recognition. In Proc. of the 5th Int. Conf. on Document Analysisand Recognition (ICDAR'99), pages 418-421.
  2. Aparna, K. and Ramakrishnan, A. (2002). A complete tamil optical character recognition system. In Lopresti, D., Hu, J., and Kashi, R., editors, Document Analysis Systems V, pages 53-57. Springer Berlin / Heidelberg.
  3. Bhardwaj, A., Damien, J., and Govindaraju, V. (2008). Script independent word spotting in multilingual documents. In Proc. of 2nd Int. Workshop on Cross Lingual Information Access, pages 48-54.
  4. Charles, S. and McCallum, A. (2011). Introduction to conditional random fields. Foundation and Trends in Machine Learning, 4(4):267-373.
  5. Chaudhuri, B. and Pal, U. (1998). A complete printed bangla ocr system. Pattern Recognition, 31(5):531- 549.
  6. Chaudhuri, B., Pal, U., and Mitra, M. (2001). Automatic recognition of printed oriya script. In Proc. of the 6th Int. Conf. on Document Analysis and Recognition (ICDAR'01), pages 795-799. IEE.
  7. Chaudhury, S., Sethi, G., Vyas, A., and Harit, G. (2003). Devising interactive access techniques for indian language document images. In Proc. of the Int. Conf. on Document Analysis and Recognition (ICDAR), pages 885-889.
  8. Dholakia, J., Yajnik, A., and Negi, A. (2007). Wavelet feature based confusion character sets for gujarati script. In Proc. of the Int. Conf. on Computational Intelligence and Multimedia Applications, pages 366-370.
  9. Doermann, D. (1998). The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding, 70(3):287-298.
  10. Goswami, M. and Mitra, S. K. (2015). Classification of printed gujarati characters using low-level stroke features. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 15(4):25:1-26.
  11. Goswami, M., Prajapati, H., and Dabhi, V. (2011). Classification of printed gujarati characters using som based k-nearest neighbor classifier. In Proc. of the Int. Conf. on Image Information Processing, pages 1-5. IEEE.
  12. Hassan, E., Chaudhury, S., and Gopal, M. (2009). Shape descriptor based document image indexing and symbol recognition. In Proc. of the 10th Int. Conf. on Document Analysis and Recognition (ICDAR'09), pages 206-210.
  13. Hassan, E., Chaudhury, S., and Gopal, M. (2014). Feature combination for binary pattern classification. International Journal of Document Analysis and Recognition (IJDAR), 17(4):375-392.
  14. Jawahar, C., Kumar, P., and Kiran, S. (2003). A bilingual ocr for hindi-telugu documents and its applications. In Proc. of the 7th Int. Conf. on Document Analysis and Recognition (ICDAR'03), pages 408-412.
  15. Jawahar, C. V., Balasubramanian, A., and M., M. (2004). Word-level access to document image datasets. In Proceedings of the workshop on computer vision, graphics and image processing.
  16. Kompalli, S., Setlur, S., and Govindaraju, V. (2005). Challenges in ocr of devanagari documents. In Proc. of the 8th Int. Conf. on Document Analysis and Recognition (ICDAR'05), pages 1-5. IEEE.
  17. Kumar, A., Jawahar, C., and Manmatha, R. (2007). Efficient search in document image collections. In Yagi, Y., editor, ACCV:LNCS, volume 1 of 4843, pages 586- 595. Springer-Verlag Berlin / Heidelberg.
  18. Lakshmi, C. and Patvardhan, C. (2002). A multi-font ocr system for printed telugu text. In Proc. of the Langauge Engineering Conference, pages 7-17.
  19. Lehal, G. and Singh, C. (2000). A gurmukhi script recognition system. In Proc. of the 15th Int. Conf. on Pattern Recognition (ICPR'00), pages 557-560.
  20. Meshesha, M. and Jawahar, C. (2008). Matching of word image for content-based retrieval from printed document images. International Journal of Document Analysis and Recognition (IJDAR), 11(1):29-38.
  21. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge, Massachusetts London, England.
  22. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453.
  23. Rath, T. and Manmatha, R. (2003). Word image matching using dynamic time wrapping. In Proc. of the Int. Conf. on Computer Vision and Pattern Recognition (ICVRP), volume 2, pages 521-527.
  24. Srihari, S., Srinivasan, H., Huang, C., and Shetty, S. (2006). Spotting words in latin, devanagari and arabic scripts. Vivek, 16(3):2-9.
  25. Suthar, S., Goswami, M., and Thakkar, A. (2014). Empirical study of thinning algorithms on printed gujarati characters and handwritten numerals. In Meenakshi, N., editor, Proc. of the 2nd Int. Conf. on Emerging Research in Computing, Information, Communication, and Applications (ERCICA'14), volume 2, pages 104- 110. ELSEVIER.
  26. Tarafdar, A., Mondal, R., Pal, S., Pal, U., and Kimura, F. (2010). Shape code based word-image matching for retrieval of indian multi-lingual documents. In Proc. of the Int. Conf. on Pattern Recognition (ICPR), pages 1989-1992.
  27. Yang, M., Kpalma, K., and Ronsin, J. (2008). A survey of shape feature extraction techniques. In Yin, P., editor, Pattern Recognition, pages 43-90. IN-TECH.
Download


Paper Citation


in Harvard Style

Goswami M. and Mitra S. (2017). High Level Shape Representation in Printed Gujarati Character . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 418-425. DOI: 10.5220/0006191104180425


in Bibtex Style

@conference{icpram17,
author={Mukesh M. Goswami and Suman K. Mitra},
title={High Level Shape Representation in Printed Gujarati Character},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={418-425},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006191104180425},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - High Level Shape Representation in Printed Gujarati Character
SN - 978-989-758-222-6
AU - Goswami M.
AU - Mitra S.
PY - 2017
SP - 418
EP - 425
DO - 10.5220/0006191104180425