A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters

Taeka Awazu, Manami Fukuo, Masami Takata, Kazuki Joe

Abstract

The web site of National Diet Library in Japan provides a lot of early-modern (AD1868-1945) Japanese printed books to the public, but full-text search is essentially impossible. In order to perform advanced search for historical literatures, the automatic textualization of the images is required. However, the ruby system, which is peculiar to Japanese books, gives a serious obstacle against the textualization. When we apply existing OCRs to early-modern Japanese printed books, the recognition rate is extremely low. To solve this problem, we have already proposed a multi-font Kanji character recognition method using the PDC feature and an SVM. In this paper, we propose a ruby character removal method for early-modern Japanese printed books using genetic programming, and evaluate our multi-fonts Kanji character recognition method with 1,000 types of early-modern Japanese printed Kanji characters.

References

  1. C. Ishikawa, N. Ashida, Y. E. M. T. T. K. and Joe, K. (2009). Recognition of Multi-Fonts Character in Early-Modern Printed Books. Proceedings of The 2009 International Conference on Parallel and Distributed Processing Technologies and Applications (PDPTA' 2009), 2:728-734.
  2. Fletcher, L. A. and Kasturi, R. (1988). A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images. IEEE Trans. Pattern Analysis and Machine Intelligence, 10(6):910-918.
  3. Koza, J. (1992). Genetic Programing : On the Programming of Computers by Means of Natural Selection. The MIT Press.
  4. M. Fukuo, M. T. and Joe, K. (2012). The Kanji character recognition evalution for the modern book of the same publisher (in Japanese). The Information Processing Society of Japan. Mathematical Modeling and Problem Solving(MPS), 26:1-6.
  5. N. Hagita, S. N. and Masuda, I. (1983). Handprinted Chinese Characters Recognition by Peripheral Direction Contributivity Feature. IEICE, J66-D(10):1185-1192.
  6. N. Stamatopoulos, G. L. and Gatos, B. (2009). A Comprehensive Evaluation Methodology for Noisy Historical Document Recognition Techniques. AND 2009 Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 47-54.
Download


Paper Citation


in Harvard Style

Awazu T., Fukuo M., Takata M. and Joe K. (2014). A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 637-645. DOI: 10.5220/0004825306370645


in Bibtex Style

@conference{icpram14,
author={Taeka Awazu and Manami Fukuo and Masami Takata and Kazuki Joe},
title={A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={637-645},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004825306370645},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters
SN - 978-989-758-018-5
AU - Awazu T.
AU - Fukuo M.
AU - Takata M.
AU - Joe K.
PY - 2014
SP - 637
EP - 645
DO - 10.5220/0004825306370645