Entity Matching in OCRed Documents with Redundant Databases
Nihel Kooli, Abdel Belaïd
2015
Abstract
This paper presents an entity recognition approach on documents recognized by OCR (Optical Character Recognition). The recognition is formulated as a task of matching entities in a database with their representations in a document. A pre-processing step of entity resolution is performed on the database to provide a better representation of the entities. For this, a statistical model based on record linkage and record merge phases is used. Furthermore, documents recognized by OCR can contain noisy data and altered structure. An adapted method is proposed to retrieve the entities from their structures by tolerating possible OCR errors. A modified version of EROCS is applied to this problem by adapting the notion of segments to blocks provided by the OCR. It handles document segments to match the document to its corresponding entities. For efficiency, a process of data labeling in the document is applied in order to filter the compared entities and segments. The evaluation on business documents shows a significant improvement of matching rates compared to those of EROCS.
References
- Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 87- 96.
- Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16-23.
- Table 2: Entity matching rates.
Paper Citation
in Harvard Style
Kooli N. and Belaïd A. (2015). Entity Matching in OCRed Documents with Redundant Databases . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-076-5, pages 165-172. DOI: 10.5220/0005177301650172
in Bibtex Style
@conference{icpram15,
author={Nihel Kooli and Abdel Belaïd},
title={Entity Matching in OCRed Documents with Redundant Databases},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2015},
pages={165-172},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005177301650172},
isbn={978-989-758-076-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Entity Matching in OCRed Documents with Redundant Databases
SN - 978-989-758-076-5
AU - Kooli N.
AU - Belaïd A.
PY - 2015
SP - 165
EP - 172
DO - 10.5220/0005177301650172