Few-exemplar Information Extraction for Business Documents

Daniel Esser, Daniel Schuster, Klemens Muthmann, Alexander Schill

Abstract

The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts and administrators. Small office/home office (SOHO) users and private individuals do often not benefit from such systems. A low extraction effectivity especially in the starting period due to a small number of initially available example documents and a high effort to annotate new documents, drastically lowers their acceptance to use a self-learning information extraction system. Therefore we present a solution for information extraction that fits the requirements of these users. It adopts the idea of one-shot learning from computer vision to the domain of business document processing and requires only a minimal number of training to reach competitive extraction effectivity. Our evaluation on a document set of 12,500 documents consisting of 399 different layouts/templates achieves extraction results of 88% F1 score on 10 commonly used fields like document type, sender, recipient, and date. We already reach an F1 score of 78% with only one document of each template in the training set.

References

  1. Bart, E. and Sarkar, P. (2010). Information extraction by finding repeated structure. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 7810, pages 175-182.
  2. Chinchor, N. and Sundheim, B. (1993). Muc-5 evaluation metrics. In Proceedings of the 5th conference on Message understanding, MUC5 7893, pages 69-78.
  3. Dengel, A. and Klein, B. (2002). smartfix: A requirementsdriven system for document analysis and understanding. Document Analysis Systems V, pages 77-88.
  4. Fei-Fei, L., Fergus, R., and Perona, P. (2006). Oneshot learning of object categories. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(4):594-611.
  5. Klein, B., Agne, S., and Dengel, A. (2004). Results of a study on invoice-reading systems in germany. In Document Analysis Systems.
  6. Medvet, E., Bartoli, A., and Davanzo, G. (2011). A probabilistic approach to printed document understanding. Int. J. Doc. Anal. Recognit., 14(4):335-347.
  7. Opentext (2012). Opentext capture center. http://www.opentext.com/ What-We-Do/ Products/ Enterprise-Content-Management/ Capture/ OpenText-Capture-Center.
  8. Salperwyck, C. and Lemaire, V. (2011). Learning with few examples: An empirical study on leading classifiers. In The International Joint Conference on Neural Networks (IJCNN).
  9. Saund, E. (2011). Scientific challenges underlying production document processing. In Document Recognition and Retrieval XVIII (DRR).
  10. Schuster, D., Muthmann, K., Esser, D., Schill, A., Berger, M., Weidling, C., Aliyev, K., and Hofmeier, A. (2013). Intellix - end-user trained information extraction for document archiving. In Document Analysis and Recognition (ICDAR), Washington, DC, USA.
Download


Paper Citation


in Harvard Style

Esser D., Schuster D., Muthmann K. and Schill A. (2014). Few-exemplar Information Extraction for Business Documents . In Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-027-7, pages 293-298. DOI: 10.5220/0004946702930298


in Bibtex Style

@conference{iceis14,
author={Daniel Esser and Daniel Schuster and Klemens Muthmann and Alexander Schill},
title={Few-exemplar Information Extraction for Business Documents},
booktitle={Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2014},
pages={293-298},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004946702930298},
isbn={978-989-758-027-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Few-exemplar Information Extraction for Business Documents
SN - 978-989-758-027-7
AU - Esser D.
AU - Schuster D.
AU - Muthmann K.
AU - Schill A.
PY - 2014
SP - 293
EP - 298
DO - 10.5220/0004946702930298