Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies

Daniel Schuster, Daniel Esser, Klemens Muthmann, Alexander Schill

2015

Abstract

Business document indexing for ordered filing of documents is a crucial task for every company. Since this is a tedious error prone work, automatic or at least semi-automatic approaches have a high value. One approach for semi-automated indexing of business documents uses self-learning information extraction methods based on user feedback. While these methods require no management of complex indexing rules, learning by user feedback requires each user to first provide a number of correct extractions before getting appropriate automatic results. To eliminate this cold start problem we propose a cooperative approach to document information extraction involving dynamic hierarchies of extraction services. We provide strategies for making the decision when to contact another information extraction service within the hierarchy, methods to combine results from different sources, as well as aging and split strategies to reduce the size of cooperatively used indexes. An evaluation with a large number of real-world business documents shows the benefits of our approach.

References

  1. AlchemyAPI (2013). http://www.alchemyapi.com/. [Online; accessed 20-August-2014].
  2. Chang, C. H., Kayed, M., Girgis, M. R., and Shaalan, K. F. (2006). A survey of web information extraction systems. Knowledge and Data Engineering, IEEE Transactions on, 18(10):1411-1428.
  3. Chinchor, N. and Sundheim, B. (1993). Muc-5 evaluation metrics. In Proceedings of the 5th conference on Message understanding, MUC5 7893, pages 69-78.
  4. Esser, D., Schuster, D., Muthmann, K., and Schill, A. (2014). Few-exemplar information extraction for business documents. In 16th International Conference on Enterprise Information Extraction (ICEIS 2014).
  5. Gini (2013). https://www.gini.net/en/. [Online; accessed 20-August-2014].
  6. Klein, B., Dengel, A., and Fordan, A. (2004). smartfix: An adaptive system for document analysis and understanding. Reading and Learning, pages 166-186.
  7. Marinai, S. (2008). Introduction to document analysis and recognition. In Machine learning in document analysis and recognition, pages 1-20. Springer.
  8. Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3-26.
  9. Opentext (2012). Opentext capture center. http://www. opentext.com/2/global/products/products-captureand-imaging/products-opentext-capture-center.htm. Hierarchical1 approach
  10. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., and Goranov, M. (2004). Kim - semantic annotation platform. Journal of Natural Language Engineering, 10(3-4):375-392.
  11. Roussel, N., Hitz, O., and Ingold, R. (2001). Web-based cooperative document understanding. 2013 12th International Conference on Document Analysis and Recognition, 0:0368.
  12. Saund, E. (2011). Scientific challenges underlying production document processing. In Document Recognition and Retrieval XVIII (DRR).
  13. Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., and Dengel, A. (2009). Seizing the treasure: Transferring knowledge in invoice analysis. In 10th International Conference on Document Analysis and Recognition, 2009., pages 848-852.
  14. Schuster, D., Hanke, M., Muthmann, K., and Esser, D. (2013a). Rule-based vs. training-based extraction of index terms from business documents - how to combine the results. In Document Recognition and Retrieval XX (DRR), San Francisco, CA, USA.
  15. Schuster, D., Muthmann, K., Esser, D., Schill, A., Berger, M., Weidling, C., Aliyev, K., and Hofmeier, A. (2013b). Intellix - end-user trained information extraction for document archiving. In Document Analysis and Recognition (ICDAR), Washington, DC, USA.
  16. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1-47.
Download


Paper Citation


in Harvard Style

Schuster D., Esser D., Muthmann K. and Schill A. (2015). Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 321-329. DOI: 10.5220/0005376403210329


in Bibtex Style

@conference{iceis15,
author={Daniel Schuster and Daniel Esser and Klemens Muthmann and Alexander Schill},
title={Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2015},
pages={321-329},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005376403210329},
isbn={978-989-758-096-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies
SN - 978-989-758-096-3
AU - Schuster D.
AU - Esser D.
AU - Muthmann K.
AU - Schill A.
PY - 2015
SP - 321
EP - 329
DO - 10.5220/0005376403210329