Authors:
Daniel Esser
;
Daniel Schuster
;
Klemens Muthmann
and
Alexander Schill
Affiliation:
TU Dresden, Germany
Keyword(s):
Information Extraction, Few-exemplar Learning, One-shot Learning, Business Documents.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Data Mining
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Enterprise Resource Planning
;
Enterprise Software Technologies
;
Performance Evaluation and Benchmarking
;
Sensor Networks
;
Signal Processing
;
Simulation and Modeling
;
Simulation Tools and Platforms
;
Soft Computing
;
Software Engineering
Abstract:
The automatic extraction of relevant information from business documents (sender, recipient, date, etc.) is a valuable task in the application domain of document management and archiving. Although current scientific and commercial self-learning solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts and administrators. Small office/home office (SOHO) users and private individuals do often not benefit from such systems. A low extraction effectivity especially in the starting period due to a small number of initially available example documents and a high effort to annotate new documents, drastically lowers their acceptance to use a self-learning information extraction system. Therefore we present a solution for information extraction that fits the requirements of these users. It adopts the idea of one-shot learning from computer vision to the domain of business document processing and requi
res only a minimal number of training to reach competitive extraction effectivity. Our evaluation on a document set of 12,500 documents consisting of 399 different layouts/templates achieves extraction results of 88% F1 score on 10 commonly used fields like document type, sender, recipient, and date. We already reach an F1 score of 78% with only one document of each template in the training set.
(More)