Authors:
Camelia Lemnaru
;
Andreea Sin-Neamțiu
;
Mihai-Andrei Vereș
and
Rodica Potolea
Affiliation:
Technical University of Cluj-Napoca, Romania
Keyword(s):
Handwriting Recognition, Historical Document, Hierarchical Classifier, Dictionary Analysis, Kurrent Schrift.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Clustering and Classification Methods
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining High-Dimensional Data
;
Pre-Processing and Post-Processing for Data Mining
;
Structured Data Analysis and Statistical Methods
;
Symbolic Systems
Abstract:
Information contained in historical sources is highly important for the research of historians; yet, extracting it manually from documents written in difficult scripts is often an expensive and time-consuming process. This paper proposes a modular system for transcribing documents written in a challenging script (German Kurrent Schrift). The solution comprises of three main stages: Document Processing, Word Processing and Word Selector, chained together in a linear pipeline. The system is currently under development, with several modules in each stage already implemented and evaluated. The main focus so far has been on the character recognition module, where a hierarchical classifier is proposed. Preliminary evaluations on the character recognition module has yielded ~ 82% overall character recognition rate, and a series of groups of confusable characters, for which an additional identification model is currently investigated. Also, word composition based on a dictionary matching app
roach using the Levenshtein distance is presented.
(More)