Authors:
Ashraf AbdelRaouf
1
;
Colin A. Higgins
2
;
Tony Pridmore
2
and
Mahmoud I. Khalil
3
Affiliations:
1
Cloud brokerage company, United Arab Emirates
;
2
The University of Nottingham, United Kingdom
;
3
Ain Shams University, Egypt
Keyword(s):
Arabic Character Recognition, Document Analysis & Understanding, Haar-like Features, Cascade Classifiers.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Character Recognition
;
Classification
;
Document Analysis and Understanding
;
Feature Selection and Extraction
;
Pattern Recognition
;
Software Engineering
;
Theory and Methods
Abstract:
Optical Character Recognition (OCR) is an important technology. The Arabic language lacks both the variety of OCR systems and the depth of research relative to Roman scripts. A machine learning, Haar-Cascade classifier (HCC) approach was introduced by Viola and Jones (Viola and Jones 2001) to achieve rapid object detection based on a boosted cascade Haar-like features. Here, that approach is modified for the first time to suit Arabic glyph recognition. The HCC approach eliminates problematic steps in the pre-processing and recognition phases and, most importantly, the character segmentation stage. A recognizer was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These recognizers were trained and tested on some 2,000 images each. The system was tested with real text images and produces a recognition rate for Arabic glyphs of 87%. The proposed method is fast, with an average document recognition time of 14.7 seconds compared with 15.8 second
s for commercial software.
(More)