In addition to these usual issues for the
handwriting recognition, recognizing Arabic
character needs to deal with ligatures, diacritics,
multi-variability writer style, bad writing habits like
touching characters, misplacement of dots, etc.
Therefore, most research in the character recognition
field is moving towards experimentation of their
proposed system to test new approaches and
algorithms. In this case, they need database to
validate their theories, for that they use databases
available for on-line and off-line writing or they
develop their own database as presented in
(Plamondon et al., 2000) (Tagougui et al., 2013)
(Vinciarelli, 2002)( Jäger et al., 2001)(Steinherz et
al., 1999).
On other wise, examining the existing databases,
we find that none of them deals with on-line / off-line
cursive Arabic writing with diacritical marks. Thus,
the development of a new database is useful for the
scientific community, with the continuous grow up of
interested searchers number work on the recognition
of Arabic writing field. To respond to this need, we
propose in this paper a new database for the Arabic
diacritical characters named "CHAKEL-DB".
So, the remainder of this article is organized as
follows: in section 2., we begin with an overview of
the related works. Section 3. presents the data
preparation and acquisition stage. Supplemented by,
a quick overview of the existing data formats and
databases, presented as a comparative study based on
indicators reflecting the requirements to choose the
formats to be adopted in our work. The experimental
results of our work are presented in section 4. And we
close the paper with a conclusion in section 5.
2 RELATED WORK
All scientific domains have several standard
databases for developing, evaluating and comparing
different techniques developed for their various tasks.
The field of recognition of handwriting is not an
exception; in this context, the handwriting
recognition community has proposed many
databases, some of them will be listed below, to
present some statistics and comparisons based on a
few specific dimensions of Handwriting. These
dimensions allow us to make a subdivision in the field
of handwriting. We can divide handwriting
recognition into two fields, depending on the form in
which the data is represented on-line or off-line.
In the case of on-line recognition of handwriting,
the user must write on a digitizing tablet with a special
stylet, so that the lines are sampled by the coordinates
(x, y) of the spaced time intervals. However, in the
case of off-line handwriting recognition, the user
writes on a paper which is then scanned by a scanner.
In this case the data is presented to the system as an
image, which requires segmentation to binarize it
through the threshold technique based on the color
pattern (color or gray scale), so, the image pixels are
either 1 or 0.
The on-line case concerns a spatial-temporal
representation of the input; while the off-line case,
involves an analysis of the spatial luminance of
image.
The most important and widely used handwritten
databases include:
IAM databases: used as collections of handwritten
samples; they are adopted for a variety of
segmentation and recognition tasks. Several off-
line and on-line databases have been developed
within the IAM, such as:
IAM-DB (IAM handwriting database) (Marti and
Bunke, 1999): this handwritten database proposed
since 1999, contains forms of unconstrained
western handwritten English text. The IAM
Handwriting Database 3.0 published in 2002
includes contributions from 657 authors making a
total of 1539 handwritten pages including 5685
sentences, 13353 lines of text and 115 320 words.
The database is labelled at the level of sentences,
lines and words, and it has been widely used in
word tracking, writer identification, text
segmentation Handwriting and off-line write
recognition. This database is presented by image
files described by meta-data files in XML format.
IAM-OnDB (Liwicki and Bunke, 2005): The
database is a collection of handwritten samples on
a white-board acquired with the E-Beam system.
The data is stored in XML format which, in
addition to the transcription of the text, also
contains information and demographic data about
writers. The database includes 221 authors
contributing a total of more than 1,700 forms with
13,049 lines of labelled texts and 86,272
occurrences of words from a dictionary of 11,059
words. In addition to the recognition of on-line
writing, the database was also used for on-line
writer identification and gender classification
from handwriting. The collected data is stored in
XML format, and available in tif-images.
Rimes (Augustin et al., 2006): it is an off-line
database composed of emails sent by individuals
to companies or administrations. It contains
12,723 pages corresponding to 5605 emails of
1,300 volunteers. Collected pages scanned and
annotated in RIM format, database is fully used