Baseline Estimation in Arabic Handwritten Text-Line

Evaluation on AHTID/MW Database

Anis Mezghani

, Slim Kanoun

1,2

, Souhir Bouaziz

, Maher Khemakhem

and Haikal El Abed

MIRACL Lab, ISIMS, University of Sfax, Sfax, Tunisia

University of Sfax, National School of Engineers (ENIS), Sfax, Tunisia

Braunschweig Technical University, Institute for Communications Technology (IfN), Braunschweig, Germany

Keywords: Baseline Estimation, Handwritten Text-Line, AHTID/MW Database, Ground Truth.

Abstract: Baseline extraction is one of the most important phases for handwriting recognition. Due to the complexity

of the Arabic scripts, baseline detection of Arabic handwritten text-lines is a difficult task compared to other

languages. In this work, a method which combines some baseline extraction techniques used in literature

was presented to provide a fine estimation of baseline in Arabic handwritten text-lines. For evaluation

purpose, the AHTID/MW database was extended by a baseline ground truth annotation. The database is

freely available for researchers worldwide which enable other researchers to test their baseline detection

systems.

1 INTRODUCTION

In pattern recognition field, handwritten text

recognition is considered as one of the most

complicated problem. Moreover, the complexity is

increased in Arabic language because the text is

written cursively in addition to the complexity of the

text characteristics (Al-Badr and Mahmoud, 1995).

The existing research on recognizing Arab text is

still limited compared with Latin or China’s

languages. In the Arabic OCR (Optical Character

Recognition) system, preprocessing stage is the most

important because it directly affects the reliability

and efficiency in the segmentation and feature

extraction process (Farooq et al., 2005). After

enhancing the quality of the image, one of the prior

parts of the text preprocessing is the estimation of

the writing line called Baseline.

The baseline is a vertical reference position for

the characters and subwords in a handwritten text-

line image. The baseline has been used by most of

the Arabic OCR systems; it can be used in skew

normalization (Pechwitz and Märgner, 2003), to

segment the Arabic text into words or characters

(Amin, 1998) and to make the text ready for the

feature extraction stage (El-Hajj et al., 2005). In

Arabic handwritten text, classic methods of baseline

estimation such as the horizontal projection are not

suitable because of wide variety of writing styles

and specific characteristics such as cursive writing

and large number of dots.

Considering these issues, we propose to develop

a method that provides a fine estimation of baseline

in Arabic handwritten text-line. To the best of our

knowledge, there is no Arabic handwritten text-line

dataset with ground truth information. Therefore, we

extended the AHTID/MW database (Mezghani et

al., 2012) by a baseline ground truth annotation of

text-line images. This database is freely available for

the scientific community and may be used as a

benchmark database where researchers can evaluate

and compare their algorithms and results with other

published works.

In Section 2, we will outline the published works

related to baseline detecting methods. Section 3

describes the developed method of baseline

estimation in Arabic handwritten text-line. In

Section 4, the AHTID/MW database structure along

with baseline ground truth data is presented.

Experimental results are reported in Section 5.

Finally, concluding remarks are given in Section 6.

2 RELATED WORK

Different baseline extraction methods are presented

in literature. Al-Shatnawi and Omar (2008)

430

Mezghani A., Kanoun S., Bouaziz S., Khemakhem M. and El Abed H. (2013).

Baseline Estimation in Arabic Handwritten Text-Line - Evaluation on AHTID/MW Database.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 430-434

DOI: 10.5220/0004218704300434

 SciTePress

classified the Arabic baseline extraction methods

into four different groups based on the techniques

used. The simplest one is based on horizontal

projection. Elgammal and Ismail (2001) detected the

baseline by finding peak value of horizontal

projection profile in a printed text-line. This method

has the defect to be very sensitive to the skew

(Pechwitz and Märgner, 2002). A modified

projection technique based on rotating word image

through different angular inclinations is presented by

Al-Rashaideh (2006). The baseline is identified by

finding the maximum value and corresponding angle

among all the peak values is obtained. Pechwitz and

Märgner (2002) proposed the only one work to

detect Arabic handwriting baseline according to the

word skeleton. The main idea of this approach is to

calculate robust features from the skeleton and use

these features for classifying the connected

components into baseline relevant and baseline

irrelevant areas. In a subsequent step, a regression

analysis of points of the relevant objects is done to

estimate the final baseline position. Some

researchers extract baselines after correcting the

slant of the word by a linear regression of the critic

points of the contour having nearly the same

horizontal positions (Farooq et al., 2005). Burrow

(2004) presented a method based on angle detection

by principle components analysis.

Other methods such as the minimization of the

entropy and Hough transform based methods, which

are used for Latin script, are developed and applied

on Arabic script (Côté et al., 1996; Likforman-

Sulem et al., 1995). These methods have the defect

to be expensive in term of calculation time. Lemaitre

et al. (2009) proposed a script independent method

for baseline detection. This method is based on the

principle of the perceptive vision, which combines

several points of view of the same word (from low

to high resolution). Boubaker et al. (2009) described

a baseline detection method which considers

geometric and topologic features. It is tested on

online and offline short Arabic handwritten writing.

Recently, a two-stage Persian/Arabic baseline

detection and correction algorithm is presented by

Ziaratban and Faez (2008). The first stage estimates

the writing path of a text-line by a fitted curve based

on candidate baseline pixels, which are detected

using template matching algorithm. Then the slant

and position of the components in the line is

adjusted. In the second stage, the baseline for each

subword is corrected. Other method of tracing the

baseline in handwritten Persian/Arabic text-line is

proposed by Nagabhushan and Alaei (2010). This

method is based on preparing patches of black and

white blocks all along the text-line, identifying some

candidate points and regressing a curve through

these candidate points to trace the baseline.

The majority of methods presented in literature

failed in estimating the correct baseline for

handwritten text having greater number of ascenders

and descenders. Menasri et al. (2008) described a

baseline extraction method of words overcoming

some difficulties in Arabic script such as the

presence of loops and various shapes for a group of

two or three dots. Inspired by this work, we

developed a baseline estimation method adapted to

Arabic handwritten text-lines.

3 BASELINE ESTIMATION

After scanning a document, some basic

preprocessing tasks like image binarization and

noise reduction have to be performed to increase the

readability of the input by the baseline detection

system. Using the binary image, we perform a noise

reduction filtering. Small holes, produced by writing

and binarization process, are closed and the

unwanted information is deleted by using the

opening and closing morphology operation

respectively (Figure 1(b)).

The developed baseline detection process

consists of three stages: the first one is a basic stage

leading to the detection and removing of diacritical

marks. The second stage extracts the upper baseline

and the lower baseline based on the horizontal

projection histogram. In the final stage, we estimate

more precisely the baseline using support points.

3.1 Diacritical Marks Elimination

More than half of Arabic letters include in their

shape dots which can be one, two or three dots. The

presence of these dots, called diacritical marks, in

their positions allows us to differentiate between

letters that belong to the same family shape. These

diacritical marks lie in either above or under the

baseline depending on the character. In order to

circumvent the bad influences in the process of

baseline detection when using horizontal projection,

we start by removing the diacritical marks based on

the size of the connected components as described in

(Menasri et al., 2008).

A sample of a text-line image

after removing the diacritical marks is shown in

Figure 1(c).

BaselineEstimationinArabicHandwrittenText-Line-EvaluationonAHTID/MWDatabase

431

(a)

(b)

(c)

Figure 1: Arabic handwritten text-line image from

AHTID/MW database: (a) before binarization; (b) after

binarization and noise reduction; (c) after diacritical marks

elimination.

3.2 Primary Baseline Estimation

For primary baseline estimation, we used the

horizontal projection method. In general, this

method is used to detect two baselines in each input

image, upper and lower baseline (El-Hajj et al.,

2005). Horizontal projection histogram will be

disturbed by many kinds of noises. Among them is

the succession of descenders, or long tails under

baseline that could lead to a high peak in the

histogram under the baseline. In Arabic script, loops

are often located in the baseline. So, to overcome

this problem, we start by detecting the loops to pre-

locate a horizontal band. This horizontal band is

three times the maximum length of loops centered

on loops.

Figure 2: Pre-localization of horizontal band by the use of

loop position: (a) pre-localization of horizontal band; (b)

high peak due to the succession of descenders.

After pre-localization based on loops, we

compute the projection histogram and calculate the

global maximum. Baselines correspond to local

minima above and below the global maximum

which are lower than 1/3 of the global maximum.

The coefficient 1/3 is obtained from tests. Figure 3

illustrates improved baselines of a text-line image.

Figure 3: Improvement of horizontal band: (a) search area

of the global maximum; (b) primary baseline estimation.

3.3 Fine Baseline Estimation

In this step, we evaluate more precisely the lower

baseline using support points. Those support points

are singular points of the skeleton and local

minimums located in the baseline.

- A skeletonization of the text-line image is

performed using the Toumazet algorithm

(Toumazet, 1990). The Thomé algorithm

(Thomé, 1978) was utilized to bring the thickness

to a single pixel taking into account saving the

geometry, location and connections. Singular

points of the skeleton are defined as points for

which one stroke starts inside the baseline and

finishes under the baseline (Figure 4).

Figure 4: Detected singular points from skeleton text-line.

- Local minima points are deduced from the

contour of sub-words. We retain only local

minimums located in the baseline.

Based on all these support points, a linear

interpolation is applied to detect the approximate

baseline. This fine detection of the baseline adapts

well to small changes in the inclination of the

writing in the same text-line. Figure 5 gives an

example of the resulting baseline estimation for an

Arabic handwritten text-line.

Figure 5: Baseline estimation of an Arabic handwritten

text-line.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

432

4 BASELINE GROUND TRUTH

DESCRIPTION

To evaluate the present work, we extended the

AHTID/MW database developed by Mezghani et al.

(2012) by a baseline ground truth annotation. The

AHTID/MW database contains 3710 text-line

images written by 53 Arabic native writers. These

images are divided into 4 sets. For each of these sets,

we provide a baseline ground truth of the data using

an XML file. An example of such XML file is given

in Figure 6.

A baseline is drawn on each text-line image

manually. This straight line should give a good

estimation of the writing line. The baseline is

parameterized by two values, Y1 and Y2, which

represent endpoints of the baseline.

Figure 6: An example of a baseline ground truth data files.

5 RESULTS AND DISCUSSION

Experiments have been carried out on AHTID/MW

database. It consists of 3710 Arabic handwritten

text-line images containing 22896 words (Mezghani

et al., 2012) with ground truth information. Due to

the fact that the baseline ground truth is provided, so

it is possible to evaluate the baseline. The error of a

baseline is calculated as the average distance

between the ground truth and the proposed result

(figure 7). Two thresholds are proposed with this

metric: with an average pixel error less than 5 pixels,

the baseline is considered as good, whereas with an

error up to 7 pixels, the baseline is acceptable. In

order to obtain the average pixel error, we

discretized the baselines into equal intervals fixed

from tests at 20 pixels. The average pixel error is

defined as the average of distances between the

ground truth and the proposed result at different

positions.

Figure 7: Visualization of the baseline error: (a) estimated

baseline; (b) baseline ground truth.

The results of the proposed baseline estimation

method are reported in Table 1. We must point that

our text-line dataset is complex due to the presence

of different handwriting styles. So, the dot removing

algorithm proposed in section 3.1 is not able to

detect all diacritical marks. The absence of other

Arabic handwritten text-line datasets with baseline

ground truth information disables as to obtain a

precise comparison of our method with existing

work. Therefore, we invited interested researchers to

use the AHTID/MW database.

Table 1: Performance of the proposed baseline estimation

method.

Baseline quality

(average pixel error)

Percentage (%)

Good (<5) 84.3

Acceptable (<7) 88.7

6 CONCLUSIONS

In this work, a baseline estimation method of Arabic

handwritten text-lines is presented. In the first stage,

the diacritical marks are eliminated. The upper and

the lower baseline were extracted thanks to the

horizontal projection method. Finally, we estimate

more precisely the baseline using support points.

To evaluate the present work, we extended the

AHTID/MW database developed by Mezghani et al.

(2012) by baseline ground truth information of text-

line images. This database contains 3710 text-line

images written by 53 Arabic native writers. We

would like to note that the AHTID/MW database,

including baseline ground truth annotation, is freely

available to interested researchers worldwide.

BaselineEstimationinArabicHandwrittenText-Line-EvaluationonAHTID/MWDatabase

433

REFERENCES

Al-Badr, B., Mahmoud, S., 1995. Survey and bibliography

of Arabic optical text recognition. Signal Processing.

Vol.41(1): 49–77.

Al-Rashaideh, H., 2006. Preprocessing phase for Arabic

Word Handwritten Recognition. Electronic Scientific

Journal. Vol.6 (1): 11–19.

Al-Shatnawi, A., Omar, K., 2008. Methods of Arabic

Baseline Detection -The State of Art. International

Journal of Computer Science and Network Security.

Vol.8 (10):137–142.

Amin, A., 1998. Off-line Arabic character recognition: the

state of the art. Pattern Recognition. Vol.31(5): 517–

530.

Boubaker, H., Kherallah, M., Alimi, M. A.,2009. New

Algorithm of Straight or Curved Baseline Detection

for Short Arabic Handwritten Writing. International

Conference on Document Analysis and Recognition.

778–782.

Burrow, P., 2004. Arabic handwriting recognition, Thesis.

University of Edinburgh. England.

Côté, M., Chériet, M., Suen, C., Lecolinet, E., 1996.

Détection des Lignes de Base de Mots Cursifs à l'aide

de l'Entropie. Colloque sur l'Intelligence Artificielle

dans les Technologies de l'Information.

Elgammal, A. M., Ismail, M. A., 2001. A Graph-Based

Segmentation and Feature Extraction Framework for

Arabic Text Recognition. International Conference

on Document Analysis and Recognition. 622–626.

El-Hajj, R., Likforman-Sulem, L., A., Mokbe, C., 2005.

Arabic Handwriting Recognition Using Baseline

Dependant Features and Hidden Markov Modeling.

International Conference on Document Analysis and

Recognition. 893–897.

Farooq, F., Govindaraju, V., Perrone, M., 2005.

Preprocessing Methods for Handwritten Arabic

Documents. International Conference on Document

Analysis and Recognition. 267–271.

Lemaitre, A., Camillerapp, J., Coüasnon, B., 2009. Multi-

script Baseline Detection Using Perceptive Vision.

Biennial Conference of the International

Graphonomics Society.

Likforman-Sulem, L., Hanimyan, A., Faure, C., 1995. A

Hough based algorithm for extracting text lines in

handwritten documents. International Conference on

Document Analysis and Recognition. 774–777.

Menasri, F., Vincent, N., Augustin, E., Cheriet, M., 2008.

Un système de reconnaissance de mots arabes

manuscrits hors-ligne sans signes diacritiques.

Conférence Internationale francophone sur l'écrit et le

document.

Mezghani, A., Kanoun, S., Khemakhem, M., El Abed, H.,

2012. A Database for Arabic Handwritten Text Image

Recognition and Writer Identification. International

Conference on Frontiers in Handwriting Recognition.

397–400.

Nagabhushan, P., Alaei, A., 2010. Tracing and

Straightening the Baseline in Handwritten

Persian/Arabic Text-line: A New Approach Based on

Painting-technique. International Journal on

Computer Science and Engineering. Vol.2 (4): 907–

916.

Pechwitz, M., Märgner, V., 2003. HMM Based approach

for handwritten Arabic Word Recognition Using the

IFN/ENIT DataBase. International Conference on

Document Analysis and Recognition. 890–894.

Pechwitz, M., Märgner, V., 2002. Baseline Estimation For

Arabic Handwritten Words. Proceedings of the Eighth

International Workshop on Frontiers in Handwriting

Recognition. 479–484.

Thomé, S., 1978. Prétraitement du chiffre manuscrit.

Congrès AFCET, France. 568–576.

Toumazet, J. J., 1990. Traitement de l’image par

l’exemple, Sybex.

Ziaratban, M., Faez, K., 2008. A Novel Two-Stage

Algorithm for Baseline Estimation and Correction in

Farsi and Arabic Handwritten Text line. International

Conference on Pattern Recognition. 1–5.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

434