Strokes Trajectory Recovery for Unconstrained Handwritten Documents

with Automatic Evaluation

Sidra Hanif

and Longin Jan Latecki

Temple University, Philadelphia, PA, U.S.A.

Keywords:

Handwriting Strokes Trajectory Recovery, Dynamic Time Warping, Chamfer Distance, LSTM, Word

Detection, Automatic Evaluation.

Abstract:

The focus of this paper is ofﬂine handwriting Stroke Trajectory Recovery (STR), which facilitates the tasks

such as handwriting recognition and synthesis. The input is an image of handwritten text, and the output

is a stroke trajectory, where each stroke is a sequence of 2D point coordinates. Usually, Dynamic Time

Warping (DTW) or Euclidean distance-based loss function is employed to train the STR network. In DTW loss

calculation, the predicted and ground-truth stroke sequences are aligned, and their differences are accumulated.

The DTW loss penalizes the alignment of far-off points proportional to their distance. As a result, DTW loss

incurs a small penalty if the predicted stroke sequence is aligned to the ground truth stroke sequence but

includes stray points/ artifacts away from ground truth points. To address this issue, we propose to compute a

marginal Chamfer distance between the predicted and the ground truth point sets to penalize the stray points

more heavily. Our experiments show that the loss penalty incurred by complementing DTW with the marginal

Chamfer distance gives better results for learning STR. We also propose an evaluation method for STR cases

where ground truth stroke points are unavailable. We digitalize the predicted stroke points by rendering the

stroke trajectory as an image and measuring the image similarity between the input handwriting image and

the rendered digital image. We further evaluate the readability of recovered strokes. By employing an OCR

system, we determine whether the input image and recovered strokes represent the same words.

1 INTRODUCTION

For a long time, handwriting analysis, such as hand-

writing recognition (Faundez-Zanuy et al., 2020) and

signature veriﬁcation (Diaz et al., 2022), has been

an active research area. There are two categories of

handwriting, online and ofﬂine. Online handwriting

is captured in real-time on a digital device such as a

tablet screen with a stylus pen. In contrast, the hand-

written text scanned or captured by a camera from a

physical medium such as paper is referred to as of-

ﬂine handwriting (Liu et al., 2011; Plamondon and

Srihari, 2000). The handwriting inscribed on a digital

device or captured from a physical medium is often

unconstrained with varying orientations. The avail-

ability of temporal movements of a stylus pen for on-

line handwriting makes the handwriting analysis task

easier. However, for ofﬂine handwriting, the input is

limited to handwritten images, making handwriting

analysis much more difﬁcult.

The current STR architectures for English hand-

https://orcid.org/0000-0001-6531-7656

https://orcid.org/0000-0002-5102-8244

writing use lines of text (Archibald et al., 2021; Bhu-

nia et al., 2018) or characters of alphabets (Rabhi

et al., 2021; Rabhi et al., 2022) as input. One of

the recent datasets for STR, IAM-online (Marti and

Bunke, 2002), includes only line-level annotations.

For English handwritten documents, line detection is

a prominent topic for processing historical document

images (Boillet et al., 2021; Alberti et al., 2019).

A learning-free mechanism for detecting lines in a

historical handwritten document is presented in (Ku-

rar Barakat et al., 2020). Another algorithm for line

segmentation in handwritten documents is proposed

in (Surinta et al., 2014b). This work is limited to

separating the overlapping words only in horizontal

lines (Surinta et al., 2014a). For more complex Arabic

handwriting, (Gader and Echi, 2022; Gader and Echi,

2020) segment the curved text lines with overlapping

words. A generative model to segment lines for Ara-

bic handwriting is proposed in (Demır et al., 2021).

This method can segment slanting and curved lines

but is not accurately enclosing a complete word in

lines because they consider a binary mask to train the

generative model. We have tried a few line segmenta-

Hanif, S. and Latecki, L.

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation.

DOI: 10.5220/0011700100003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 661-671

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

661

tion (Surinta et al., 2014b; Alberti et al., 2019) meth-

ods. They are either proposed for non-English text

or do not work well for unconstrained non-horizontal

text in handwritten document images. Even when

line detection works well, performing STR on words

yields better performance than on lines, as this paper

demonstrates (the ﬁrst two rows of Table 2).

Non-English handwriting datasets (Barakat et al.,

2019; Yavariabdi et al., 2022) emphasize word-level

annotations and provide word, line, and page-level an-

notations for historical handwritten documents. The

Chinese, Japanese, Arabic, and Tamil (Bhunia et al.,

2018; Nguyen et al., 2021; Viard-Gaudin et al., 2005;

Privitera and Plamondon, 1995) datasets for stroke

trajectory also provide word-level annotations. Older

STR datasets such as IRONOFF (Viard-Gaudin et al.,

1999) consists of words/characters/digits of English

handwriting. Unipen dataset (Guyon et al., 1994)

is available with character-level annotation, whereas

IRONOFF consists of words. But we were unable to

obtain any copies of IRONOFF datasets. In contrast,

the publically available English handwriting dataset

IAM-online (Marti and Bunke, 2002) for STR in-

cludes line-level annotations with missing word-level

annotations. Therefore, we propose constructing a

large-scale word-level annotation for the IAM-online

dataset in our work. In recent years conventional

methods (Diaz et al., 2022; Senatore et al., 2022) de-

vised rule-based algorithms for signature/word trajec-

tory recovery. Moreover, stroke trajectory recovery

has progressed through deep neural networks. For

stroke trajectory recovery, (Bhunia et al., 2018) in-

troduces a ﬁrst trainable convolutional network. This

LSTM architecture learns strokes from Tamil scripts

with Euclidean distance loss, making it hard to apply

to long words with multiple strokes, such as English

handwriting. Recently, (Rabhi et al., 2021; Rabhi

et al., 2022) introduced an attention mechanism to

train the writing order recovery for characters. These

attention networks are trained on characters with L1-

loss, which is, again difﬁcult to train for words. Sim-

ilarly, (Nguyen et al., 2021) employs an LSTM ar-

chitecture with an attention layer and Gaussian mix-

ture model trained with cross-entropy loss. How-

ever, it is limited to encoding only a single Japanese

character. Recently, (Archibald et al., 2021) presents

the stroke trajectory recovery, where LSTM is trained

with a Dynamic Time Waring (DTW) loss function.

(Archibald et al., 2021) has a disadvantage that it can

work only for a line of text. Apart from the restric-

tion of the input to the lines of text, DTW loss has

a drawback for long sequence matching: it sums the

loss function for all the points when ﬁnding the best

alignment between two sequences. Hence order pre-

serving stray points, i.e., predicted stroke points far

apart from their matching originals, have a minor in-

ﬂuence on the DTW loss. However, they result in

noticeable artifacts in the predicted strokes. To cir-

cumvent this issue, we propose to add the Chamfer

distance (Akmal Butt and Maragos, 1998) between

predicted and ground truth point sets to the loss func-

tion. In order to prevent penalizing stroke points with

small deviations, we augment the Chamfer distance to

a marginal Chamfer distance. Applying the marginal

Chamfer distance yields a more signiﬁcant penalty for

stray points/artifacts.

Another challenge a stroke trajectory recovery

system faces is the availability of ground truth strokes.

Annotating ground-truth strokes for the STR system

is a laborious process that demands time and resource

allocation. Therefore, our work proposes an eval-

uation algorithm that does not require ground truth

stroke points to evaluate the STR system. To the best

of our knowledge, the evaluation of the STR system

without ground truth stroke points has not been pro-

posed before. The main contributions of this work

are as follows: 1) We introduce large-scale word-level

annotations for the English handwriting STR dataset

sampled from the IAM-online dataset. Our version of

the IAM-online dataset contains 62,000 words. 2) A

word-level STR method estimates loss for each word

rather than averaging DTW loss over the entire line

of text. To avoid the stray points/artifacts in pre-

dicted stroke points, we employ a marginal Chamfer

distance that penalizes large, easily noticeable devi-

ations and artifacts. 3) We also introduce an algo-

rithm for evaluating the STR system on images with-

out ground truth stroke points. 4) Since our method

works with words, we demonstrate that our method

is scalable to unconstrained handwritten documents,

i.e., full-page text. 5) The quantitative and qualitative

results demonstrate the superior performance of our

approach in comparison to the state-of-the-art (SOTA)

methods.

We introduce the proposed method in Sec. 2 and

demonstrate the experimental results in Sec. 3.

2 METHOD

2.1 Network Architecture

We deploy a CNN with bidirectional LSTM for stroke

trajectory learning. The input to the CNN is a word

image resized to a ﬁxed height with variable width to

keep the same aspect ratio. A CNN branch consists

of seven convolutional blocks with ReLU activation.

The convolutional ﬁlters have a 3x3 kernel size with

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

662

Figure 1: The block diagram of the overall system for training, inference, and automatic evaluation.

2x2 and 2x1 max pooling in each layer. The output

of the CNN branch, a Wx1024 dimensional feature

vector, where W is the width of the image, is fed into

eight bidirectional LSTM blocks. Each bidirectional

LSTM block consists of 128 hidden units, so each

LSTM block’s input is Wx128. Lastly, a bidirectional

LSTM is followed by a 1-D convolutional block that

predicts a Wx4-dimensional output. The number of

output points is proportional to the width of the input

image.

The ﬁrst two dimensions of the output indicate the

relative coordinates (x,y) with respect to the previous

location. The last two dimensions indicate start-of-

stroke (sos) and end-of-stroke (eos) tokens, respec-

tively. Cross-entropy loss is employed to learn the

start-of-stroke and end-of-stroke tokens. The overall

architecture is shown in Fig. 1.

2.2 Loss Function

The ground truth in IAM-online dataset (Marti and

Bunke, 2002) is a sequence of points deﬁned as (x, y)

coordinates with a time stamp. Since the number of

predicted coordinates is propositional to the width of

the input image, we re-sample the equidistant ground-

truth coordinates such that the number of points is

proportional to the image’s width.

Figure 2: The impact of marginal Chamfer distance loss on

the stroke trajectory recovery. (a) DTW loss only. (b) DTW

+ Chamfer distance loss.

2.2.1 Dynamic Time Warping (DTW) Loss

In general, DTW (Berndt and Clifford, 1994) com-

putes the optimal match between ground-truth (GT)

(T = (t

,....t

)) and predicted sequences (P =

, p

,....p

)) of different lengths by ﬁnding the

warping path between two sequences. In DTW loss,

the cost matrix is calculated as:

cost(i, j) = ||p

−t

(1)

The accumulative cost matrix (A) is given as,

A(i, j) = cost(i, j) + min[A(i − 1, j),A(i −1, j − 1),

A(i, j − 1)]

(2)

(3)

for 1 ≤ i ≤ n and 1 ≤ j ≤ m.

Given matrix A, DTW computes the optimal

warping path from A(n, m) to A(1,1) as the alignment

of points in P to points in T is expressed as index map-

ping α : {1, .. .,n} → {1,...,m}, where α is an onto

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation

663

function.

DTW

(P,T) =

∑

i=1

||p

−t

α(i)

||. (4)

2.2.2 Chamfer Distance Loss

The DTW gives promising results for training the

stroke trajectory recovery systems (Archibald et al.,

2021). It helps to match the point trajectory for

ground truth and predicted strokes. However, the

DTW loss function does not impose a sufﬁciently

large penalty for predicted points following a similar

trajectory as the ground truth points but far off com-

pared to the ground truth at the pixel level. Fig. 2(a)

shows that the predicted and ground truth points are

far off, but the DTW loss is small since the predicted

strokes follow the same trajectory as a ground truth

stroke. Therefore, in this work, we propose to add a

marginal Chamfer distance. Its effect is illustrated in

Fig. 2(b). The proposed marginal Chamfer distance

between the predicted and ground truth point sets is

given by

(P,T) =

∑

p∈P

max(min

g∈T

∥p − g∥

) − c

,0) + (5)

∑

g∈T

max(min

p∈P

∥g − p∥

) − c

,0), (6)

where P and T represent the point sets for predicted

and ground truth strokes. The Chamfer distance

is calculated on the pixel level. In the experimen-

tal section, we will discuss the setting of parameter

c. The intuition behind the marginal Chamfer dis-

tance for STR is to consider the distance only if the

predicted stroke point is at least a unit pixel apart

from its nearest ground-truth stroke point. It leads

to quantitative improvements in loss calculations as

shown in Table 2. The proposed loss is simply a sum

DTW

(P,T) + d

(P,T)

2.3 Automatic Evaluation

In the previous stroke trajectory recovery evaluation

system, we computed the distance between ground-

truth stroke points and predicted stroke points. For

this purpose, we require the ground truth strokes’

coordinate information to compute the difference.

However, adding the coordinates information is ex-

tra work and requires extensive labor. Moreover,

the existing handwriting datasets have limited avail-

ability of stroke coordinates information. Hence, al-

though the proposed system can be applied to hand-

writing datasets without stroke coordinates informa-

tion, it is impossible to evaluate the quality of its pre-

dicted strokes using the existing methods. Therefore,

we propose two measures for evaluating the quality of

recovered stroke trajectories when ground truth stroke

information is not given, namely image matching and

readability.

2.3.1 Image Matching

The LSTM predicts the stroke trajectory recovery

point coordinates (x,y). We digitize the obtained

strokes (X, Y) by plotting the points (x,y) and con-

structing a digital image from all the recovered stroke

points (X, Y). As we compare the original and recon-

structed image, the dimensions of the reconstructed

image are the same as the original image. However,

since the text is plotted with unit thickness, the thick-

ness of the text in the input and reconstructed image

differs. To overcome this issue, we propose dilating

the reconstructed images so that the thickness of the

text in the input and the reconstructed image are the

same. We dilate for the kernel size ranging from 0 to

10 and select the kernel size that yields the least num-

ber of pixels in the absolute difference between the

input and dilated images. Let I

input

be the input hand-

written text image, and let I

predict

denote the image

reconstructed from the coordinates of stroke points

predicted by LSTM. I

predict

is reconstructed by dig-

italizing the predicted stroke trajectory as described

in Sec. 4.2. Hence all the words in I

predict

are one

pixel thick.

Next, we dilate I

predict

with a dilation kernel k ∈

(1,10), and denote the dilated image with kernel k

as D(I

predict

,k). We compare two digital images by

computing their symmetric difference as

Di f f (k)

= |I

input

− D(I

predict

,k)|. (7)

We deﬁne I

Di f f

as the image I

Di f f (k)

with the min-

imum number of foreground pixels for k ∈ (0,10).

This allows for estimating the thickness of the input

text in the reconstructed image.

Next, we check the quality of the reconstructed

image I

Di f f

by performing connected component

analysis. Let C be the largest connected component

in I

Di f f

image. The ratio of the number of foreground

pixels in C to the total number of foreground pixels in

input

gives us the error in the stroke trajectory predic-

tion, which is denoted as ε. This value can be used as

a quantitative measure of the predicted strokes. Em-

pirically, we observed that if the error ε is less than

the threshold (T = 0.025), then the quality of stroke

trajectory recovery is good and vice versa.

The intuition of the proposed method is that the

image I

Di f f

has small scattered connected compo-

nents if stroke trajectory recovery is good. However,

the I

Di f f (k)

image has large and quite noticeable con-

nected components if stroke trajectory recovery is of

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

664

poor quality. The example of good and poor stroke

trajectory recovery validated by the proposed method

is shown in Fig. 8.

2.3.2 Readability

The second part of the automatic evaluation checks

the preservation of the readability of the input text and

the text from the recovered handwriting trajectory. To

verify that the recovered stroke trajectory is read the

same as the input handwriting word, we recognize the

characters in both images. Let I

and I

be the two im-

ages for word from the input handwriting image and

the one recovered from the proposed stroke trajectory

recovery method, respectively. The text recognition

on I

and I

gives us the string of characters for input

word denoted by W

= [w

,..., w

] and the string

characters from recovered stroke trajectory denoted

by W

= [w

,..., w

], where m and n are the to-

tal character recognized in I

and I

. We utilize the

pre-trained text recognition network (Li et al., 2021)

to compute W

and W

. The difference between the

two recognized strings is computed by the edit dis-

tance between the two strings. In our work, we com-

pute the edit distance between two strings W

and W

with Levenshtein distance. Let the Levenshtein dis-

tance between two strings be d

lev

and the number of

characters in input string W

is m. The readability

error R is deﬁned as d

lev

/m, that is, the ratio of incor-

rect string matching to the total number of characters

in the input string. Ideally, the Levenshtein distance

lev

) and readability error (R ) are expected to be zero

for good stroke trajectory recovery. Empirically, we

noticed that the R less than T = 0.1 results in satis-

factory reconstruction, which we deﬁne as acceptable

readability. This process is illustrated in Fig. 3.

Figure 3: The readability module based on text recognition

(Li et al., 2021) utilized for our automatic evaluation. The

top-left text is the input and the bottom-left text is recon-

structed from predicted stroke trajectory recovery.

3 EXPERIMENTS

We ﬁrst present the construction of word-level anno-

tations for IAM-online dataset (Sec. 3.1) and intro-

duce the greeting-card handwritten messages (GHM)

dataset (3.1.1). We discuss the evaluation metric

and results of the proposed method on these datasets

(Secs. 3.2 and 4). Finally, we present the application

of our proposed method for the GHM dataset (Sec.

4.2).

3.1 Word-Level Annotation

In the IAM-online dataset (Marti and Bunke, 2002),

the stylus pen movement on an electronic device’s

screen provides the coordinates of the ground-truth

point for stroke trajectory recovery. However, the

IAM-online dataset only provides the stroke’s ground

truth for the line-level text. Therefore, we use word

detection to construct stroke annotations for words.

Word detection generally requires word bounding

boxes to train the detection network, but the IAM-

online dataset does not include word bounding boxes.

So, we propose to train the word detection on GNHK

(Lee et al., 2021) dataset and then applied to the IAM-

online dataset. The images in the GNHK dataset

are sourced from different regions of Europe, North

America, Africa, and Asia containing 39k+ words,

sufﬁcient to train a word detector with data augmenta-

tion (Bochkovskiy et al., 2020). Hence, it is a diverse

dataset regarding writing style and image quality as

the penmanship varies in different parts of the world,

and camera quality varies for each captured image.

We explore three state-of-the-art detectors; one

scene text detector (Baek et al., 2019) and two object

detectors for localizing the words in unconstrained

handwritten text. A scene text detection network

identiﬁes character regions in natural images and de-

tects the words based on character regions and afﬁnity

scores between them (Baek et al., 2019). However,

the deterministic approach to constructing bound-

ing boxes around character region scores is not well

suited for handwriting word detection. Because in

the handwritten text, adjacent lines may overlap, and

characters may have high afﬁnity scores in adjacent

lines, which misleads the word detection results. This

shortcoming of the scene text detector gives low de-

tection accuracy for unconstrained handwritten text

(as shown in Fig. 5).

On the other hand, the object detector attributes

words as objects and is more efﬁcient for detect-

ing overlapping words in adjacent lines in a non-

horizontal orientation. For our task, a single-stage

detector YOLO (Bochkovskiy et al., 2020) performs

better than a two-stage detector Faster R-CNN (Ren

et al., 2015; Hanif et al., 2019; Hanif and Latecki, ).

Both are trained and evaluated on GNHK dataset (Lee

et al., 2021). In Table 1, we list the quantitative re-

sults for word detection on GNHK dataset. Due to its

best performance, we select the single-stage YOLO

word detector (Bochkovskiy et al., 2020) and apply

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation

665

Table 1: The accuracy (mAP) of word detection with state-of-the-art scene text detector (Baek et al., 2019) and single and

two-stage object detectors (Ren et al., 2015; Bochkovskiy et al., 2020).

Method mAP@0.5 mAP@0.5:0.95

Scene text detector (Baek et al., 2019) 0.603 0.565

Two-stage word detector (Ren et al., 2015) 0.780 0.565

Single-stage word detector (Bochkovskiy et al., 2020) 0.913 0.619

it to IAM-online dataset (Marti and Bunke, 2002) for

word detection.

Figure 4: (a) Word detection for the GNHK dataset (Lee

et al., 2021), (b) word detection for IAM-online dataset

(Marti and Bunke, 2002), (c) Words from IAM-online

dataset used to train our network.

In Fig. 4(a,b), we show the word detection vi-

sualization for sample images from GNHK dataset

(Lee et al., 2021) and IAM-online dataset (Marti

and Bunke, 2002) respectively. The images in

GNHK (Lee et al., 2021) vary in handwriting style,

background, and camera conditions. Therefore, the

word detection trained in GNHK dataset (Lee et al.,

2021) has acceptable performance for the IAM-online

dataset (Marti and Bunke, 2002). We used the word-

level annotations for the IAM-online dataset (Marti

and Bunke, 2002) to ﬁnetune and evaluate our sys-

tem. Word detection on the IAM-online dataset gives

us 41,665 and 19,496 words for train and test sets, re-

spectively. Fig. 4(b) shows the word detection from

lines from IAM-online, and Fig. 4(c) shows the words

used in our work to train the network.

3.1.1 Greeting Card Messages Dataset

In our work, we propose a word-level stroke trajectory

recovery.

To evaluate our system on unlabelled handwritten

documents, we acquire approximately 2,000 greeting-

cards handwritten messages (GHM) dataset from

greeting cards company (Signed, 2022). The GHM

dataset shared

. Handwritten messages from the

GHM dataset do not follow any ﬁxed template be-

cause it consists of user-uploaded handwritten mes-

sages to greeting cards company (Signed, 2022). The

https://drive.google.com/file/d/1G-EZBfEhsHThR9d

R1YtJdPE3Mg0ay0w-/view?usp=sharing

samples from the GHM dataset are shown in Fig. 5.

Figure 5: Sample documents form GHM dataset with unla-

belled handwriting images.

3.2 Evaluation Metric

We used a distance-based evaluation metric to evalu-

ate stroke trajectory recovery. The average distance

of points in the ground-truth (T) stroke to its nearest

predicted stroke (P) is denoted by dist

t,p

. Similarly,

the average distance of points in the predicted stroke

(P) to its nearest ground-truth stroke (T) is denoted by

dist

p,t

. The metric dist

t,p

signiﬁes that every ground-

truth stroke point is close to the predicted point and

vice versa for dist

p,t

. dist

t,p

and dist

p,t

are the same

evaluation metrics as used in (Archibald et al., 2021).

However, apart from the mean (mean) of the distances

between predicted and ground truth stroke points, we

also compute the standard deviation (std) of the met-

rics. We also compute the loss for predicting the start-

of-stroke token ε

sos

. ε

sos

should have the lowest value

if the start-of-stroke token is predicted correctly.

4 RESULTS

DTW and Chamfer distance are complementary loss

functions to train our system. DTW loss ensures

that the predicted stroke sequences are similar to the

ground truth stroke sequence. The Chamfer distance

between the predicted and ground truth point set en-

sures that there are no spurious points/artifacts in pre-

dicted points; that is, no predicted points are far away

from ground truth points.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

666

Table 2: The quantitative comparison for training on DTW loss for lines and words. The third row lists the quantitative results

for training with combined DTW and Chamfer distance loss between predicted and ground-truth points.

Loss function

dist

t,p

(mean)

dist

t,p

(std)

dist

p,t

(mean)

dist

p,t

(std)

sos

DTW

line

(Archibald et al., 2021) 0.01558 0.01402 0.02776 0.0353 0.1553

DTW

word

0.01653 0.014862 0.01287 0.00965 0.1427

DTW

word

+ Chamfer distance 0.01492 0.01042 0.01257 0.00946 0.1479

Table 3: The quantitative comparison of increasing the value of c from 1 to 3.

Loss function

dist

t,p

(mean)

dist

t,p

(std)

dist

p,t

(mean)

dist

p,t

(std)

c = 1 0.01492 0.01042 0.01257 0.00946

DTW

word

+ Chamfer distance

c = 3 0.0177 0.0118 0.01859 0.0162

4.1 IAM-Online Dataset

Table 2 presents a quantitative comparison, where

bold numbers show the best results (lowest value).

The ﬁrst and second rows of Table 2 show the eval-

uation with line-level and word-level input for DTW

loss, respectively. We observe that both dist

t,p

and

dist

p,t

metrics are much lower for word-level than the

line-level input. These results show that training the

stroke trajectory recovery with a word-level dataset,

as we proposed, improves the results. We also val-

idated that Chamfer distance loss for predicted and

ground truth point sets improves the quantitative re-

sults. Both mean and standard deviation of dist

p,t

and dist

t,p

decrease by adding Chamfer distance to

the loss function. It means that the predicted strokes

are better at imitating the ground-truth strokes. The

lower values of both dist

t,p

and dist

p,t

illustrate that

every ground-truth stroke has a close predicted stroke

and vice versa. So, we do not get spurious predicted

strokes and yet do not miss to follow the shape of

the ground-truth strokes. We noticed that chamfer

distance loss has minimal inﬂuence on start-of-stroke

(ε

sos

) as shown in Table 2. In another experiment, we

try the higher values of c (Eq. 6) as shown in Table

3, but by increasing the value of c increases the std

of dist

t,p

and dist

p,t

. Therefore, we keep the value of

c=1 in our work.

The visualization of the elimination of spurious

predicted points/artifacts from the predicted stroke

trajectory by adding Chamfer distance loss is illus-

trated in Fig. 6. We can see that extra stroke points

cause more artifacts in (a) than in (b).

4.2 Greeting Cards Messages Dataset

The previous methods on the IAM-online dataset

work with line-level text for stroke trajectory recovery

(Archibald et al., 2021), which is not scaleable to un-

Figure 6: Samples of stroke trajectory recovery with (a)

DTW loss and (b) DTW loss with Chamfer distance on the

predicted and ground truth point sets. The recovered stroke

trajectory is shown by red, blue, and green arrows, and the

predicted start-of-stroke point is shown as an orange circle

(best view in colored).

constrained handwritten text images without line de-

tection. This is one of the main reasons why we work

with word-level text.

In one of the applications of the proposed work,

we show the stroke trajectory recovery for greeting-

card handwritten messages (Signed, 2022). We ap-

plied the developed method trained for word-level an-

notation using DTW and Chamfer distance to the im-

ages containing greeting-card handwritten messages.

Handwriting images from greeting card messages do

not follow any ﬁxed template because they consist of

user-uploaded handwritten messages. Therefore, ﬁrst,

we detect the words in greeting card handwritten mes-

sages with word detector described in Sec. 3.1. Then

we execute the trained STR model on each detected

word. The recovery of stroke trajectory for greeting

card handwritten messages is the word-level stroke

trajectory recovery of each word.

We render the image from predicted stroke points

as described in Fig . 7, where the proposed STR sys-

tem predicts the stroke trajectory recovery for each

word. Whereas the render image module converts

stroke points into an image. Finally, we align the ren-

dered image to the location of the detected word.

The visual results of the proposed word-level STR

system on handwritten greeting card messages as

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation

667

Figure 7: The mechanism of image rendering from pre-

dicted stroke points.

shown in Fig. 9 are rendered word-wise by the im-

age render module illustrated in Fig. 7. Our proposed

STR system can easily be applied to text with differ-

ent orientations, as shown in Fig. 9. The left-hand

images are input handwritten messages, and the right-

hand images are rendered from the recovered stroke

trajectory for each word.

4.2.1 Automatic Evaluation

We also applied the proposed automated evaluation

method to the GHM dataset. In Fig. 8, we show the

input handwritten messages, the dilated image, and

connected components for I

Di f f

. The recovered im-

age is dilated to match the width of the words in the

input handwriting image as described in Sec. 2.3.1.

According to the criteria deﬁned in Sec. 2.3.1, the

Fig. 8(top) example shows the good stroke trajec-

tory recovery with small scattered connected compo-

nents. Whereas, Fig. 8(bottom) shows the poor stroke

trajectory recovery as there are larger connected com-

ponents in the difference image I

Di f f

In Table 4, we listed the accuracy of the image

matching and readability-based evaluation proposed

in our work. The ﬁrst column in Table 4 shows

the percentage of documents with correctly recov-

ered stroke trajectories according to our two proposed

quality measures. According to the thresholds de-

ﬁned in Sec 2.3.1 and Sec. 2.3.2, the accuracy of

stroke recovery from image matching and readability

is 24.30% and 24.76%, respectively.

We manually veriﬁed the accuracy with user scor-

ing.

If the threshold deﬁned in image matching and

readability evaluation classiﬁes the image from the re-

covered stroke trajectory as satisfactory and the user

also gives a satisfactory score to the recovered image,

then a conﬁdence score of 1 is assigned to that recon-

struction. The average conﬁdence score (con f idence)

for all the images is computed. The user scores the

image in binary, scoring either 0 or 1. We applied

this binary criterion to evaluate the robustness of the

thresholds deﬁned in Sec. 2.3.1 and Sec. 2.3.2. The

accuracy of image matching and readability with the

conﬁdence score (con f idence) are listed in Table 4.

Our observation shows that the automatic evaluation

Figure 8: The input image, dilated image (after dilation is

applied to the recovered image), and the difference of the

input and dilated images.

based on image matching is a better evaluation mea-

sure than the readability evaluation. One of the rea-

sons is that the text recognizer can correctly recognize

words even if they are visually dissimilar to the input

handwritten words.

Table 4: The quantitative analysis of automatic evaluation

on greeting cards handwritten messages (Signed, 2022).

Method Accuracy Conﬁdence

Image matching 24.30% 74.0%

Readability evaluation 24.76% 53.8%

5 CONCLUSIONS

In our proposed work, we trained a neural net-

work with word-level annotations for the IAM-online

dataset using DTW and Chamfer distance loss func-

tions. We demonstrated that adding Chamfer distance

loss to DTW is beneﬁcial for removing artifacts and

spurious stroke points for better stroke trajectory re-

covery. We also proposed automatic evaluation meth-

ods using image matching and readability consistency

to evaluate the quality of stroke trajectory recovery for

unlabeled datasets. Finally, we demonstrate the abil-

ity of our proposed work to work in unconstrained

practical applications by applying and evaluating it

on an unlabeled handwriting greeting card messages

dataset.

ACKNOWLEDGEMENT

We would like to thank Signed Inc. for funding this

work. The calculations were carried out on HPC

resources at Temple University, supported in part

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

668

Figure 9: The input image and the rendered image from predicted stroke trajectory recovery for GHM dataset.

by the NSF through grant number 1625061 and by

the U.S. ARL under contract number W911NF-16-2-

0189. We would also like to thank the undergraduate

student Abdulrahman Samir Alshehri for cleaning the

dataset for the handwritten documents

REFERENCES

Akmal Butt, M. and Maragos, P. (1998). Optimum design

of chamfer distance transforms. IEEE Transactions on

Image Processing, 7(10):1477–1484.

Alberti, M., V

ogtlin, L., Pondenkandath, V., Seuret, M., In-

gold, R., and Liwicki, M. (2019). Labeling, cutting,

grouping: an efﬁcient text line segmentation method

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation

669

for medieval manuscripts. In 2019 International Con-

ference on Document Analysis and Recognition (IC-

DAR), pages 1200–1206. IEEE.

Archibald, T., Poggemann, M., Chan, A., and Martinez, T.

(2021). Trace: A differentiable approach to line-level

stroke recovery for ofﬂine handwritten text. arXiv

preprint arXiv:2105.11559.

Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H. (2019).

Character region awareness for text detection. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 9365–9374.

Barakat, B. K., El-Sana, J., and Rabaev, I. (2019). The

pinkas dataset. In 2019 International Conference on

Document Analysis and Recognition (ICDAR), pages

732–737. IEEE.

Berndt, D. J. and Clifford, J. (1994). Using dynamic time

warping to ﬁnd patterns in time series. In KDD work-

shop, volume 10, pages 359–370. Seattle, WA, USA:.

Bhunia, A. K., Bhowmick, A., Bhunia, A. K., Konwer, A.,

Banerjee, P., Roy, P. P., and Pal, U. (2018). Hand-

writing trajectory recovery using end-to-end deep

encoder-decoder network. In 2018 24th International

Conference on Pattern Recognition (ICPR), pages

3639–3644. IEEE.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. arXiv preprint arXiv:2004.10934.

Boillet, M., Kermorvant, C., and Paquet, T. (2021). Mul-

tiple document datasets pre-training improves text

line detection with deep neural networks. In 2020

25th International Conference on Pattern Recognition

(ICPR), pages 2134–2141. IEEE.

Demır, A. A.,

OzS¸eker,

I., and

Ozkaya, U. (2021). Text line

segmentation in handwritten documents with genera-

tive adversarial networks. In 2021 International Con-

ference on INnovations in Intelligent SysTems and Ap-

plications (INISTA), pages 1–5. IEEE.

Diaz, M., Crispo, G., Parziale, A., Marcelli, A., and Ferrer,

M. A. (2022). Writing order recovery in complex and

long static handwriting.

Faundez-Zanuy, M., Fierrez, J., Ferrer, M. A., Diaz, M.,

Tolosana, R., and Plamondon, R. (2020). Hand-

writing biometrics: Applications and future trends

in e-security and e-health. Cognitive Computation,

12(5):940–953.

Gader, T. B. A. and Echi, A. K. (2020). Unconstrained

handwritten arabic text-lines segmentation based on

ar2u-net. In 2020 17th International Conference on

Frontiers in Handwriting Recognition (ICFHR), pages

349–354. IEEE.

Gader, T. B. A. and Echi, A. K. (2022). Deep learning-

based segmentation of connected components in ara-

bic handwritten documents. In International Confer-

ence on Intelligent Systems and Pattern Recognition,

pages 93–106. Springer.

Guyon, I., Schomaker, L., Plamondon, R., Liberman, M.,

and Janet, S. (1994). Unipen project of on-line data

exchange and recognizer benchmarks. In Proceed-

ings of the 12th IAPR International Conference on

Pattern Recognition, Vol. 3-Conference C: Signal Pro-

cessing (Cat. No. 94CH3440-5), volume 2, pages 29–

33. IEEE.

Hanif, S. and Latecki, L. J. Autonomous character region

score fusion for word detection in camera-captured

handwriting documents.

Hanif, S., Li, C., Alazzawe, A., and Latecki, L. J. (2019).

Image retrieval with similar object detection and local

similarity to detected objects. In Paciﬁc Rim Inter-

national Conference on Artiﬁcial Intelligence, pages

42–55. Springer.

Kurar Barakat, B., Cohen, R., Droby, A., Rabaev, I., and

El-Sana, J. (2020). Learning-free text line segmen-

tation for historical handwritten documents. Applied

Sciences, 10(22):8276.

Lee, A. W., Chung, J., and Lee, M. (2021). Gnhk: A dataset

for english handwriting in the wild. In International

Conference on Document Analysis and Recognition,

pages 399–412. Springer.

Li, M., Lv, T., Cui, L., Lu, Y., Florencio, D., Zhang, C.,

Li, Z., and Wei, F. (2021). Trocr: Transformer-based

optical character recognition with pre-trained models.

arXiv preprint arXiv:2109.10282.

Liu, C.-L., Yin, F., Wang, D.-H., and Wang, Q.-F.

(2011). Casia online and ofﬂine chinese handwriting

databases. In 2011 International Conference on Doc-

ument Analysis and Recognition, pages 37–41. IEEE.

Marti, U.-V. and Bunke, H. (2002). The iam-database:

an english sentence database for ofﬂine handwrit-

ing recognition. International Journal on Document

Analysis and Recognition, 5(1):39–46.

Nguyen, H. T., Nakamura, T., Nguyen, C. T., and

Nakawaga, M. (2021). Online trajectory recovery

from ofﬂine handwritten japanese kanji characters of

multiple strokes. In 2020 25th International Con-

ference on Pattern Recognition (ICPR), pages 8320–

8327. IEEE.

Plamondon, R. and Srihari, S. N. (2000). Online and off-

line handwriting recognition: a comprehensive survey.

IEEE Transactions on pattern analysis and machine

intelligence, 22(1):63–84.

Privitera, C. M. and Plamondon, R. (1995). A system for

scanning and segmenting cursively handwritten words

into basic strokes. In Proceedings of 3rd International

Conference on Document Analysis and Recognition,

volume 2, pages 1047–1050. IEEE.

Rabhi, B., Elbaati, A., Boubaker, H., Hamdi, Y., Hussain,

A., and Alimi, A. M. (2021). Multi-lingual character

handwriting framework based on an integrated deep

learning based sequence-to-sequence attention model.

Memetic Computing, 13(4):459–475.

Rabhi, B., Elbaati, A., Boubaker, H., Pal, U., and Alimi,

A. (2022). Multi-lingual handwriting recovery frame-

work based on convolutional denoising autoencoder

with attention model.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

670

Senatore, R., Santoro, A., Parziale, A., and Marcelli, A.

(2022). A biologically inspired approach for recov-

ering the trajectory of off-line handwriting.

Signed (2022).

Surinta, O., Holtkamp, M., Karaaba, M. F., van Oosten, J.,

Schomaker, L. R. B., and Wiering, M. A. (2014a).

A* path planning for line segmentation of handwrit-

ten documents. In Frontiers in Handwriting Recog-

nition (ICFHR), 2014 14th International Conference

on, pages 175–180. IEEE.

Surinta, O., Holtkamp, M., Karabaa, F., Van Oosten, J.-P.,

Schomaker, L., and Wiering, M. (2014b). A path plan-

ning for line segmentation of handwritten documents.

In 2014 14th international conference on frontiers in

handwriting recognition, pages 175–180. IEEE.

Viard-Gaudin, C., Lallican, P.-M., and Knerr, S. (2005).

Recognition-directed recovering of temporal informa-

tion from handwriting images. Pattern Recognition

Letters, 26(16):2537–2548.

Viard-Gaudin, C., Lallican, P. M., Knerr, S., and Binter,

P. (1999). The ireste on/off (ironoff) dual handwrit-

ing database. In Proceedings of the Fifth Interna-

tional Conference on Document Analysis and Recog-

nition. ICDAR’99 (Cat. No. PR00318), pages 455–

458. IEEE.

Yavariabdi, A., Kusetogullari, H., Celik, T., Thummana-

pally, S., Rijwan, S., and Hall, J. (2022). Cardis:

A swedish historical handwritten character and word

dataset. IEEE Access, 10:55338–55349.

Strokes Trajectory Recovery for Unconstrained Handwritten Documents with Automatic Evaluation

671