body, writing angle, density, word’sspacing, and duc-
tus. Although effective, however, all such features
represent a completely manual solution to the prob-
lem, since this requires human experts, possibly sup-
ported by some image editing tools (David and Karl,
2014), to extract them.
In order to automate the management of handwrit-
ten corpora, we propose a completely automatic rep-
resentation of elements based on the notion of hand-
writing shape. To model the writing shape, a set of
effective visual characteristics (called local features)
are extracted from each element using specific image
analysis techniques like, for example, SIFT (Lowe,
1999) and SURF (Bay et al., 2008). So-obtained lo-
cal features are then compared in order to establish
their degree of (dis)similarity, with the final aim to
establish whether a corpus is authentic or not with
respect to a specific author. This implies a pre-
processing phase where the analysis of some hand-
written pages of authentic writings is executed in
order to build a “ground truth” reference informa-
tion for comparing suspicious writings to the authen-
tic ones. Given an input page, composed by a set
of target elements, and an element distance function
that measures the (dis)similarity of a given pair of
elements using their local features, we want to de-
termine automatically if the query manuscript page
could be considered authentic with respect to a spe-
cific author. The (dis)similarity between pages is nu-
merically assessed by way of a page distance func-
tion that somehow “combines” the single element dis-
tances into an overall value. Preliminary experimen-
tal results conducted on a software implementation of
the proposed solution, namely WRITINGSIMILARI-
TYSEARCH (WSS), and using real data demonstrate
the effectiveness of our method and encourage further
investigations on this direction.
The rest of the paper is organized as follows: Sec-
tion 2 reports the related work; Section 3 details the
proposed similarity-based image retrieval approach.
In Section 4 we describe WSS, whereas in Section 5
we comment some preliminary experimental results
based on real document collections. Finally, Section 6
concludes the paper.
2 RELATED WORK
In this section we report state-of-the-art solutions to
authorship attribution with respect to the specific field
of image analysis.
OCR is a traditional approach based on pattern
recognition techniques that enable a computer to read
texts (i.e., scanned images of a texts) (Bunke and
Wang, 1997). However, if this is a feasible solution
on printed text, its use for manuscripts is rather prob-
lematic. In general, OCR applied to handwritten texts
is far from being perfect because of the issue of “vari-
ations”. For example, the same letter drawn by the
same person is slightly different each time, as well
as letters drawn by different hands. These variations
make it hard for the computer to read the writing cor-
rectly and to make a successful match in the context
of authorship attribution.
Due to poor recognition results provided by OCR,
handwritten document image retrieval remains a very
challenging problem: keeping documents as image
format is a more economical and flexible alternative
than converting image documents into text format by
OCR; furthermore, it is more robust for different vari-
ations and degradations (David and Karl, 2014).
In (Aiolli and Ciula, 2009), the authors propose
the tool System for Paleographic Inspections (SPI).
SPI solves the problem of variations by training and
working on prototypes of letters, i.e., collecting ab-
stracted models of a single person’s handwriting. The
prototype comes with a predefined set of limits be-
tween which the letter belonging to the unidentified
document may deviate from the prototype. The main
limit of SPI is that the segmentation process focuses
on the shape of individual letters only. Thus, the over-
all appearance of the manuscript page, at the different
levels of granularity (i.e., sentences, words, etc.), and
its immediate context are completely ignored.
In (Rath et al., 2004), a probabilistic annotation
model for word matching in written documents is
presented. Word images are represented by means
of Fourier coefficients. A learning model is trained
to map any given word image to a specific word
from a vocabulary with a probability. At query time,
the model estimates the probability of a query word
and a sequence of feature vector occurring together.
The method is pure text-based retrieval and achieves
multiple-words query tasks; however, it suffers from
queries which do not appear in the training set.
Finally, (Cao et al., 2011) propose an adapted vec-
tor model for word retrieval, where documents and
queries are represented by means of a vector space of
term frequency (TF) and inverse document frequency
(IDF) for each term in the vocabulary. TFs and IDFs
are estimated by means of word segmentation and
recognition likelihood. Retrieval is achieved by mea-
suring the similarity between vectors of query and
data documents with a ranked list. Similarly to (Rath
et al., 2004), also this approach is impracticable when
queries do not belong to the vocabulary.
To the best of our knowledge, WRITINGSIMI-
LARITYSEARCH is the first attempt to provide a thor-
Similarity-basedImageRetrievalforRevealingForgeryofHandwrittenCorpora
105