show that this model greatly improves the genera-
tion performance over a state of the art benchmark
model.
•
We experiment on performing style transfer on
new writers using this model achieves, and we
show that it achieves much better results than the
benchmark model.
•
Finally, and maybe most interestingly, we further
analyze the extracted the latent space from our
model to show that there is a limited number of
styles for each letter and that the style manifold is
not a continuous space.
2 RELATED WORK
2.1 Generative Models
Recent advances in deep learning (Goodfellow et al.,
2016) architectures and optimization methods led to re-
markable results in the area of generative models. For
static data, like images, the mainstream research builds
on the advances in Variational Autoencoders (Kingma
and Welling, 2013) and Generative Adversarial Net-
works (Goodfellow et al., 2014).
For generating sequences, the problem is more dif-
ficult: the model generates one frame at a time, and
the final result must be coherent over long sequences.
Recent recurrent neural networks architectures, like
Long-Short Term Memory (LSTM) (Hochreiter and
Schmidhuber, 1997) and Gated Recurrent Units (GRU)
(Chung et al., 2014), achieve unprecedented perfor-
mance in handling long sequences.
These architectures has been used in many applica-
tions, like learning language models (Sutskever et al.,
2014), image captioning (Vinyals et al., 2015), mu-
sic generation (Briot and Pachet, 2017) and speech
synthesis (Oord et al., 2016).
We use these powerful tools to extract meaning-
ful latent spaces for styles. Our work is strongly in-
spired by the seminal work performed by (Ha and Eck,
2017). They investigated the problem of sketch draw-
ing (Google, 2017) using a Variational Autoencoder.
The latent space that emerged from training encoded
meaningful semantic information about these draw-
ings. We use here a similar architecture, without the
variational part, showing a similar behaviour.
2.2 Data Representation
For handwriting, a continuous coordinate representa-
tion (e.g. continuous X, Y) seems the natural option.
However, generating continuous data is not straight-
forward. Traditionally, in neural networks, when we
want to output a continuous value, a simple linear or
Tanh activation function is used in the output layer of
the neural network.
However, Bishop (Bishop, 1994) studied the lim-
itations of these functions and showed that they can
not model rich distributions. In particular, when the
input can have multiple outputs (one-to-many), these
functions will average over all the outputs. He pro-
posed the use of Gaussian Mixture Model (GMM) as
the final activation function of a neural network. The
alliance of neural networks and GMMs is called Mix-
ture Density Network (MDN). The training consists in
optimizing the GMM parameters (mean, standard devi-
ation, covariance). The inference is done by sampling
from the GMM distribution.
To simplify the process, and focus our study on
investigating of styles, we extract two features for the
tracings: directions and speed (explained in section
3), and we quantize these features. Thus, we model
each point in the letter tracings as two categorical
distributions, and use two SoftMax functions (one for
each feature) as the outputs of the network, which is
much simpler than MDN. This was inspired by the
studies done in (Oord et al., 2016), where they report
impressive results on originally continuous data, using
suitable quantization policy. Categorical distributions
are more flexible and generic than continuous ones.
2.3 Evaluation Metrics
The objective evaluation of a generative model perfor-
mance is a challenging task, since there is no consen-
sus for objective evaluation metrics. In many cases,
a subjective evaluation is performed to overcome this
problem. For handwriting of Chinese letters, (Chang
et al., 2018) proposed two metrics: Content accuracy
and Style discrepancy. In the first metric, a classifier
is trained to determine the type of the letter on the
reference letters, then it is used to evaluate the gener-
ated letters. However, it is not clear how to reliably
use the classifier trained on one distribution (reference
letters) to evaluate new distribution (the generated let-
ters). The second metric is not applicable to our case,
since it assumes the use of Convolution Neural Net-
work (CNN) on the image of the letter, while we use
the pen sequence of drawing the letter (i.e., temporal
data) with RNNs.
We use the same metrics like in (Mohammed et al.,
2018) to evaluate the quality of handwriting generation:
the BLEU score (Papineni et al., 2002) – a metric
widely used in text translation and image captioning –
and the End of Sequence (EoS) analysis (both metrics
are explained in section 5).
ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence
678