approach could provide will be measured by the im-
provement of classification accuracy. Using a Neu-
ral Network which is based on the encoder-decoder
paradigm, data is projected into a manifold, obtain-
ing feature vectors, where modifications will be per-
formed, allowing us to explore different strategies
to create new examples. Regarding these latest in-
stances, they could come from this feature space or
the original data space, through the reconstruction
provided by the decoder.
2 PREVIOUS WORK
For years, data augmentation has been used when
the problem involves images. Applying this process
is relatively straightforward under these conditions,
since transformations which are used typically in this
context are feasible in the real world, such as scale,
rotate an image or simulate different positions of the
camera. Therefore, using this technique would be
advantageous if the number of original images were
scarce or to improve the generalisation, making the
classifier more robust when these modifications are
uncontrolled, very usual in natural images. Exam-
ples of the application of these methods are shown
in one of the first successful Convolutional Neural
Network (CNN) (LeCun et al., 1998) or the more re-
cent breakthrough in machine learning starred by the
Alexnet model (Krizhevsky et al., 2012). Addition-
ally, this kind of procedures is applied in problems
that are not images, for instance including Gaussian
noise in speech (Schl
¨
uter and Grill, 2015). However,
the modifications to create these synthetic sets are
usually handcrafted and problem-dependent. Even for
images, it is difficult to determine what kind of trans-
formation is better.
On the other hand, the class imbalance scenario
was one of the precursors of these techniques. Due
to this disproportion, training an unbiased classifier
could be complicated. To solve this situation, there
was proposed a Synthetic Minority Over-Sampling
Technique (SMOTE) (Chawla et al., 2002). This ap-
proach is based on the idea of performing modifica-
tions in the feature space, that is, the sub-space that
is learned by the classifier, i.e.: the space that accom-
modates the projection after a hidden layer in a Neural
Network. When data is projected into this space, the
process generates new instances of the least frequent
class using oversampling, using different schemes,
such as interpolating or applying noise.
In (Wong et al., 2016) authors studied the bene-
fits from using synthetically created data in order to
train different classifiers. In their work, they used the
following strategy: limiting the available samples to a
certain number and then they compared the addition
of the remaining data, simulating the acquisition of
unseen real data with the addition of the generated
samples. They distinguished among data, original,
and feature space as in SMOTE. Their results showed
that it is better to perform distortions in the original
space, if they are known, rather than in the feature
space. The authors also exposed that improvement
is bounded by the accuracy obtained when the same
amount of real unseen data is included.
Another approach that follows this idea could be
seen in (DeVries and Taylor, 2017), where it is pro-
posed a method, inspired by SMOTE as well, to per-
form modifications in the feature space. The idea
behind this is being able to deal with any problem
once the data is projected into this space. There-
fore, this could be used with every task as it does
not depend on the input of the system. In this case,
they are coping with the limited availability of la-
belled data. Accordingly, their approach is based on
an encoder that performs the projection into the fea-
ture space and then a decoder that retrieves these vec-
tors. During this encoding-decoding procedure, they
create new instances by using different techniques
such as interpolation, extrapolation and noise addi-
tion. After decoding the feature vectors, they can
either get new instances in the original space or use
the distorted version in the feature space. Regarding
the model that they used, it is based on a Sequence
AutoEncoder (Srivastava et al., 2015) implemented
as a stacked LSTM (Li and Wu, 2015). Concerning
datasets, they conducted experiments with sequential
data, i.e.: a sequence of strokes, a sequence of vectors,
etc., but they also performed experiments with images
that were treated as a sequence of row-vectors. After
this process, they found that performing extrapolation
in feature space and then classify, gets better results
than applying affine transformations in the original
space, projecting them into the feature space and then
classifying the final vectors. In addition, they carried
out experiments using the reconstruction provided by
the decoder but results reflected that in this case in-
cluding extrapolated and reconstructed samples de-
crease the performance.
In this work, we want to provide a study concern-
ing the capacity of a generative model for creating ex-
amples that can improve the generalisation of differ-
ent classifiers. To do that, we have used a generative
network with the encoder/decoder architecture that
provides a manifold that comprises the feature space.
Regarding the generated samples, they will be the re-
sult of a process of controlled distortion in this feature
space. Additionally, we want to evaluate two possi-
Empirical Evaluation of Variational Autoencoders for Data Augmentation
97