converter. In general, because parallel data with the
same speech contents of a source speaker and a tar-
get speaker is required to create a voice converter, to
prepare a lot of ones is expensive.
When the cost to collect data is high, a develop-
ment of an application becomes difficult. It is nec-
essary to solve this problem in order to develop var-
ious applications using VC techniques. Although a
VC method using DNN converting spectral envelope
is proposed (Xie et al., 2014), due to large dimension-
ality of log spectral envelope used as input of DNN,
the structure of DNN becomes complicated, conse-
quently, there are problems that a lot of training data
are required and converting time becomes longer.
Although the aforementioned VC methods are
one-to-one VC where a particular source speaker’s
voice is converted to a particular target speaker’s,
many-to-one VC methods, conversion from an arbi-
trary source speaker to a particular target speaker, and
many-to-many VC methods, conversion from an ar-
bitrary source speaker to an arbitrary target speaker,
have been also proposed (Toda et al., 2006; Liu et al.,
2015). Here, an arbitrary speaker is a speaker that has
not been used as training data for creating voice con-
verter. If the many-to-one VC becomes possible, the
cost of a creating voice converter is reduced because
it is not necessary to build a new voice converter for a
new source speaker. However, creating a many-to-one
voice converter requires more speech data than creat-
ing a one-to-one voice converter because a many-to-
one voice converter needs to be trained by speech data
of multiple speakers. Thus, the voice converter should
be built by fewer data in order to realize many-to-one
VC easily.
In this study, we aim to reduce the amount of
the training data and to shorten the converting time
for one-to-one and many-to-one VC by using autoen-
coders and relatively simple DNN.
In the proposed method, at first, autoencoders,
trained with data of the source speaker and the tar-
get speaker respectively, are created, and higher-order
features from each autoencoder are extracted. Then
a DNN which converts the higher-order features of
the source speaker into those of the target speaker is
trained. The target speaker’s higher-order features for
a new source speaker’s voice data are obtained by in-
putting higher-order features of the voice data into the
DNN. The acoustic features are restored from the con-
verted higher-order features by using the weight of the
autoencoder of the target speaker. Finally, the con-
verted voice is obtained from the acoustic features by
speech synthesis.
The remainder of the paper is organized as fol-
lows: In Section 2, we describe related works which
deal with VC methods using RBMs and autoencoders.
In Section 3, we explain the overview and important
aspects of the proposed method and the benefit of au-
toencoders. In Section 4, we present the evaluations
where the proposed method was compared with the
conventional methods in one-to-one VC and many-to-
one VC settings. In Section 5, we conclude the paper
and discuss future works.
2 RELATED WORKS
There are a lot of studies on VC. We describe VC
methods using DNN in this section because they are
actively proposed in recent years.
Nakashika et al. (Nakashika et al., 2015) have
proposed a VC method using speaker-dependent
CRBMs. A CRBM of a source speaker and one of
a target speaker are trained, then the higher-order fea-
tures obtained from the CRBM of the source speaker
are converted into the higher-order features obtained
from the CRBM of the target speaker by neural net-
work (NN). The converted higher-order features are
restored to the acoustic features by the inverse pro-
jection of the CRBM of the target speaker, and the
speech signal is obtained. In the evaluation exper-
iments, the proposed method outperforms conven-
tional VC methods using GMM, RBM and recurrent
neural network (RNN). The voice converter can be
created without a large dataset and it is possible to
realize VC in shorter time due to the usage of 24-
dimensional MFCC as acoustic features.
Nguyen et al. (Nguyen et al., 2016) have proposed
a speaker conversion method, which comprehensively
converts spectral envelope, fundamental frequency
(F0), intensity trajectory and phone duration. In spec-
tral envelope conversion, which corresponds to VC,
the method that employs autoencoders with weights
using L1 norm constraint in pre-training is proposed.
This method outperforms a VC method using DNN
with randomly initialized weights. Although this can
convert the spectral envelope with high accuracy, a
large dataset is required to build a voice converter and
its conversion time would be long because the method
employs 512-dimensional log spectral envelope and a
large NN which has three hidden layers where each
hidden layer has 3000 nodes.
Mohammadi et al. (Mohammadi and Kain, 2014)
have proposed a VC method using deep autoencoders.
In this method, input features are compressed by deep
autoencoders of source speaker and target speaker,
and higher-order features are obtained. An artificial
neural network (ANN) which converts the higher-
order features of the source speaker into those of the
Fast Many-to-One Voice Conversion using Autoencoders
165