Fast Many-to-One Voice Conversion using Autoencoders

Yusuke Sekii

, Ryohei Orihara

, Keisuke Kojima

, Yuichi Sei

Yasuyuki Tahara

and Akihiko Ohsuga

Graduate School of Information Systems, University of Electro-Communications, Chofu-city, Tokyo, Japan

Solid Sphere, inc., Tokyo, Japan

Keywords:

Voice Conversion, Autoencoder, Deep Learning, Deep Neural Network, Spectral Envelope.

Abstract:

Most of voice conversion (VC) methods were dealing with a one-to-one VC issue and there were few studies

that tackled many-to-one / many-to-many cases. It is difﬁcult to prepare the training data for an application

with the methods because they require a lot of parallel data. Furthermore, the length of time required to con-

vert a speech by Deep Neural Network (DNN) gets longer than pre-DNN methods because the DNN-based

methods use complicated networks. In this study, we propose a VC method using autoencoders in order to

reduce the amount of the training data and to shorten the converting time. In the method, higher-order features

are extracted from acoustic features of source speakers by an autoencoder trained with source speakers’ data.

Then they are converted to higher-order features of a target speaker by DNN. The converted higher-order fea-

tures are restored to the acoustic features by an autoencoder trained with data drawn from the target speaker. In

the evaluation experiment, the proposed method outperforms the conventional VC methods that use Gaussian

Mixture Models (GMM) and DNNs in both one-to-one conversion and many-to-one conversion with a small

training set in terms of the conversion accuracy and the converting time.

1 INTRODUCTION

In recent years, voice conversion (VC), which is a

technique used to change timbre features of a source

speaker into those of a target speaker, has been ac-

tively studied. VC techniques can be applied to alle-

viate a sense of discomfort due to a change of voice

actors or actresses in animated ﬁlms, to create dubbed

voice of movie in voice of the actor or actress them-

selves, and to assist a call by converting a hard-to-hear

voice to an easy-to-hear voice in real time.

VC based on Gaussian Mixture Models (GMM) is

a typical conventional VC approach (Stylianou et al.,

1998; Toda et al., 2007). However, in recent years,

it has been reported that VC approaches employing

Deep Neural Networks (DNNs) outperform VC ap-

proaches based on GMM (Desai et al., 2009). It can

be explained by a fact that the shape of the vocal tract

is generally non-linear; VC methods using non-linear

operations such as DNNs are more compatible with

human speech than methods based on linear opera-

tions such as GMM (Nakashika et al., 2015). As

the non-linear VC approaches, those employing re-

stricted Boltzmann machines (RBMs) (Chen et al.,

2013), deep belief networks (DBNs), which are ex-

tended versions of RBMs (Nakashika et al., 2013) and

conditional restricted Boltzmann machines (CRBMs)

(Wu et al., 2013) are proposed. Furthermore, it has

been reported that the conversion accuracy can be im-

proved by pre-training based on RBMs and autoen-

coders in VC methods using DNNs (Mohammadi and

Kain, 2014; Liu et al., 2015).

Although a lot of VC research have used Mel-

frequency cepstrum coefﬁcients (MFCC) or Mel-

cepstrum (MCEP) as acoustic features, it has been re-

ported that VC methods converting spectral envelope

are better than ones converting MFCC (Chen et al.,

2013; Nguyen et al., 2016). In VC methods convert-

ing MFCC, the similarity in high-frequency range is

inferior to ones converting spectral envelope because

the information is lost in the former when MFCC is

restored to spectral envelope. For this reason, it can

be said that we should select the spectral envelope as

acoustic features to be used for VC.

From the discussion above, it is best to choose a

non-linear operation for the conversion method and

spectral envelope as the acoustic feature in order to

achieve highly accurate VC. However, the dimen-

sionality of spectral envelope is large compared with

MFCC, and it requires a lot of data to create a voice

164

Sekii Y., Orihara R., Kojima K., Sei Y., Tahara Y. and Ohsuga A.

Fast Many-to-One Voice Conversion using Autoencoders.

DOI: 10.5220/0006193301640174

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 164-174

ISBN: 978-989-758-220-2

converter. In general, because parallel data with the

same speech contents of a source speaker and a tar-

get speaker is required to create a voice converter, to

prepare a lot of ones is expensive.

When the cost to collect data is high, a develop-

ment of an application becomes difﬁcult. It is nec-

essary to solve this problem in order to develop var-

ious applications using VC techniques. Although a

VC method using DNN converting spectral envelope

is proposed (Xie et al., 2014), due to large dimension-

ality of log spectral envelope used as input of DNN,

the structure of DNN becomes complicated, conse-

quently, there are problems that a lot of training data

are required and converting time becomes longer.

Although the aforementioned VC methods are

one-to-one VC where a particular source speaker’s

voice is converted to a particular target speaker’s,

many-to-one VC methods, conversion from an arbi-

trary source speaker to a particular target speaker, and

many-to-many VC methods, conversion from an ar-

bitrary source speaker to an arbitrary target speaker,

have been also proposed (Toda et al., 2006; Liu et al.,

2015). Here, an arbitrary speaker is a speaker that has

not been used as training data for creating voice con-

verter. If the many-to-one VC becomes possible, the

cost of a creating voice converter is reduced because

it is not necessary to build a new voice converter for a

new source speaker. However, creating a many-to-one

voice converter requires more speech data than creat-

ing a one-to-one voice converter because a many-to-

one voice converter needs to be trained by speech data

of multiple speakers. Thus, the voice converter should

be built by fewer data in order to realize many-to-one

VC easily.

In this study, we aim to reduce the amount of

the training data and to shorten the converting time

for one-to-one and many-to-one VC by using autoen-

coders and relatively simple DNN.

In the proposed method, at ﬁrst, autoencoders,

trained with data of the source speaker and the tar-

get speaker respectively, are created, and higher-order

features from each autoencoder are extracted. Then

a DNN which converts the higher-order features of

the source speaker into those of the target speaker is

trained. The target speaker’s higher-order features for

a new source speaker’s voice data are obtained by in-

putting higher-order features of the voice data into the

DNN. The acoustic features are restored from the con-

verted higher-order features by using the weight of the

autoencoder of the target speaker. Finally, the con-

verted voice is obtained from the acoustic features by

speech synthesis.

The remainder of the paper is organized as fol-

lows: In Section 2, we describe related works which

deal with VC methods using RBMs and autoencoders.

In Section 3, we explain the overview and important

aspects of the proposed method and the beneﬁt of au-

toencoders. In Section 4, we present the evaluations

where the proposed method was compared with the

conventional methods in one-to-one VC and many-to-

one VC settings. In Section 5, we conclude the paper

and discuss future works.

2 RELATED WORKS

There are a lot of studies on VC. We describe VC

methods using DNN in this section because they are

actively proposed in recent years.

Nakashika et al. (Nakashika et al., 2015) have

proposed a VC method using speaker-dependent

CRBMs. A CRBM of a source speaker and one of

a target speaker are trained, then the higher-order fea-

tures obtained from the CRBM of the source speaker

are converted into the higher-order features obtained

from the CRBM of the target speaker by neural net-

work (NN). The converted higher-order features are

restored to the acoustic features by the inverse pro-

jection of the CRBM of the target speaker, and the

speech signal is obtained. In the evaluation exper-

iments, the proposed method outperforms conven-

tional VC methods using GMM, RBM and recurrent

neural network (RNN). The voice converter can be

created without a large dataset and it is possible to

realize VC in shorter time due to the usage of 24-

dimensional MFCC as acoustic features.

Nguyen et al. (Nguyen et al., 2016) have proposed

a speaker conversion method, which comprehensively

converts spectral envelope, fundamental frequency

(F0), intensity trajectory and phone duration. In spec-

tral envelope conversion, which corresponds to VC,

the method that employs autoencoders with weights

using L1 norm constraint in pre-training is proposed.

This method outperforms a VC method using DNN

with randomly initialized weights. Although this can

convert the spectral envelope with high accuracy, a

large dataset is required to build a voice converter and

its conversion time would be long because the method

employs 512-dimensional log spectral envelope and a

large NN which has three hidden layers where each

hidden layer has 3000 nodes.

Mohammadi et al. (Mohammadi and Kain, 2014)

have proposed a VC method using deep autoencoders.

In this method, input features are compressed by deep

autoencoders of source speaker and target speaker,

and higher-order features are obtained. An artiﬁcial

neural network (ANN) which converts the higher-

order features of the source speaker into those of the

Fast Many-to-One Voice Conversion using Autoencoders

165

target speaker is trained. A DNN is trained by com-

bining the deep autoencoders with the ANN, then

the DNN is ﬁne-tuned. This method outperforms

conventional VC methods using GMM etc. with a

small training set. This method enables a voice con-

verter even with a smaller dataset and it is possible

to shorten the converting time due to the usage of 24-

dimensional MCEP as acoustic features.

Liu et al. (Liu et al., 2015) have proposed a

speaker-independent VC method using a DNN. The

spectral features of three concatenated frames are

used as input and speech data of multi-source speak-

ers are used in training. The proposed method gen-

erates a one-to-one speaker-dependent DNN based on

weights initialized by a speaker-independent DNN. It

outperforms a VC method using a DNN pre-trained

by DBNs. In the evaluation experiment, the proposed

method yields as high accuracy as conventional one-

to-one VC methods using GMM and DNNs. This

method also enables a voice converter even with

a smaller dataset and it is possible to shorten the

converting time due to the usage of 24-dimensional

MCEP as acoustic features.

3 PROPOSED METHOD

3.1 The Overview of Proposed Method

The VC process of the proposed method is described

below (Figure 1).

Step 1: Acoustic features, which are spectral enve-

lope in this study, are extracted from a source

speech.

Step 2: Higher-order features are extracted by an

autoencoder trained by acoustic features of the

source speaker.

Step 3: The higher-order features are converted into

those of the target speaker by DNN.

Step 4: The converted higher-order features are re-

stored to the target acoustic features by an autoen-

coder trained by the target speaker’s data.

Step 5: The converted voice is created from the

acoustic features by speech synthesis.

3.2 Autoencoder

A typical NN is a supervised learning technique, and

they require a pair of input and output values. An au-

toencoder (Hinton and Salakhutdinov, 2006) is an un-

supervised learning technique and is a NN equalizes

the output values with its input values i.e. we only

need the input values.

In a NN which has an input layer, a hidden layer

and an output layer (Figure 2), an autoencoder is de-

ﬁned as follows:

h = f (W

x + b

), (1)

y = g(W

h + b

), (2)

where x is the input layer, h is the hidden layer, y

is the output layer, W

and b

are the weight and the

bias to convert x into h respectively, W

and b

are the

weight and the bias to convert h into x respectively,

and f and g are activation functions respectively. By

using Equation (1) and (2), the equation converting

the input x into the output y is described as follows:

y = g(W

f (W

x + b

) + b

). (3)

An autoencoder decides the weights (W

, W

) and

the biases (b

, b

) as hyper parameters so that y be-

comes similar to x, i.e. the hyper parameters are de-

cided to minimize the value of a loss function to mea-

sure the distance between y and x. Root mean square

error (RMSE) is typically used as a loss function.

E = ||x − y||

(4)

An autoencoder can be used as a pre-training tech-

nique like a RBM. NNs, whose weights are initial-

ized with the values obtained through an autoencoder

training, yield better results by means of ﬁne-tuning

(Mohammadi and Kain, 2014). On the other hand,

higher-order features can be seen as compressed ones

of the input, if the autoencoder has a hidden layer

which is smaller than the input layer. Thus the fea-

tures of large dimensionality can be represented by

those of small dimensionality.

3.3 Converting Acoustic Features

In this study, higher-order features extracted by an au-

toencoder are used. We aim to reduce the amount of

the training data and to shorten the converting time by

using higher-order features with smaller dimensional-

ity.

In this study, the proposed structure to convert

features is described in Figure 3. At ﬁrst, autoen-

coders of each speaker and a DNN converting higher-

order features are trained. The acoustic features of

a source speaker (x) and those of a target speaker (x

)

are treated as input, then autoencoders of each speaker

are trained. Higher-order features (h, h

) are extracted

by each autoencoder. Then a DNN is created by us-

ing the higher-order feature (h) extracted by the au-

toencoder of the source speaker as input data and the

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

166

Source

Speech

Acoustic feature

of source speech

Converted

higher-order

feature

Higher-order

feature of

source speech

Converted

acoustic feature

Extract

acoustic feature

Extract higher-order

feature by autoencoder

Speech synthesis

Conversion

by DNN

Converted

Speech

Restore higher-order

feature by autoencoder

Figure 1: The process of the proposed VC.

x h

Figure 2: Autoencoder.

higher-order feature (h

) extracted by the autoencoder

of the target speaker as the ground truth of the con-

version. Secondly, a voice converter is created by

combining the autoencoders with the DNN (Figure 3

below). The higher-order features are extracted from

the input acoustic features by using the weight (W

)

of the encoder part of the autoencoder of the source

speaker. The extracted higher-order features are con-

verted into target higher-order features by the DNN.

The acoustic features are restored from the converted

higher-order features (h

) by using the weight (W

)

of the decoder part of the autoencoder of the target

speaker, then the converted acoustic features (y

) are

obtained. Acoustic features can be converted by a net-

work, which consists of an encoder weight of a source

speaker’s autoencoder, a decoder weight of a target

speaker’s autoencoder and a higher-order feature con-

version DNN.

In order to put many-to-one VC into practice,

acoustic features of multiple source speakers as train-

ing data are required. An autoencoder is considered

to give more generalized higher-order features by us-

ing training data consists of multiple source speak-

ers. The DNN that converts the generalized higher-

order features into the target higher-order features is

expected to be able to convert an arbitrary speech not

used as training data of the source speakers with a

high degree of accuracy.

4 EXPERIMENTAL EVALUATION

4.1 Preliminary Experiment

In this preliminary experiment, we used a male

speaker (YMGT) as a source speaker and a female

speaker (RDY) as a target speaker from the speech

database created by Solid Sphere, inc.

4.1.1 Determining Appropriate Parameters

We conducted the preliminary experiment in order to

identify optimal parameters of the proposed method

and methods to be compared with it. In this experi-

ment, we used 450 utterances as a training set and 50

utterances as a testing set. In the proposed method,

100-dimensional higher-order features are converted

by various DNNs with different hyper parameters

such as the number of hidden layers and the number

of hidden nodes. We used 100 epochs for the autoen-

coder training and 30 epochs for the DNN training.

In order to evaluate the quality of spectral conversion,

It’s a private speech database. It consists of four male

speakers and six female speakers, and 500 utterances are

recorded by each speaker.

Fast Many-to-One Voice Conversion using Autoencoders

167

Source speaker’s Autoencoder

x’

h’

y’

’

Target speaker’s Autoencoder

Cnversion by DNN

(Correct answer data: h’)

h’’ y’’

’

Training Part

Conversion Network

Figure 3: The overview of feature conversion by the proposed method.

log spectral distortion (LSD), which measures how

close the converted spectrum becomes to the target

spectrum, is employed. LSD is deﬁned as follows:

LSD =

∑

i=1



10log



, (5)

where x

is the i-th converted spectrum, y

is the i-th

target spectrum and n is spectral dimensionality (513

in this work). The result of the spectral conversion for

combinations of the hyper parameters is shown in Fig-

ure 4. From the result, we decided to use a DNN with

three hidden layers where each layer has 500 nodes in

the actual experiment. We also used a simpler DNN

that uses 50-dimensional higher-order features as in-

put in the experiment.

We chose Nguyen et al.’s work (Nguyen et al.,

2016), which converts 513-dimensional log spectral

envelope by DNN, as a method to be compared

with ours, because spectral conversion accuracy of

the method is the highest as far as we know. 513-

dimensional log spectral envelope is converted by var-

ious DNNs with different hyper parameters such as

the number of hidden layers and the number of hidden

nodes. We used 30 epochs for DNN training. We used

LSD in the evaluation of spectral conversion. The re-

sult of the spectral conversion for combinations of the

hyper parameters is shown in Figure 5. As a result, we

decided to use a DNN which has three hidden layers

with 3000 nodes in each layer and a DNN which has

three hidden layers with 100 nodes in each layer as

VC methods to be compared with ours, because they

are the most and second most accurate methods in the

result respectively.

4.1.2 Effect of the Amount of Data

We studied how the accuracy of the spectral conver-

sion is affected by the amount of training data. In this

experiment, we used a proposed method which con-

verts 50-dimensional higher-order features (AE50),

a proposed method which converts 100-dimensional

higher-order features (AE100) and two VC meth-

ods, which convert 513-dimensional spectral enve-

lope by DNNs, to compared with ours (SPEC3000

and SPEC100). DNNs for AE50, AE100, SPEC3000

and SPEC100 have two hidden layers with 200 nodes,

three hidden layers with 500 nodes, three hidden lay-

ers with 3000 nodes, and three hidden layers with

100 nodes respectively. We used 100 epochs to train

autoencoders of AE50 and AE100, and 30 epochs

to train DNNs of AE50, AE100, SPEC3000 and

SPEC100. We used LSD as a measure to evalu-

ate the accuracy of spectral conversion. The re-

sult of the spectral conversion against the amount of

training data is shown in Figure 6. AE50 yielded

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

168

4.2

4.3

4.4

4.5

4.6

4.7

100 200 300 500 1000

LSD(dB)

Hidden nodes in each layer

1 hidden layer

2 hidden layer

3 hidden layer

Figure 4: Change of LSD with the proposed method.

4.6

4.8

5.0

5.2

5.4

5.6

5.8

50 100 200 500 1000 1500 2000 3000

LSD(dB)

Hidden nodes in each layer

1 hidden layer

2 hidden layer

3 hidden layer

Figure 5: Change of LSD with a method to convert spectral

envelope by DNN.

higher accuracy with a small amount of data. AE100

and SPEC3000 yielded higher accuracy with a large

amount of data. As a result, one can hypothesize that

a method using a simple DNN (such as AE50) yields

higher accuracy with a small amount of data, and a

method using a complicated DNN (such as AE100

and SPEC3000) yields higher accuracy with a large

amount of data. The hypothesis implies that a method

using a simple DNN is the best in case that a large

training set is unavailable and converting time must

be short because the method can generally convert

spectra in a shorter time than a method using a com-

plicated DNN. However, the method using a simpler

DNN (SPEC100) yielded lower accuracy with a small

amount of data and yielded higher accuracy with a

large amount of data. In the actual experiment, we

veriﬁed this hypothesis by using two data sets.

4.2 Experimental Setup

We conducted experiments with proposed methods

and other VC methods in one-to-one VC and many-

4.2

4.4

4.6

4.8

5.0

20 30 50 100 200 300 450

LSD(dB)

The number of utterances in training

AE50

AE100

SPEC3000

SPEC100

Figure 6: Change of LSD to variation of the number of

training speakers.

to-one VC settings. We used the speech database cre-

ated by Solid Sphere, inc. in the experiment. We

prepared four pairs (YMGT to KJM, KJM to HM,

TK to YMGT and HM to TK) consist of two male

speakers (KJM and YMGT) and two female speakers

(HM and TK) for the one-to-one VC. In the many-to-

one VC experiment, we created eight voice convert-

ers for combinations of two target speakers and four

conditions of training speakers, where the number of

training speakers is two, four, six and eight respec-

tively. The target speakers are a male speaker (KRT)

and a female speaker (HM). In order to evaluate the

converters, KRT to HM conversion and HM to KRT

conversion were used. Note that none of them are

included in the eight training speakers. In both ex-

periments, we used 300 utterances as a large train-

ing set (large) and 20 utterances as a small training

set (small). The number of training data is constant

regardless of the number of training speakers. Fur-

thermore, we used 50 utterances as a testing set. We

employed parallel training data, which are created by

aligning an utterance of a source speaker with one of a

target speaker by dynamic time warping (DTW). The

utterances are required to be the same content in both

a source speaker and a target speaker.

In this experiment, we compared VC accuracy

of two proposed methods based on various parame-

ters with one of four conventional VC methods. We

used the methods appeared in 4. 1. 2, namely,

AE50, AE100, SPEC3000 and SPEC100, again. Ad-

ditionally, we used a conventional method which em-

ployed a GMM (JDGMM) (Toda et al., 2007) and a

method converting MFCC by a DNN (MFCC-DNN)

(Desai et al., 2009). We employed 513-dimensional

log spectral envelope with TANDEM-STRAIGHT

(Kawahara et al., 2008) in AE50, AE100, SPEC3000

and SPEC100. In JDGMM and MFCC-DNN, we

Fast Many-to-One Voice Conversion using Autoencoders

169

employed 25-dimensional MFCC calculated from the

spectral envelope. We used 64 Gaussian components

to build the system in JDGMM. DNNs for MFCC-

DNN, SPEC3000, SPEC100, AE50 and AE100 have

two hidden layers with 50 nodes, three hidden layers

with 3000 nodes, three hidden layers with 100 nodes,

two hidden layers with 200 nodes, and three hidden

layers with 500 nodes respectively. The activation

function and the learning optimization algorithm of

the autoencoders and the DNNs are ReLU (Nair and

Hinton, 2010) and ADAM (Kingma and Ba, 2015) re-

spectively. We used 100 epochs for the autoencoder

training and 30 epochs for the DNN training with

AE50 and AE100. Moreover, we used 200, 20 and

20 epochs for the DNN training with MFCC-DNN,

SPEC3000 and SPEC100 respectively. These param-

eters were decided by the preliminary experiment and

Desai et al.’s study (Desai et al., 2009).

As measures of the objective evaluation, we em-

ployed LSD and conversion time of acoustic features.

As for the subjective evaluation, we employed mean

opinion score (MOS). The MOS is a statistical mea-

surement of voice quality based on human opinion of

speech. It is expressed as a numerical value between

1 and 5, where 1 is the lowest voice quality, and 5 is

the highest voice quality. Subjects consisting of nine

men and women in their twenties listened to the tar-

get speech and converted speech, and assessed simi-

larity (how well they can recognize the target speaker

from the converted speech) and quality (how clear and

natural the converted speech is). We transformed not

only the spectral feature but also the fundamental fre-

quency (F0), which is the feature of the voice pitch,

for the converted speech. The conversion of F0 is de-

scribed as follows:

ˆy

(y)

(x)

− µ

(x)

) + µ

(y)

, (6)

AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

MOS

similarity

quality

＊: significant difference

＊

Figure 7: The result of the subjective evaluation in one-to-

one VC with small training set.

AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

MOS

similarity

quality

Figure 8: The result of the subjective evaluation in one-to-

one VC with large training set.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

time (sec)

Figure 9: Required time to convert spectra of a speech (sec).

where x

and ˆy

are a log-scaled F0 of the source

speaker and the converted one at frame t respectively,

(x)

and σ

(x)

are the mean and standard deviation of

log-scaled F0 of the source speaker respectively, and

(y)

and σ

(y)

are those of the target speaker respec-

tively. In the evaluation of the one-to-one VC, two

speeches, one from TK to YMGT conversion and the

other from HM to TK conversion, both randomly se-

lected, were evaluated and the results were averaged.

In the subjective evaluation of the many-to-one VC, a

speech randomly chosen from conversions to a target

speaker HM was evaluated.

4.3 Results and Discussion

4.3.1 Results of One-to-One Voice Conversion

Table 1 shows the result of LSD evaluation in one-

to-one VC with the small training set. The LSD

values of AE50, AE100, SPEC3000 and SPEC100

are lower than those of JDGMM and MFCC-DNN

i.e. the spectral conversion accuracy by AE50,

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

170

Table 1: The result of LSD evaluation in one-to-one VC with small training set.

target AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

YMGT to KJM 4.53 4.39 5.80 5.44 4.44 4.74

KJM to HM 4.71 4.66 6.77 6.31 4.62 4.78

TK to YMGT 4.56 4.33 5.66 5.30 4.45 4.67

HM to TK 4.18 4.12 5.12 4.85 4.14 4.28

average 4.50 4.38 5.84 5.48 4.41 4.61

Table 2: The result of LSD evaluation in one-to-one VC with large training set.

target AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

YMGT to KJM 4.08 4.04 5.06 5.19 4.06 4.17

KJM to HM 4.29 4.20 4.72 4.96 4.21 4.35

TK to YMGT 4.02 3.96 5.04 5.10 3.96 4.12

HM to TK 3.88 3.82 4.55 4.50 3.88 3.97

average 4.07 4.01 4.84 4.94 4.03 4.15

AE100, SPEC3000 and SPEC100 is better than that of

JDGMM and MFCC-DNN. This is due to employing

MFCC as acoustic features in JDGMM and MFCC-

DNN. When converted MFCC is restored to spectral

envelope, the high-frequency components are broken.

As a result, the similarity of spectral envelope be-

comes low. However, the difference in the result of

the subjective evaluation in Figure 7 is narrower than

that in Table 1 since MFCC is a feature taking human

speech perception into consideration. Although the

difference in spectral conversion accuracy between

AE100 and SPEC3000 is small, the accuracy of these

methods is higher than that of AE50 and SPEC100.

Since the accuracy of AE100 is higher than that of

AE50, it is found that the method employing large

dimensional higher-order features yields higher accu-

racy than small ones. In spite that the result shown

in Table 2 resembles Table 1, the difference between

LSD values of AE50 and SPEC100 and ones of

AE100 and SPEC3000 becomes narrow. Since AE50

and SPEC100 employ simpler DNNs than AE100 and

SPEC3000, it seems that a method using a simple

DNN requires much training data. As a result, the

hypothesis set up in the preliminary experiment: “the

method using a simple DNN yields higher accuracy

with a small amount of data, and the method using a

complicated DNN yields higher accuracy with a large

amount of data”, is rejected.

Figure 7 and 8 show the result of the evaluation

of the similarity and quality of the conversion based

on human auditory perception. MOS values of each

method are average score calculated from values of

two conversion pairs. In the experiment with the

small training set, SPEC3000 results in the highest

similarity and AE100 results in the highest quality.

Although there was a statistically signiﬁcant differ-

ence between AE100 and SPEC100 in the quality,

there were no signiﬁcant differences in the similar-

ity between the methods. In the experiment with the

large training set, the proposed methods (AE50 and

AE100) result in the highest similarity and quality.

However, there were no signiﬁcant differences in the

similarity and the quality between the methods.

Figure 9 shows required time for spectral con-

version by each method. In the methods converting

log spectral envelope, the time required for obtain-

ing converted spectral envelope from input log spec-

tral envelope is calculated. On the other hand, in the

methods converting MFCC, the time required for ob-

taining converted MFCC from input MFCC is calcu-

lated. Although in Table 1 and 2, the difference be-

tween AE100 and SPEC3000 is narrow, in compari-

son of converting time, the converting time of AE100

was 0.39 seconds, whereas that of SPEC3000 was

2.83 seconds. Namely, the spectral conversion by

SPEC3000 takes approximately seven times as long

as that by AE100.

Although the upper limit of the conversion time

should be decided by a nature of application, let us

assume that the target value for realizing a real-time

VC is set as 2.5 seconds. In current speech synthe-

sis technologies, the time required to analyze a 2-

second speech to get features and to restore to the

same speech is approximately 1.9 seconds

. There-

fore, feature conversion should be carried out in ap-

proximately 0.6 seconds, hence the methods to fulﬁll

this are AE50, AE100, MFCC-DNN and SPEC100.

In the methods, AE100 is the predominant candidate

because of the balance of the conversion accuracy and

conversion time.

From the above results, it is seen that the pro-

in case of using TANDEM-STRAIGHT

Fast Many-to-One Voice Conversion using Autoencoders

171

MOS

similarity

quality

Figure 10: The result of the subjective evaluation in many-

to-one VC with small training set.

MOS

similarity

quality

Figure 11: The result of the subjective evaluation in many-

to-one VC with large training set.

posed method outperforms the conventional methods

in terms of both conversion accuracy and converting

time in the one-to-one VC.

4.3.2 Results of Many-to-One Voice Conversion

We carried out spectral conversion of eight patterns by

each method, then evaluated the results by LSD. Ta-

ble 3 and 4 show the results for each number of train-

ing speakers; ‘mix2’ means that the number of train-

ing speaker used for building a many-to-one voice

converter is two, ’mix4’ means four training speak-

ers, and so on. In JDGMM, because it is not ca-

pable of many-to-one VC, the results of the one-to-

one VC (TK to KRT and KRT to TK) are described.

As seen in Table 1 and 2, JDGMM and MFCC-DNN

yield lower accuracy than the other methods; we ex-

plained the cause of this in 4.3.1. In many-to-one VC

experiment, AE100 also yields the highest accuracy

with both the small training set and the large train-

ing set. Furthermore, AE50 also yields higher ac-

curacy than SPEC3000 in a case of the large train-

ing set. Therefore, it seems that the proposed method

enables better speaker-independent conversion than a

method of converting features directly because gen-

eralized higher-order features are obtained from the

training data of multiple source speakers by autoen-

coders. In Table 3 and 4, all the methods yield poor

results when the number of training speakers is two.

However, it matters little to LSD when it equals 4 or

more. In Table 3, however, it was observed that the

conversion accuracy of SPEC100 is improved mono-

tonically along the number of training speakers. In

this experiment, the effect of the number of the train-

ing speakers on LSD when it is more than eight is

uncertain. It could be the case that a method of using

a lot of training speakers does not necessarily improve

accuracy of VC.

Figure 10 and 11 show the results of the subjec-

tive evaluation in the many-to-one VC. In the same

method, a voice converter based on eight training

speakers results in higher MOS values than one based

on two training speakers in both similarity and qual-

ity. As noted here and in Table 1 and 2, a voice

converter using more than three training speakers can

generate a speech with higher quality than one us-

ing two training speakers. However, the results for

MFCC-DNN is inconsistent with the trend. Regard-

ing the difference between the methods, SPEC3000

outperforms AE100 in terms of MOS against the

results in Table 1 and 2. Although we did one-

way variance analysis of both similarity and quality

for each method based on eight training speakers in

the small training set and large training set respec-

tively, there are no signiﬁcant differences. Moreover,

SPEC3000, which is a many-to-one VC method, out-

performs JDGMM, which is a one-to-one VC method,

in terms of both similarity and quality in the large

training set. As a result, it was found that a many-to-

one VC method employing spectral envelope yields

higher accuracy than a conventional one-to-one VC

method based on GMM. As the reason that the result

of SPEC3000 in the subjective experiment is superior,

it seems that SPEC3000 can deal with various input

due to the complicated DNN structure.

5 CONCLUSION

In this study, we proposed the VC method employ-

ing autoencoders in order to reduce the amount of the

training data and to shorten the converting time for

one-to-one and many-to-one VC. In the evaluation ex-

periment, the proposed method outperforms the con-

ventional voice conversion methods that use GMM

and DNN in both one-to-one conversion and many-

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

172

Table 3: The result of LSD evaluation in many-to-one VC with small training set.

AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

mix2 4.60 4.52 – 5.33 4.61 4.91

mix4 4.49 4.44 – 5.39 4.48 4.81

mix6 4.55 4.47 – 5.35 4.51 4.70

mix8 4.55 4.48 – 5.41 4.47 4.66

average 4.55 4.48 5.65 5.37 4.52 4.77

Table 4: The result of LSD evaluation in many-to-one VC with large training set.

AE50 AE100 JDGMM MFCC-DNN SPEC3000 SPEC100

mix2 4.38 4.35 – 5.26 4.40 4.43

mix4 4.32 4.28 – 5.11 4.29 4.30

mix6 4.30 4.26 – 5.08 4.33 4.35

mix8 4.32 4.27 – 5.12 4.36 4.31

average 4.33 4.29 4.87 5.14 4.34 4.35

to-one conversion with small training dataset in terms

of the conversion accuracy and the converting time.

Therefore, the proposed method is superior in cases

of developing applications under constraints that con-

verting time must be short and a large training set is

unavailable.

In future works, we will improve accuracy by pre-

training a DNN to convert the higher-order features

and ﬁne-tuning a combined DNN consists of the au-

toencoders and the feature-converting DNN. Further-

more, we will conduct a many-to-one VC experiment

with a lot of training speakers data, then we will

specify the appropriate number of training speakers

through observing how accuracy changes.

ACKNOWLEDGEMENTS

This work was supported by Solid Sphere, inc.

and JSPS KAKENHI Grant Numbers 26330081,

2687020, 16K124111.

REFERENCES

Chen, L. H., Ling, Z. H., Song, Y., and Dai, L. R. (2013).

Joint spectral distribution modeling using restricted

boltzmann machines for voice conversion. In Proc.

INTERSPEECH, pages 3052–3056.

Desai, S., Raghavendra, E. V., Yegnanarayana, B., Black,

A. W., and Prahallad, K. (2009). Voice conversion

using artiﬁcial neural networks. In Proc. IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 3893–3896.

Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing

the dimensionality of data with neural networks. Sci-

ence, 313(5786):504–507.

Kawahara, H., Morise, M., Takahashi, T., Nisimura, R.,

Irino, T., and Banno, H. (2008). Tandem-straight:

A temporally stable power spectral representation for

periodic signals and applications to interference-free

spectrum, f0, and aperiodicity estimation. In Proc.

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 3933–3936.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In Proc. International Con-

ference for Learning Representations (ICLR).

Liu, L. J., Chen, L. H., Ling, Z. H., and Dai, L. R.

(2015). Spectral conversion using deep neural net-

works trained with multi-source speakers. In Proc.

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 4849–4853.

Mohammadi, S. H. and Kain, A. (2014). Voice conversion

using deep neural networks with speaker-independent

pre-training. In Proc. Spoken Language Technology

Workshop (SLT), 2014 IEEE, pages 19–23.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units

improve restricted boltzmann machines. In Proc. In-

ternational Conference on Machine Learning, pages

807–814. Omnipress.

Nakashika, T., Takashima, R., Takiguchi, T., and Ariki, Y.

(2013). Voice conversion in high-order eigen space us-

ing deep belief nets. In Proc. INTERSPEECH, pages

369–372.

Nakashika, T., Takiguchi, T., and Ariki, Y. (2015). Voice

conversion using speaker-dependent conditional re-

stricted boltzmann machine. EURASIP Journal on

Audio, Speech, and Music Processing, 2015(1):1–12.

Nguyen, H. Q., Lee, S. W., Tian, X., Dong, M., and Chng,

E. S. (2016). High quality voice conversion using

prosodic and high-resolution spectral features. Mul-

timedia Tools and Applications, 75(9):5265–5285.

Stylianou, Y., Cappe, O., and Moulines, E. (1998). Con-

tinuous probabilistic transform for voice conversion.

IEEE Transactions on Speech and Audio Processing,

6(2):131–142.

Toda, T., Black, A. W., and Tokuda, K. (2007). Voice

Fast Many-to-One Voice Conversion using Autoencoders

173

conversion based on maximum-likelihood estimation

of spectral parameter trajectory. IEEE Transac-

tions on Audio, Speech, and Language Processing,

15(8):2222–2235.

Toda, T., Ohtani, Y., and Shikano, K. (2006). Eigenvoice

conversion based on gaussian mixture model. In Proc.

INTERSPEECH 2006 - Ninth International Confer-

ence on Spoken Language Processing (ICSLP), pages

2446–2249.

Wu, Z., Chng, E. S., and Li, H. (2013). Conditional re-

stricted boltzmann machine for voice conversion. In

Proc. IEEE China Summit International Conference

on Signal and Information Processing (ChinaSIP),

pages 104–108.

Xie, F.-L., Qian, Y., Fan, Y., Soong, F. K., and Li, H.

(2014). Sequence error (SE) minimization training of

neural network for voice conversion. In Proc. INTER-

SPEECH, pages 2283–2287.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

174