Non-Parallel Training Approach for Emotional Voice Conversion Using

CycleGAN

Mohamed Elsayed

, Sama Hadhoud

, Alaa Elsetohy

, Menna Osman

and Walid Gomaa

1,2

Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology, Alexandria, Egypt

Faculty of Engineering, Alexandria University, Alexandria, Egypt

Keywords:

Emotional Voice Conversion, CycleGAN, World Vocoder, Non-Parallel.

Abstract:

The focus of this research is proposing a nonparallel emotional voice conversion for Egyptian Arabic speech.

This method aims to change emotion-related features of a speech signal without changing its lexical content

or speaker identity. We relied on the assumption that any speech signal can be divided into content and style

code and the conversion between different emotion domains is done by combining the target style code with the

content code of the input speech signal. We evaluated the model using an Egyptian Arabic dataset covering

two emotion domains and the conversion results were successful depending on a survey conducted on random

people. Our purpose is to produce a state-of-the-art pre-trained model as it will be an unprecedented model in

the Egyptian Arabic language as far as we are concerned.

1 INTRODUCTION

Voice conversion (VC) focuses on extracting the acous-

tic features of the source voice and then, mapping them

to those of the target voice. After that, the waveforms

are synthesized from the generated acoustic features.

Pre-processing is done before training the mapping

function. Such pre-processing includes using dynamic

time warping (DTW) which is used to time align be-

tween the source and target voice features under study.

This is done when working on parallel data. Emotional

voice conversion approach is similar to that of the VC,

with the important role of prosody to express emotions

in speech (Choi and Hahn, 2021). Emotional voice

conversion focuses on converting the emotion-related

features from the source emotion domain to that in the

target domain while preserving the linguistic content

and speaker identity. Such emotion-related features

include prosodic and spectrum-related features.

Emotional voice conversion has various applica-

tions in conversational agents, intelligent dialogue sys-

tems, and other expressive speech synthesis applica-

tions (Luo et al., 2017). Additionally, it is promising

for applications in human-machine interaction, such

as enabling robots to respond to people with emotional

intelligence (Olaronke and Ikono, 2017). Speech not

only conveys information but also shows one’s emo-

tional state.

Given the signiﬁcance of emotions in communi-

cation, we focused on emotion-voice transformation

in this work. Prosodic features like pitch, intensity,

and speaking rate can be used to help identify emo-

tions (Scherer et al., 1991). Emotional voice con-

version aims to change emotion-related features of

a speech signal without changing its lexical content or

speaker identity (Choi and Hahn, 2021).

It is proposed in (Huang and Akagi, 2008) that the

perception of emotion is multi-layered. Thus, from

top to bottom, the layers are represented by emotion

categories, semantic primitives, and acoustic features.

It is further suggested that emotion production and

perception are inverse processes. An inverse three-

layered model for speech emotion production was

proposed in (Xue et al., 2018). Studies on speech

emotion generally utilize prosodic features concern-

ing voice quality, speech rate, fundamental frequency

(

), spectral features, duration,

contour, and energy

envelope (Schröder, 2006).

Early studies on emotional voice conversion mostly

relied on parallel training data, or a pair of utterances

that contain the same content but with different emo-

tions from the same speaker. Through the paired fea-

ture vectors, the conversion model learns mapping

from the source emotion A to the target emotion B dur-

ing training. The authors in (Tao et al., 2006) mainly

addressed prosody conversion by decomposing the

pitch contour of the source speech using classiﬁcation

and regression trees, then utilizing Gaussian Mixture

Model (GMM) and regression-based clustering tech-

niques.

Recent works proposed deep learning approaches

Elsayed, M., Hadhoud, S., Elsetohy, A., Osman, M. and Gomaa, W.

Non-Parallel Training Approach for Emotional Voice Conversion Using CycleGAN.

DOI: 10.5220/0012156000003543

In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2023) - Volume 2, pages 17-24

ISBN: 978-989-758-670-5; ISSN: 2184-2809

achieving remarkable performance. In (Luo et al.,

2016), an emotional voice conversion model is pro-

posed. This model is divided into two parts. In the ﬁrst

part, Deep Belief Networks (DBNs) are used to mod-

ify spectral features, while in the second one Neural

Networks (NNs) are used to modify the fundamen-

tal frequency (

). The STRAIGHT-based approach

was used for extracting features from the source voice

signal and the destination speech signal while intro-

ducing the spectral conversion part and

conversion

part. This model successfully adjusts both the acoustic

voice and the prosody for the emotional voice simul-

taneously when compared to traditional approaches

(NNs and GMMs). However, this work and other re-

cent emotional voice conversion techniques require

temporally aligned parallel samples, which is very dif-

ﬁcult to attain in practical applications. Additionally,

accurate time alignment requires manual segmentation

of the speech signal, which is also time-consuming.

Beyond the parallel training data, new methods

for learning the translation across emotion domains

utilizing CycleGAN and StarGAN have been devel-

oped (Gao et al., 2019). In this paper

, we used the

CycleGAN architecture, which switches between two

emotion domains rather than learning a one-to-one

mapping between pairs of emotional utterances. The

training approach for the model is speaker-dependent.

It depends on extracting the emotion-related speech

features using WORLD vocoder. Such features include

the fundamental frequency (

) and spectral envelope.

Gaussian normalization is applied to the

whereas,

spectral envelope is introduced to the auto-encoder

model. Disentanglement is applied to separate the con-

tent code from the style code. This is very beneﬁcial

to preserve the lexical and speaker identity during the

learning process. A survey was conducted on random

people in which people were exposed to original and

converted samples. The survey included two emotions,

the neutral and the angry, accuracy of convergence to

neutral domain is about 63.42% whereas it is 56.19%

for convergence to the angry domain.

The paper is organized as follows. Section 1 is an

introduction that gives a general overview of the sub-

ject under research. Section 2 covers related work. In

Section 3, we introduce the structure of our model, the

loss function used in the training, and the experimental

setup followed by a description on the training dataset.

In Section 4, we discuss the results obtained from the

training. Dashboard graphs are used to demonstrate

these outputs. Section 5 shows the survey outputs and

The conversion system code is uploaded to the follow-

ing repository: https://github.com/MohamedElsayed-22/no

n-parallel-training-for-emotion-conversion-of-arabic-spe

ech-using-cycleGAN-and-WORLD-Vocoder.

an analysis of the results. Section 6 summarizes our

work and gives general ideas about future work.

2 RELATED WORK

Earlier conversion models changed prosody-related

emotional features directly. According to (Xue et al.,

2018), the acoustic features of spectral sequence and

have the highest effects in converting emotions,

with duration and the power envelope having the least

impact. Traditional approaches to conversion include

modeling spectral mapping using statistical techniques

such as partial least squares regression (Helander et al.,

2010), sparse representation (Sisman et al., 2019), and

Gaussian Mixture models (GMMs) (Toda et al., 2007).

2.1 Emotion in Speech

Emotion is introduced in both the spoken linguis-

tic content and acoustic features (Zhou et al., 2021).

Acoustic features play the important role in human

interactions. Emotions can be characterized by either

the categorical or the dimensional representation.

The categorical approach is easier and more

straightforward where emotions are labeled and the

model architecture is built based on that. Ekman’s six

basic emotions theory (Ekman, 1992) is one of the

most famous categorical approaches, where emotions

are categorized into 6 categories which are anger, dis-

gust, fear, happiness, sadness, and surprise. The down-

side of the categorical approach is that it ignores the

minor differences in human emotions. However, the

dimension-representation approach models the phys-

ical properties of emotion-related features. Russell’s

circumplex model is one of the models that represent

emotion in terms of arousal, valence, and dominance.

In this paper, we use the categorical way of represent-

ing emotions. The process of generating emotion from

the speaker is opposite to that of the emotion percep-

tion of the listener. Consequently, we assume that

the encoding of the speaker’s emotion is the opposite

process of the listener’s decoding of that emotion.

In (Gao et al., 2019), an approach was introduced

to learn a mapping between the distributions of of two

utterances from distinct emotion categories,

and

. However, the joint distribution

p x

, x

cannot be directly determined for nonparallel data.

Therefore, they assume that the speech signal can be

broken down into an emotion-invariant content compo-

nent and an emotion-dependent style component and

that the encoder

and the decoder

are inverse func-

tions. These assumptions allow estimating

p x

and p x

conversion models.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

2.2 Disentanglement

Based on the results obtained from disentangled rep-

resentation learning in image style transfer (Gatys

et al., 2016), (Huang et al., 2018) , Such approach

can be utilized similarly in speech. Each speech sig-

nal taken from either domain

can be divided

into a content code (C) and a style code (S), as pro-

posed in (Gao et al., 2019). The content code carries

emotion-independent information, while the style code

represents emotion-dependent information.

The content code is shared across domains and

contains the data that we wish to retain. The style code

is domain-speciﬁc and includes the data we wish to

alter. As shown in ﬁg. 1, we take the content code of

the source speech and merge it with the style code of

the target emotion during the conversion stage.

Figure 1: Nonparallel training inspired by disentanglement

(Gao et al., 2019).

2.3 WORLD Vocoder

WORLD (Morise et al., 2016), ﬁg. 2, a vocoder-based

system, is a high-quality real-time speech synthesis

system. It consists of three algorithms for analysis

through retrieving three speech parameters, in addition

to a synthesis algorithm that takes these parameters as

inputs. The fundamental frequency contour (

), spec-

tral envelope, and excitation signal are the parameters

obtained. The

estimation algorithm is DIO (Morise

et al., 2009). The spectral envelope is determined

by Cepstrum and linear predictive coding algorithms

(LPC) (Atal and Hanauer, 1971). Those two parame-

ters, as mentioned before, are emotion-related speech

features. They are taken as input to the auto-encoder

and the Gaussian normalization for the process of con-

version. Furthermore, the outputs are taken to the

synthesis part of the WORLD to retrieve the resulted

converted voice.

Figure 2: WORLD Vocoder Working Mechanism (Morise

et al., 2016).

2.4 CycleGAN

Generative Adversarial Networks (GANs) (Goodfel-

low et al., 2014) have produced remarkable results in

various ﬁelds like image processing, computer vision,

and sequential data (Gui et al., 2021). The main goal is

to generate new data based on the training data, which

is done through concurrently training two models: a

generative model, the generator

, which captures the

data distribution, and a discriminative model, the dis-

criminator

, which classiﬁes/decides whether a given

sample is real or fake. The learning behavior of

designed to maximize the probability of

making a

mistake. This typically produces a two-player min-

imax game. In this paper, we are concerned with a

particular GAN-based network architecture called Cy-

cleGAN (Zhu et al., 2017). CycleGAN learns mapping

X Y

from a source domain

to a target domain

without the need for parallel data. It is also combined

with an inverse mapping G

Y X

CycleGAN is based on two losses: the ﬁrst one is

adversarial loss, which is deﬁned as follows.

ADV

X Y

, D

, X, Y E

y P y

x P x

log 1 D

X Y

(1)

It shows how matching the distribution of gener-

ated images

X Y

is to the distribution in the target

domain

.Thus, as long as the distribution of the gen-

erated images

P x

to that of the target domain

P y

Non-Parallel Training Approach for Emotional Voice Conversion Using CycleGAN

becomes closer, the loss will consequently become

smaller. The generator

X Y

works on generating im-

ages while trying to maximize the error of the discrim-

inator

, as mentioned before, whereas,

works

in the opposite direction. The second loss is cycle

consistency loss, which is deﬁned using the

norm,

is as follows.

CYC

X Y

, G

Y X

y P y

X Y

Y X

y y

x P x

Y X

X Y

x x

(2)

It focuses on preventing the learned mappings

X Y

and

Y X

from contradicting each other. As

explained in (Kaneko and Kameoka, 2018) working

on the adversarial loss will not guarantee that the core

information of X and G

X Y

x are preserved.

X, Y are for the domain, while x, y are for the sam-

ples, as explained earlier in section 2.1:

and

This is because the main target of the adver-

sarial loss function is to guarantee whether

X Y

follows the distribution of the target domain. Cycle-

consistency loss function focuses on making

X Y

and

Y X

ﬁnd a pair

x, y

having the same core in-

formation.

An identity mapping loss is also introduced

in (Kaneko and Kameoka, 2018).

X Y

, G

Y X

y P y

X Y

y y

x P x

Y X

x x

(3)

To achieve identity transfer, the features we are

concerned about must be transferred without modiﬁ-

cation from the source domain to the target domain.

X Y

and its inverse generator

Y X

are directed to

ﬁnd a mapping to achieve this target.

3 EXPERIMENTAL WORK

3.1 Methodology

In this paper, we propose a nonparallel training ap-

proach for the Egyptian Arabic Language emotional

voice conversion, since using the traditional parallel

approach is inefﬁcient and infeasible to create parallel

datasets. Thus, we concentrated on learning the spec-

tral sequence and the fundamental frequency

conver-

sion using the CycleGAN and Gaussian Normalization

models. Moreover, the use of cycleGAN architecture

is due to its breakthrough in style transfer in images,

a scope similar to that of our proposed problem, men-

tioned in Section 2.4. The concept of disentanglement,

Section 2.2, is used to separate the emotion-related fea-

tures, style code (S), from the speech content (C) thus,

training is undergone on these features separately with-

out affecting the content (C). We used World Vocoder,

Section 2.3, which extract the emotion-related features

that are fed to the training architecture, described in

detail in Section 3.3.

3.2 Non-Parallel Training

In a nonparallel training method, we train a conver-

sion model between two partially shared emotion do-

mains (

). Unlike the parallel training approach

which depends on the mapping between two utterances

which are the same for all features except for the as-

pects under study. In this way, nonparallel training

is more practical and of less cost making it feasible

for industrial applications. Parallel training requires

time alignment of the samples. This can be done man-

ually, which is difﬁcult for large datasets, or using

dynamic time warping (DTW) (Berndt and Clifford,

1994) which depends on pattern detection in time se-

ries to match word templates against the waveform of

discrete time series of the continuous-waveform voice

samples.

The approximate word matching that DTW relies

on can be a drawback for their work, as it might include

a wide range of pronunciations and map them to the

same word, even though each pronunciation may carry

a different emotion. This can lead to inaccuracies in

emotional conversion.

3.3 Model

The conversion system, shown in ﬁg. 3, takes the

emotion-related features, more speciﬁcally fundamen-

tal frequency

and spectral sequence, of both the

source and target domains which are extracted by

WORLD vocoder. The

of the source emotion do-

main is then transformed by a linear transformation

using log Gaussian normalization to match the

the target emotion domain.

Aperiodicity, which is analyzed from the input sam-

ple, is mapped directly since it is not one of the features

under study. Low-dimensional representation of spec-

tral sequence in mel-cepstrum domain is introduced to

the auto-encoder. Gated CNN is used to implement the

used encoders and decoders. A GAN module is used to

produce realistic spectral frames. In this way, each fea-

ture is converted separately without being dependent

on other features under study. Lastly, the converted

emotion-related features are introduced to the World

vocoder for recombining them back with the content

code of the speech signal.

The introduced neural network architecture con-

sists of an autoencoder, content encoder, style encoder,

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

Figure 3: Voice emotion conversion mechanism.

decoder, and GAN discriminator. They work sequen-

tially to generate the output as shown in ﬁg. 4.

For an emotional speech signal

that is mapped

between two partially shared emotion domains

, X

, Instance normalization (Ulyanov et al., 2016)

was used to remove the emotional style mean and vari-

ance. The content encoder

is responsible for ex-

tracting the content code c

of signal sample x

(4)

A 3-layer MLP (Huang and Akagi, 2008) was used

to encode the emotional characteristics, The style en-

coders

is responsible for extracting the style code

of signal sample x

(5)

The decoder

, s

is responsible for recombin-

ing the content code from one emotion with the style

code of another, for example, it uses

and

to get

2 1

, s

j i

(6)

Note that the style code is learned from the entire

emotion domain. This was accomplished by adding

an adaptive instance normalization layer (Huang and

Belongie, 2017).

Finally, The GAN discriminator is responsible for

distinguishing real samples from machine-synthesized

samples.

Figure 4: Network Structure.

3.4 Loss Function

The auto-encoder, ﬁg. 1, consists of an encoder and

a decoder thus, to keep them inverse operations to

each other (Gao et al., 2019). Reconstruction loss is

applied in the direction of

, s

. The recon-

struction loss, as calculated in 7 where

represents

the expected value, focuses on quantifying the model

ability to regenerate the original sample

from the

synthesized sample x

in the same domain.

rec

(7)

The spectral sequence ought to remain unchanged

following the process of encoding and decoding:

, E

(8)

The latent space is partially shared between the

two emotions, speciﬁcally the content code is the

shared space, so semi-cycle consistency loss is pre-

ferred which is applied in the direction of encoding.

In content code, the content of an arbitrary sample

represented as

is coded to that of the equiv-

alent sample

2 1

in the target domain

, this is done

as follows:

2 1

. The coding direction

can not be represented as

1 2 1

because we

take the semi-cycle consistency loss approach as the

latent space of the content code is shared and such

coding direction will lead to changes to the content

of the speech. In the style code, the following coding

direction is implemented as well:

2 1

Consequently, we can construct the loss functions for

the content and style codes separately equations 11

and 10 respectively similar to that in the reconstruc-

tion loss 7.

cycle

2 1

(9)

Non-Parallel Training Approach for Emotional Voice Conversion Using CycleGAN

cycle

2 1

(10)

A GAN module is used to keep the converted sam-

ples indistinguishable from that in the target emotion.

Thus, we improve the quality of the synthesized sam-

ples. The GAN loss is computed on x

i j

where i j

GAN

log 1 D

i j

logD

(11)

Thus the used loss function is the weighted sum of

the three loss functions L

rec

, L

cycle

, and L

GAN

3.5 Experimental Setup

Although there are several speech emotion databases

for different European and Asian languages, there are

very few Arabic speech emotion databases in literature.

Moreover, those datasets were recorded in Arabic spo-

ken by Syrian, Saudi, and Yemeni native speakers, so

there is a deﬁciency in Egyptian Arabic datasets and

research. The Egyptian Arabic dataset that we used

consists of a total of

0.9

hr of audio recordings from

native Egyptian speaker covering

different emotion

categories (neutral and angry). It was recorded using

Audacity sound engineering software in a silent room

with a sampling rate of

48KHz

. The Dataset contains

1000

utterances for each emotion. The training and

testing sets were randomly selected (

80%

for training,

for testing). The training set was sampled with a

ﬁxed length of 128 frames 5ms each.

4 RESULTS

The focus is mainly on the results of the loss functions

discussed earlier in 3.4. In ﬁg. 5, the discriminator loss

curve has signiﬁcant oscillations. This is because of

the small batch size chosen which is 1. We tested var-

ious values including 8, 16, etc., however, a batch of

size 1 was the most suitable as far as we are concerned

to keep a trade-off with the generator loss. The gener-

ator loss keeps converging to a fair extent, indicating

how the generation of utterances to the target domain

is improving.

The semi-cycle loss convergence shown in ﬁg. 6

is a reﬂection of the naturalness of the reconstructed

utterances. It causes the utterance generation in both

emotion domains to demonstrate low generation losses

which are explicitly clariﬁed in ﬁg. 7. Discriminator

losses in both emotion domain directions demonstrated

in ﬁg. 8 provide robust evidence on the naturalness of

converted samples since they represent the model’s

Figure 5: Generator Loss and Discriminator Loss.

ability to distinguish between the real and synthesized

samples.

Figure 6: Semi-cycle Loss.

Figure 7: Generator Loss from Angry to Neutral and Gener-

ator Loss from Neutral to Angry.

Figure 8: Discriminator Loss of the Angry Domain and

Discriminator Loss of the Neutral Domain.

The speaker’s identity was almost steady and no

much change happened to it as shown in ﬁg. 9. More-

over, alongside the training steps, the identity loss

decreased signiﬁcantly at the beginning and then kept

steady at a relatively low value. Since the model train-

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

ing approach is speaker-dependent, the pre-trained

model will be for one speaker. Consequently, the train-

ing dataset for each pre-trained model will be of one

speaker. Thus, the speaker’s identity is preserved.

Generally, the generation loss decreased signiﬁ-

cantly at the beginning, it rose up a little bit, neverthe-

less, it ended at a relatively sufﬁcient value to generate

the target emotion. The discriminator loss is oscil-

lating, however, keeping the cycle loss at low values

and generating utterances in both domain directions

efﬁciently.

Figure 9: Speaker Identity Change.

5 DISCUSSION

The assessment of the results is based on a survey

through which participants rated the clarity of the

voice, emotion, and speaker identity out of 10. About

105 participants took place in the assessment. Each

one listened to 5 different utterances for both the angry

and neutral emotions. The mean accuracies of con-

vergence to neutral and angry are

63.42%

and

56.19%

respectively based on the survey results. The assess-

ment results are shown in ﬁg. 10. The results might

not be up to our expectations regarding the model ar-

chitecture. This is due to the fact that angry emotion

in the dataset is more shouting than expressing anger,

however, our model performed better than VC- Star-

GAN (Kameoka et al., 2018) in terms of conversion

ability (average

59.8%

44%

). Regarding the newly

proposed model EVC-USEP (Shah et al., 2023), our

model performed better (average

59.8%

41.5%

Further modiﬁcations to the dataset might introduce

better results. In addition to taking into considera-

tion the other emotion-related features that affect the

delivery of emotions. They might not be of much sig-

niﬁcance to affect the output, however, they will surely

introduce improvements to the model results.

Figure 10: Survey Results.

6 CONCLUSION

This research proposes a nonparallel speaker-

dependent emotional voice conversion approach for

Egyptian Arabic speech using CycleGAN. The pro-

posed method successfully changes emotion-related

features of a speech signal without altering the lexi-

cal content or speaker identity. However, the results

might not be up to expectations due to the nature of

the dataset. Further modiﬁcations to the dataset and

considering other emotion-related features are likely

to introduce improvements to the model results. Fu-

ture work includes using continuous wavelet transform

(CWT) to decompose

into

different scales so it

can observe abrupt changes, modifying the model to

be speaker independent. Overall, this study provides

a signiﬁcant contribution to the development of emo-

tional voice conversion for Egyptian Arabic speech

and can pave the way for further research in this area.

REFERENCES

Atal, B. S. and Hanauer, S. L. (1971). Speech analy-

sis and synthesis by linear prediction of the speech

wave. The Journal of the Acoustical Society of Amer-

ica, 50(2B):637–655.

Berndt, D. J. and Clifford, J. (1994). Using dynamic time

warping to ﬁnd patterns in time series. In Proceedings

of the 3rd International Conference on Knowledge Dis-

covery and Data Mining, pages 359–370. AAAI Press.

Choi, H. and Hahn, M. (2021). Sequence-to-sequence emo-

tional voice conversion with strength control. IEEE

Access, 9:42674–42687.

Ekman, P. (1992). An argument for basic emotions. Cogni-

tion and Emotion, 6(3-4):169–200.

Gao, J., Chakraborty, D., Tembine, H., and Olaleye, O.

(2019). Nonparallel emotional speech conversion. In

Interspeech 2019. ISCA.

Gatys, L. A., Ecker, A. S., and Bethge, M. (2016). Image

style transfer using convolutional neural networks. In

CVPR, pages 2414–2423. IEEE.

Non-Parallel Training Approach for Emotional Voice Conversion Using CycleGAN

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-

Farley, D., Ozair, S., and ..., Y. B. (2014). Generative

adversarial nets. In Advances in neural information

processing systems, volume 27.

Gui, J., Sun, Z., Wen, Y., Tao, D., and Ye, J. (2021). A review

on generative adversarial networks: Algorithms, theory,

and applications. IEEE Transactions on Knowledge

and Data Engineering.

Helander, E., Virtanen, T., Nurminen, J., and Gabbouj, M.

(2010). Voice conversion using partial least squares

regression. IEEE Transactions on Audio, Speech, and

Language Processing.

Huang, C.-F. and Akagi, M. (2008). A three-layered model

for expressive speech perception. Speech Communica-

tion, 50(10):810–828.

Huang, X. and Belongie, S. (2017). Arbitrary style transfer

in real-time with adaptive instance normalization. In

Proceedings of the IEEE International Conference on

Computer Vision, pages 1501–1510. IEEE.

Huang, X., Liu, M.-Y., Belongie, S., and Kautz, J. (2018).

Multimodal unsupervised image-to-image translation.

In The European Conference on Computer Vision

(ECCV).

Kameoka, H., Kaneko, T., Tanaka, K., and Hojo, N. (2018).

Stargan-vc: non-parallel many-to-many voice conver-

sion using star generative adversarial networks. In

2018 IEEE Spoken Language Technology Workshop

(SLT), pages 266–273.

Kaneko, T. and Kameoka, H. (2018). Cyclegan-vc: Non-

parallel voice conversion using cycle-consistent ad-

versarial networks. In 2018 26th European Signal

Processing Conference (EUSIPCO), pages 2100–2104.

IEEE.

Luo, Z., Takiguchi, T., and Ariki, Y. (2016). Emotional voice

conversion using deep neural networks with mcc and f0

features. In 2016 IEEE/ACIS 15th International Con-

ference on Computer and Information Science (ICIS).

Luo, Z.-H., Chen, J., Takiguchi, T., and Sakurai, T. (2017).

Emotional voice conversion using neural networks with

arbitrary scales f0 based on wavelet transform. Journal

of Audio, Speech, and Music Processing, 18.

Morise, M., Kawahara, H., and Katayose, H. (2009). Fast

and reliable f0 estimation method based on the period

extraction of vocal fold vibration of singing voice and

speech. Paper 11.

Morise, M., Yokomori, F., and Ozawa, K. (2016). World:

A vocoder-based high-quality speech synthesis system

for real-time applications. IEICE Transactions on In-

formation and Systems, E99-D(7):1877–1884.

Olaronke, I. and Ikono, R. (2017). A systematic review of

emotional intelligence in social robots.

Scherer, K. R., Banse, R., Wallbott, H. G., and Goldbeck, T.

(1991). Vocal cues in emotion encoding and decoding.

Motivation and Emotion, 15(2):123–148.

Schröder, M. (2006). Expressing degree of activation in

synthetic speech. IEEE Transactions on Audio, Speech,

and Language Processing, 14(4):1128–1136.

Shah, N., Singh, M. K., Takahashi, N., and Onoe, N. (2023).

Nonparallel emotional voice conversion for unseen

speaker-emotion pairs using dual domain adversarial

network & virtual domain pairing.

Sisman, B., Zhang, M., and Li, H. (2019). Group sparse

representation with wavenet vocoder adaptation for

spectrum and prosody conversion. IEEE/ACM Trans-

actions on Audio, Speech, and Language Processing,

27(6):1085–1097.

Tao, J., Kang, Y., and Li, A. (2006). Prosody conversion

from neutral speech to emotional speech. IEEE Trans-

actions on Audio, Speech, and Language Processing,

14(4):1145–1154.

Toda, T., Black, A. W., and Tokuda, K. (2007). Voice con-

version based on maximum-likelihood estimation of

spectral parameter trajectory. IEEE Transactions on

Audio, Speech, and Language Processing, 15(8):2222–

2235.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. (2016). In-

stance normalization: The missing ingredient for fast

stylization. ArXiv, abs/1607.08022.

Xue, Y., Hamada, Y., and Akagi, M. (2018). Voice conver-

sion for emotional speech: Rule-based synthesis with

degree of emotion controllable in dimensional space.

Speech Communication.

Zhou, K., Sisman, B., Liu, R., and Li, H. (2021). Emotional

voice conversion: Theory, databases and esd. arXiv.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).

Unpaired image-to-image translation using cycle-

consistent adversarial networks. In Proceedings of

the IEEE international conference on computer vision,

pages 2223–2232. IEEE.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics