Authors:
Kenji Yasuda
;
Ryohei Orihara
;
Yuichi Sei
;
Yasuyuki Tahara
and
Akihiko Ohsuga
Affiliation:
Graduate School of Information and Engineering, University of Electro-Communications, 1–5–1 Chofugaoka, Chofu-shi, Tokyo, 182–8585 and Japan
Keyword(s):
Deep Learning, Domain Transfer, Generative Adversarial Network, Unsupervised Learning, Voice Conversion.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Computational Intelligence
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Methodologies and Methods
;
Neural Networks
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Theory and Methods
Abstract:
In recent years, natural and highly accurate outputs in domain transfer tasks have been achieved by deep learning techniques. Especially, the advent of Generative Adversarial Networks (GANs) has enabled the transfer of objects between unspecified domains. Voice conversion is a popular example of speech domain transfer, which can be paraphrased as domain transfer of speakers. However, most of the voice conversion studies have focused only on transforming the identities of speakers. Understanding other nuances in the voice is necessary for natural speech synthesis. To resolve this issue, we transform the emotions in speech by the most promising GAN model, CycleGAN. In particular, we investigate the usefulness of speech with low emotional intensity as training data. Such speeches are found to be useful when the training data contained multiple speakers.