Authors:
Kotaro Koseki
;
Yuichi Sei
;
Yasuyuki Tahara
and
Akihiko Ohsuga
Affiliation:
The University of Electro-Communications, Tokyo, Japan
Keyword(s):
Multimodal Learning, Deep Learning Machine Learning, CNN, VAE.
Abstract:
The task of “face generation from voice” will bring about a significant change in the way voice calls are made. Voice calls create a psychological gap compared to face to face communication because the other party’s face is not visible. Generating a face from voice can alleviate this psychological gap and contribute to more efficient communication. Multimodal learning is a machine learning method that uses different data (e.g., voice and face images) and is being studied to combine various types of information such as text, images, and voice, as in google’s imagen(Saharia et al., 2022). In this study, we perform multimodal learning of speech and face images using a CNN convolutional speech encoder and a face image variational autoencoder (VAE: Variational Autoencoder) to create models that can represent speech and face images of different modalities in the same latent space. Focusing on the emotional information of speech, we also built a model that can generate face images that refl
ect the speaker’s emotions and attributes in response to input speech. As a result, we were able to generate face images that reflect rough emotions and attributes, although there are variations in the emotions depending on the type of emotion.
(More)