SPEAR: SPADE-Net and HuBERT for Enhanced Audio-to-Facial Reconstruction
Xuan-Nam Cao, Xuan-Nam Cao, Minh-Triet Tran, Minh-Triet Tran
Generating talking faces has become an essential area of research due to its broad applications. Previous studies in facial synthesis have faced challenges in maintaining consistency between input landmarks and generated facial images, especially when dealing with complex expressions or pose variations. To address these challenges, this paper proposes a novel generative approach for face synthesis driven by audio, pose, and reference images. The proposed system combines a pretrained Variational Autoencoder (VAE), Transformer encoders, SPADE (Spatially Adaptive Normalization) modules, and optical flow-based warping to generate realistic facial images. The system utilizes HuBERT for audio feature extraction, a pose encoder for capturing pose-driven features, and a reference encoder to provide contextual facial information. The generated face, incorporating audio cues, pose variations, and reference images, is refined through optical flow to align with the driven pose and landmarks, ensuring high fidelity and natural facial animation. Experimental results demonstrate the effectiveness of this system in generating high quality, emotion driven facial animations.
DownloadPaper Citation
in Harvard Style
Cao X. and Tran M. (2025). SPEAR: SPADE-Net and HuBERT for Enhanced Audio-to-Facial Reconstruction. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 1311-1318. DOI: 10.5220/0013349100003890
in Bibtex Style
author={Xuan-Nam Cao and Minh-Triet Tran},
title={SPEAR: SPADE-Net and HuBERT for Enhanced Audio-to-Facial Reconstruction},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
in EndNote Style
JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - SPEAR: SPADE-Net and HuBERT for Enhanced Audio-to-Facial Reconstruction
SN - 978-989-758-737-5
AU - Cao X.
AU - Tran M.
PY - 2025
SP - 1311
EP - 1318
DO - 10.5220/0013349100003890
PB - SciTePress