
Figure 5: Visualization of face construction in difference persons.
REFERENCES
Abdul, Z. K. and Al-Talabani, A. K. (2022). Mel fre-
quency cepstral coefficient and its applications: A re-
view. IEEE Access, 10:122136–122158.
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020).
wav2vec 2.0: A framework for self-supervised learn-
ing of speech representations.
Cao, X.-N., Trinh, Q.-H., Do-Nguyen, Q.-A., Ho, V.-S.,
Dang, H.-T., and Tran, M.-T. (2024a). Eapc: Emotion
and audio prior control framework for the emotional
and temporal talking face generation. In ICAART (2),
pages 520–530.
Cao, X.-N., Trinh, Q.-H., Ho, V.-S., and Tran, M.-T. (2023).
Speechsyncnet: Speech to talking landmark via the
fusion of prior frame landmark and the audio. In
2023 IEEE International Conference on Visual Com-
munications and Image Processing (VCIP), pages 1–
5. IEEE.
Cao, X.-N., Trinh, Q.-H., and Tran, M.-T. (2024b). Trans-
apl: Transformer model for audio and prior landmark
fusion for talking landmark generation. In ICMV.
Chen, L., Li, Z., Maddox, R. K., Duan, Z., and Xu, C.
(2018). Lip movements generation at a glance. In
Proceedings of the European conference on computer
vision (ECCV), pages 520–535.
Eskimez, S. E., Zhang, Y., and Duan, Z. (2021). Speech
driven talking face generation from a single image and
an emotion condition.
Gong, Y., Chung, Y.-A., and Glass, J. (2021). Ast:
Audio spectrogram transformer. arXiv preprint
arXiv:2104.01778.
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu,
J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang,
R. (2020). Conformer: Convolution-augmented trans-
former for speech recognition.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Hochreiter, S. (2018). Gans trained by a two time-
scale update rule converge to a local nash equilibrium.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K.,
Salakhutdinov, R., and Mohamed, A. (2021). Hu-
bert: Self-supervised speech representation learning
by masked prediction of hidden units.
Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., and
Xu, F. (2021). Audio-driven emotional video portraits.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 14080–
14089.
Kingma, D. P. and Welling, M. (2019). An introduction to
variational autoencoders. CoRR, abs/1906.02691.
Kingma, D. P. and Welling, M. (2022). Auto-encoding vari-
ational bayes.
Larkin, K. G. (2015). Structural similarity index ssimpli-
fied: Is there really a simpler concept at the heart of
image quality measurement?
Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. (2019).
Semantic image synthesis with spatially-adaptive nor-
malization.
Sinha, S., Biswas, S., Yadav, R., and Bhowmick, B. (2022).
Emotion-controllable generalized talking face genera-
tion.
Song, L., Wu, W., Qian, C., He, R., and Loy, C. C. (2022).
Everybody’s talkin’: Let me talk as you want. IEEE
Transactions on Information Forensics and Security,
17:585–598.
Verma, P. and Berger, J. (2021). Audio transform-
ers:transformer architectures for large scale audio un-
derstanding. adieu convolutions.
Vougioukas, K., Petridis, S., and Pantic, M. (2020). Realis-
tic speech-driven facial animation with gans. Interna-
tional Journal of Computer Vision, 128:1398–1413.
Wang, C., Wu, Y., Qian, Y., Kumatani, K., Liu, S., Wei, F.,
Zeng, M., and Huang, X. (2021). Unispeech: Unified
speech representation learning with labeled and unla-
beled data.
Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C.,
He, R., Qiao, Y., and Loy, C. C. (2020a). Mead: A
large-scale audio-visual dataset for emotional talking-
face generation. In ECCV.
Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C.,
He, R., Qiao, Y., and Loy, C. C. (2020b). Mead: A
large-scale audio-visual dataset for emotional talking-
face generation. In ECCV.
Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., and
Li, G. (2023). Identity-preserving talking face gener-
ation with landmark and appearance priors. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9729–9738.
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kaloger-
akis, E., and Li, D. (2020). Makelttalk: speaker-
aware talking-head animation. ACM Transactions On
Graphics (TOG), 39(6):1–15.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1318