
Ghorbani, S., Ferstl, Y., Holden, D., Troje, N. F., and
Carbonneau, M.-A. (2023). Zeroeggs: Zero-shot
example-based gesture generation from speech. In
Computer Graphics Forum, volume 42, pages 206–
216. Wiley Online Library.
Gong, Y., Chung, Y.-A., and Glass, J. (2021). Ast:
Audio spectrogram transformer. arXiv preprint
arXiv:2104.01778.
Habibie, I., Xu, W., Mehta, D., Liu, L., Seidel, H.-P.,
Pons-Moll, G., Elgharib, M., and Theobalt, C. (2021).
Learning speech-driven 3d conversational gestures
from video. In Proceedings of the 21st ACM Interna-
tional Conference on Intelligent Virtual Agents, pages
101–108.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion
probabilistic models. Advances in neural information
processing systems, 33:6840–6851.
Kingma, D. P. (2013). Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114.
Kipp, M. (2001). Anvil-a generic annotation tool for mul-
timodal dialogue. In Seventh European conference on
speech communication and technology. Citeseer.
Kopp, S., Jung, B., Lessmann, N., and Wachsmuth, I.
(2003). Max-a multimodal assistant in virtual reality
construction. KI, 17(4):11.
Kopp, S., Krenn, B., Marsella, S., Marshall, A. N.,
Pelachaud, C., Pirker, H., Th
´
orisson, K. R., and
Vilhj
´
almsson, H. (2006). Towards a common
framework for multimodal generation: The behavior
markup language. In Intelligent Virtual Agents: 6th
International Conference, IVA 2006, Marina Del Rey,
CA, USA, August 21-23, 2006. Proceedings 6, pages
205–217. Springer.
Kopp, S. and Wachsmuth, I. (2002). Model-based anima-
tion of co-verbal gesture. In Proceedings of Computer
Animation 2002 (CA 2002), pages 252–257. IEEE.
Lei, W., Ge, Y., Yi, K., Zhang, J., Gao, D., Sun, D., Ge,
Y., Shan, Y., and Shou, M. Z. (2023). Vit-lens-2:
Gateway to omni-modal intelligence. arXiv preprint
arXiv:2311.16081.
Levine, S., Theobalt, C., and Koltun, V. (2009). Real-time
prosody-driven synthesis of body language. In ACM
SIGGRAPH Asia 2009 papers, pages 1–10.
Li, J., Kang, D., Pei, W., Zhe, X., Zhang, Y., He, Z., and
Bao, L. (2021). Audio2gestures: Generating diverse
gestures from speech audio with conditional varia-
tional autoencoders. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
11293–11302.
Liu, X., Wu, Q., Zhou, H., Xu, Y., Qian, R., Lin, X., Zhou,
X., Wu, W., Dai, B., and Zhou, B. (2022). Learn-
ing hierarchical cross-modal association for co-speech
gesture generation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion, pages 10462–10472.
McNeill, D. (2019). Gesture and thought. University of
Chicago press.
Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. (2008).
Gesture modeling and animation based on a proba-
bilistic re-creation of speaker style. ACM Transactions
On Graphics (TOG), 27(1):1–24.
Peebles, W. and Xie, S. (2023). Scalable diffusion models
with transformers. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
4195–4205.
Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F.,
and Poggi, I. (2002). Embodied contextual agent in
information delivering application. In Proceedings of
the first international joint conference on Autonomous
agents and multiagent systems: part 2, pages 758–
765.
Piwek, P., Krenn, B., Schr
¨
oder, M., Grice, M., Baumann, S.,
and Pirker, H. (2004). Rrl: A rich representation lan-
guage for the description of agent behaviour in neca.
arXiv preprint cs/0410022.
Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan,
N. J., Jin, Q., and Guo, B. (2023). Mm-diffusion:
Learning multi-modal diffusion models for joint au-
dio and video generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 10219–10228.
Tanke, J., Zhang, L., Zhao, A., Tang, C., Cai, Y., Wang,
L., Wu, P.-C., Gall, J., and Keskin, C. (2023). Social
diffusion: Long-term multiple human motion antici-
pation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 9601–9611.
Vaswani, A. (2017). Attention is all you need. Advances in
Neural Information Processing Systems.
Vilhj
´
almsson, H., Cantelmo, N., Cassell, J., E. Chafai, N.,
Kipp, M., Kopp, S., Mancini, M., Marsella, S., Mar-
shall, A. N., Pelachaud, C., et al. (2007). The behav-
ior markup language: Recent developments and chal-
lenges. In Intelligent Virtual Agents: 7th International
Conference, IVA 2007 Paris, France, September 17-
19, 2007 Proceedings 7, pages 99–111. Springer.
Wei, Y., Hu, D., Tian, Y., and Li, X. (2022). Learning in
audio-visual context: A review, analysis, and new per-
spective. arXiv preprint arXiv:2208.09579.
Yang, Y., Yang, J., and Hodgins, J. (2020). Statistics-based
motion synthesis for social conversations. In Com-
puter Graphics Forum, volume 39, pages 201–212.
Wiley Online Library.
Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T.,
Tao, D., and Black, M. J. (2023). Generating holistic
3d human motion from speech. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 469–480.
Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., and
Lee, G. (2020). Speech gesture generation from the
trimodal context of text, audio, and speaker identity.
ACM Transactions on Graphics (TOG), 39(6):1–16.
Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., and
Lee, G. (2019). Robots learn social skills: End-to-
end learning of co-speech gesture generation for hu-
manoid robots. In 2019 International Conference on
Robotics and Automation (ICRA), pages 4303–4309.
IEEE.
Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., and Yu, L.
(2023). Taming diffusion models for audio-driven
co-speech gesture generation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 10544–10553.
GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications
362