
ments from speech. Advances in neural information
processing systems, 20.
Ferstl, Y., Neff, M., and McDonnell, R. (2019). Multi-
objective adversarial gesture generation. In Motion,
Interaction and Games, pages 1–10.
Habibie, I., Elgharib, M., Sarkar, K., Abdullah, A., Nyat-
sanga, S., Neff, M., and Theobalt, C. (2022). A mo-
tion matching-based framework for controllable ges-
ture synthesis from speech. In ACM SIGGRAPH 2022
Conference Proceedings, pages 1–9.
Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., and
Sumi, K. (2018). Evaluation of speech-to-gesture gen-
eration using bi-directional lstm network. In Proceed-
ings of the 18th International Conference on Intelli-
gent Virtual Agents, pages 79–86.
Holden, D., Komura, T., and Saito, J. (2017). Phase-
functioned neural networks for character control.
ACM Transactions on Graphics (TOG), 36(4):42.
Holden, D., Saito, J., and Komura, T. (2016). A deep
learning framework for character motion synthesis
and editing. ACM Transactions on Graphics (TOG),
35(4):138.
Holden, D., Saito, J., Komura, T., and Joyce, T. (2015).
Learning motion manifolds with convolutional au-
toencoders. In SIGGRAPH Asia 2015 Technical
Briefs, SA ’15, pages 18:1–18:4, New York, NY,
USA. ACM.
Iwamoto, N., Kato, T., Shum, H. P., Kakitsuka, R., Hara,
K., and Morishima, S. (2017). Dancedj: A 3d dance
animation authoring system for live performance. In
International Conference on Advances in Computer
Entertainment, pages 653–670. Springer.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Laban, R. and Ullmann, L. (1971). The mastery of move-
ment.
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D.,
Yang, M.-H., and Kautz, J. (2019). Dancing to music.
Advances in neural information processing systems,
32.
Lee, S., Lee, S., Lee, Y., and Lee, J. (2021). Learning a
family of motor skills from a single motion clip. ACM
Transactions on Graphics (TOG), 40(4):1–13.
Levine, S., Kr
¨
ahenb
¨
uhl, P., Thrun, S., and Koltun, V.
(2010). Gesture controllers. In ACM SIGGRAPH
2010 papers, pages 1–11.
Levine, S., Theobalt, C., and Koltun, V. (2009). Real-time
prosody-driven synthesis of body language. In ACM
SIGGRAPH Asia 2009 papers, pages 1–10.
Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler,
S., and Li, H. (2020). Learning to generate di-
verse dance motions with transformer. arXiv preprint
arXiv:2008.08171.
Li, R., Yang, S., Ross, D. A., and Kanazawa, A. (2021).
Ai choreographer: Music conditioned 3d dance gen-
eration with aist++. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
13401–13412.
Li, Z., Zhou, Y., Xiao, S., He, C., and Li, H. (2017). Auto-
conditioned LSTM network for extended complex hu-
man motion synthesis. CoRR, abs/1707.05363.
Liu, L. and Hodgins, J. (2017). Learning to schedule con-
trol fragments for physics-based characters using deep
q-learning. ACM Transactions on Graphics (TOG),
36(3):29.
Liu, L. and Hodgins, J. (2018). Learning basketball
dribbling skills using trajectory optimization and
deep reinforcement learning. ACM Trans. Graph.,
37(4):142:1–142:14.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,
Battenberg, E., and Nieto, O. (2015). librosa: Audio
and music signal analysis in python. In Proceedings
of the 14th python in science conference, volume 8,
pages 18–25.
McNeill, D. (2008). Gesture and thought. In Gesture and
Thought. University of Chicago press.
Morro Motion (2017). Dance mocap collection.
https://assetstore.unity.com/packages/3d/animations/
dance-mocap-collection-102966.
M
¨
uller, M., R
¨
oder, T., Clausen, M., Eberhardt, B., Kr
¨
uger,
B., and Weber, A. (2007). Documentation mocap
database hdm05.
Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. (2008).
Gesture modeling and animation based on a proba-
bilistic re-creation of speaker style. ACM Transactions
On Graphics (TOG), 27(1):1–24.
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R.
(2013). Berkeley mhad: A comprehensive multimodal
human action database. In 2013 IEEE workshop on
applications of computer vision (WACV), pages 53–
60. IEEE.
Okamoto, T., Shiratori, T., Kudoh, S., and Ikeuchi, K.
(2010). Temporal scaling of leg motion for music
feedback system of a dancing humanoid robot. In
2010 IEEE/RSJ International Conference on Intelli-
gent Robots and Systems, pages 2256–2263. IEEE.
Park, J. and Ko, H. (2008). Real-time continuous
phoneme recognition system using class-dependent
tied-mixture hmm with hbt structure for speech-
driven lip-sync. IEEE Transactions on Multimedia,
10(7):1299–1306.
Peng, X. B., Abbeel, P., Levine, S., and van de Panne,
M. (2018). Deepmimic: Example-guided deep rein-
forcement learning of physics-based character skills.
CoRR, abs/1804.02717.
Peng, X. B., Berseth, G., Yin, K., and Van De Panne, M.
(2017). Deeploco: Dynamic locomotion skills using
hierarchical deep reinforcement learning. ACM Trans-
actions on Graphics (TOG), 36(4):41.
Peng, X. B., Ma, Z., Abbeel, P., Levine, S., and Kanazawa,
A. (2021). Amp: Adversarial motion priors for styl-
ized physics-based character control. ACM Transac-
tions on Graphics (TOG), 40(4):1–20.
Sargin, M. E., Erzin, E., Yemez, Y., Tekalp, A. M., Erdem,
A. T., Erdem, C., and Ozkan, M. (2007). Prosody-
driven head-gesture animation. In 2007 IEEE Inter-
national Conference on Acoustics, Speech and Sig-
GRAPP 2024 - 19th International Conference on Computer Graphics Theory and Applications
138