REFERENCES
Ahuja, C., Lee, D. W., Nakano, Y. I., and Morency, L.
(2020). Style transfer for co-speech gesture anima-
tion: A multi-speaker conditional-mixture approach.
In European Conference on Computer Vision (ECCV),
volume 12363 of LNCS, pages 248–265. Springer In-
ternational Publishing.
Asakawa, E., Kaneko, N., Hasegawa, D., and Shirakawa,
S. (2022). Evaluation of text-to-gesture generation
model using convolutional neural network. Neural
Networks, 151:365–375.
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P.,
Bera, A., and Manocha, D. (2021). Text2Gestures:
A transformer-based network for generating emotive
body gestures for virtual agents. In 2021 IEEE Virtual
Reality and 3D User Interfaces (VR).
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
(2017). Enriching word vectors with subword infor-
mation. Transactions of the Association for Computa-
tional Linguistics, 5:135–146.
Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Real-
time multi-person 2d pose estimation using part affin-
ity fields. In 2017 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 1302–
1310.
Cassell, J., Vilhj
´
almsson, H. H., and Bickmore, T. (2001).
BEAT: The behavior expression animation toolkit. In
28th Annual Conference on Computer Graphics and
Interactive Techniques (SIGGRAPH ’01), pages 477–
486. Association for Computing Machinery.
Chiu, C.-C., Morency, L.-P., and Marsella, S. (2015). Pre-
dicting co-verbal gestures: A deep and temporal mod-
eling approach. In 15th International Conference
on Intelligent Virtual Agents (IVA), pages 152–166.
Springer International Publishing.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In 2019
Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL),
pages 4171–4186. Association for Computational
Linguistics.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang,
W. (2022). Language-agnostic BERT sentence em-
bedding. In 60th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 878–891.
Association for Computational Linguistics.
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., and
Malik, J. (2019). Learning individual styles of con-
versational gesture. In 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 3492–3501.
Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., and
Sumi, K. (2018). Evaluation of speech-to-gesture gen-
eration using bi-directional lstm network. In 18th In-
ternational Conference on Intelligent Virtual Agents
(IVA), pages 79–86. Association for Computing Ma-
chinery.
Kucherenko, T., Jonell, P., van Waveren, S., Henter,
G. E., Alexandersson, S., Leite, I., and Kjellstr
¨
om, H.
(2020). Gesticulator: A framework for semantically-
aware speech-driven gesture generation. In 2020
International Conference on Multimodal Interaction
(ICMI), pages 242–250. Association for Computing
Machinery.
Levine, S., Kr
¨
ahenb
¨
uhl, P., Thrun, S., and Koltun, V.
(2010). Gesture controllers. ACM Transactions on
Graphic, 29(4).
McNeill, D. (1996). Hand and Mind: What Gestures Reveal
about Thought. University of Chicago Press.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-
Net: Convolutional networks for biomedical image
segmentation. In International Conference on Med-
ical Image Computing and Computer-Assisted Inter-
vention (MICCAI), pages 234–241. Springer Interna-
tional Publishing.
Simon, T., Joo, H., Matthews, I., and Sheikh, Y. (2017).
Hand keypoint detection in single images using mul-
tiview bootstrapping. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 4645–4653.
Takeuchi, K., Kubota, S., Suzuki, K., Hasegawa, D.,
and Sakuta, H. (2017). Creating a gesture-speech
dataset for speech-based automatic gesture genera-
tion. In HCI International 2017 – Posters’ Extended
Abstracts, pages 198–202. Springer International Pub-
lishing.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Advances in
Neural Information Processing Systems, volume 30.
Curran Associates, Inc.
Yang, Y. and Ramanan, D. (2013). Articulated human de-
tection with flexible mixtures of parts. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence,
35(12):2878–2890.
Yang, Z., Yang, Y., Cer, D., Law, J., and Darve, E. (2021).
Universal sentence representation learning with con-
ditional masked language model. In 2021 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 6216–6228. Association for
Computational Linguistics.
Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., and
Lee, G. (2020). Speech gesture generation from the
trimodal context of text, audio, and speaker identity.
ACM Transactions on Graphics, 39(6).
HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications
54