
model spatial relationships between joints compared
with methods using 1D convolution. Evaluation ex-
periments demonstrated that 2CM-GPT achieved su-
perior accuracy in motion-reconstruction tasks both
quantitatively and qualitatively and showed high per-
formance in text-guided 2D motion generation. Ap-
plying the motions generated with 2CM-GPT to
pose-guided human video generation confirmed that
the resulting videos exhibited natural movements.
These results indicate the practicality of 2CM-GPT
in motion-generation tasks. Future work will include
training with more diverse motion datasets, introduc-
ing advanced architectures to further strengthen the
relationship between text and motion, and exploring
other possible applications for 2D motion generation.
REFERENCES
Ahn, H., Ha, T., Choi, Y., Yoo, H., and Oh, S. (2018).
Text2Action: Generative Adversarial Synthesis from
Language to Action. In 2018 IEEE International Con-
ference on Robotics and Automation (ICRA), pages
5915–5920. IEEE.
Ahuja, C. and Morency, L.-P. (2019). Language2Pose: Nat-
ural Language Grounded Pose Forecasting. In 2019
International Conference on 3D Vision (3DV), pages
719–728. IEEE.
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T.,
and Yu, G. (2023). Executing your Commands via
Motion Diffusion in Latent Space. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 18000–18010.
Cho, K., van Merrienboer, B., G
¨
ulc¸ehre, C¸ ., Bahdanau,
D., Bougares, F., Schwenk, H., and Bengio, Y.
(2014). Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation.
In Proceedings of the 2014 Conference on Empiri-
cal Methods in Natural Language Processing, pages
1724–1734. ACL.
Graves, A. and Graves, A. (2012). Long Short-Term Mem-
ory. Supervised sequence labelling with recurrent
neural networks, pages 37–45.
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., and
Cheng, L. (2022). Generating Diverse and Natural
3D Human Motions From Text. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 5152–5161.
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A.,
Gong, M., and Cheng, L. (2020). Action2Motion:
Conditioned Generation of 3D Human Motions. In
Proceedings of the 28th ACM International Confer-
ence on Multimedia, pages 2021–2029.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Hochreiter, S. (2017). GANs Trained by a Two Time-
Scale Update Rule Converge to a Local Nash Equi-
librium. Advances in neural information processing
systems, 30.
Hochreiter, S. and Schmidhuber, J. (1997). Long Short-
Term Memory. Neural Comput., 9(8):1735–1780.
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., and Chen,
T. (2024). MotionGPT: Human Motion as a Foreign
Language. Advances in Neural Information Process-
ing Systems, 36.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft COCO: Common Objects in Context. In
Computer Vision–ECCV 2014: 13th European Con-
ference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V 13, pages 740–755. Springer.
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and
Black, M. J. (2015). SMPL: A Skinned Multi-Person
Linear Model. ACM Trans. Graphics (Proc. SIG-
GRAPH Asia), 34(6):248:1–248:16.
Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G.,
and Black, M. J. (2019). AMASS: Archive of Mo-
tion Capture as Surface Shapes. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 5442–5451.
Petrovich, M., Black, M. J., and Varol, G. (2022). TEMOS:
Generating diverse human motions from textual de-
scriptions. In European Conference on Computer Vi-
sion, pages 480–497. Springer.
Plappert, M., Mandery, C., and Asfour, T. (2016). The KIT
Motion-Language Dataset. Big Data, 4(4):236–252.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning Transferable Visual Models
From Natural Language Supervision. In International
conference on machine learning, pages 8748–8763.
PMLR.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the Limits of Transfer Learning with a Uni-
fied Text-to-Text Transformer. Journal of machine
learning research, 21(140):1–67.
Razavi, A., Van den Oord, A., and Vinyals, O. (2019). Gen-
erating Diverse High-Fidelity Images with VQ-VAE-
2. Advances in neural information processing systems,
32.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence
to Sequence Learning with Neural Networks. In Ad-
vances in Neural Information Processing Systems 27:
Annual Conference on Neural Information Processing
Systems, pages 3104–3112.
Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., and
Cohen-Or, D. (2022). MotionCLIP: Exposing Human
Motion Generation to CLIP Space. In European Con-
ference on Computer Vision, pages 358–374. Springer.
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D.,
and Bermano, A. H. (2023). Human Motion Diffusion
Model. In The Eleventh International Conference on
Learning Representations.
Van Den Oord, A., Vinyals, O., et al. (2017). Neural Dis-
crete Representation Learning. Advances in neural in-
formation processing systems, 30.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
2D Motion Generation Using Joint Spatial Information with 2CM-GPT
589