2D Motion Generation Using Joint Spatial Information with 2CM-GPT

Ryota Inoue

, Tsubasa Hirakawa

, Takayoshi Yamashita

and Hironobu Fujiyoshi

Chubu University, 1200 Matsumoto-cho, Kasugai, Aichi, Japan

{inoue, hirakawa}@mprg.cs.chubu.ac.jp, {takayoshi, fujiyoshi}@isc.chubu.ac.jp

Keywords:

Deep Learning, 2D Motion Generation, Variational AutoEncoder, Language Model.

Abstract:

Various methods have been proposed for generating human motion from text due to advancements in large lan-

guage models and diffusion models. However, most research has focused primarily on 3D motion generation.

While 3D motion enables realistic representations, the creation and collection of datasets using motion-capture

technology is costly, and its application to downstream tasks, such as pose-guided human video generation, is

limited. Therefore, we propose 2D Convolutional Motion Generative Pre-trained Transformer (2CM-GPT), a

method for generating two-dimensional (2D) motion from text. 2CM-GPT is based on the framework of Mo-

tionGPT, a method for 3D motion generation, and uses a motion tokenizer to convert 2D motion into motion

tokens while learning the relationship between text and motion using a language model. Unlike MotionGPT,

which utilizes 1D convolution for processing 3D motion, 2CM-GPT uses 2D convolution for processing 2D

motion. This enables more effective capture of spatial relationships between joints. Evaluation experiments

demonstrated that 2CM-GPT is effective in both motion reconstruction and text-guided 2D motion generation.

The generated 2D motion is also shown to be effective for pose-guided human video generation.

1 INTRODUCTION

The task of generating human motion from text is

gaining attention due to its potential applications in

ﬁelds such as games, movies, virtual reality, and aug-

mented reality. With signiﬁcant advancements in

large language models (LLMs) and diffusion mod-

els, various methods for generating human motion

from text have been proposed, e.g., methods based

on LLMs, such as Generative Pre-trained Trans-

former (GPT) including MotionGPT (Jiang et al.,

2024) and T2M-GPT (Zhang et al., 2023), and

those using diffusion models such as motion diffu-

sion model (MDM) (Tevet et al., 2023), MotionDif-

fuse (Zhang et al., 2024), and motion latent diffusion

(MLD) (Chen et al., 2023).

Conventional human-motion-generation methods

mainly focus on generating three-dimensional (3D)

motions using natural-language texts describing mo-

tions as input. These methods are trained on a

skinned multi-person linear (SMPL) model (Loper

et al., 2015) motion-text-pair dataset to generate

human motions. However, 3D motion generation

https://orcid.org/0009-0006-1546-3990

https://orcid.org/0000-0003-3851-5221

https://orcid.org/0000-0003-2631-9856

https://orcid.org/0000-0001-7391-4725

incurs high dataset-creation costs. Typically, 3D-

human-motion data are created and collected using

motion-capture technology, but this requires dedi-

cated motion-capture equipment, studios, and ac-

tors. This also requires specialized knowledge for

setup and post-processing. Applications such as pose-

guided human-video-generation tasks also often re-

quire 2D motions as input. Therefore, it is necessary

to convert 3D motions to 2D motions. In summary,

conventional methods focusing on 3D motion gener-

ation incur high cost of creating datasets and are inef-

ﬁcient when applied to 2D applications.

We propose 2D Convolutional Motion Genera-

tive Pre-trained Transformer (2CM-GPT), a method

for generating 2D motion from text. By focusing on

2D motion, datasets can be created more easily using

techniques such as pose estimation, compared with

3D motion, enabling direct application of the gener-

ated 2D motion to various applications. Using text as

input also enables intuitive motion generation. Our

method consists of a motion tokenizer for process-

ing motion as text and a language model for learning

the relationship between text and motion. The motion

tokenizer uses a vector quantised variational autoen-

coder (VQ-VAE) (Van Den Oord et al., 2017) to con-

vert human 2D motion into discrete motion tokens.

This enables handling motion in a feature space sim-

582

Inoue, R., Hirakawa, T., Yamashita, T. and Fujiyoshi, H.

2D Motion Generation Using Joint Spatial Information with 2CM-GPT.

DOI: 10.5220/0013179600003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

582-590

ISBN: 978-989-758-728-3; ISSN: 2184-4321

ilar to text. The language model uses a pre-trained

text-to-text transfer Transformer (T5) (Raffel et al.,

2020) to learn motion as language. This enables the

generation of diverse 2D motions on the basis of dif-

ferent texts. We converted the HumanML3D (Guo

et al., 2022) dataset, which pairs text with 3D motion

data and is widely used in text-to-motion tasks, into

2D-motion data for model training.

Our contributions are as follows.

• We propose a generation method for modifying

the convolution method of motion tokenizer, a

conventional method that relies on 3D data, to en-

able 2D motion generation from text.

• The 2D motion generated with our method was

applied to a pose-guided human-video-generation

method to demonstrate the effectiveness of our

method.

2 RELATED WORK

In this section, we introduce datasets that pair text

with 3D motion and methods for generating 3D mo-

tion from text.

2.1 3D Human Motion-Language

Dataset

Datasets that pair human motion with text annota-

tions play a crucial role in text-to-motion tasks. These

datasets are essential for training models that generate

motion based on natural-language descriptions, and

the quality and diversity of the datasets signiﬁcantly

impact the performance of generation models. Repre-

sentative datasets widely used in text-to-motion tasks

include HumanML3D and the KIT Motion-Language

(ML) dataset (Plappert et al., 2016).

HumanML3D is a combined dataset of Human-

Act12 (Guo et al., 2020) and the Archive of Mo-

cap as Surface Shapes (Mahmood et al., 2019), com-

prising 14,616 motions and 44,970 text annotations.

This dataset includes everyday motions covering a

wide range of actions, such as walking and jumping;

sports motions, such as swimming and karate; acro-

batic motions, such as rotating; and artistic motions,

such as dancing. Therefore, HumanML3D is a versa-

tile dataset that can be used in various contexts and is

frequently used as a benchmark for diverse motion-

generation tasks. The KIT-ML dataset consists of

3,911 motions and 6,353 text annotations, including

more detailed motions, e.g., gesture motions, such as

pointing and waving; movements, such as walking

and crawling; manipulations, such as throwing and

wiping; and sports motions such as martial arts and

tennis.

2.2 Text-Guided 3D Motion Generation

Text-guided 3D motion generation aims to generate

human motion in 3D space using natural-language

text as input. Text2Action (Ahn et al., 2018) uses a

recurrent neural network (Graves and Graves, 2012)-

based sequence-to-sequence (Sutskever et al., 2014)

architecture to generate motion from text. Joint

language to pose (Ahuja and Morency, 2019) com-

bines a text encoder based on long short-term mem-

ory (Hochreiter and Schmidhuber, 1997) and a motion

encoder-decoder using a gated recurrent unit (Cho

et al., 2014) that learns text and motion in a shared

feature space for more accurate motion generation.

MotionCLIP (Tevet et al., 2022) leverages the text-

image latent space of contrastive language-image pre-

training (Radford et al., 2021), enabling ﬂexible and

broad mappings from language to motion. This ap-

proach is particularly adaptable to diverse instruc-

tions given in language, contributing to a wide range

of motion-generation tasks. TEMOS (Petrovich

et al., 2022) uses a combination of a VAE and Trans-

former (Vaswani et al., 2017) for sequence-level em-

beddings of text and motion, generating diverse mo-

tion. MLD, a method based on diffusion models, exe-

cutes diffusion processes in the latent space of motion

to generate plausible human-motion sequences. Mo-

tionGPT uses a motion tokenizer using a VQ-VAE to

build a motion vocabulary and an LLM to learn the re-

lationship between text and motion vocabularies. This

enables MotionGPT to handle multiple motion tasks

such as text-to-motion, motion-to-text, motion predic-

tion, and motion in-between in a uniﬁed framework.

In summary, research in text-to-motion is cur-

rently active and has made signiﬁcant progress

through these previous studies. However, most of

these methods focus only on 3D motion generation,

and research on 2D motion generation has not pro-

gressed much.

3 PROPOSED METHOD

In this section, we introduce 2CM-GPT, our method

for generating 2D motion from text.

3.1 Overview

Previous methods focusing on 3D motion generation

incur high costs for dataset creation and are inefﬁcient

when applied to 2D applications. Thus, 2CM-GPT

2D Motion Generation Using Joint Spatial Information with 2CM-GPT

583

Figure 1: 2CM-GPT architecture. The model consists of a motion tokenizer for processing motion as text and a language

model for learning the relationship between text and motion. The input to the language model is a text-motion vocabulary that

combines two types of tokens: text tokens that describe motion and motion tokens that represent 2D motion.

Quantizer

Conv2D

ResBlock

Conv2D

ReLU

Conv2D

ResBlock

Conv2D

ReLU

Conv2D

ReLU

× 2

Inputs

Outputs

× 2

ReLU

Conv2D

ReLU

Conv2D

× 3

Conv2D

Figure 2: Motion tokenizer architecture. The model uses a CNN-based architecture with 2D convolution (Conv2D), residual

block (ResBlock), ReLU activation, and nearest neighbor interpolation (Upsample).

was developed for generating 2D motion from text.

By focusing on 2D motion, it becomes possible to

create datasets using methods, such as pose estima-

tion, enabling easier and more precise data creation

compared with focusing on 3D motion. Using text

as input enables intuitive generation, and the gener-

ated 2D motion can be directly applied to 2D appli-

cations. The architecture of 2CM-GPT is shown in

Figure 1. 2CM-GPT is based on the framework of the

3D-motion-generation method MotionGPT and con-

sists of two main components: a motion tokenizer

and language model, with modiﬁcations to the con-

volution method in the motion tokenizer and training

method of the language model. The next sections de-

scribe these components: Section 3.2 explains the ar-

chitecture, role, and training of the motion tokenizer,

and Section 3.3 discusses the role and training of the

language model when learning the relationship be-

tween text and motion.

3.2 Motion Tokenizer

The motion tokenizer in 2CM-GPT uses a VQ-VAE

architecture to build a vocabulary of 2D motion. VQ-

VAE is an extension of VAE with a discrete latent

space and consists of an encoder-decoder architec-

ture. The motion tokenizer ﬁrst uses an encoder to

estimate latent variables from the input motion. It

then quantizes the latent variables using a codebook

of discrete vectors. Speciﬁcally, it ﬁnds the nearest

vector b

in the codebook B := {b

}

k=1

for each la-

tent variable z

output by the encoder. The codebook

is constructed with K latent variable vectors of dimen-

sion d. The vector quantization is shown in Equation

(1):

= b

, where k = argmin

∥

− b

∥

(1)

Next, the quantized latent variables are used by the

decoder to reconstruct the input motion.

With 2CM-GPT, we change the convolution used

in the encoder and decoder of the motion tokenizer

from 1D convolution to 2D convolution. While 3D

motion data are represented in the same dimension

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

584

for joint positions, velocities, and angles, and Mo-

tionGPT applies 1D convolution, 2D motion data rep-

resent joints and xy-coordinates in two different di-

mensions. Applying 1D convolution after aggregating

these into a single dimension fails to properly model

the relationships between joints. Instead, by using 2D

convolution while keeping the two dimensions sep-

arate, 2CM-GPT can directly account for the spatial

relationships between joints. The architecture of the

motion tokenizer and motion decoder is shown in Fig-

ure 2.

The motion tokenizer in MotionGPT uses three

different loss functions for training: reconstruction

loss, embedding loss, and commitment loss. Of

these losses, embedding loss is computed by extract-

ing joint velocity information from the embedding.

However, 2CM-GPT uses 2D motion, which is repre-

sented only by the coordinates of the joints, thus has

no velocity information. Therefore, we use two dif-

ferent loss functions for training: reconstruction loss

and commitment loss. To improve codebook utiliza-

tion, we use the exponential moving average and the

codebook-resetting technique (Razavi et al., 2019).

The L1 smooth loss is used for reconstruction loss.

3.3 Language Model

2CM-GPT uses a pre-trained T5 model to learn the

relationship between text tokens and motion tokens

generated by the motion tokenizer. The input to the

language model is a text-motion vocabulary that com-

bines two types of tokens: text tokens that describe

motion and motion tokens that represent 2D motion.

Depending on the task, the text-motion vocabulary

can represent text tokens, motion tokens, and tokens

that represent both text and motion, enabling Mo-

tionGPT to generate diverse and ﬂexible text and mo-

tion.

The language model in MotionGPT involves pre-

training that learns the relationship between text to-

kens and motion tokens and instruction tuning for var-

ious motion-related tasks. During pre-training, 15%

of the input tokens are randomly replaced with special

sentinel tokens, and the model learns to generate the

corresponding tokens in the output. It also learns the

relationship using paired text and motion tokens. For

instruction tuning, prompts for tasks, such as text-to-

motion, motion-to-text, motion prediction, and mo-

tion in-between, are used. However, to achieve higher

accuracy in the text-to-motion task, we ﬁne-tune the

language model in 2CM-GPT using only the instruc-

tion prompts for that task. The instruction prompts

used as input are those for text-to-motion from Mo-

tionGPT.

4 EXPERIMENT

We quantitatively and qualitatively evaluated the ef-

fectiveness of 2CM-GPT by comparing the recon-

structed 2D motion using the motion tokenizer and

the motion generation guided by text. We also ap-

plied the 2D motion generated by 2CM-GPT to the

downstream task of pose-guided human video gener-

ation and qualitatively assessed the accuracy of the

generated 2D motion.

4.1 2D Human-Motion-Language

Dataset

Text-guided 2D motion generation requires a dataset

consisting of pairs of 2D motion and text, but no

suitable dataset currently exists. Therefore, creating

a dataset is a major challenge, and it is essential to

convert datasets into a new data format. For human

video generation, a 2D motion with 18 joints based on

the Common Objects in Context (COCO) (Lin et al.,

2014) dataset is widely used, so a dataset based on this

format is desirable. Therefore, we used a 3D motion

dataset to create a 2D motion dataset in the COCO

format. Speciﬁcally, we converted 3D motion to 2D

motion on the basis of HumanML3D, which consists

of pairs of text and 3D motion. The conversion pro-

cedure begins with estimating an SMPL mesh from

the 3D motion data included in HumanML3D using

the SMPL model. The estimated SMPL mesh is then

multiplied using a regressor matrix to obtain the co-

ordinates of 18 joints on a 2D plane. This method en-

ables us to obtain accurate joint information based on

the COCO format. Our 2D motion dataset was con-

structed on the basis of the text, type of motion, and

length of the action, similar to HumanML3D. This en-

ables the generation of diverse 2D motions guided by

text.

4.2 Experimental Settings

In this experiment, we used the 2D Human Motion-

Language dataset as the training and evaluation

dataset. To enhance the diversity of the dataset, we

applied left-right mirrored data augmentation to all

motions and speciﬁc texts (such as ”right hand,” ”left

hand,” ”right foot,” ”left foot”), effectively doubling

the dataset size. For training, we used 23,384 mo-

tions, 1,460 for validation and 4,384 for evaluation.

For training the motion tokenizer, the batch size

was set to 256, training iterations to 3,000 epochs,

learning rate to 2 × 10

−4

, and the codebook was

128 × 512. For pre-training and ﬁne-tuning the lan-

guage model, the batch size was set to 16, with 300

2D Motion Generation Using Joint Spatial Information with 2CM-GPT

585

epochs for pre-training, 100 epochs for ﬁne-tuning,

and a learning rate of 1 × 10

−4

. The optimizer used

for all training processes was AdamW.

4.3 Evaluation Metrics

We evaluated the accuracy of the motion tokenizer’s

2D motion reconstruction with 2CM-GPT using mean

per joint position error (MPJPE), Fr

echet inception

distance (FID) (Heusel et al., 2017), and diver-

sity (Guo et al., 2020).

MPJPE measures the Euclidean distance between

the generated joints and ground-truth joints, taking

the mean of these distances. A value closer to 0 in-

dicates that the generated motion is accurate. MPJPE

is useful for measuring the reconstruction accuracy of

motion because it directly reﬂects the joint-position

errors. MPJPE is expressed as

MPJPE =

∑

i=1

∥ j

− j

′

∥

(2)

where N is the number of joints, j

is the ground-truth-

joint coordinates, and j

′

is the predicted joint coordi-

nates.

FID measures the Fr

echet distance between the

distribution of generated motions and that of ground-

truth motions, with values closer to 0 indicating that

the distribution of generated motions is close to the

ground-truth distribution. FID is useful for evaluating

the overall quality and realism of generated motions.

FID is expressed as

FID =∥ µ − µ

′

∥

+Tr(Σ + Σ

′

− 2(ΣΣ

′

)

1/2

) (3)

where µ and Σ respectively represent the mean and co-

variance of the ground-truth-motion distribution, and

′

and Σ

′

represent those of the generated motion dis-

tribution.

Diversity calculates the variance of randomly

sampled pairs from the generated motions, with larger

values indicating greater diversity. Diversity is ex-

pressed as

Diversity =

∑

i=1

∥ v

− v

′

∥

(4)

where S is the number of sampled pairs, and v

and v

′

are pairs of motion vectors.

For FID and diversity evaluation, feature vectors

extracted from current evaluation models are typically

used. However, there is currently no suitable fea-

ture extractor for the 2D Human Motion-Language

dataset. Therefore, we used the motion tokenizer of

MotionGPT and 2CM-GPT as feature extractors, en-

abling fair comparison of the inherent performance

of each method. This enables accurate evaluation of

the reconstruction and generation accuracy of 2CM-

GPT, clearly demonstrating performance differences

with other methods.

4.4 Motion Reconstruction

We quantitatively and qualitatively evaluated the 2D-

motion-reconstruction accuracy of 2CM-GPT’s mo-

tion tokenizer.

Quantitative Evaluation. The accuracy of motion

reconstruction using the motion tokenizer is shown

in Table 1, calculated using MPJPE, FID, and diver-

sity. The table also shows that 2CM-GPT improved

reconstruction accuracy compared with MotionGPT.

Speciﬁcally, MPJPE improved by 0.018 points, in-

dicating improved joint-position accuracy, enabling

closer reconstruction to the ground truth. To com-

plement the quantitative evaluation, we also show

MPJPE results at individual joints in table 2. The

table shows that 2CM-GPT consistently outperforms

MotionGPT at all joints. Notably, joints such as the

hips, shoulder, and neck exhibit the lowest MPJPE

values. These joints typically involve less complex

movement patterns, making them more predictable

for the model. In contrast, joints like the wrists and

elbows show higher MPJPE values, suggesting that

these are more challenging to reconstruct. This is

likely due to their greater range of motion and higher

degrees of freedom, as well as the fact that they are

involved in more intricate and dynamic movements.

FID improved by over 7.894 points, indicating that

the reconstructed motion distribution is similar to that

of real motions. Diversity improved by over 0.859

points, indicating enhanced diversity in the generated

motions, enabling 2CM-GPT to reconstruct a broader

range of motion patterns. These results indicate that

using 2D convolution for processing 2D motion data

is more effective than using 1D convolution for mo-

tion reconstruction.

Qualitative Evaluation. The reconstructed motion

using the motion tokenizer is shown in Figure 3.

The ﬁgure also shows that 2CM-GPT reconstructed

motion is closer to the input motion compared with

MotionGPT. Focusing on the movements of each

joint, the generated results from MotionGPT indi-

cate noticeable deviations in joint positions and ori-

entations. For example, even when the input motion

faces left, the reconstructed motion from MotionGPT

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

586

Table 1: Comparison of accuracy of motion reconstruction by motion tokenizer between MotionGPT and 2CM-GPT. The

method names in the columns for FID and diversity indicate the motion tokenizer used as the feature extractor. “Real”

represents the evaluation results of ground-truth 2D motions.

Method MPJPE ↓

FID ↓ Diversity ↑

MotionGPT 2CM-GPT MotionGPT 2CM-GPT

Real 0.000 0.000 0.000 16.981 16.954

MotionGPT 0.179 33.753 34.545 11.385 10.896

2CM-GPT 0.161 25.859 25.150 12.244 12.129

Figure 3: Comparison of reconstructed motions by motion tokenizer between MotionGPT and 2CM-GPT. The left image

shows the initial frame, and the right image shows the ﬁnal frame.

Table 2: Comparison of MPJPE for each joint of motion re-

construction by motion tokenizer between MotionGPT and

2CM-GPT.

Method

Joint MotionGPT 2CM-GPT

Nose 0.180 0.162

Neck 0.155 0.139

Shoulder Right 0.168 0.149

Elbow Right 0.196 0.171

Wrist Right 0.247 0.223

Shoulder Left 0.168 0.155

Elbow Left 0.195 0.180

Wrist Left 0.245 0.234

Hip Right 0.151 0.133

Knee Right 0.156 0.137

Ankle Right 0.173 0.151

Hip Left 0.151 0.135

Knee Left 0.156 0.141

Ankle Left 0.172 0.155

Eye Right 0.177 0.159

Eye Left 0.177 0.161

Ear Right 0.170 0.154

Ear Left 0.171 0.156

might face forward, resulting in incorrect motion re-

construction. In contrast, 2CM-GPT maintains joint

positions and orientations closer to the input motion,

accurately reproducing the overall motion. This sug-

gests that using 2D convolution in 2CM-GPT enables

better modeling of spatial relationships between joints

compared with using 1D convolution in MotionGPT.

4.5 Text-Guided Motion Generation

We qualitatively evaluated the accuracy of text-

guided 2D motion generation using 2CM-GPT, which

uses a motion tokenizer with improved motion-

reconstruction accuracy. The 2D motions generated

by inputting various texts is shown in Figure 4. The

results in Figure 4 indicate that the generated 2D mo-

tions align with the input texts. Notably, the start

and end points of actions, such as leg kicks and hand

claps, are accurately represented, and the consistency

of movements is maintained. 2CM-GPT effectively

interprets the semantic information in the text and

translates it into 2D motion, demonstrating high per-

formance in both accuracy and realism.

4.6 Application of Text-Guided 3D

Motion Generation

We applied the 2D motion generated with 2CM-GPT

from text to pose-guided human video generation,

demonstrating the effectiveness of 2CM-GPT through

qualitative evaluation. For pose-guided human video

generation, we used DisCo (Wang et al., 2024), a

method characterized by faithfulness, generalizabil-

2D Motion Generation Using Joint Spatial Information with 2CM-GPT

587

(a) A person kicks with their right leg.

(b) A man jumps and brings both arms above his head as he spreads his legs then moves them back

into the original position.

Figure 4: Text-guided motion generation results from 2CM-GPT. The corresponding input texts are shown in (a) and (b).

Figure 5: Human video generation results using 2D motion generated with 2CM-GPT. The input text used for generating the

2D motion is “A person claps their hands together well above their head.”

ity, and compositionality . The result of inputting the

2D motion generated with 2CM-GPT into DisCo is

shown in Figure 5. The results in this ﬁgure indicate

that the input 2D motion results in natural actions in

the generated human video. The 2D motion gener-

ated with 2CM-GPT was seamlessly integrated into

the output of DisCo, conﬁrming the effectiveness of

2CM-GPT.

4.7 Ablation Studies

We evaluated the appropriate size K of codebook us-

ing motion tokenizer with different codebook sizes.

The accuracy of motion reconstruction using the mo-

tion tokenizer is shown in Table 3, calculated using

MPJPE, FID, and diversity. The table shows that K =

32 is appropriate for motion tokenizer. This suggests

that a too large size K of codebook cannot represent

2D motion well, which has few features.However, if

the codebook size K is too small, the motions gener-

ated by the language model become monotonous and

less diverse.It is suggested that the language model

cannot learn the relationship between text and motion

well because there are not enough tokens to represent

motion.

Table 3: Comparison of accuracy of motion reconstruction

by motion tokenizer in 2CM-GPT with different codebook

sizes. 2CM-GPT used as the feature extractor.

2CM-GPT MPJPE ↓ FID ↓ Diversity ↑

K=16 0.122 7.816 14.251

K=32 0.114 8.313 14.049

K=64 0.134 15.933 13.104

K=128 0.161 25.150 12.129

K=256 0.176 34.037 11.054

K=512 0.175 33.580 11.213

K=1024 0.185 40.502 10.952

5 CONCLUSIONS

We proposed 2CM-GPT for generating 2D motion

from text. By adopting 2D convolution for the mo-

tion tokenizer, 2CM-GPT is able to more accurately

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

588

model spatial relationships between joints compared

with methods using 1D convolution. Evaluation ex-

periments demonstrated that 2CM-GPT achieved su-

perior accuracy in motion-reconstruction tasks both

quantitatively and qualitatively and showed high per-

formance in text-guided 2D motion generation. Ap-

plying the motions generated with 2CM-GPT to

pose-guided human video generation conﬁrmed that

the resulting videos exhibited natural movements.

These results indicate the practicality of 2CM-GPT

in motion-generation tasks. Future work will include

training with more diverse motion datasets, introduc-

ing advanced architectures to further strengthen the

relationship between text and motion, and exploring

other possible applications for 2D motion generation.

REFERENCES

Ahn, H., Ha, T., Choi, Y., Yoo, H., and Oh, S. (2018).

Text2Action: Generative Adversarial Synthesis from

Language to Action. In 2018 IEEE International Con-

ference on Robotics and Automation (ICRA), pages

5915–5920. IEEE.

Ahuja, C. and Morency, L.-P. (2019). Language2Pose: Nat-

ural Language Grounded Pose Forecasting. In 2019

International Conference on 3D Vision (3DV), pages

719–728. IEEE.

Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T.,

and Yu, G. (2023). Executing your Commands via

Motion Diffusion in Latent Space. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 18000–18010.

Cho, K., van Merrienboer, B., G

ulc¸ehre, C¸ ., Bahdanau,

D., Bougares, F., Schwenk, H., and Bengio, Y.

(2014). Learning Phrase Representations using RNN

Encoder-Decoder for Statistical Machine Translation.

In Proceedings of the 2014 Conference on Empiri-

cal Methods in Natural Language Processing, pages

1724–1734. ACL.

Graves, A. and Graves, A. (2012). Long Short-Term Mem-

ory. Supervised sequence labelling with recurrent

neural networks, pages 37–45.

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., and

Cheng, L. (2022). Generating Diverse and Natural

3D Human Motions From Text. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 5152–5161.

Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A.,

Gong, M., and Cheng, L. (2020). Action2Motion:

Conditioned Generation of 3D Human Motions. In

Proceedings of the 28th ACM International Confer-

ence on Multimedia, pages 2021–2029.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). GANs Trained by a Two Time-

Scale Update Rule Converge to a Local Nash Equi-

librium. Advances in neural information processing

systems, 30.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-

Term Memory. Neural Comput., 9(8):1735–1780.

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., and Chen,

T. (2024). MotionGPT: Human Motion as a Foreign

Language. Advances in Neural Information Process-

ing Systems, 36.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft COCO: Common Objects in Context. In

Computer Vision–ECCV 2014: 13th European Con-

ference, Zurich, Switzerland, September 6-12, 2014,

Proceedings, Part V 13, pages 740–755. Springer.

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and

Black, M. J. (2015). SMPL: A Skinned Multi-Person

Linear Model. ACM Trans. Graphics (Proc. SIG-

GRAPH Asia), 34(6):248:1–248:16.

Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G.,

and Black, M. J. (2019). AMASS: Archive of Mo-

tion Capture as Surface Shapes. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 5442–5451.

Petrovich, M., Black, M. J., and Varol, G. (2022). TEMOS:

Generating diverse human motions from textual de-

scriptions. In European Conference on Computer Vi-

sion, pages 480–497. Springer.

Plappert, M., Mandery, C., and Asfour, T. (2016). The KIT

Motion-Language Dataset. Big Data, 4(4):236–252.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning Transferable Visual Models

From Natural Language Supervision. In International

conference on machine learning, pages 8748–8763.

PMLR.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).

Exploring the Limits of Transfer Learning with a Uni-

ﬁed Text-to-Text Transformer. Journal of machine

learning research, 21(140):1–67.

Razavi, A., Van den Oord, A., and Vinyals, O. (2019). Gen-

erating Diverse High-Fidelity Images with VQ-VAE-

2. Advances in neural information processing systems,

32.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence

to Sequence Learning with Neural Networks. In Ad-

vances in Neural Information Processing Systems 27:

Annual Conference on Neural Information Processing

Systems, pages 3104–3112.

Tevet, G., Gordon, B., Hertz, A., Bermano, A. H., and

Cohen-Or, D. (2022). MotionCLIP: Exposing Human

Motion Generation to CLIP Space. In European Con-

ference on Computer Vision, pages 358–374. Springer.

Tevet, G., Raab, S., Gordon, B., Shaﬁr, Y., Cohen-or, D.,

and Bermano, A. H. (2023). Human Motion Diffusion

Model. In The Eleventh International Conference on

Learning Representations.

Van Den Oord, A., Vinyals, O., et al. (2017). Neural Dis-

crete Representation Learning. Advances in neural in-

formation processing systems, 30.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

2D Motion Generation Using Joint Spatial Information with 2CM-GPT

589

(2017). Attention Is All You Need. In Proceedings of

the 31st International Conference on Neural Informa-

tion Processing Systems, NIPS’17, page 6000–6010.

Curran Associates Inc.

Wang, T., Li, L., Lin, K., Zhai, Y., Lin, C.-C., Yang, Z.,

Zhang, H., Liu, Z., and Wang, L. (2024). DisCo: Dis-

entangled Control for Realistic Human Dance Gener-

ation. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 9326–9336.

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y.,

Zhao, H., Lu, H., and Shen, X. (2023). T2M-GPT:

Generating Human Motion from Textual Descriptions

with Discrete Representations. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L.,

and Liu, Z. (2024). MotionDiffuse: Text-Driven Hu-

man Motion Generation with Diffusion Model. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

590