Evaluation of Body Parts Representations in Motion Reconstruction

Philippe de Clermont Gallerande

1,2 a

, Quentin Avril

2 b

, Philippe Henri Gosselin

2 c

Ferran Argelaguet

1 d

and Ludovic Hoyet

1 e

Inria, Univ. Rennes, CNRS, IRISA, France

InterDigital, Rennes, France

{philippe.de-clermont-gallerande, ferran.argelaguet, ludovic.hoyet}@inria.fr,

Keywords:

Animation, Neural Networks, Human Motion, Body Parts.

Abstract:

Acquiring, encoding, transmitting, decoding, and displaying motion signals is an essential challenge in our

new world of interconnected immersive applications (XR, online games etc.). In addition to being potentially

disturbed by multiple factors (e.g., signal noise, latency, packet loss), this motion data should be modiﬁable and

customizable to ﬁt the needs of speciﬁc applications. Simultaneously, several approaches have successfully

proposed to explicitly integrate the semantics of the human body in a deep learning framework by separating

it into smaller parts. We propose to use such an approach to obtain a robust streamed animation data. Specif-

ically, we create and train several neural networks on the motion of different body parts independently from

each other. We further compare the performances of several body decompositions using multiple objective

reconstruction metrics. Eventually, we show that this Body Parts approach brings new opportunities compared

to a compact one, such as a perfectly partitioned and more interpretable motion data, while obtaining compa-

rable reconstruction results.

1 INTRODUCTION

In an increasingly digital and connected world, ap-

plications showcasing virtual humans for video calls

and multi-users immersive applications are rapidly

spreading. Such programs will typically require in a

foreseeable future to represent users with customiz-

able and realistic virtual avatars driven in real-time by

their own movements. Similarly to current require-

ments for video streams, novel constraints will then

be introduced on the reliability and quality of the mo-

tion data transmitted, in addition to traditional human

motion modeling needs (e.g., Character Animation).

In particular, a number of errors can be introduced in

the motion data, e.g., due to missing or drifting sen-

sors, user occlusions, as well as packet losses (i.e.

transmission disturbances). As current approaches

rely on a holistic representation of the human body

motion, such errors have typically an impact on the

complete reconstructed body motion, which can be

https://orcid.org/0009-0002-2151-8135

https://orcid.org/0000-0002-0101-3351

https://orcid.org/0000-0002-0973-4030

https://orcid.org/0000-0002-6160-8015

https://orcid.org/0000-0002-7373-6049

Figure 1: The different body decompositions evaluated in

this work. A Body Parts (BPs) model is a wrapper of one,

or several, neural networks reconstructing a body part mo-

tion. They are linked together by overlapping joints: the

BPs Connectors (orange circles). From left to right: BPs1

(whole-body), BPs2 (upper/lower body), BPs3 (spine, arms

and legs) and BPs5 (spine, right/left arm, right/left leg).

detrimental for the user experience. To illustrate this

problem, let us take as an example a situation where

only the upper body of the user is visible (e.g., facing

a webcam): in this scenario, any erroneous informa-

tion about the lower body pose would be incorporated

in the entangled representation, and result in an incor-

rect reconstruction of the whole body animation on

the remote client side.

To tackle these challenges, we propose in this pa-

de Clermont Gallerande, P., Avril, Q., Gosselin, P. H., Argelaguet, F. and Hoyet, L.

Evaluation of Body Parts Representations in Motion Reconstruction.

DOI: 10.5220/0013138200003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 53-64

ISBN: 978-989-758-728-3; ISSN: 2184-4321

per to explore the value of a semantically partitioned

representation of the motion data (i.e., a body parts-

based representation). Such representations have al-

ready been successfully leveraged for various motion

modeling tasks. Speciﬁcally, their common key inter-

est was to precisely model and act on the local motion

data, thus driving changes at a global scale.

Taking inspiration from this idea, we explore the

potential beneﬁts of an independent body parts-based

representations in the context of motion data stream-

ing. More precisely, we propose a novel human mo-

tion modeling framework enabling the creation of dis-

tinct body parts models, which we evaluate on a mo-

tion reconstruction task through several experiments.

We believe that such an approach can provide: a) a ro-

bust and semantically partitioned representation to re-

strain local degradation (e.g., noise or errors) of the

whole-body (global) animation to speciﬁc body parts

motion; b) an efﬁcient solution to update speciﬁc

body parts information at a time, which could be valu-

able in situations where they might not be available

or relevant (e.g. an occluded arm motion, or a lower-

body pose sequence in a seated meeting context); c) a

novel and intuitive solution to animate complex non-

humanoid characters based on body semantics, for in-

stance to accommodate applications where users are

not necessarily represented by anthropomorphic char-

acters (e.g., (Cheymol et al., 2023)).

Although such an approach comes with poten-

tial limitations in some use cases (e.g. motion syn-

thesis), results of our experiments on a pure mo-

tion reconstruction task suggest that independent sub-

models lead to similar performances than a whole-

body model. Such results could imply that modeling

every body motion correlations might not be relevant

in all applications, in contrast to recent works in deep-

based character animation, which tend to involve the

creation of complex and holistic neural networks.

The key contributions of this paper are therefore:

• A novel and highly modular human motion mod-

eling framework to explicitly create semantic rep-

resentations.

• A ﬁrst comparative study of the impact of the

body decomposition granularity in a reconstruc-

tion task involving several noisy scenarios.

2 RELATED WORK

In this section, we present related research on deep-

based human motion processing with whole-body

(Section 2.1) and body parts approaches (Section 2.2).

2.1 Whole Body Motion Modeling

Over the years, many approaches have been proposed

to edit (Aristidou and Lasenby, 2011), organize (Ko-

var et al., 2002), or synthesize skeletal animations (Ko

and Badler, 1996) based on a variety of heuristics or

paradigms. As their limits became apparent on in-

creasingly challenging poses, data-based approaches

that can model stochastic processes gained interest.

The spread of deep neural networks to model non-

linear manifolds (Mourot et al., 2022; Chen, 2023)

further increased their performances. For instance,

the pioneering work from Holden et al. (2015) lever-

ages a convolutional network to learn a motion mani-

fold, which can then be used to denoise human anima-

tions. Increasingly complex architectures were later

proposed to model the spatiotemporal correlations of

the human body, such as graph convolutions to exploit

the topological information of the human body at var-

ious scales (Dang et al., 2021). Along improvements

in the deep learning literature such as attention layers

and Transformers, performances increased (Shu et al.,

2022; Mao et al., 2020) but they were still focused

on a speciﬁc skeleton. This issue was then addressed

through retargeting tasks (Aberman et al., 2020), cul-

minating in the use of transformers to learn motion

semantics (Lee et al., 2023a; Zhang et al., 2024).

In the context of motion synthesis, Generative

Adversarial Networks (Liu et al., 2021; Malek-

Podjaski and Deligianni, 2023) and Variational Auto-

Encoders (Yan et al., 2018) have been widely used.

These architectures produce coherent latent spaces

that can be used to interpolate between learned sam-

ples and generate new ones that respect the data dis-

tribution constraints. Various strategies have been de-

signed to ease the training without losing interesting

features, such as using multi-task learning (Li et al.,

2022a; Butepage et al., 2017). Others have lever-

aged prior knowledge (e.g., motion periodicity) to dy-

namically change the approximated manifold (Holden

et al., 2017; Zhang et al., 2018). Starke et al. (2022)

extended this approach to learn the motion phase

manifold by creating approximated sinusoidals in the

latent space. Finally, other recent approaches explore

reconstructing skeleton motions from sparse signals,

which is a severely ill-deﬁned problem, particularly

when the pelvis transformation is not available (Yang

et al., 2021). To compensate for the lack of input

signals to condition the network, deep reinforcement

learning (Ye et al., 2022) and physic simulation were

leveraged (Jiang et al., 2022; Lee et al., 2023b). Reda

et al. (2023) extended these approaches to solve a

combination of the previous challenges by retargeting

motions on other topologies using such sparse signals.

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

2.2 Body Parts-Based Representations

Separating the body into parts is a well-established

idea for its modeling, used by Hanavan (1964) to

obtain a mathematical model and later leveraged

by Zhou et al. (2014) to design a multivariate Gaus-

sian process model based on inter-joints correlations.

It was ﬁrst exploited in a Deep Learning context

by Wang and Neff (2015) in their multi-channel auto-

encoder and shortly after to evaluate a structural

RNN (Jain et al., 2016). Since then, it has been

widely used either by imitating the hierarchical struc-

ture over the depth of the network (Shao et al., 2019;

Li et al., 2019) or by progressively incorporating each

body parts features (Li et al., 2022b). Wang et al.

(2022) used multiple hierarchical decompositions of

the body and the different intrinsic relationships prop-

erties between each node, merged together by a dedi-

cated information fusion network, to perform a robust

human semantic parsing of challenging images.

Other recent works used body parts as basic units

for various tasks. E.g., Hu et al. (2024) used them

for skeletal motion retargeting, feeding body part fea-

tures to attention layers to obtain a ﬁxed-size vector,

independent of the number of input joints. These la-

tent vectors were then stacked, added to the features

of the same dimension from the targeted skeleton, and

decoded. To some extent, this approach is reminiscent

of the “primal skeleton” proposed by Aberman et al.

(2020). Body decompositions were also used to have

a ﬁner control of the local motion style of each body

part (Jang et al., 2022), e.g., enabling the addition of

different style features on the legs and arms. Interest-

ingly, this approach leveraged retargeting principles,

although used to transfer motion styles and not skele-

tons, providing ﬂexibility for modifying the global

motion style by acting locally. Zou et al. (2025) used

such an approach to train independent generators for

each body part motion, coupled with a part coordina-

tion module for global coherency, that were then used

to obtain more details in text to motion results.

Another advantage is that each body part can act

as a building block that can be added, removed, or

swapped. Lee et al. (2022) leveraged this modular-

ity to create and animate chimera characters, where

the different body parts animations came from prede-

ﬁned virtual source characters equipped with simple

motions such as walking, running, jumping, or punch-

ing. Deep Reinforcement Learning was then used to

assemble body parts and synthesize synchronized and

realistic motions, in combination with energy-aware

rewards and editing techniques (similar to Dynamic

Time Warping).

These works demonstrate the value of body part

approaches on a number of topics related to human

motion modeling. First, it allows to focus the learning

on local, high-frequency features, and prevents exces-

sive averaging at the global level. Second, it often

leads to models with greater ﬂexibility and versatility.

However, even though various body parts decompo-

sitions have been used in the past, no evaluation has

been conducted yet on their relative performances.

3 BODY PARTS MODELING

We describe here our human motion modeling frame-

work for creating independent body parts sub-models.

First, Section 3.1 deﬁnes the different Body Parts

(BPs) models corresponding to different body decom-

positions. Then, Section 3.2 describes the connectors

articulating different sub-models. Finally, Section 3.3

elaborates on the combination of the body parts mo-

tions to produce the ﬁnal result.

3.1 Body Parts Deﬁnition

To evaluate several body parts approaches, we ﬁrst

create a common structure. We start by splitting an

animation skeleton into one or several sub-graphs at

various granularity. Each of these set of trees corre-

sponds to a speciﬁc hierarchy of body parts and forms

a Body Parts (BPs) model. A BPs model is thus a

wrapper around one or several neural networks, each

tasked with the motion modeling of a body part (i.e.

a semantic sub-graph). In other words, a BPsn model

uses n neural networks trained independently on n dis-

tinct body parts motion.

Formally, for animation data X ∈ R

T ×J×F

, with T

frames, an entire skeleton of J joints and F features, a

body part bp sub-model will be trained on the tensor

∈ R

T ×J

×F

with J

≤ J the number of joints.

Conveniently, the particular case of a single-body

part model, BPs1, is our baseline: a whole-body ap-

proach (Holden et al., 2015). Then, we progressively

increase the number of body parts while keeping a se-

mantic decomposition. Based on previous works in

the literature, we design some BPs models (Figure 1):

• BPs1 wraps one body part: Whole-Body

• BPs2 wraps two body parts: Upper-Body and

Lower-Body (Yang et al., 2021)

• BPs3 wraps three body parts: Both-Arms, Spine,

and Lower-Body (Jang et al., 2022)

• BPs5 wraps ﬁve body parts: Right-/Left-Arm,

Right-/Left-Leg, and Spine (Hu et al., 2024)

Evaluation of Body Parts Representations in Motion Reconstruction

3.2 Body Parts Connectors

The literature highlights diverse methods to effec-

tively aggregate the information from different body

parts, e.g., using graph-based approaches with seman-

tic edges (Wang et al., 2022) or skeleton-based pool-

ing (Aberman et al., 2020; Jang et al., 2022). How-

ever, these approaches lead to interdependence be-

tween the body parts, which can impair ﬂexibility and

generalization (either to topologies, morphologies or

motions). With the goal of proposing a versatile, par-

titioned, and semantic human motion model, we in-

stead chose to use a simpler approach to connect the

sub-models. Speciﬁcally, we reduce the overlap be-

tween different body parts to a single joint, which we

call the BPs Connector. This joint thus links two con-

nected body parts (as illustrated in Figure 1). Early

results suggested that keeping this joint (at the cost

of managing the overlap) was important to create the

local manifolds as it often acts as the root of the tree

and thus impacts the whole kinematic chain.

Although this choice introduces a number of

strong constraints in some contexts, as we explic-

itly discard long-range correlations, it also presents

a number of beneﬁts that are further explored in Sec-

tion 5. First, as the body parts motions are not con-

ditioned on each other, we obtain independent sub-

models. It means that bending the right arm (Right-

Arm model) will have no incidence on the knee ori-

entation (Right-Leg model). Even though it prevents

an easier solving of some motion ambiguities, e.g., in

locomotion (no upper/lower body coordination), such

a model could be less biased in handling other simi-

lar motions (e.g. walking while carrying an object).

In other words, it will less suffer from overﬁtting

and generalize more easily to diverse motion datasets.

Second, this separation of the BPs sub-models en-

ables a fully independent (and parallel) training. It

also provides an unexpected beneﬁt, as different sub-

models can thus be trained or ﬁne-tuned on speciﬁc

datasets (cf. Section 5). Finally, this separation into

independent sub-models enables to work (generate,

reconstruct, edit) on the animation of a body part

without impacting anything else. E.g., we can gen-

erate the animation of an arm moving alone, or attach

it to an arbitrary kinematic chain.

3.3 Body Parts Combination

To obtain an end-to-end skeleton reconstruction re-

gardless of the body decomposition, we designed a

generic strategy to concatenate the sub-graphs mo-

tions. For anatomical reasons, we make the assump-

tion that the sub-model corresponding to the motion

BPs1

Input

Animation

Output

Animations

BPs2

Split

Concat

Figure 2: Inference of two Body Parts models with differ-

ent body decompositions. (Top) a single body part model

(BPs1) and (bottom) a multiple body parts one (BPs2).

of the axial skeleton (i.e., the spine) is the most im-

portant for the global animation. This body part sub-

model is framed in blue in Figure 1, i.e., Upper-Body

for BPs2, and Spine for BPs3 and BPs5. This prior-

itization choice is also motivated by a) the compara-

tively easier task of predicting the spine movements

compared to the limbs and b) the intuitiveness of hav-

ing the torso and hips lead the overall motion.

Indeed, combining the different sub-models

through their common joint leads to two practical im-

plementations: a) overwriting the child BPs Connec-

tor rotation or b) interpolating them with speciﬁc hy-

perparameters as weights. In practice, we restricted

ourselves to scenario (a) because of its faster infer-

ence time, comparatively good results and explicit

separation of the body parts in line with the rest of

our work. The pipeline also includes a splitting opera-

tion (cf. Figure 2), which is a straightforward process

leveraging masks along the joints’ dimension.

In the end, we thus work at a level of abstraction

helpful to track the networks performances and com-

pare the different body decomposition.

4 EXPERIMENT

In this section, we detail the experiments conducted to

evaluate the BPs models. Section 4.1 provides train-

ing details (i.e., datasets, architectures and losses),

while Section 4.2 describes our experimental results.

4.1 Training

4.1.1 Dataset & Implementation Details

For our experiments, we rely on several motion cap-

ture datasets processed by Holden et al. (2016). We

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

particularly rely on the CMU part, a choice moti-

vated by its size, its widespread use, and the extensive

range of covered motions. 60% of the CMU dataset

is used for training, 20% for validation and hyperpa-

rameters tuning, while the rest of the sequences and

other datasets are kept for evaluation.

Since the skeleton is normalized across the

datasets, we base our deﬁnition of a motion sequence

on the joint rotations. A motion sequence is therefore

a vector X ∈ R

T ×J

×F

with T the number of frames,

the number of joints in the body part, and F the

dimension of our rotation representation. In our ex-

periment, T = 61, which corresponds to ∼ 0.5 s at

120 Hz, J

= 31 when considering the entire skele-

ton, and F = 6 as suggested by Zhou et al. (2019).

To mitigate overﬁtting, we use a random sampling

strategy with a ﬁxed number of sequences per anima-

tion for training, and we switch to a sequential sam-

pling during evaluation. We manually ﬁx the seed

of the random generators to enable a fair and repro-

ducible comparison between each model. We use the

AdamW optimizer with a learning rate of 1e

−4

and a

weight decay of 1e

−2

; together with a learning rate

scheduler at epoch (decay) and step (cyclical triangu-

lar) scales. The batch size is ﬁxed at 16.

4.1.2 Architectures

Since one of our objectives is to create and compare

robust, efﬁcient, and effective representation of our

data, we use simple Auto-Encoder-like architectures.

We take inspiration from siMLPe for the encoder and

decoder layers (Guo et al., 2023) to beneﬁt from its

good performances, its small number of trainable pa-

rameters and its reduced time complexity. In the fol-

lowing, we describe our main architectures:

The ﬁrst is an Auto-Encoder (AE). Let φ be the

encoder and ψ the decoder. The architecture is thus

described as: φ : R

T ×J

×F

→ R

and ψ : R

→

T ×J

×6

with (here) d < T · J

· F.

The second is a Periodic Auto-Encoder

(PAE) (Starke et al., 2022). It exploits the in-

herent periodicity of the motion by constraining the

network to produce a latent phase manifold.

The third is a Variational Auto-Encoder (VAE).

It enables the automatic creation of a continuous la-

tent space, which is extensively used in generation

and leads to a strong regularization. It takes root in

Bayesian inference and assumes that its optimized la-

tent representation will approximate the real data dis-

tribution, implicitly capturing the hidden causes be-

hind it. The initial prior distribution and subsequent

parameterization are Gaussians in the VAE original

and widespread form.

Finally, there is the Hyperspherical Variational

Auto-Encoder (S-VAE) (Davidson et al., 2018). This

VAE takes a von Mises-Fischer (vMF) distribution as

prior instead of a Gaussian to better ﬁt hyperspherical

structures such as rotations. As originally reported,

the vMF collapses in higher dimensionality, which led

us to use several low dimension latent channels.

Speciﬁcally, in our experiments, one sub-model

latent code has a size d

= ⌊

100

⌋ to obtain a compa-

rable dimension of d ≈ 100 for all BPs models. Since,

early results on tuning the latent vectors size did not

signiﬁcantly change the results, we chose to divide its

size equally between each body part depending on the

addressed granularity, which would also generalize to

other decompositions in the future.

4.1.3 Models Complexity

In this Section, we report the models complexity as

number of trainable parameters and inference time.

We report the mean and standard deviation of the

number of parameters in several BPs models based on

different sub-model architectures (cf. Section 4.1.2).

Overall, BPs1 is bigger with 184,221 ± 378 pa-

rameters, BPs2 and BPs3 are smaller with respec-

tively 146, 423 ±362 and 145, 304 ± 361, and BPs5

is medium with 171, 968 ± 374 parameters.

In terms of inference times, all our models are

below 0.08s, with variations depending on the ar-

chitecture of the sub-models (e.g. AE-based mod-

els ≈ 0.005 s, VAE-based models ≈ 0.012 s). In par-

ticular, BPs2, BPs3 and BPs5 necessarily need the

forward pass of several models to produce a whole

skeleton. Such a process could be parallelized, but

we decided to leave it sequential in our experiment

to prevent the need for several GPUs. However, this

choice leads to a steady increase in the time complex-

ity with the number of sub-models (e.g. an AE-based

BPs1 is 4 times faster than an AE-based BPs5).

4.1.4 Loss

We train our models with a total loss described in this

Section, which includes some reconstruction losses

and regularization terms. Note that we also use some

of the following as pure evaluation metrics, up to a

scaling factor.

• Mean Squared Error (MSE) compares term-wise

the ground-truth and predicted matrices.

• Mean Per Joints Angular Error (MPJAE) is a

geodesic distance on SO(3). We ﬁrst estimate the

difference rotation between R

and R

pred

(global

scale) before computing its angle. Formally:

di f f

= R

pred

and θ

di f f

= arccos(

(tr(R

di f f

)−1)

Evaluation of Body Parts Representations in Motion Reconstruction

• Mean Per Joints Local Angular Error (MPJLAE)

is like the MPJAE, but computed at the local scale.

• The Angular Velocity (AV) is estimated by keep-

ing the geodesic distance results over several sub-

sequent frames. We then approximate the angu-

lar velocity by observing the angle variations over

multiple frames. In practice, we average the an-

gular difference every 10 frames to prevent an ex-

plosion of the metric in initial steps.

• The Angular Acceleration (AA) is estimated us-

ing the results of the angular velocity.

• Mean Per Joints Positional Error (MPJPE), which

computes the distance of the global position of the

joints between ground-truth and prediction.

• Jerk, which compares the 3rd order time deriva-

tive of the global positions between ground-truth

and prediction.

The complete loss used for training is then:

recon

= w

MSE

· L

MSE

+ w

MPJPE

· L

MPJPE

+ w

MPJAE

· L

MPJAE

+ w

MPJLAE

· L

MPJLAE

+ w

· L

+ w

· L

+ w

Jerk

· L

Jerk

(1)

If a sub-model is an instance of a VAE, we further

add the corresponding Kullback-Leibler Divergence

with its own weight (w

), and a cyclical KL anneal-

ing (Fu et al., 2019) to mitigate KL vanishing. In the

experiments described in the next section, we used

MSE

= 1, w

MPJAE

= 1, w

MPJLAE

= 0.1, w

= 0.001,

= 0.01, w

MPJPE

= 0.1, w

Jerk

= 0.001 and w

0.005, and we further added a L1-norm on the weights

as regularization to complement the weight decay.

4.2 Results

In this section, we present our different evaluation

steps. The ﬁrst focuses on the objective robustness,

which includes the reconstruction accuracy in perfect

(Section 4.2.1) and noisy (Section 4.2.2) conditions.

Then we observe the interpretability (Section 4.2.3)

and controllability (Section 4.2.4) of this partitioned

latent representation. In the following, the reported

means and standard deviations are computed based on

the performance of the BPs models on the test anima-

tions of the corresponding dataset(s) for seven gener-

ation seeds.

4.2.1 Reconstruction Accuracy

This section ﬁrst reports the reconstruction metrics

computed on the original dataset (CMU) test sub-

set (Table 1, top). As introduced previously, CMU

is a very large dataset involving numerous motions

(4, 164, 000 frames), ranging from walking and run-

ning to dancing via boxing and digging. The re-

sults show that BPs1 achieves the best results on the

MPJPE overall, which is easily explained by the ad-

ditional information provided by long-range depen-

dencies. However, the other BPs models still provide

interesting results: in particular, all three decomposi-

tions have similar performances, although worse than

BPs1, which suggests that the upper/lower body co-

ordination was particularly relevant to model in this

dataset (e.g. many locomotion animation). It is also

relevant to note that the MPJAE results are more am-

biguous, with BPs 3 and 5 producing overall better

results on this metric.

To assert the generalization capacities of the mod-

els, we then exhibit the results of the reconstruction

metrics computed on three unseen datasets: MHAD,

Edin locomotion and Edin ﬁght. MHAD includes

several general motions (601, 000 frames), mostly

involving the upper body: e.g. clapping, waving

arms, bending, throwing, or a long-lasting T-pose.

Edin locomotion is focused on mostly unstructured

locomotion tasks (81, 000 frames), i.e. walking, jog-

ging, running forward and backward, but also side-

stepping. Edin ﬁght contains boxing and kicking mo-

tions (90, 000 frames), with an ever present and quite

speciﬁc “en garde” posture.

In Table 1 (bottom) we describe the distribution

of errors on a concatenation of the datasets (thus

weighted by their number of animation frames), while

Table 2 details the metrics per dataset. The results of

the Mean in Table 1 is thus closer to the results of

MHAD in Table 2 because this dataset has four times

more animations that the other two. On these unseen

datasets, we observe that BPs1 is still better than the

others on Edin locomotion, which is linked to the up-

per/lower body coordination addressed earlier. How-

ever, the gap is closer on Edin ﬁght, on which BPs5

has comparable performances to BPs1 and better ones

than BPs2 and BPs3. It suggests that for some speciﬁc

datasets, having a biased coordination between body

parts might hurt generalizability. Furthermore, BPs1

performs the worst out of all BPs models on MHAD,

which suggests good overall generalization capacities

for perfectly independent body parts sub-models.

4.2.2 Noise Robustness

In numerous applications, such as immersive video-

conferencing, motion models robust to various kinds

of noise are necessary to preserve user experience.

In our application context, such detrimental perturba-

tions can typically come from noisy input (e.g., track-

ing errors), or corruption during transmission. Tra-

ditionally, a more resilient model is obtained through

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

Table 1: MPJPE and MPJAE results on the CMU test (top) and other (Bottom) datasets. MPJPE was normalized to correspond

to the error (in cm) on a 1.80m skeleton. The metrics reported for the distinct datasets are weighted by the number of

animations per dataset. Results in bold are the best overall BPs model for this metric, and those underlined are the second

one.

CMU test subset

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

AE 4.98 ± 6.69 6.18 ± 8.09 6.24 ± 8.07 6.62 ± 8.51 16.8 ± 22.7 16.2 ± 21.4 15.0 ± 19.8 15.3 ± 19.9

PAE 4.92 ± 6.46 7.00 ± 8.99 7.17 ± 9.14 6.91 ± 8.98 16.0 ± 21.1 18.8 ± 24.6 17.3 ± 22.5 16.0 ± 21.0

VAE 5.58 ± 7.30 6.43 ± 8.24 6.35 ± 8.05 7.16 ± 9.01 17.5 ± 23.3 17.1 ± 22.5 15.4 ± 20.1 16.5 ± 20.9

S-VAE 6.54 ± 8.64 7.40 ± 9.55 8.60 ± 11.07 8.42 ± 10.72 19.9 ± 26.8 20.6 ± 27.1 21.0 ± 27.4 20.8 ± 26.5

Mean 5.50 ± 0.65 6.75 ± 0.48 7.09 ± 0.94 7.28 ± 0.69 17.6 ±1.5 18.2 ±1.7 17.2 ± 2.4 17.2 ± 2.1

MHAD, Edin locomotion, Edin ﬁght

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

AE 12.00 ± 13.18 12.03 ± 12.60 11.16 ± 11.71 11.46 ± 12.30 30.3 ± 31.5 29.7 ± 30.9 28.4 ± 29.0 27.3 ± 28.1

PAE 12.03 ± 12.98 12.65 ± 13.20 11.95 ± 12.77 11.85 ± 12.74 30.4 ± 31.7 31.2 ± 32.3 29.3 ± 30.2 27.9 ± 28.8

VAE 12.91 ± 13.74 12.22 ± 12.72 11.30 ± 12.14 11.52 ± 12.38 31.2 ± 31.8 30.4 ± 31.4 28.6 ± 29.2 28.0 ± 29.0

VAE-S 13.47 ± 14.23 13.13 ± 13.86 13.08 ± 13.92 12.75 ± 13.95 32.6 ± 32.8 32.4 ± 33.1 32.0 ± 32.5 31.0 ± 32.2

Mean 12.60 ± 0.62 12.51 ± 0.42 11.87 ± 0.76 11.90 ± 0.52 31.1±0.9 30.9 ±1.0 29.6 ± 1.4 28.5 ± 1.4

Table 2: Mean reconstruction Metrics MPJPE and MPJAE on distinct datasets. Results in bold are the best overall Body Parts

model for this metric, and those underlined are the second one.

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

MHAD 13.43 ± 0.60 12.84 ± 0.44 12.09 ± 0.73 12.19 ± 0.47 30.9 ± 0.9 30.3 ± 0.9 29.0 ± 1.3 28.0 ± 1.2

Edin loco 8.13 ± 0.62 9.96 ± 0.44 9.78 ± 0.70 10.00 ± 0.61 26.3 ± 1.0 27.8 ± 1.1 25.7 ± 1.4 24.5±1.9

Edin ﬁght 11.25 ± 0.81 12.74 ± 0.51 12.39 ± 1.00 11.72 ± 0.83 37.1 ±1.4 38.3 ± 1.4 36.8 ± 2.4 35.9 ± 2.5

data augmentation (e.g. incorporating noise to train-

ing samples), however such a curriculum could miss

on unexpected noise distributions or impair the main

task learning. An inherently robust model could

therefore be interesting, and we evaluate the different

BPs models in this prospect on three noisy situations:

1. “Input Noise”: add random SO(3) noise (i.e. unit

quaternions with small angles) on each input sam-

ple. It simulates tracking errors or occlusions.

2. “Latent Noise”: remove (i.e. set at 0) a random

part of the latent vector (10%). This typically cor-

responds to data corruption during transmission.

3. “Input+Latent Noise”: combination of the above.

In each situation, we compute the metrics by com-

paring with the noise-free animation. The results are

reported on all the test datasets and averaged over the

various architectures (Table 3) and seeds.

First of all, most results appear to be the same than

in the noise-free evaluation (previous section), i.e.,

BPs3 performs better than BPs1 on MHAD and BPs5

is still second on half of the datasets. We can also note

that the BPs models perform overall better on the la-

tent noise scenario than on the input noise one, which

suggests that it is an easier task to solve here. How-

ever, BPs1 provides an exception on the Edin ﬁght

dataset where it is slightly worse in the latent noise

case and, we can observe the same tendency for BPs5

with virtually the same results in both scenarios. In

other words, on certain datasets, having an error on

the latent vector might be particularly difﬁcult to re-

cover. It seems however interesting to note that BPs2,

BPs3 and BPs5 seem to provide a relevant alternative

to BPs1 without losing excessive reconstruction accu-

racy.

4.2.3 Partitioned Latent Representation

Complementary to the traditional evaluation of noise

on latent representations, we provide in this section

visual examples of animations where noise is applied

to strategic parts of the motion data. Such errors could

be representative of local tracking errors such as a

drifting sensor. Even in a case where the noise was

not properly eliminated, we would reasonably expect

the noise to still be localized on the same body part.

In other words, we would expect some interpretability

of the model.

Speciﬁcally, we present examples where we add

noise on speciﬁc parts of the latent vectors and visu-

ally compare the results between all the BPs models

(Figure 3). In this scenario, we a) tamper the latent

vector produced by one or several sub-models of ei-

ther BPs2, BPs3 or BPs5, b) tamper an arbitrary part

of the latent vector produced by BPs1, and c) compare

Evaluation of Body Parts Representations in Motion Reconstruction

Table 3: MPJPE and MPJAE results on the different datasets after applying input, latent, or input+latent noise. MPJPE was

normalized to correspond to the error in centimeters on a 1.80m skeleton. Results in bold are the best overall Body Parts

model for this metric, and those underlined are the second one.

Input noise

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

CMU 8.05 ± 0.32 9.47 ± 0.28 9.63 ± 0.61 10.08 ± 0.26 21.0 ± 1.1 22.4 ± 1.2 21.8 ± 1.6 21.7±1.2

MHAD 16.27 ± 0.50 15.74 ± 0.48 15.15 ± 0.63 15.24 ± 0.35 32.9 ±0.5 32.7±0.7 32.5 ±0.8 31.2 ± 0.6

Edin loco 10.38 ± 0.40 12.44 ± 0.36 12.18 ± 0.55 12.57 ± 0.33 28.4 ± 0.8 30.5 ± 1.0 29.0 ± 1.2 27.6 ± 1.6

Edin ﬁght 13.14 ± 0.70 14.67 ± 0.56 14.41 ± 0.88 13.63 ± 0.65 39.4 ±1.3 41.3 ± 1.4 40.3 ± 2.1 39.3 ± 2.0

Latent noise

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

CMU 8.02 ± 1.28 8.51 ± 1.08 8.82 ± 1.39 9.15 ± 1.17 21.3 ± 2.1 21.3 ± 2.6 20.6 ± 3.2 21.1 ±3.0

MHAD 15.28 ± 0.97 14.40 ± 0.84 13.87 ± 1.10 14.13 ± 0.79 33.1 ±1.4 32.4±1.6 31.6 ±2.0 31.0 ± 1.8

Edin loco 9.72 ± 1.20 10.95 ±0.96 10.82 ± 1.20 11.22 ± 1.02 28.4 ± 1.6 29.3 ± 1.9 27.6 ± 2.3 27.1 ± 2.7

Edin ﬁght 13.28 ± 1.22 14.30 ± 1.02 14.07 ± 1.41 13.62 ± 1.20 39.8 ±1.8 40.7 ± 2.1 39.5 ± 3.0 38.8 ± 3.0

Input + Latent noise

MPJPE [cm] MPJAE [deg]

BPs1 BPs2 BPs3 BPs5 BPs1 BPs2 BPs3 BPs5

CMU 9.77 ± 0.99 10.57 ±0.80 10.74 ± 1.05 11.28 ± 0.75 23.9 ± 1.7 24.6 ± 2.0 24.1 ± 2.5 24.5 ± 2.1

MHAD 17.63 ± 0.82 16.82 ± 0.77 16.36 ± 0.89 16.64 ± 0.55 34.8 ±1.1 34.4±1.3 34.4 ±1.5 33.6 ± 1.2

Edin loco 11.44 ± 0.93 12.98 ± 0.73 12.82 ± 1.04 13.30 ± 0.69 30.1 ± 1.4 31.6 ± 1.7 30.4 ± 2.2 29.7 ± 2.2

Edin ﬁght 14.75 ± 1.08 15.89 ± 0.95 15.66 ± 1.22 15.11 ± 0.98 41.8 ±1.7 43.3 ± 2.1 42.4 ± 2.7 41.6 ± 2.5

and analyze the results. Since we use distinct latent

spaces to reconstruct our animation, it is natural that

adding noise to a speciﬁc latent vector will not impact

the others. Indeed, we observe on Figure 3 (bottom)

that tampering the lower-body latent vector modiﬁes

the entire BPs1 animation, i.e., all the joint positions

are visibly different from the ground-truth. However,

for BPs2, BPs3 or BPs5 the upper-body pose is al-

most exactly the same as the ground-truth one. In

other words, a partitioned approach will constrain the

noise to be localized in contrast to BPs1, which suf-

fers from its entangled representation. Interestingly,

we can note in Figure 3 (top) that all the models suffer

from tampering the upper-body latent vectors, i.e, the

lower-body joints’ positions are disturbed compared

to the ground-truth. Non-BPs1 models still seem to

produce closer poses overall and we can observe that

the lower-body pose is locally similar to the ground-

truth (i.e. the knees are bent, but not as much as

in BPs1 reconstruction and the legs are less spread

out). In traditional Character Animation, disturbing

the root rotation produces overall bad animation, and

we observe a similar behavior in our framework. In-

deed, the root rotation is managed by the axial skele-

ton (e.g. Spine or Upper-Body sub-models), which

might explain worse overall performances when it is

disturbed, highlighting its importance.

Upper body latent noise

Lower body latent noise

BPs1 BPs2 BPs3 BPs5

Figure 3: Reconstruction results after adding noise to spe-

ciﬁc parts of the latent vector. The left skeleton is the orig-

inal animation (black), then each column displays the re-

construction from BPs1 (purple), BPs2 (green), BPs3 (or-

ange) and BPs5 (blue). In the case of BPs2, BPs3 and

BPs5, we tamper the latent vectors of the upper (top row)

or lower body (bottom row) sub-models. We can notice that

the whole animation produced by BPs1 is noisy, whereas

the other BPs models produce coherent animations for the

unperturbed part.

4.2.4 Modular Body Parts Models

Finally, we illustrate an additional beneﬁt of a modu-

lar motion modeling approach. Indeed, its high ﬂexi-

bility enables its use for other scenarios than the orig-

inal one and turns it into a reliable and controllable

framework. For instance, it provides an easy way of

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

Figure 4: Animation of a novel topology with several sub-

models. Here, we simultaneously reconstruct the humanoid

skeleton motion with several Auto Encoder-based body

parts sub-models (blue) and generate an animation for an-

other pair of arms with two Variational Auto Encoder-based

sub-models (cyan).

creating coarse animation of novel character topolo-

gies, e.g., as often seen in applications where users

can choose to embody non-anthropomorphic charac-

ters. For instance, we can easily copy, remove, or re-

place a speciﬁc sub-model by another one with only

minor modiﬁcations to the original BPs model. This

process enables the use of the sub-models as building

blocks too reconstruct or create local animations. To

illustrate this modularity, we showcase (cf. Figure 4)

the use of an AE-based BPs5 model to perform the re-

construction of the original skeleton and of two VAEs,

trained respectively on the right and left arms motion.

By slightly altering the latent code produced by the

VAE encoders, we effectively generate the new arms

animations. Eventually, the motions can then be con-

catenated following a novel topology to obtain a new

animation. The whole process is close to a straightfor-

ward manipulation of the animation skeleton in tra-

ditional Character Animation, but would usually be

difﬁcult to handle using a BPs1 model.

5 DISCUSSION

The experiments presented in this paper, evaluating

the beneﬁts of body parts-based representations on

human motion reconstruction tasks, brought many in-

teresting results, which we discuss in this section.

Although BPs1 achieves the best results overall on

the objective reconstruction metrics, the other models

show comparable, and sometimes better, results, de-

spite discarding the upper/lower coordination crucial

in locomotion tasks. However, on datasets that have

many motions involving mostly the coordination of

both arms together (such as MHAD), BPs3 performs

better. This result could be explained by the rela-

tively good modeling of the arms coordination (one

sub-model in this 3-parts decomposition) and the re-

moval of the upper/lower body communication that

might bring spurious motions. We observe similar ob-

jective performances on Edin

ﬁght, which also relies

heavily on the upper-body; although there are a) some

boxing motions in CMU that could explain BPs1 suc-

cess and b) some kicks and steps that might be eas-

ier to model with prior locomotion knowledge. Over-

all and counter-intuitively, BPs5 achieves the second

place in terms of reconstruction performance on half

the studied scenarios. In other words, the trade-off

between global coherency and local modeling could

be interesting to explore in some reconstruction task,

using appropriated metrics.

The results presented in this paper rely on several

objective metrics, some of them with a certain level

of correlation, that were chosen due to their wide us-

age in the literature. In particular, we focused on re-

porting the MPJPE and MPJAE that are extensively

used in skeletal motion modeling (i.e., human pose

estimation, motion prediction, or reconstruction) and,

although strongly correlated, can be complementary.

As introduced previously, the MPJPE is an objective,

interpretable and differentiable metric, and as such

our main evaluation criterion. In a kinematic repre-

sentation, it is linked to the MPJAE, which itself suf-

fers from a) relying only on global angular informa-

tion (and thus ignoring potential local errors, hence

the need for other metrics such as MPJLAE), and

b) a lack of interpretability. However, the MPJAE

more strongly penalizes incorrect global rotations that

might still produce correct joint positions (e.g. wrist

rotation around the wrist-elbow axis), and thus a low

MPJPE.

Furthermore, we chose to evaluate the reconstruc-

tion results using metrics applied on the whole skele-

ton, which provided three beneﬁts: a) a fair compari-

son between the various models, b) a coherency with

our main use case, and c) accounting for joint accu-

mulation errors by including the most proximal joints

in the skeletal chains. Indeed, since the MPJPE and

MPJAE are computed at the global scale (i.e. after a

forward kinematics pass), they tend to accumulate the

errors along the kinematic chain (i.e. the whole skele-

ton). While they are not speciﬁcally tailored for eval-

uating discrepancies at the junction between body part

models, such errors are then necessarily accounted

by the metrics. However, exploring metrics describ-

ing more precisely the error distributions across joints

would be valuable, and would provide insights on the

difﬁculty of modeling the connectors, which are key

joints from an anatomical standpoint. Situational met-

rics, such as the MPJPE computed on speciﬁc body

part chains (here used as training losses) could also

be evaluated, for instance in scenarios similar to the

one described in Section 4.2.3. Finally, multiple other

metrics would be interesting to consider in the fu-

ture for human motion modeling, e.g., ranging from

Evaluation of Body Parts Representations in Motion Reconstruction

metrics based on time derivatives (in particular jerk),

which would penalize bad temporal coherency and

jerkiness, to metrics closer of the human perception.

While time derivatives metrics were indeed computed

for this work, reliable perceptive metrics that would

not necessitate subjective evaluations is still an issue

in deep-based animation, as is the data availability for

creating them.

We believe that the above results highlight the dif-

ﬁculty of obtaining a comprehensive human motion

representation. In fact, in terms of motion diversity,

structured motion capture dataset are sparse, priori-

tizing widespread and generic motions that might not

be relevant for speciﬁc applications. To obtain such

motions, we thus rely on often approximated meth-

ods (e.g. latent, or linear, interpolations) to man-

age transitions or generalization, or on unstructured

datasets. Nevertheless, expensive and cumbersome

motion capture setups, actors and plenty of time are

required, which prevent the creation of large-scales

datasets comparable in size to what can be found

in the image community. Furthermore, some mo-

tions are inherently challenging to capture and thus

under-represented, either because they are special-

ized (e.g. the “en garde” posture or making pot-

tery) or because they occur in challenging conditions

(e.g., swimming implies water, refraction, reﬂection

and waterproof equipment). As such, creating mod-

els complex enough to learn all the correlations at

the gesture and body scales on a dataset might lead

to overﬁtting its motion distribution and thus hurt the

model generalizability. In contrast, small independent

models could learn to reconstruct a set of local mo-

tions, that would together form a complex gesture and

an action by specifying a number of objectives, e.g.,

having some models learn body parts motion distri-

butions and an autonomous agent sampling them to

perform speciﬁc, goal-driven tasks.

Another major advantage of a body parts-based

approach is its modularity, which produces a struc-

tured latent vector while providing several beneﬁts

that apply at different levels of the animation pipeline.

E.g., body parts approaches enable the edition or syn-

thesis of local animations independently from an-

other, which can be used for instance to denoise or

update them selectively. As illustrated in the results,

it could prevent spatially located noise from tamper-

ing the whole signal (e.g. motion capture method

with trackers drift), while also providing novel editing

capabilities for characters with novel topologies. It

would further increase the controllability of the deep-

based pipeline, for instance by adapting the body

parts latent vectors depending on other control sig-

nals. E.g., in case of high streaming latency, we could

keep the reconstruction ﬁdelity of the axial skeleton

while sacriﬁcing the other body parts updates or high-

frequencies. In short, we obtain a ﬂexible and adapt-

able representation that can address a wide variety of

exotic scenarios ranging from an interactive tool for

artistic creation at the body part level to a reduction in

the quantity and quality of transmitted information in

a streaming setting (i.e., reconstructing relevant body

parts and generating the others, or having a coarse to

ﬁne approach on the server side).

Finally, separating the human body into parts also

provides an increased ﬂexibility to train human mo-

tion models. As introduced previously and illustrated

in Figures 2 and 5 (top), the evaluation process di-

rected our framework towards several BPs wrappers

designed to train different sub-models with the same

pipeline than for a whole-body model, in a transpar-

ent “Centralized” manner. Such a method could be

extended in the future to better consider the body

parts relationships, e.g., to train a network in charge

of the synchronization between sub-models. How-

ever, our framework also enables to train the different

sub-models separately (as illustrated in Figure 5, bot-

tom). The training could then be fully parallelized, or

performed on entirely different data sources similarly

to a Federated Learning process. Hybrid schemes,

e.g., ﬁrst training sub-models in a “Centralized” fash-

ion, before ﬁne-tuning them using the “Federated” ap-

proach on dedicated datasets (e.g. expert motions)

would also be interesting to explore. As another ex-

ample, we could also consider use cases where rel-

evant body part models would be tuned on datasets

of people with missing limbs, to model their motions

without the artifacts that a traditional post-processing

of the skeleton could produce. However, we leave

such novel learning mechanisms for future work.

6 CONCLUSION

This paper proposes a new human motion modeling

framework for creating independent body parts mod-

els and a simple technique for combining them to ad-

dress several use cases. Additionally, we evaluate dif-

ferent body decompositions used in the literature by

testing them on various use case simulations. Our re-

sults suggest that even though we explicitly discard

long-range dependencies across the body, our inde-

pendent body parts models can produce results com-

parable to a whole-body model in a reconstruction

task. Moreover, results suggest that their denoising

(especially at the local scale) and generalization ca-

pacities are relevant, and their modularity provides

other interesting features. For instance, they semanti-

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

N sources of data

"Federated"

Centralized

Output

Training signal

Body part i

BPsN

1 source of data

Split

Concat

Add

Whole-Body

loss ?

Figure 5: Possible generic training processes for a Body

Parts model with N sub-models. In the “Federated” case

(bottom), we make the assumption that the sources of data

would provide the data corresponding to each body part. In

other words, if we use one or several whole-body datasets,

the masking would be done at the sources of data level.

cally partition the animation data, enabling a more in-

terpretable and controllable representation, which is a

desirable property in unsupervised and unpredictable

contexts. This property includes the modiﬁcation of

the latent vector in a scenario of real-time anima-

tion transmission (UDP-based) over the Internet. Fi-

nally, this semantic representation could further lead

to novel semantic-based animation techniques of non-

humanoid avatars, novel training schemes for large or

privacy-aware models, or even to different compres-

sion methods.

In future works, we aim to address some limita-

tions of the current approach and further exploit its

advantages. In particular, our work aimed at explor-

ing the beneﬁts of an alternative approach compared

to the current state-of-the-art techniques that involve

highly specialized and engineered architectures. As

such, this study should be perceived as complemen-

tary to other body parts works, to encourage the ex-

ploitation of more modular motion representations,

and to serve as a possible milestone in the correspond-

ing open discussion. The future steps would thus in-

volve more in-depth exploration and analysis of in-

dependent body parts motion modeling, and compar-

isons with state-of-the-art models on specialized top-

ics. We believe that creatively leveraging local motion

distributions, that are fast to model, tune, or modify,

could prove valuable in numerous scenarios that do

not necessarily need an explicit and holistic model-

ing of every joints correlations (e.g. learning expert

motions or managing occlusions and trackers issues).

REFERENCES

Aberman, K., Li, P., Lischinski, D., Sorkine-Hornung, O.,

Cohen-Or, D., and Chen, B. (2020). Skeleton-aware

networks for deep motion retargeting. ACM Trans. on

Graph., 39(4).

Aristidou, A. and Lasenby, J. (2011). Fabrik: A fast, itera-

tive solver for the inverse kinematics problem. Graph-

ical Models, 73(5):243–260.

Butepage, J., Black, M. J., Kragic, D., and Kjellstrom, H.

(2017). Deep representation learning for human mo-

tion prediction and classiﬁcation. In IEEE/CVF Conf.

on Comp. Vision and Pattern Recognition.

Chen, A. (2023). Character motion synthesis based on deep

learning: A survey. Highlights in Science, Engineer-

ing and Technology, 76:705–724.

Cheymol, A., Fribourg, R., L

ecuyer, A., Normand, J.-

M., and Argelaguet, F. (2023). Beyond my real

body: Characterization, impacts, applications and

perspectives of “dissimilar” avatars in virtual real-

ity. IEEE Trans. on Visualization and Comp. Graph.,

29(11):4426–4437.

Dang, L., Nie, Y., Long, C., Zhang, Q., and Li, G. (2021).

Msr-gcn: Multi-scale residual graph convolution net-

works for human motion prediction. In IEEE/CVF Int.

Conf. on Comp. Vision, pages 11467–11476.

Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tom-

czak, J. M. (2018). Hyperspherical variational auto-

encoders. Conf. on Uncertainty in Artiﬁcial Intelli-

gence.

Fu, H., Li, C., Liu, X., Gao, J., Celikyilmaz, A., and Carin,

L. (2019). Cyclical annealing schedule: A simple ap-

proach to mitigating kl vanishing. In North American

Chapter of the Ass. for Computational Linguistics.

Guo, W., Du, Y., Shen, X., Lepetit, V., Alameda-Pineda,

X., and Moreno-Noguer, F. (2023). Back to mlp:

A simple baseline for human motion prediction. In

IEEE/CVF Conf. on Comp. Vision and Pattern Recog-

nition, pages 4809–4819.

Hanavan, E. P. (1964). A mathematical model of the human

body, volume 32. Aerospace Medical Research Lab.,

Aerospace Medical Div., Air Force Systems Com-

mand.

Holden, D., Komura, T., and Saito, J. (2017). Phase-

functioned neural networks for character control.

ACM Trans. on Graph., 36(4):1–13.

Holden, D., Saito, J., and Komura, T. (2016). A deep learn-

ing framework for character motion synthesis and

editing. ACM Trans. on Graph., 35(4):1–11.

Holden, D., Saito, J., Komura, T., and Joyce, T. (2015).

Learning motion manifolds with convolutional au-

toencoders. In ACM SIGGRAPH Asia Technical

Briefs.

Hu, L., Zhang, Z., Zhong, C., Jiang, B., and Xia, S. (2024).

Pose-aware attention network for ﬂexible motion re-

targeting by body part. IEEE Trans. on Visualization

and Comp. Graph., 30(8):4792–4808.

Jain, A., Zamir, A. R., Savarese, S., and Saxena, A.

(2016). Structural-rnn: Deep learning on spatio-

Evaluation of Body Parts Representations in Motion Reconstruction

temporal graphs. In IEEE/CVF Conf. on Comp. Vision

and Pattern Recognition, pages 5308–5317.

Jang, D.-K., Park, S., and Lee, S.-H. (2022). Motion puzzle:

Arbitrary motion style transfer by body part. ACM

Trans. on Graph., 41(3).

Jiang, J., Streli, P., Qiu, H., Fender, A., Laich, L., Snape,

P., and Holz, C. (2022). Avatarposer: Articulated full-

body pose tracking from sparse motion sensing. In

Comp. Vision – ECCV 2022, pages 443–460. Springer

Nature.

Ko, H. and Badler, N. (1996). Animating human locomo-

tion with inverse dynamics. IEEE Comp. Graph. and

Applications, 16(2):50–59.

Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion

graphs. ACM Trans. on Graph., 21(3):473–482.

Lee, S., Kang, T., Park, J., Lee, J., and Won, J. (2023a).

Same: Skeleton-agnostic motion embedding for char-

acter animation. In ACM SIGGRAPH Asia Conf. Proc.

Lee, S., Lee, J., and Lee, J. (2022). Learning virtual

chimeras by dynamic motion reassembly. ACM Trans.

on Graph., 41(6):1–13.

Lee, S., Starke, S., Ye, Y., Won, J., and Winkler, A. (2023b).

Questenvsim: Environment-aware simulated motion

tracking from sparse sensors. In ACM SIGGRAPH

Conf. Proc.

Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., and Tian,

Q. (2022a). Symbiotic graph neural networks for 3d

skeleton-based human action recognition and motion

prediction. IEEE Trans. on Pattern Analysis and Ma-

chine Intelligence, 44(6).

Li, M., Chen, S., Zhang, Z., Xie, L., Tian, Q., and Zhang,

Y. (2022b). Skeleton-parted graph scattering networks

for 3d human motion prediction. In Comp. Vision –

ECCV 2022, page 18–36.

Li, Y., Wang, Z., Yang, X., Wang, M., Poiana, S. I.,

Chaudhry, E., and Zhang, J. (2019). Efﬁcient con-

volutional hierarchical autoencoder for human motion

prediction. The Visual Computer, 35(6):1143–1156.

Liu, Z., Lyu, K., Wu, S., Chen, H., Hao, Y., and Ji, S.

(2021). Aggregated multi-gans for controlled 3d hu-

man motion prediction. AAAI Conf. on Artiﬁcial In-

telligence Proc., 35(3):2225–2232.

Malek-Podjaski, M. and Deligianni, F. (2023). Adversarial

attention for human motion synthesis. In IEEE Sym-

posium Series on Computational Intelligence, pages

69–74.

Mao, W., Liu, M., and Salzmann, M. (2020). History re-

peats itself: Human motion prediction via motion at-

tention. In Comp. Vision – ECCV 2020, pages 474–

489.

Mourot, L., Hoyet, L., Le Clerc, F., Schnitzler, F., and

Hellier, P. (2022). A survey on deep learning for

skeleton-based human animation. Comp. Graph. Fo-

rum, 41:122–157.

Reda, D., Won, J., Ye, Y., van de Panne, M., and Winkler,

A. W. (2023). Physics-based motion retargeting from

sparse inputs. Proc. ACM Comput. Graph. Interact.

Tech., 6.

Shao, Z., Li, Y., Guo, Y., Zhou, X., and Chen, S. (2019). A

hierarchical model for human action recognition from

body-parts. IEEE Trans. on Circuits and Systems for

Video Technology, 29(10):2986–3000.

Shu, X., Zhang, L., Qi, G.-J., Liu, W., and Tang, J. (2022).

Spatiotemporal co-attention recurrent neural networks

for human-skeleton motion prediction. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 44(6).

Starke, S., Mason, I., and Komura, T. (2022). Deepphase:

periodic autoencoders for learning motion phase man-

ifolds. ACM Trans. on Graph., 41(4):1–13.

Wang, W., Zhou, T., Qi, S., Shen, J., and Zhu, S.-C. (2022).

Hierarchical human semantic parsing with compre-

hensive part-relation modeling. IEEE Trans. on Pat-

tern Analysis and Machine Intelligence, 44(7):3508–

3522.

Wang, Y. and Neff, M. (2015). Deep signatures for index-

ing and retrieval in large motion databases. In ACM

SIGGRAPH Conf. on Motion in Games, page 37–45.

Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shecht-

man, E., Hadap, S., Yumer, E., and Lee, H. (2018).

Mt-vae: Learning motion transformations to gener-

ate multimodal human dynamics. In Comp. Vision –

ECCV 2018.

Yang, D., Kim, D., and Lee, S.-H. (2021). Lobstr:

Real-time lower-body pose prediction from sparse

upper-body tracking signals. Comp. Graph. Forum,

40(2):265–275.

Ye, Y., Liu, L., Hu, L., and Xia, S. (2022). Neural3points:

Learning to generate physically realistic full-body

motion for virtual reality users. Comp. Graph. Forum,

41(8):183–194.

Zhang, H., Starke, S., Komura, T., and Saito, J. (2018).

Mode-adaptive neural networks for quadruped motion

control. ACM Trans. on Graph., 37(4):1–11.

Zhang, J., Tu, Z., Weng, J., Yuan, J., and Du, B. (2024). A

modular neural motion retargeting system decoupling

skeleton and shape perception. IEEE Trans. on Pattern

Analysis and Machine Intelligence.

Zhou, L., Shang, L., Shum, H. P., and Leung, H. (2014).

Human motion variation synthesis with multivariate

gaussian processes. Comp. Animation and Virtual

Worlds, 25(3-4):301–309.

Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. (2019). On

the Continuity of Rotation Representations in Neural

Networks. In IEEE/CVF Conf. on Comp. Vision and

Pattern Recognition, pages 5738–5746.

Zou, Q., Yuan, S., Du, S., Wang, Y., Liu, C., Xu, Y., Chen,

J., and Ji, X. (2025). Parco: Part-coordinating text-to-

motion synthesis. In Comp. Vision – ECCV 2024.

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications