Diffusion Transformer Framework for Speech-Driven Stylized Gesture

Generation

Nada Elmasry

, Yanbo Cheng

and Yingying Wang

Department of Computing and Software, McMaster University, Hamilton, Canada

Keywords:

Speech-Driven, Gesture Synthesis, Motion Stylization, Multimodal Coordination, Deep Learning, Diffusion

Transformer.

Abstract:

Gestures are a vital component of human expression, playing a pivotal role in conveying information and

emotions. Generating co-speech gestures remains challenging in human-computer interaction due to the intri-

cate relationship between speech and gestures. While recent advances in learning-based methodologies have

shown some progress, they still encounter limitations, as a lack of diversity and a mismatch between generated

gestures and the semantic and emotional context of speech, impacting the effectiveness of communication. In

this work, we propose a novel gesture generation framework that takes speech audio and a target style ges-

ture example as inputs, automatically synthesizing new gesture performances that align with the speech in the

desired style. Speciﬁcally, our framework comprises four main components: a dual-stream audio encoder,

a gesture-style encoder, a cross-attention modality fusion module, and a latent diffusion generation module.

The dual-stream audio encoder and gesture style encoder extract diverse modality embeddings from audio

and motion inputs; the cross-attention fusion module maps the multi-modal embeddings into a uniﬁed latent

space, and the diffusion module produces expressive and stylized gestures. The results demonstrate the excep-

tional performance of our method in generating natural and diversiﬁed gestures that accurately and coherently

convey the intended information, surpassing the benchmarks established by traditional methods. Finally, we

discuss future directions for our research.

1 INTRODUCTION

Gestures are integral components of human commu-

nication, functioning as co-expressive elements com-

plementing speech (David, 1992; McNeill, 2019).

They consist of non-verbal hand and arm movements

that enhance communication by synchronizing with

speech emphasis in time and matching speech con-

tent in semantics. Thus, gestures play a pivotal role in

conveying information, emotion and personality.

Co-speech gesture synthesis is crucial in devel-

oping lifelike conversational virtual characters, for

human-computer interaction, computer graphics, and

social robotics applications. However, the multi-

modal and multi-functional nature of gestures causes

great challenges in their automatic generation. Un-

like generic human motions, gesture performance is

not standalone, but part of multimodal conversational

behaviors dependent on speech and prosody. McNeill

(David, 1992) categorizes gestures into four types:

beat, iconic, metaphoric, and deictic, each of which

https://orcid.org/0009-0006-3694-1453

https://orcid.org/0009-0009-1684-0585

https://orcid.org/0000-0002-5680-1929

correlates to different prosodic or semantic aspect

in speech. Thus automatically generating realistic

gestures of all categories that well synchronize with

speech emphasis and match the spoken content is a

hard multi-modal coordination problem to solve. An-

other challenge is that gesture motions are free form

expressive motions that do not follow a regular pat-

tern like locomotion, and identities, personalities and

emotions can all cast signiﬁcant impact on gesture

performance styles. Recent research has utilized deep

learning approaches to predict and produce gestures,

however, the generated motion quality and diversity

is still restricted by the data and network design, due

to the spatiotemporal complexity of gesture perfor-

mance.

In this work, we propose an example-based

stylistic co-speech gesture generation framework that

solves the aforementioned challenges. Our frame-

work takes speech audio and an example of tar-

get style gesture as input, and outputs novel gesture

performance in the speciﬁed style that matches the

speech. The gesture synthesis task is achieved by

four major components: a dual-stream audio encoder,

a gesture style encoder, a multimodal cross-attention

Elmasry, N., Cheng, Y. and Wang, Y.

Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation.

DOI: 10.5220/0013318400003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 355-362

ISBN: 978-989-758-728-3; ISSN: 2184-4321

355

fusion module, and a diffusion-based gesture genera-

tor. First, the dual-stream audio encoder extracts ef-

fective acoustic embeddings from the speech input,

and the gesture style encoder extracts motion style

embeddings from the example gesture. The multi-

modal cross-attention module aligns the audio and

motion embeddings extracted from different modal-

ities through attention mechanisms, and fuses them to

a uniﬁed latent space. The diffusion model takes the

fused latent embedding and outputs diverse gesture

performance.

Compared to existing gesture synthesis research,

our proposed framework has many advantages.

Single-shot example gesture is an efﬁcient and fea-

sible way for specifying the desired style. Incorpo-

rating style features addresses the challenge of cap-

turing the vast combination of motion content and

stylistic variations in human movement. Our dual-

stream audio encoder extracts effective acoustic fea-

tures from speech, and the subsequent cross-attention

module is capable of capturing the corresponding be-

tween gesture style features and the speech audio fea-

tures from different modalities. Our diffusion-based

gesture generator ensures the stylistic diversity in the

synthesized gestures. Preliminary results demonstrate

that our framework outperforms existing gesture syn-

thesis work under similar training conditions, gener-

ating expressive and context-appropriate gestures that

align with the given speech. We summarize the con-

tribution of our work as follows:

• We propose a novel framework that takes single-

shot style example for synthesizing expressive

gestures in desired styles matching speech input;

• We introduce a dual-stream audio encoder that ef-

fectively extracts acoustic features from speech;

• We demonstrate the multi-modal cross-attention

module for fusing the correlated features between

speech and style;

• We present the latent diffusion-based gesture gen-

erator, capable of synthesizing diverse stylistic

gesture performances.

2 RELATED WORK

Rule-based Methods for Gesture Synthesis

Early gesture generation relied on rule-based

systems with manual speech-gesture mappings.

Cassell et al.’s Animated Conversation (Cassell

et al., 1994) pioneered the automatic production of

context-appropriate gestures, facial expressions, and

intonation by integrating dialogue generation, text-

to-speech, and symbolic representations. Th

orrison’s

Ymir (Wei et al., 2022) enhanced this approach by

incorporating multimodal inputs—speech, gaze, ges-

ture, and intonation—through perception, dialogue,

decision-making, and action scheduling modules,

enabling more interactive animations. Further

advancements included Cassell et al.’s Behaviour

Expression Animation Toolkit (BEAT) (Cassell

et al., 2001), which synthesized nonverbal cues with

customizable personalities; Kopp et al.’s Max (Kopp

and Wachsmuth, 2002; Kopp et al., 2003), generating

complex gestures from XML speciﬁcations using

non-uniform cubic B-Splines; and Pelachaud et al.’s

Greta (Pelachaud et al., 2002), a 3D virtual agent ex-

pressing emotions through a Belief-Desire-Intention

framework. The development of domain-speciﬁc

languages (DSLs) such as MURML (Kopp et al.,

2003), APML (De Carolis et al., 2004), and RRL

(Piwek et al., 2004) followed, although they were

primarily XML-based and incompatible. To resolve

this, the Behavior Markup Language (BML) (Kopp

et al., 2006; Vilhj

almsson et al., 2007) was created as

a comprehensive framework for intent and behavior

planning, becoming the standard for rule-based sys-

tems and integrating into platforms like SmartBody

and humanoid robots. Despite their ability to produce

synchronized gestures, rule-based systems are limited

by ﬁnite handcrafted rules and pre-recorded motions,

resulting in restricted motion diversity, scalability

challenges due to manual effort, and reliance on

explicit speech-gesture mappings based on text or

acoustic features.

Data-driven Statistical Gesture Generation

Researchers developed data-driven statistical models

for gesture synthesis to address the limitations

of rule-based methods, but these often relied on

curated gesture libraries and manual annotations,

limiting scalability and adaptability. Kipp used

ANVIL (Kipp, 2001) to annotate co-speech gestures,

modeling them based on features like handedness,

timing, and communicative function. Neff et al.

(Neff et al., 2008) created an animation lexicon to

generate gestures from text. Bergmann and Kopp

introduced Bayesian networks for transforming

speech into gestures (Bergmann and Kopp, 2009),

enhancing them with probabilistic and rule-based

components. Levine et al. employed hidden Markov

model (HMM) and conditional random ﬁeld (CRF)

(Levine et al., 2009) to select motion clips based

on prosodic features and reduce overﬁtting. Chiu

et al. developed Hierarchical Factored Conditional

Restricted Boltzmann Machine (HFCRBM) for

audio-based smooth gesture generation, and Yang et

al. (Yang et al., 2020) implemented statistical motion

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

356

Figure 1: Framework overview, where four major components are illustrated, i.e. dual-stream audio encoder, gesture style

encoder, multi-modal cross-attention module and diffusion-based gesture generator.

graphs for synchronized body motions, enhancing

diversity with stochastic search algorithms.

Deep Learning for Gesture Synthesis

Deep learning has signiﬁcantly advanced co-speech

gesture generation by enabling the synthesis of

natural and diverse gestures from large datasets,

eliminating the need for manually designed lexi-

cons and mapping rules. Early approaches utilized

deterministic models such as Convolutional Neural

Networks (CNNs) (Habibie et al., 2021) and Re-

current Neural Networks (RNNs) (Liu et al., 2022;

Yoon et al., 2019; Yoon et al., 2020) to map speech

inputs directly to gesture sequences. While these

models improved the perceived naturalness and

appropriateness of generated gestures, they often

produced more averaged and less diverse outputs.

Generative models have emerged as a superior

alternative by introducing stochasticity into the

generation process, leading to more diverse and

human-like gestures. These approaches include Nor-

malizing Flows, Generative Adversarial Networks

(GANs), and diffusion-based models. Generative

models, such as Normalizing Flows, Variational

Autoencoders (VAEs), and Vector Quantized VAEs

(VQ-VAEs), have been employed to learn diverse and

realistic gesture distributions. For instance, Ahuja

et al. (Ahuja et al., 2020) developed a Temporal

Convolutional Network (TCN) to create stylized

gestures, enhancing motion expressiveness, Yoon et

al. (Yoon et al., 2020) utilized adversarial networks

with multimodal information for gesture generation,

and Li et al. (Li et al., 2021) employed VAEs to

train generators with shared and motion-speciﬁc

latent spaces for coherent gesture sequences. Despite

these advancements, generative models often suffer

from low semantic alignment with speech input due

to the inherent many-to-many relationship between

speech and gestures. Recent approaches aim to

improve intent alignment with gesture prediction and

incorporate gesture styles for personalized synthesis.

Diffusion Models for Gesture Generation

Recently, diffusion-based models have advanced

gesture generation by leveraging stochastic diffusion

processes to learn data distributions, enhancing

ﬂexibility and diversity. These models produce

gestures that are semantically or emotionally aligned

with input speech. Notable approaches include

DiffGesture, which employs a transformer-based

diffusion pipeline with annealed noise sampling for

temporal consistency (Zhu et al., 2023); GestureD-

iffuCLIP, which integrates latent-diffusion models

and CLIP-based conditioning for better control (Ao

et al., 2023); TalkSHOW, utilizing VQ-VAEs for

body and hand motions (Yi et al., 2023); and LDA,

which provides style control using classiﬁer-free

guidance for diffusion models in both music-to-dance

(Alexanderson et al., 2023). Additionally, models

have been developed for predicting the movement

of multiple speakers in social settings (Tanke et al.,

2023), multi-modal diffusion for video and audio

generation (Ruan et al., 2023), and efﬁcient omni-

modal representation learning paradigms (Lei et al.,

2023).

Despite these advancements, diffusion-based gen-

erative models still struggle to maintain semantic

alignment with speech due to the many-to-many re-

lationship between speech and gestures. Recent re-

search aims to improve intent alignment and incorpo-

rate personalized gesture styles.

3 METHOD

In this work, we propose a novel framework for gener-

ating diverse and stylized co-speech gestures through

diffusion. As illustrated in Figure 1, our frame-

work mainly consists of four components: a dual-

stream audio encoder (Sec. 3.1), a gesture style en-

coder (Sec. 3.2), a multi-modal cross-attention fusion

module (Sec. 3.3) and a diffusion-based gesture gen-

erator (Sec. 3.4). Given a speech input, the frame-

work allows users to provide a one-short gesture ex-

ample to specify their desired target style. The dual-

Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation

357

stream audio encoder and the gesture style encoder

take the speech and the style example as input, and

project them to audio embeddings and style embed-

dings respectively. Instead of naively piecewising the

audio and style embeddings together, our multimodal

cross-attention module correlates the audio emphasis

and speech elements with the salient gesture style fea-

tures, and aligns embeddings from the two modalities

in one uniﬁed latent space. Lastly, the uniﬁed embed-

dings are passed to diffusion-based gesture generator,

which synthesizes co-speech gesture performance in

the target style frame by frame in an auto-regressive

fashion. We discuss the details of each component

of our gesture generation framework in the following

sections.

3.1 Speech Encoding

Given a speech input, our framework employs a dual-

stream audio encoder to project the speech audio into

latent embeddings. Speciﬁcally, the audio input is

a sequence of T -frame total length. At each frame,

a window of N neighboring frames is cut into an

audio segment and fed to the audio encoder to ex-

tract its features. We propose to extract the audio

features from two streams: raw audio input in time

domain, and mel-spectrogram in frequency domain.

Outputs from the two streams are then fused into

the sequence of audio embedding vectors denoted by

A = [a

, a

, . . . , a

T −1

] where A ∈ R

T ×D

, and D

de-

notes the dimension of the audio embedding vector

for each frame.

Time Domain Stream: A convolutional neural net-

work (CNN) designed to process raw audio features.

It consists of four 1D convolutional layers with pro-

gressively increasing channel sizes (64, 128, 256,

512) and kernel sizes (1, 3, 5, 7). Each convolutional

layer is followed by a GELU activation function and

dropout layers for regularization. The convolutional

layers are succeeded by three fully connected layers,

reducing the dimensionality to the target embedding

size.

Frequency Domain Stream: Based on the Audio

Spectrogram Transformer (AST) architecture (Gong

et al., 2021), this encoder transforms the input spec-

trogram into embedded patches using a patch em-

bedding layer. Positional embeddings are added to

provide spatial context. The patches are processed

through multiple self-attention layers and feedfor-

ward networks, producing a reﬁned spectrogram en-

coding.

Stream Fusion Block: The Fusion block inte-

grates the outputs from the Speech Encoder and Au-

dio Spectrogram Encoder by concatenating their out-

Figure 2: Dual-stream Audio Encoder.

puts. The combined vector is then normalized using

layer normalization and processed through a multi-

layer perceptron network (MLP) with SiLU activa-

tion. This fusion allows the model to jointly learn

from both time-domain (raw audio) and frequency-

domain (spectrogram) representations, resulting in a

comprehensive audio embedding.

3.2 Style Encoding

The style input processing module encodes desired

gesture characteristics, including motion type, pos-

ture, and expressiveness. It utilizes detailed animation

and joint data, encompassing joint local translations

and rotations, translational and rotational velocities,

and joint movements relative to the character’s root

transform (Ghorbani et al., 2023). These features cap-

ture both static and dynamic properties of gestures,

ensuring that the generated gestures are realistic and

stylistically accurate.

Each frame of the animation clip is represented

by a feature vector a = [ρ

, ρ

, ˙r

], where

∈ R

3 j

and ρ

∈ R

6 j

represent the joint local trans-

lations and rotations,

and

represent the joint lo-

cal translational and rotational velocities, and ˙r

and

˙r

represent the character root translational and rota-

tional velocity local to the character root transform. j

corresponds to the number of joints in the kinematic

tree.

Inspired by attention mechanisms and variational

autoencoder (VAE) (Vaswani, 2017; Kingma, 2013),

the Style Encoder transforms a reference style ani-

mation clip into a low-dimensional embedding vec-

tor that encodes the stylistic properties of the ges-

tures. A Variational Auto-Encoder (VAE) samples

the style embeddings from multivariate Gaussian dis-

tribution. The extracted style sequence is then pro-

cessed through convolutional layers and an Attention-

based Feed Forward Transformer network to produce

the style embedding vector e.

3.3 Multimodal Cross-Attention Fusion

The Cross-Attention Fusion Network integrates the

audio and style embeddings to enable the generator to

generate semantically and stylistically coherent ges-

tures. Leveraging multi-head attention mechanisms,

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

358

Figure 3: Multimodal Cross-Attention Fusion Network.

the cross-attention network captures the intricate re-

lationships between speech and style. The audio and

style embeddings are ﬁrst concatenated. Multi-head

attention layers are then applied to capture interac-

tions between these modalities as shown in Figure 3.

A feedforward network further processes the fused

embeddings to the target embedding size, enabling

the model to produce high-quality, context-aware la-

tent embeddings for gesture generation.

Let F

represent the deep features extracted from

the speech encoder, and F

represent the deep fea-

tures from the style encoder. The joint feature rep-

resentation is obtained by concatenating F

and F

followed by a transformer encoder with self-attention

mechanisms. Cross-attention layers are subsequently

applied to share context between audio and style fea-

tures, producing a mixed encoding that informs the

ﬁnal gesture output.

3.4 Gesture Diffusion

Our gesture generation method leverages a diffusion

transformer model operating within a pose feature

space to synthesize realistic and contextually appro-

priate gestures. During training, we employ a forward

diffusion process that incrementally adds Gaussian

noise to the initial pose sequence representation x

resulting in a sequence of progressively noisier pose

representations {x

}

t=1

that approximate a standard

normal distribution N (0, I). This process is deﬁned

by eq.(1), where β

is a predeﬁned variance schedule.

q(x

t−1

) = N



;

1 −β

t−1

, β



, (1)

The cumulative effect over t timesteps can be ex-

pressed directly in terms of x

in eq.2 with

∏

s=1

(1 −β

q(x

) = N



;

√

, (1 −



, (2)

In the reverse diffusion process, our model learns to

recover the original pose sequence from the noisy in-

put by estimating the noise added at each timestep.

The denoising model ε

predicts the noise given the

Figure 4: Our Diffusion Transformer architecture based on

the adaLN-Zero architecture introduced by (Peebles and

Xie, 2023).

noisy pose x

, the timestep t, and the conditioning in-

formation c:

= ε

, t, c). (3)

The conditioning information c is represented by the

output embedding of our miltomodal corss-attention

fusion network. The model is trained by minimizing

the mean squared error between the predicted noise

and the actual noise added during the forward process:

= E

,t,ε

∥

ε −ε

, t, c)

∥

, (4)

where ε ∼ N (0, I) and x

√

1 −

ε.

During inference, we generate gestures by starting

from random noise x

∼ N (0, I) and iteratively ap-

plying the reverse diffusion steps using the Denoising

Diffusion Probabilistic Model (DDPM) sampling al-

gorithm (Ho et al., 2020) to obtain the denoised pose

sequence x

. At each timestep t, the model predicts

the noise to be removed, guided by the conditioning

information c.

4 IMPLEMENTATION

4.1 Dataset and Data Preprocessing

We train and evaluate our system using the Ze-

roEGGS dataset (Ghorbani et al., 2023), which com-

prises full-body motion capture and synchronized au-

dio recordings from a single English-speaking fe-

male actor performing 67 monologues across 19 dis-

tinct gesture styles. These styles range from posture-

focused categories like ”Tired” and ”Oration” to intri-

cate hand and head movements, ensuring a wide vari-

ety of gesture types for training context-appropriate

Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation

359

Figure 5: Visual result comparison.

models. The dataset includes 135 minutes of data

recorded at 60 frames per second, represented by a

75-joint skeletal model that captures detailed hand

and ﬁnger movements, providing high ﬁdelity for ges-

ture generation tasks.

To augment the data, we mirrored all anima-

tion sequences, effectively doubling the training data.

Head orientation was processed by projecting the

head z-axis direction onto the ground plane and com-

puting the median to establish a global target facing

direction for each sequence, which is set to the global

z-axis during runtime. The style labels, based on actor

instructions, may differ from external annotations, in-

troducing subjectivity that is considered during train-

ing and analysis. Preprocessing steps include nor-

malizing skeleton data to ensure consistent joint po-

sitioning, downsampling audio to 16kHz for compat-

ibility with the speech encoder, and extracting Mel-

frequency cepstral coefﬁcients (MFCCs) and energy

per frame to represent speech content.

4.2 Implementation Details

Our gesture generation model is implemented using

PyTorch and trained on an NVIDIA RTX 3060 GPU

with a batch size of 32 and an initial learning rate of

0.0001. We utilize the RAdam optimizer for its adap-

tive learning rate properties and an exponential learn-

ing rate scheduler to promote faster convergence and

better generalization.

For audio feature extraction, our speech en-

coder comprises a custom CNN-based encoder that

processes the energy and log-amplitude of mel-

spectrograms, alongside a pretrained Audio Spectro-

gram Transformer (AST) that extracts additional mel-

spectrogram features. These features are fused to

form a single feature vector representing each speech

segment. Gesture style data is encoded using an

attention-based style encoder, which captures general

features from a reference animation style sample clip

with a dynamic window length between 256 and 512

frames, sampled from the same animation clip as the

target sequence. Our diffusion-based gesture gener-

ator employs 1000 diffusion timesteps with a linear

variance schedule ranging from β

= 1 ×10

−4

to 0.1,

and the hidden dimension of all transformer layers in

the Diffusion Transformer (DiT) is set to 1024.

5 RESULTS

5.1 Training Loss

The training process maximizes the Evidence Lower

Bound (ELBO) of the gesture motion’s log-likelihood

given a speech sequence by minimizing the negative

ELBO, which serves as the training loss. The total

loss is deﬁned as:

L = E

q(z|e)

[−log p(Y | S, z)] + D

(q(z | e) ∥ p(z))

= L

recon

+ D

(q(z | e) ∥ p(z))+ L

(5)

Reconstruction Loss. L

recon

evaluates how ac-

curately the model reconstructs the target gesture se-

quence from speech and style embeddings. It is com-

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

360

posed of:

recon

= λ

+ λ

d p

+ λ

(6)

where:

• L

and L

: Mean Absolute Error (MAE) for joint

positions and rotations, ensuring pose accuracy.

• L

and L

: MAE for joint translational and rota-

tional velocities, promoting smooth motion.

• L

d p

and L

: MAE of velocities computed via ﬁ-

nite differences, enhancing motion smoothness.

• L

: MAE for the facing direction in world space,

preventing rotational drift.

The weights λ

, λ

d p

, λ

, and λ

balance

each loss component and are empirically determined

during training.

Regularization Term. D

(q(z | e) ∥ p(z)) mea-

sures the Kullback–Leibler divergence between the

posterior distribution q(z | e) from the style encoder

and the prior p(z), a standard Gaussian. This encour-

ages the latent space to resemble the prior, preventing

overﬁtting and enhancing generalization.

Cost Annealing. gradually increases the weight

of the regularization term during training, stabilizing

the learning process and promoting a meaningful la-

tent space.

Diffusion Loss. L

is the standard noise estima-

tion loss used in diffusion models (Ho et al., 2020):

= E

,t,ε

∥

ε −ε

, t, c)

∥

(7)

5.2 Qualitative Results

Figure 5 displays the gesture generation results for

a sample from the ZeroEGGS dataset. The top ﬁg-

ures illustrate gestures produced by our framework,

while the bottom ﬁgure shows those generated by

the ZeroEGGS (Ghorbani et al., 2023) model. Our

model effectively captures emotional and semantic

cues, demonstrating the desired style with consistent

motion and appropriate emphasis in threatening ges-

tures. In contrast, ZeroEGGS fails to accurately cap-

ture the style, resulting in average gestures with no-

ticeable repetitiveness. More qualitative results are

available in the video submission for our approach

with different styles.

6 CONCLUSIONS

In this study, we introduced an example-based stylis-

tic co-speech gesture generation framework that ef-

fectively produces expressive gestures aligned with

speech and desired styles. The framework com-

bines a dual-stream audio encoder, a gesture style

encoder, a multimodal cross-attention fusion mod-

ule, and a diffusion-based gesture generator to create

high-quality and diverse gesture performances. Qual-

itative results show that the model outperforms bench-

mark systems by generating gestures that are both

contextually appropriate, coherent, and realistic. Fu-

ture work aims to enhance the framework’s robust-

ness and scalability by evaluating it on larger, more

diverse datasets and benchmarking against state-of-

the-art methods. We also plan to address limitations

in generating speciﬁc styles—such as laughter and el-

derly gestures—to improve the model’s generaliza-

tion capabilities, conduct user studies to validate the

naturalness of the gestures, and optimize the infer-

ence process by reducing the number of required seed

frames for faster, near real-time gesture generation.

REFERENCES

Ahuja, C., Lee, D. W., Nakano, Y. I., and Morency, L.-P.

(2020). Style transfer for co-speech gesture anima-

tion: A multi-speaker conditional-mixture approach.

In Computer Vision–ECCV 2020: 16th European

Conference, Glasgow, UK, August 23–28, 2020, Pro-

ceedings, Part XVIII 16, pages 248–265. Springer.

Alexanderson, S., Nagy, R., Beskow, J., and Henter, G. E.

(2023). Listen, denoise, action! audio-driven motion

synthesis with diffusion models. ACM Transactions

on Graphics (TOG), 42(4):1–20.

Ao, T., Zhang, Z., and Liu, L. (2023). Gesturediffuclip:

Gesture diffusion model with clip latents. ACM Trans-

actions on Graphics (TOG), 42(4):1–18.

Bergmann, K. and Kopp, S. (2009). Increasing the expres-

siveness of virtual agents: autonomous generation of

speech and gesture for spatial description tasks. In

AAMAS (1), pages 361–368.

Cassell, J., Pelachaud, C., Badler, N., Steedman, M.,

Achorn, B., Becket, T., Douville, B., Prevost, S., and

Stone, M. (1994). Animated conversation: rule-based

generation of facial expression, gesture & spoken in-

tonation for multiple conversational agents. In Pro-

ceedings of the 21st annual conference on Computer

graphics and interactive techniques, pages 413–420.

Cassell, J., Vilhj

almsson, H. H., and Bickmore, T. (2001).

Beat: the behavior expression animation toolkit. In

Proceedings of the 28th annual conference on Com-

puter graphics and interactive techniques, pages 477–

486.

David, M. (1992). Hand and mind: What gestures reveal

about thought. University of Chicago press.[Google

Scholar].

De Carolis, B., Pelachaud, C., Poggi, I., and Steedman, M.

(2004). Apml, a markup language for believable be-

havior generation. Life-like characters: tools, affec-

tive functions, and applications, pages 65–85.

Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation

361

Ghorbani, S., Ferstl, Y., Holden, D., Troje, N. F., and

Carbonneau, M.-A. (2023). Zeroeggs: Zero-shot

example-based gesture generation from speech. In

Computer Graphics Forum, volume 42, pages 206–

216. Wiley Online Library.

Gong, Y., Chung, Y.-A., and Glass, J. (2021). Ast:

Audio spectrogram transformer. arXiv preprint

arXiv:2104.01778.

Habibie, I., Xu, W., Mehta, D., Liu, L., Seidel, H.-P.,

Pons-Moll, G., Elgharib, M., and Theobalt, C. (2021).

Learning speech-driven 3d conversational gestures

from video. In Proceedings of the 21st ACM Interna-

tional Conference on Intelligent Virtual Agents, pages

101–108.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion

probabilistic models. Advances in neural information

processing systems, 33:6840–6851.

Kingma, D. P. (2013). Auto-encoding variational bayes.

arXiv preprint arXiv:1312.6114.

Kipp, M. (2001). Anvil-a generic annotation tool for mul-

timodal dialogue. In Seventh European conference on

speech communication and technology. Citeseer.

Kopp, S., Jung, B., Lessmann, N., and Wachsmuth, I.

(2003). Max-a multimodal assistant in virtual reality

construction. KI, 17(4):11.

Kopp, S., Krenn, B., Marsella, S., Marshall, A. N.,

Pelachaud, C., Pirker, H., Th

orisson, K. R., and

Vilhj

almsson, H. (2006). Towards a common

framework for multimodal generation: The behavior

markup language. In Intelligent Virtual Agents: 6th

International Conference, IVA 2006, Marina Del Rey,

CA, USA, August 21-23, 2006. Proceedings 6, pages

205–217. Springer.

Kopp, S. and Wachsmuth, I. (2002). Model-based anima-

tion of co-verbal gesture. In Proceedings of Computer

Animation 2002 (CA 2002), pages 252–257. IEEE.

Lei, W., Ge, Y., Yi, K., Zhang, J., Gao, D., Sun, D., Ge,

Y., Shan, Y., and Shou, M. Z. (2023). Vit-lens-2:

Gateway to omni-modal intelligence. arXiv preprint

arXiv:2311.16081.

Levine, S., Theobalt, C., and Koltun, V. (2009). Real-time

prosody-driven synthesis of body language. In ACM

SIGGRAPH Asia 2009 papers, pages 1–10.

Li, J., Kang, D., Pei, W., Zhe, X., Zhang, Y., He, Z., and

Bao, L. (2021). Audio2gestures: Generating diverse

gestures from speech audio with conditional varia-

tional autoencoders. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

11293–11302.

Liu, X., Wu, Q., Zhou, H., Xu, Y., Qian, R., Lin, X., Zhou,

X., Wu, W., Dai, B., and Zhou, B. (2022). Learn-

ing hierarchical cross-modal association for co-speech

gesture generation. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 10462–10472.

McNeill, D. (2019). Gesture and thought. University of

Chicago press.

Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. (2008).

Gesture modeling and animation based on a proba-

bilistic re-creation of speaker style. ACM Transactions

On Graphics (TOG), 27(1):1–24.

Peebles, W. and Xie, S. (2023). Scalable diffusion models

with transformers. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

4195–4205.

Pelachaud, C., Caroﬁglio, V., De Carolis, B., de Rosis, F.,

and Poggi, I. (2002). Embodied contextual agent in

information delivering application. In Proceedings of

the ﬁrst international joint conference on Autonomous

agents and multiagent systems: part 2, pages 758–

765.

Piwek, P., Krenn, B., Schr

oder, M., Grice, M., Baumann, S.,

and Pirker, H. (2004). Rrl: A rich representation lan-

guage for the description of agent behaviour in neca.

arXiv preprint cs/0410022.

Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., Yuan,

N. J., Jin, Q., and Guo, B. (2023). Mm-diffusion:

Learning multi-modal diffusion models for joint au-

dio and video generation. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 10219–10228.

Tanke, J., Zhang, L., Zhao, A., Tang, C., Cai, Y., Wang,

L., Wu, P.-C., Gall, J., and Keskin, C. (2023). Social

diffusion: Long-term multiple human motion antici-

pation. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 9601–9611.

Vaswani, A. (2017). Attention is all you need. Advances in

Neural Information Processing Systems.

Vilhj

almsson, H., Cantelmo, N., Cassell, J., E. Chafai, N.,

Kipp, M., Kopp, S., Mancini, M., Marsella, S., Mar-

shall, A. N., Pelachaud, C., et al. (2007). The behav-

ior markup language: Recent developments and chal-

lenges. In Intelligent Virtual Agents: 7th International

Conference, IVA 2007 Paris, France, September 17-

19, 2007 Proceedings 7, pages 99–111. Springer.

Wei, Y., Hu, D., Tian, Y., and Li, X. (2022). Learning in

audio-visual context: A review, analysis, and new per-

spective. arXiv preprint arXiv:2208.09579.

Yang, Y., Yang, J., and Hodgins, J. (2020). Statistics-based

motion synthesis for social conversations. In Com-

puter Graphics Forum, volume 39, pages 201–212.

Wiley Online Library.

Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T.,

Tao, D., and Black, M. J. (2023). Generating holistic

3d human motion from speech. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 469–480.

Yoon, Y., Cha, B., Lee, J.-H., Jang, M., Lee, J., Kim, J., and

Lee, G. (2020). Speech gesture generation from the

trimodal context of text, audio, and speaker identity.

ACM Transactions on Graphics (TOG), 39(6):1–16.

Yoon, Y., Ko, W.-R., Jang, M., Lee, J., Kim, J., and

Lee, G. (2019). Robots learn social skills: End-to-

end learning of co-speech gesture generation for hu-

manoid robots. In 2019 International Conference on

Robotics and Automation (ICRA), pages 4303–4309.

IEEE.

Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., and Yu, L.

(2023). Taming diffusion models for audio-driven

co-speech gesture generation. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 10544–10553.

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

362