ODPG: Outﬁtting Diffusion with Pose Guided Condition

Seohyun Lee

∗

, Jintae Park

∗

and Sanghyeok Park

∗

Korea University, Seoul, South Korea

{happy8825, sonjt00, huanhuan1235}@korea.ac.kr

Keywords:

Computer Vision, Diffusion, Virtual Try-On, Bias-Augmented Query Attention, Multi-Conditioned Gen-

eration, End-to-End, Implicit Garment Integration, Dynamic Pose Alignment, Data-Efﬁcient, Non-Warped

Synthesis.

Abstract:

Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without phys-

ically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON

methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in

achieving high realism and handling dynamic poses. This paper introduces Outﬁtting Diffusion with Pose

Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple condition-

ing inputs during the denoising process. By transforming garment, pose, and appearance images into latent

features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthe-

sis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of

the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with ﬁne-grained texture

details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping

processes. Future work will focus on generating VTON outputs in video format and on applying our attention

mechanism, as detailed in the Method section, to other domains with limited data.

1 INTRODUCTION

Virtual Try-On (VTON) technology allows users to

visualise how clothes would look on them without

physically trying them on. Recently, VTON technol-

ogy has seen a signiﬁcant surge in demand, driven by

advancements in digitalization and the growing pref-

erence for online shopping experiences. The VTON

process involves sophisticated image synthesis tech-

niques that map clothing items onto images of users,

creating a realistic visual representation.

Some previous popular approaches in VTON uti-

lized Generative Adversarial Networks (GANs), with

a generator network creating realistic clothing im-

ages on users and a discriminator network evaluating

the authenticity of these images (Goodfellow et al.,

2014). The generator improves over time by learn-

ing to produce images that are indistinguishable by

the discriminator, enabling the generation of high-

quality and realistic try-on images. However, the use

of GANs in VTON faces signiﬁcant challenges. One

major issue is the mode collapse problem, where sam-

ples generated by GANs lack diversity even when

trained on multimodal data. This leads to critical issue

∗

These authors contributed equally.

in VTON task as GANs continuously generate similar

results. Additionally, the adversarial nature of GAN

training can lead to unstable synthesized results, as

the generator and discriminator oscillate rather than

converge to a stable point. This instability is exacer-

bated when the discriminator becomes too powerful,

causing vanishing gradients and halting the genera-

tor’s updates.

Therefore, recent VTON works prefer diffusion-

based methods (Gou et al., 2023) (Kim et al., 2023),

which synthesize more realistic images through a se-

ries of denoising steps. Stable Diffusion (SD) (Lu

et al., 2024) (Rombach et al., 2022) is the most fre-

quently used method in the diffusion-based VTON

domain. However, there seems to be a constraint

when using Stable Diffusion directly for VTON appli-

cations. Stable Diffusion generates images through a

process of denoising guided by text information. Due

to the difference in information densities between lan-

guage and vision, the produced images often exhibit

ambiguity. Textual descriptions may not capture the

ﬁne details and complexities of visual appearances,

leading to images that do not accurately reﬂect the in-

tended appearance.

To address these challenges, we introduce Outﬁt-

ting Diffusion with Pose Guided Condition (ODPG)

392

Lee, S., Park, J. and Park, S.

ODPG: Outﬁtting Diffusion with Pose Guided Condition.

DOI: 10.5220/0013150600003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

392-402

ISBN: 978-989-758-728-3; ISSN: 2184-4321

method that ﬁrst transforms the target garment im-

age into latent features using a VAE and Swin Trans-

former. We then incorporate these features as condi-

tions in the denoising process, enabling non-explicit

synthesis of the garment to source human image.

In this way, we generate VTON images based on

dynamic poses by incorporating additional clothing

latent conditions unlike traditional VTON models

which often produce images with reduced realism and

are limited to speciﬁc static poses.

The main contributions of our ODPG method are:

• Multi-Image-Conditioned Controllable Gener-

ation for VTON: We introduce a novel model for

VTON, designed to capture a rich semantic un-

derstanding of the target garment, target pose, and

source appearance. Our architecture enhances the

generation process by embedding a target garment

image as an additional condition in the denoising

process.

• VTON Without Explicit Garment Warping:

Unlike traditional VTON models that rely on

explicit garment warping processes, our ap-

proach integrates garment information implicitly

throughout the denoising process. This allows for

seamless garment alignment with various human

poses and appearances without the need for warp-

ing.

• Bias-Augmented Query Attention Mechanism:

We propose a bias-augmented query attention

mechanism that allows the model to learn com-

plex relationships between the source appearance,

target pose, and target garment. This mecha-

nism enhances the model’s ability to generalize to

new combinations of appearances, poses, and gar-

ments, thereby reducing the need for a four-paired

dataset.

• End-to-End Dynamic VTON Model: We

present a novel end-to-end VTON model that in-

corporates pose transfer in a single uniﬁed train-

ing process. Unlike existing approaches that rely

on multi-stage pipelines, this architecture simpli-

ﬁes the training process while maintaining high-

quality synthesis results.

2 RELATED WORK

2.1 Controllable Diffusion Models

Given the challenges with GAN-based methods

(Goodfellow et al., 2014), which struggle with in-

stability and difﬁculty in generating high-quality im-

ages in a single pass, diffusion models (Ho et al.,

2020a) have become a preferred alternative for high-

resolution image synthesis. These models work by

progressively creating realistic images through mul-

tiple denoising steps. Advances in controllable dif-

fusion models, such as Stable Diffusion, have intro-

duced additional control signals during the data gen-

eration process. This innovation allows for more ac-

curate and targeted data generation, enabling users

to control speciﬁc attributes of the output. This in-

creased ﬂexibility and utility extend to various appli-

cations, including the VTON domain.

Therefore, we aim to achieve realistic and con-

trollable VTON results by incorporating the source

human image, target garment image, and target pose

image as conditions into the denoising stages of the

controllable diffusion model architecture.

2.2 Pose-Guided Person Image

Synthesis

Pose-Guided Person Image Synthesis is a technique

where a new image of a person in a different pose

is generated from an existing image. This task was

initially introduced by Ma et al., who generated new

images of individuals in different poses using an ad-

versarial approach to enhance image quality.

Early techniques focused on separating pose and

appearance features by learning pose-irrelevant infor-

mation; therefore, they struggled with complex tex-

tures and detailed appearances. To address this issue,

auxiliary information like parsing maps and heatmaps

were incorporated, improving generation quality by

breaking down images into semantically meaningful

parts. Recent advancements have further reﬁned spa-

tial alignment between pose and appearance using dif-

fusion models like PIDM (Bhunia et al., 2023) and

PoCoLD (Han et al., 2023), which mitigate the in-

stability of GANs and enhance high-resolution image

synthesis. These advanced methods focus on spatial

correspondence through cross-attention mechanisms,

though they risk overﬁtting by merely aligning source

appearance to the target pose.

Our approach, however, is more efﬁcient and end-

to-end, freezing most parameters and avoiding re-

liance on image-text pairs, making it more adaptable

for dynamic VTON tasks.

2.3 Coarse-to-Fine Latent Diffusion for

Pose-Guided Person Image

Synthesis

PIDM (Bhunia et al., 2023) and PoCoLD (Han et al.,

2023) enhance the alignment between the source im-

ODPG: Outﬁtting Diffusion with Pose Guided Condition

393

age and the target pose to generate more realistic tex-

tures, but they struggle to achieve a high-level seman-

tic understanding of human images. This shortfall of-

ten results in overﬁtting and poor generalization, es-

pecially when synthesizing exaggerated poses that de-

viate signiﬁcantly from the source image or are un-

common in the training data.

Lu et al. address these issues with a Coarse-to-

Fine Latent Diffusion (CFLD) (Lu et al., 2024) ap-

proach, which integrates a perception-reﬁned decoder

and a hybrid-granularity attention module to improve

semantic understanding and texture detail.

Perception-reﬁned decoder achieves the percep-

tion of the source human image by initializing sets of

learnable queries randomly and reﬁning them in the

next decoder progressively. The output produced by

the decoder is regarded as the coarse-grained prompt

for source image description, which focuses on the

common semantics possessed across all the different

person images such as gender and age.

Hybrid-Granularity Attention Module effectively

encodes multi-scale ﬁne-grained human appearance

features as bias terms. These bias terms then aug-

ment the coarse-grained prompt generated by the

Perception-Reﬁned Decoder. This method helps with

the alignment of the source image and the target pose

by using only the necessary ﬁne-grained details of the

coarse-grained prompt. This enhances the quality of

the output images by improving generalization.

Unlike previous methods, this novel architecture

does not depend on image-caption pairs, making it

more adaptable for Pose-Guided Person Image Syn-

thesis tasks.

3 METHOD

3.1 Preliminary

Our approach leverages the principles of latent dif-

fusion models to achieve high-quality image gener-

ation. The framework consists of two main compo-

nents: a Variational Autoencoder (VAE) and a UNet-

based denoising model. The VAE is responsible for

transforming images from the raw pixel space into a

lower-dimensional latent space (Kingma et al., 2014).

This transformation facilitates efﬁcient manipulation

and processing of the images.

In the diffusion model, the generation process in-

volves a sequence of steps where noise is progres-

sively added to a latent variable, and then gradually

removed to produce a clean image. This process is in-

spired by the Denoising Diffusion Probabilistic Mod-

els (DDPM) (Ho et al., 2020b). The forward diffusion

process adds Gaussian noise to the initial latent vari-

able z

at each timestep t, resulting in a noisy latent

. The relationship is given by:

√

1 −

ε, (1)

where ε ∼ N (0, I) represents the Gaussian noise,

and

is a variance schedule controlling the amount

of noise added at each step. The reverse process uses

a neural network to predict and subtract the noise, ef-

fectively denoising the latent variables to reconstruct

the image.

The UNet-based model (Ronneberger et al., 2015)

plays a crucial role in this denoising process. It is

trained to predict the noise present in the noisy la-

tents z

conditioned on the timestep t. The objective

is to minimize the mean squared error between the

predicted noise and the actual noise, formalized as:

mse

= E

,ε,t



∥ε −ε

, t)∥



. (2)

3.2 Our Approach

Method Overview As illustrated in Fig 1, we intro-

duce a conditioning mechanism that integrates gar-

ment, appearance, and pose information into the

image synthesis process. In Section 3.2.1, we

propose a multi-scale feature extraction approach.

Subsequently, in Section 3.2.2, we present a bias-

augmented query attention mechanism to propagate

the extracted features through a U-Net-based diffu-

sion model. Finally, in Section 3.2.3, we discuss a

training strategy designed to improve generalization

and reduce dependency on the training data.

3.2.1 Multi-Scale Feature Extraction

Building upon CFLD for Pose-Guided Person Im-

age Synthesis, which generates ﬁne-grained pose-

transferred human images by adding pose and appear-

ance conditions during the denoising process, we ex-

tend this approach to include garment, appearance,

and pose conditions in the dynamic VTON process.

To achieve this, we ﬁrst extract multi-scale feature

maps from the Source Human, Target Garment, and

Target Pose images. Speciﬁcally, both the Source Hu-

man image and the Target Garment image are passed

through a Swin Transformer, yielding multi-scale fea-

ture maps for the source, F

= [ f

, f

], and for the

garment, F

= [ f

, f

]. Simultaneously, we

extract keypoints from the Target Pose image using

OpenPose, a pre-trained pose estimation model, and

process them through ResNet blocks to obtain multi-

scale pose feature maps, F

= [ f

, f

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

394

Figure 1: Overall Pipeline of ODPG : The source image features are extracted at three different scales using a Swin Trans-

former. Similarly, the garment image passes through a Swin Transformer, with features extracted at four scales. The fourth-

scale feature is passed through a transformer decoder with learnable queries, which outputs a general garment information to

condition the diffusion UNet structure at each downsample and upsample block. Multi-scale features from both the source and

garment images are combined via attention mechanisms, with ﬁne details added as conditions in the upsampling blocks. Ad-

ditionally, pose information, processed through a ResNet, provides coarse guidance, conditioning the downsampling blocks.

3.2.2 Bias Augmented Query Attention

In our UNet-based diffusion model, we use a learn-

able vector as the initial query Q

learn

for the attention

mechanism. During the denoising process, this query

interacts with keys and values corresponding to the

target garment via cross-attention. By incorporating

pose, appearance, and garment conditions as biases to

the query Q

learn

and applying the attention output to

the noisy latent, the model effectively generates natu-

ral garment features that align with the source human

and target pose.

To initiate this process, we ﬁrst ﬂatten the coarsest

garment feature map F

and pass it through a Trans-

former decoder. This step generates F

out

which cap-

tures general garment information. Based on this fea-

ture map, we then compute the keys K and values V ,

which are used throughout the UNet’s denoising pro-

cess:

K = W

out

, V = W

out

(3)

Here, W

and W

are learnable projection matrices

corresponding to the l-th scale of denoising blocks.

Through cross-attention between the query Q

learn

and these keys K and values V, we aim to ex-

tract garment features that correspond to the target

pose. To achieve this, multi-scale pose feature maps

[ f

, f

] are added as biases to the query Q

learn

within the downsampling blocks. Consequently, the

samples processed through these layers effectively

incorporate both the pose and garment information,

ensuring that the generated garment aligns naturally

with the target pose.

In the upsampling blocks, we simultaneously in-

tegrate garment and appearance information by utiliz-

ing the appearance encoder φ

, as proposed in CFLD.

The ﬁne-grained appearance encoder φ

is designed

to capture detailed texture information and consists

of multiple Transformer layers, including a zero con-

ODPG: Outﬁtting Diffusion with Pose Guided Condition

395

volution layer at both the beginning and end to sta-

bilize training. In our method, garment features F

and appearance features F

, extracted at each scale,

are passed through the appearance encoder φ

to cal-

culate the corresponding biases:

B = φ

, F

) (4)

These biases are then incorporated into the query

learn

, adjusting it to reﬂect appearance and garment

information. The overall attention mechanism can be

summarized by the following equation:

= softmax



learn

+ B)K

√



V. (5)

Here, The attention output F

from the lth layer is

then added to the noisy latent, guiding the denois-

ing process at each step. Through the entire denois-

ing process described above, the VTON sample, hav-

ing undergone all the upsampling and downsampling

blocks, effectively integrates the source appearance,

target garment, and target pose information. This

integration results in a realistic synthesized image.

By utilizing appearance and pose information as the

query to extract corresponding garment features, the

VTON output incorporates natural clothing details,

such as wrinkles and ﬁt, that emerge when a person

of a speciﬁc body shape wears the garment and takes

on a speciﬁc pose.

Moreover, this attention mechanism not only en-

hances the realism of the synthesized images but

also reduces the model’s dependency on four-paired

datasets. In the following section, we explore how

this approach allows us to train the model effectively

using existing three-paired datasets, thereby simpli-

fying data requirements for dynamic VTON architec-

tures.

3.2.3 Enhancing Generalization and Reducing

Data Dependency

Training a VTON model that simultaneously trans-

forms both garment and pose typically requires a four-

paired dataset, including the source human image, tar-

get garment image, target pose image, and a ground

truth image showing the source person wearing the

target garment in the target pose. However, existing

datasets, such as DeepFashion (Liu et al., 2016) and

FashionTryOn (Zheng et al., 2019a), are three-paired

datasets. In these datasets, the source human, pose-

transferred human, and target garment all feature the

same garment, which may appear to limit the dynamic

VTON model’s ability to generalize to new garments.

Previous methods focus on either pose transfer or gar-

ment transfer, and thus lack comprehensive pairing

for transformations in both aspects.

Our Bias Augmented Query Attention mechanism

enables the model to learn complex relationships be-

tween the source appearance, target pose, and target

garment, enhancing its ability to generalize to new

combinations of appearances, poses, and garments,

thereby reducing the need for a four-paired dataset.

Speciﬁcally, the attention mechanism integrates the

source appearance and target pose as biases into the

learnable query Q

learn

, guiding the model to extract

relevant garment features during the denoising pro-

cess. This approach allows the model to synthesize

realistic images of the source person wearing differ-

ent, previously unseen garments in new poses, with-

out requiring a ground truth image that combines all

transformations.

We acknowledge that in our training setup, the

source person is wearing the target garment, which

may seem to limit the model’s ability to generalize to

unseen garments. However, our experimental results

demonstrate that the model does not merely replicate

the input; instead, it effectively transfers and inte-

grates garment features based on the conditioned ap-

pearance and pose without memorizing speciﬁc gar-

ment instances.

To validate the generalization capability of our

model, we tested it with source person images wear-

ing garments different from the target garment and

in various poses. The model successfully synthe-

sized realistic images of the source person wearing

new, previously unseen garments in new poses. This

indicates that our Bias Augmented Query Attention

mechanism enables the model to focus on the interac-

tions between garments, poses, and appearances, ef-

fectively generalizing beyond the training data.

This approach reduces the dependency on four-

paired datasets, allowing us to leverage existing three-

paired datasets such as DeepFashion (Liu et al., 2016)

and FashionTryOn (Zheng et al., 2019a), which con-

tain only three components: the source human image,

the garment image, and the target posed human image

with the same garment.

In summary, our attention mechanism enhances

the model’s ability to generalize to new combinations

of appearances, poses, and garments without relying

on exhaustive paired datasets. This alleviates the data

requirements for training dynamic VTON architec-

tures.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

396

4 OPTIMIZATION

To optimize the training process, we utilize a self-

reconstruction mechanism between the source image

and itself. The reconstruction loss is formulated as:

rec

= E

,ε,t



∥ε −ε

, t, x

, x

, X

)∥



where z

= E(x

) denotes the latent representation

of the source image, and z

is the perturbed latent state

produced from z

at timestep t. The total loss function

is deﬁned as:

overall

= L

mse

+ L

rec

To further enhance the training process, we adopt

the cumulative classiﬁer-free guidance to strengthen

the source appearance, target pose features and gar-

ment features simultaneously. Following the method-

ology presented in InstructPix2Pix (Brooks et al.,

2023), where both image and textual conditionings

are randomly dropped during training to allow for

conditional or unconditional denoising, we employ

a similar mechanism with our multi-conditional in-

puts(pose, appearance, and garment. By doing so, we

ensure ﬂexibility and control in guiding the model’s

output based on each feature. The modiﬁed classiﬁer-

free guidance is expressed as:

=ε

, t,

+ w

pose

(ε

, t,

0, x

t p

) −ε

, t,

0))

+ w

app

(ε

, t, x

, x

t p

) −ε

, t,

0, x

t p

))

+ w

garment

(ε

, t, x

, x

t p

) −ε

, t,

0, x

t p

)),

(6)

where w

pose

, w

app

, and w

garment

are weighting fac-

tors for the pose, appearance, and garment guidance,

respectively. Based on this noise prediction, we uti-

lized the DDIM Scheduler in the sampling process,

performing 50 denoising steps. As a result, the syn-

thesized images were successfully based on the em-

bedded conditions of the Source Human Image, Tar-

get Pose Image, and Target Garment Image.

5 EXPERIMENT

5.1 Dataset Description

Our experiment was conducted on the large-scale In-

Shop Clothes Retrieval benchmark of DeepFashion

(Liu et al., 2016). The original DeepFashion database

contains over 300,000 cross-pose and cross-domain

image pairs. However, since our model’s goal is to

synthesize VTON images based on new poses and

garments, we extracted a dataset consisting of source

human images, target pose human images, and tar-

get garment images. Furthermore, we utilized only

256×176 low-resolution images to verify that our

model is capable of generating ﬁne details even un-

der limited resolution and constrained computational

power. Consequently, the ﬁnal extracted DeepFash-

ion dataset consists of 6,014 images. As this dataset

alone is insufﬁcient to fully validate our model’s per-

formance, we also utilized the FashionTryOn (Zheng

et al., 2019a). This dataset consists of 21,724 im-

ages, including garment images and human model im-

ages in different poses, with a resolution of 192×256.

We extracted joint keypoints from the model images

using OpenPose and subsequently trained our three-

image-conditioned architecture using both the Fash-

ionTryOn and the extracted DeepFashion datasets.

5.2 Implementation Details

We developed our method using the PyTorch frame-

work, incorporating HuggingFace Diffusers based on

Stable Diffusion v1.5. For preprocessing, the source,

garment images, and pose images are resized to di-

mensions of 256×256 pixels.

Training was performed for 30 epochs with the

Adam optimizer. The training process utilized a sin-

gle NVIDIA A100 GPU, running for approximately

30 hours, with a batch size conﬁgured to 24.

The key aspects of our model conﬁguration are as

follows:

• UNet Conﬁguration: We use a pretrained UNet

model, focusing on training speciﬁc blocks (in-

dexed from 3 to 11). Cross-attention mechanisms

are selectively applied, with only the key and

value layers being trainable.

• Pose Guidance: Pose information is downscaled

by a factor of 4, with deﬁned channels for both

pose and input data.

• Appearance Guidance: Convolutional layers are

conﬁgured with kernel and stride sizes of 1, and

attention residual blocks are indexed from 3 to 11

to effectively capture detailed features.

• Classiﬁer-Free Guidance: During training, con-

ditions are dropped with a probability of 20%.

• Garment Loss Weight: The garment loss weight

is set to 1.0 to ensure balanced training between

the source and garment features.

ODPG: Outﬁtting Diffusion with Pose Guided Condition

397

Figure 2: Qualitative results showcasing the performance of our model. The images illustrate the ability of our model to

retrieve relevant items across different poses and domains.

5.3 Qualitative Analysis

As illustrated in Fig 2, we attempted to implicitly syn-

thesize the Source Human (col 1) with the target pose

(col 2) and target garment (col 3) through the denois-

ing UNet.

The synthesized results in the output column (col

4) of each left and right table demonstrate that our

end-to-end architecture successfully integrates gar-

ment and pose information into multiple dynamic-

posed VTON results.

The garment features effectively deliver detailed

patterns, wrinkles, textures, and colors of the target

garment into the overall denoising process. Addi-

tionally, the pose features encoded by several ResNet

blocks are successfully reﬂected in all samples.

5.4 Quantitative Analysis

We evaluated our model using FID, SSIM, and LPIPS

to compare its performance against other models.

These metrics are well-suited to assess the overall

quality, structural similarity, and perceptual diversity

of generated images. However, given that our model

is an end-to-end multi-conditioning approach capa-

ble of simultaneously altering both pose and garment,

there is an inherent trade-off in metrics when com-

pared to models that either focus solely on garment

manipulation or adopt multi-stage processes for pose

and garment modiﬁcations.

The results of our experiments, as presented in Ta-

ble 1, provide a comparison between our multi condi-

tioning end-to-end model (ODPG) and two different

types of models on the FashionTryOn dataset: those

that condition only on the garment and those that use

multi-stage training to modify both the pose and gar-

ment.

Garment-only models such as VITON-HD (Choi

et al., 2021), GP-VTON (Xie et al., 2023), and CP-

VTON (Wang et al., 2018) performed well in metrics

like SSIM and LPIPS, since they only focus on condi-

tioning the garment. Our model, however, handles the

more complex task of changing both the garment and

the pose in a single end-to-end process. Despite this,

ODPG achieved reasonable SSIM and LPIPS scores

in comparison. Notably, our model outperformed all

other models in FID, which indicates that the gener-

ated images from ODPG are closer to real images in

terms of overall quality and distribution.

The second set of models, such as FashionTryOn

(Zheng et al., 2019b), Wang et al. (Wang et al.,

2020), and style-VTON (Islam et al., 2024), rely on

multi-stage training, where the pose and garment are

handled in separate steps. FashionTryOn utilizes a

bidirectional GAN in multiple stages to enhance both

pose and garment deformation, while Wang et al. em-

ploy distinct training steps for parsing transforma-

tion, spatial alignment, and face reﬁnement. Simi-

larly, style-VTON uses a two-step approach, where

a spatial transformation network ﬁrst aligns the gar-

ment, followed by a UNet to reﬁne details.

In contrast, our model’s end-to-end nature simpli-

ﬁes the pipeline while minimizing trade-off in perfor-

mance. Although the SSIM and LPIPS metrics im-

ply slightly lower performance of our model com-

pared to some multi-stage models, ODPG demon-

strated competitive performance, especially excelling

in FID. This suggests that our model effectively cap-

tures the visual realism and diversity of the generated

images while maintaining the beneﬁts of an end-to-

end structure, without the need for separate stages.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

398

Table 1: Quantitative comparisons of two types of models

on the FashionTryOn dataset: models that change only the

garment, and models that adapt to both pose and garment.

The higher the better for SSIM, and the lower the better for

LPIPS and FID.

Method SSIM ↑ LPIPS ↓ FID ↓

Pose unchanged (garment only changed)

VITON-HD 0.853 0.187 44.25

GP-VTON 0.724 0.384 66.01

CP-VTON 0.739 0.159 56.23

Pose and garment both changed

FashionTryon 0.699 0.211 53.911

Wang et al. 0.695 0.152 44.306

style-VTON 0.759 0.131 39.341

ODPG (ours, end-to-end) 0.583 0.288 33.625

5.5 Comparison with Other Models

As illustrated in Fig 3, we compared our ODPG

model with several existing models, including IMAG-

Dressing (Shen et al., 2024), IDM-VTON (Choi et al.,

2024), and OOTDiffusion (Xu et al., 2024). A

common issue among these previous virtual try-on

(VTON) methods is their approach of masking the

person and making the garment to ﬁt onto the exist-

ing silhouette. This technique often fails to accurately

reﬂect the garment’s features and ﬁt, especially when

the source image depicts a person wearing thick or

layered clothing.

For instance, overlaying a short-sleeved shirt onto

a person initially wearing bulky attire can result in the

new garment appearing oversized and ill-ﬁtting, mim-

icking the bulkiness of the original clothing. Simi-

larly, if the person is wearing a hooded sweatshirt and

we attempt to dress them in a non-hooded garment,

remnants of the hood may awkwardly appear at the

back of the neck. These problems arise because the

models ambiguously combine the size and shape of

the original input garment with the new one, leading

to unrealistic distortions and artifacts.

While some models like IMAGDressing attempt

to incorporate pose information, they often struggle

to preserve the input person’s appearance accurately,

sometimes altering individual-speciﬁc traits or even

the perceived gender. Other models such as IDM-

VTON and OOTDiffusion maintain the person’s char-

acteristics better but still encounter issues like occa-

sional sub-optimal ﬁtting, and some challenges adapt-

ing to pose changes.

In contrast, our ODPG model effectively ad-

dresses these challenges by ﬁrst integrating pose and

garment information during the downsampling stage

and then adding appearance details in the upsampling

stage. This approach allows the model to adjust the

new garment to the person’s body structure, regard-

less of the original clothing’s size or shape. As a re-

sult, the new garment achieves a more natural and re-

alistic ﬁt, while maintaining consistency in the overall

appearance.

These advancements highlight the strengths of our

ODPG model in achieving a more accurate and realis-

tic integration of pose, garment, and appearance fea-

tures, overcoming the common limitations observed

in existing models.

Figure 3: Comparison of different models: IMAGDressing,

IDM-VTON, OOTD, and our ODPG model.

5.6 Ablation Study

5.6.1 Appearance Encoder Bias

We demonstrate the effectiveness of our method,

which adds two bias augmented queries—one from

the source human image and one from the garment

image—by conducting experiments with different in-

put variations to the appearance encoder. We con-

ducted ﬁve experiments: i) appearance only, ii) 2x

appearance, iii) garment only, iv) 2x garment, and v)

appearance + garment. Figure 4 demonstrates that the

additional bias in the queries enhances feature focus.

The appearance-only case captures appearance details

more effectively than the garment-only case, but gar-

ment features merely reﬂected.

In particular, the analysis of the third row in Fig 4

shows that when only appearance bias is added, the

garment feature is almost excluded. Doubling the

appearance bias (2x appearance) leads to improved

preservation of the source image’s appearance, in-

cluding ﬁner details such as shoe color, which are

not as well captured with a standard appearance bias.

Conversely, when only garment bias is applied, the

target garment is accurately represented, but other

features, such as the original pants color, are not pre-

served.

Scaling the garment bias by 2 ensures that the tar-

get garment is well-represented, but still leads to the

loss of other key features, such as the hairstyle. In

ODPG: Outﬁtting Diffusion with Pose Guided Condition

399

Figure 4: The results of ﬁve experiments demonstrating the impact of different input variations to the appearance encoder.

The experiments include (i) appearance only, (ii) 2x appearance, (iii) garment only, (iv) 2x garment, and (v) appearance +

garment. The ﬁgure highlights how additional bias in queries improves the focus of the features, with the best results achieved

by balancing both appearance and garment inputs.

contrast, applying both appearance and garment bi-

ases results in a balanced outcome where the target

garment is accurately reﬂected while preserving non-

garment features (e.g., pants color, hair color, and skin

tone), which are critical for achieving realistic VTON

results.

This experiment demonstrates that the latent space

can be effectively controlled to balance the repre-

sentation of both garment and appearance features.

By adjusting the garment and appearance biases, the

method maintains the necessary garment modiﬁca-

tions while preserving the source appearance, result-

ing in optimal visual ﬁdelity.

5.6.2 Explicit Garment Synthesis

In Fig 5, we present the results of an experiment

comparing the effects of applying gray masking to

the upper body region of source images against our

method without masking. In Figure 5(a), we observed

that for garments with complex textures or patterns,

such as striped clothing, applying the gray mask led

to a simpliﬁed version of the patterns, resulting in

fewer stripes or less intricate designs. Additionally,

as shown in Figure 5(b), we also noticed slight color

distortions in the masked images.

Through our evaluation, we found that the ap-

proach without gray masking produced better results.

This is because when gray masking is applied to the

upper body region of the source image, it removes

the detailed information. This not only loses the de-

tailed garment information like color and patterns, it

also contaminates the appearance information. This

happens because our attention mechanism in the im-

age generation process computes cross-attention with

two queries from source human and garment images

as bias. When masking is applied, the gray mask

which takes the majority of the appearance informa-

tion makes the model extract unrelated information.

This hinders the denoising process and distorts the

color and patterns of the output image. Additionally,

the similarity between the gray mask and the back-

ground causes further confusion for the model, ham-

pering the learning process and reducing output qual-

ity.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

400

Figure 5: Impact of gray masking on upper body segmentation, showing how masking removes garment details and contami-

nates appearance information, leading to distorted color and patterns in the output due to the cross-attention mechanism using

biased queries from the source and garment images.

6 CONCLUSION AND FUTURE

WORKS

In this work, we introduce Outﬁtting Diffusion with

Pose-Guided Conditioning (ODPG), a novel vir-

tual try-on (VTON) framework that seamlessly inte-

grates garment, pose, and appearance features into

a diffusion-based U-Net architecture. By leverag-

ing a bias-augmented learnable query mechanism, our

model effectively generates realistic VTON images

without the need for explicit garment warping. It cap-

tures ﬁne-grained details such as garment textures, ﬁt,

and natural alignment with dynamic poses. Our ap-

proach simpliﬁes the virtual try-on pipeline by elim-

inating multi-stage processes, offering an efﬁcient

end-to-end solution that maintains high visual qual-

ity. Our experimental results demonstrate the model’s

ability to generalize beyond the need for fully paired

datasets, reducing dependency on exhaustive data re-

quirements while still achieving impressive perfor-

mance in generating realistic try-on results.

We are considering several promising directions

for future research. One key area is extending the

model to handle video inputs, enabling continuous

and dynamic garment and pose transitions in video-

based virtual try-on scenarios. This capability could

provide more interactive and engaging user experi-

ences. Additionally, the implicit learning strategy

proposed in our model might be adapted to other do-

mains that face data limitations or dependency chal-

lenges. By effectively addressing data scarcity, our

method has the potential to be applied in various ﬁelds

where obtaining extensive paired datasets is difﬁcult.

These future directions highlight the versatility and

potential of our model for broader applications in both

virtual try-on and other data-driven tasks.

ACKNOWLEDGEMENT

This study was conducted with support from the Cre-

ative Challenger Program (CCP) activity funding pro-

vided by the Teaching and Learning Enhancement

Team at Korea University in 2024.

REFERENCES

Bhunia, A. K., Khan, S., Cholakkal, H., Anwer, R. M.,

Laaksonen, J., Shah, M., and Khan, F. S. (2023). Per-

son image synthesis via denoising diffusion model. In

CVPR, pages 5968–5976.

Brooks, T., Holynski, A., and Efros, A. A. (2023). In-

structpix2pix: Learning to follow image editing in-

structions.

Choi, S., Park, S., Lee, M., and Choo, J. (2021). Viton-

hd: High-resolution virtual try-on via misalignment-

aware normalization.

Choi, Y., Kwak, S., Lee, K., Choi, H., and Shin, J. (2024).

Improving diffusion models for authentic virtual try-

on in the wild.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., and Bengio, Y. (2014).

Generative adversarial nets. In Advances in Neural

Information Processing Systems.

Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., and Zhang,

L. (2023). Taming the power of diffusion models for

high-quality virtual try-on with appearance ﬂow. In

Proceedings of the 31st ACM International Confer-

ence on Multimedia.

Han, X., Zhu, X., Deng, J., Song, Y.-Z., and Xiang, T.

(2023). Controllable person image synthesis with pose

constrained latent diffusion.

Ho, J., Jain, A., and Abbeel, P. (2020a). Denoising diffusion

probabilistic models. In Advances in Neural Infor-

mation Processing Systems, volume 33, pages 6840–

6851.

ODPG: Outﬁtting Diffusion with Pose Guided Condition

401

Ho, J., Jain, A., and Abbeel, P. (2020b). Denoising diffusion

probabilistic models. In Advances in Neural Informa-

tion Processing Systems.

Islam, T., Miron, A., Liu, X., and Li, Y. (2024). Stylevton:

A multi-pose virtual try-on with identity and clothing

detail preservation. Neurocomputing, 594.

Kim, J., Gu, G., Park, M., Park, S., and Choo, J. (2023).

Stableviton: Learning semantic correspondence with

latent diffusion model for virtual try-on. In arXiv

preprint arXiv:2312.01725.

Kingma, P, D., Welling, and Max (2014). Auto-encoding

variational bayes. In International Conference on

Learning Representations.

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016).

Deepfashion: Powering robust clothes recognition and

retrieval with rich annotations. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 1096–1104.

Lu, Y., Zhang, M., Ma, A. J., Xie, X., and Lai, J. (2024).

Coarse-to-ﬁne latent diffusion for pose-guided person

image synthesis. In arXiv.org. Retrieved from https:

//arxiv.org/abs/2402.18078.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022). High-resolution image synthesis

with latent diffusion models. In CVPR, pages 10684–

10695.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

net: Convolutional networks for biomedical image

segmentation. In International Conference on Med-

ical Image Computing and Computer-Assisted Inter-

vention (MICCAI).

Shen, F., Jiang, X., He, X., Ye, H., Wang, C., Du, X., Li, Z.,

and Tang, J. (2024). Imagdressing-v1: Customizable

virtual dressing.

Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., and

Yang, M. (2018). Toward characteristic-preserving

image-based virtual try-on network. In Proceed-

ings of the European Conference on Computer Vision

(ECCV).

Wang, J., Sha, T., Zhang, W., Li, Z.-J., and Mei, T. (2020).

Down to the last detail: Virtual try-on with ﬁne-

grained details. In Proceedings of the 28th ACM Inter-

national Conference on Multimedia, pages 466–474.

Xie, Z., Huang, Z., Dong, X., Zhao, F., Dong, H., Zhang,

X., Zhu, F., and Liang, X. (2023). Gp-vton: Towards

general purpose virtual try-on via collaborative local-

ﬂow global-parsing learning. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Xu, Y., Gu, T., Chen, W., and Chen, C. (2024). Ootdiffu-

sion: Outﬁtting fusion based latent diffusion for con-

trollable virtual try-on.

Zheng, N., Song, X., Chen, Z., Hu, L., Cao, D., and Nie,

L. (2019a). Virtually trying on new clothing with ar-

bitrary poses. In Proceedings of the 27th ACM Inter-

national Conference on Multimedia (MM ’19), pages

2253–2261.

Zheng, N., Song, X., Chen, Z., Hu, L., Cao, D., and Nie,

L. (2019b). Virtually trying on new clothing with ar-

bitrary poses. In Proceedings of the 27th ACM Inter-

national Conference on Multimedia (MM ’19), pages

2253–2261.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

402