
volution layer at both the beginning and end to sta-
bilize training. In our method, garment features F
l
g
and appearance features F
l
s
, extracted at each scale,
are passed through the appearance encoder φ
A
to cal-
culate the corresponding biases:
B = φ
A
(F
l
s
, F
l
g
) (4)
These biases are then incorporated into the query
Q
learn
, adjusting it to reflect appearance and garment
information. The overall attention mechanism can be
summarized by the following equation:
F
l
o
= softmax
(Q
learn
+ B)K
T
√
d
V. (5)
Here, The attention output F
l
o
from the lth layer is
then added to the noisy latent, guiding the denois-
ing process at each step. Through the entire denois-
ing process described above, the VTON sample, hav-
ing undergone all the upsampling and downsampling
blocks, effectively integrates the source appearance,
target garment, and target pose information. This
integration results in a realistic synthesized image.
By utilizing appearance and pose information as the
query to extract corresponding garment features, the
VTON output incorporates natural clothing details,
such as wrinkles and fit, that emerge when a person
of a specific body shape wears the garment and takes
on a specific pose.
Moreover, this attention mechanism not only en-
hances the realism of the synthesized images but
also reduces the model’s dependency on four-paired
datasets. In the following section, we explore how
this approach allows us to train the model effectively
using existing three-paired datasets, thereby simpli-
fying data requirements for dynamic VTON architec-
tures.
3.2.3 Enhancing Generalization and Reducing
Data Dependency
Training a VTON model that simultaneously trans-
forms both garment and pose typically requires a four-
paired dataset, including the source human image, tar-
get garment image, target pose image, and a ground
truth image showing the source person wearing the
target garment in the target pose. However, existing
datasets, such as DeepFashion (Liu et al., 2016) and
FashionTryOn (Zheng et al., 2019a), are three-paired
datasets. In these datasets, the source human, pose-
transferred human, and target garment all feature the
same garment, which may appear to limit the dynamic
VTON model’s ability to generalize to new garments.
Previous methods focus on either pose transfer or gar-
ment transfer, and thus lack comprehensive pairing
for transformations in both aspects.
Our Bias Augmented Query Attention mechanism
enables the model to learn complex relationships be-
tween the source appearance, target pose, and target
garment, enhancing its ability to generalize to new
combinations of appearances, poses, and garments,
thereby reducing the need for a four-paired dataset.
Specifically, the attention mechanism integrates the
source appearance and target pose as biases into the
learnable query Q
learn
, guiding the model to extract
relevant garment features during the denoising pro-
cess. This approach allows the model to synthesize
realistic images of the source person wearing differ-
ent, previously unseen garments in new poses, with-
out requiring a ground truth image that combines all
transformations.
We acknowledge that in our training setup, the
source person is wearing the target garment, which
may seem to limit the model’s ability to generalize to
unseen garments. However, our experimental results
demonstrate that the model does not merely replicate
the input; instead, it effectively transfers and inte-
grates garment features based on the conditioned ap-
pearance and pose without memorizing specific gar-
ment instances.
To validate the generalization capability of our
model, we tested it with source person images wear-
ing garments different from the target garment and
in various poses. The model successfully synthe-
sized realistic images of the source person wearing
new, previously unseen garments in new poses. This
indicates that our Bias Augmented Query Attention
mechanism enables the model to focus on the interac-
tions between garments, poses, and appearances, ef-
fectively generalizing beyond the training data.
This approach reduces the dependency on four-
paired datasets, allowing us to leverage existing three-
paired datasets such as DeepFashion (Liu et al., 2016)
and FashionTryOn (Zheng et al., 2019a), which con-
tain only three components: the source human image,
the garment image, and the target posed human image
with the same garment.
In summary, our attention mechanism enhances
the model’s ability to generalize to new combinations
of appearances, poses, and garments without relying
on exhaustive paired datasets. This alleviates the data
requirements for training dynamic VTON architec-
tures.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
396