
Kim, G., Kwon, T., and Ye, J. C. (2022). DiffusionCLIP:
Text-guided diffusion models for robust image manip-
ulation. In CVPR, pages 2426–2435.
Lee, S., Gu, G., Park, S., Choi, S., and Choo, J. (2022).
High-resolution virtual try-on with misalignment and
occlusion-handled conditions. In ECCV, pages 204–
219.
Li, P., Xu, Y., Wei, Y., and Yang, Y. (2020). Self-correction
for human parsing. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 44(6):3260–3271.
Lin, A., Zhao, N., Ning, S., Qiu, Y., Wang, B., and Han, X.
(2023). Fashiontex: Controllable virtual try-on with
text and texture. In ACM SIGGRAPH Conference Pro-
ceedings, pages 56:1–56:9. ACM.
Oldfield, J., Tzelepis, C., Panagakis, Y., Nicolaou, M. A.,
and Patras, I. (2023). PandA: Unsupervised learning
of parts and appearances in the feature maps of GANs.
In ICLR.
Parmar, G., Singh, K. K., Zhang, R., Li, Y., Lu, J., and
Zhu, J. (2023). Zero-shot image-to-image translation.
CoRR, abs/2302.03027.
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and
Lischinski, D. (2021). StyleCLIP: Text-driven manip-
ulation of stylegan imagery. In ICCV, pages 2085–
2094.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In ICML, pages
8748–8763.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and
Chen, M. (2022). Hierarchical text-conditional im-
age generation with clip latents. arXiv preprint
arXiv:2204.06125.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In CVPR, pages 10684–
10695.
Shen, Y., Gu, J., Tang, X., and Zhou, B. (2020). Interpreting
the latent space of GANs for semantic face editing. In
CVPR, pages 9240–9249.
Shen, Y. and Zhou, B. (2021). Closed-form factorization
of latent semantics in GANs. In CVPR, pages 1532–
1540.
Song, D., Li, T., Mao, Z., and Liu, A.-A. (2020). SP-
VITON: shape-preserving image-based virtual try-
on network. Multimedia Tools and Applications,
79:33757–33769.
Spingarn, N., Banner, R., and Michaeli, T. (2021). GAN
“steerability” without optimization. In ICLR.
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., and Cohen-Or,
D. (2021). Designing an encoder for stylegan image
manipulation. ACM Transactions on Graphics (TOG),
40(4):1–14.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. In NeurIPS.
Voynov, A. and Babenko, A. (2020). Unsupervised discov-
ery of interpretable directions in the GAN latent space.
In ICML, pages 9786–9796.
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., and
Yang, M. (2018). Toward characteristic-preserving
image-based virtual try-on network. In ECCV, pages
589–604.
Wang, H., Lin, G., del Molino, A. G., Wang, A., Yuan,
Z., Miao, C., and Feng, J. (2022). ManiCLIP: Multi-
attribute face manipulation from text. arXiv preprint
arXiv:2210.00445.
Wei, T., Chen, D., Zhou, W., Liao, J., Tan, Z., Yuan, L.,
Zhang, W., and Yu, N. (2022). HairCLIP: Design
your hair by text and reference image. In CVPR, pages
18072–18081.
Wright, L. (2019). Ranger - a synergistic op-
timizer. https://github.com/lessw2020/
Ranger-Deep-Learning-Optimizer.
Wu, Z., Lischinski, D., and Shechtman, E. (2021).
Stylespace analysis: Disentangled controls for Style-
GAN image generation. In CVPR, pages 12863–
12872.
Xia, W., Yang, Y., Xue, J.-H., and Wu, B. (2021). Tedi-
GAN: Text-guided diverse face image generation and
manipulation. In CVPR, pages 2256–2265.
Yang, H., Chai, L., Wen, Q., Zhao, S., Sun, Z., and He,
S. (2021). Discovering interpretable latent space di-
rections of GANs beyond binary attributes. In CVPR,
pages 12177–12185.
Yang, H., Zhang, R., Guo, X., Liu, W., Zuo, W., and Luo,
P. (2020). Towards photo-realistic virtual try-on by
adaptively generating-preserving image content. In
CVPR, pages 7850–7859.
Yu, R., Wang, X., and Xie, X. (2019). VTNFP: An image-
based virtual try-on network with body and clothing
feature preservation. In ICCV, pages 10511–10520.
Y
¨
uksel, O. K., Simsar, E., Er, E. G., and Yanardag, P.
(2021). LatentCLR: A contrastive learning approach
for unsupervised discovery of interpretable directions.
In ICCV, pages 14243–14252.
Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.
(2019). Self-attention generative adversarial net-
works. In ICML, pages 7354–7363.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,
O. (2018). The unreasonable effectiveness of deep
features as a perceptual metric. In CVPR.
Zhu, J., Feng, R., Shen, Y., Zhao, D., Zha, Z., Zhou, J., and
Chen, Q. (2021). Low-rank subspaces in GANs. In
NeurIPS.
APPENDIX
Hyperparameters. Our method used the pre-
trained StyleGAN-Human (Fu et al., 2022) model,
which has the structure of StyleGAN2 (Karras et al.,
2020) with a modification to output 256×512 images.
We used a truncation value of ψ = 0.7 to generate im-
ages for training and testing. The StyleGAN-Human
model consists of a total of 16 layers, which are di-
vided into three stages (i.e., course, middle, fine) with
StyleHumanCLIP: Text-Guided Garment Manipulation for StyleGAN-Human
67