
environments. This advancement improves the user
experience and opens new possibilities for using 3D
avatars in games, social networks, professional train-
ing, and virtual meetings.
ACKNOWLEDGMENT
This research was supported by the National Sci-
ence and Technology Council (NSTC), Taiwan, under
grant NSTC 113-2640-E-194-001.
REFERENCES
AlBahar, B., Lu, J., Yang, J., Shu, Z., Shechtman, E.,
and Huang, J.-B. (2021). Pose with Style: Detail-
preserving pose-guided image synthesis with condi-
tional stylegan. ACM Transactions on Graphics.
AlBahar, B., Saito, S., Tseng, H.-Y., Kim, C., Kopf, J., and
Huang, J.-B. (2023). Single-image 3d human digitiza-
tion with shape-guided diffusion. In SIGGRAPH Asia.
Cao, Y., Chen, G., Han, K., Yang, W., and Wong, K.-Y. K.
(2022). Jiff: Jointly-aligned implicit face function for
high quality single view clothed human reconstruc-
tion. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 2729–2739.
Feng, Q., Liu, Y., Lai, Y.-K., Yang, J., and Li, K. (2022).
Fof: Learning fourier occupancy field for monocular
real-time human reconstruction. In NeurIPS.
Huang, Y., Yi, H., Xiu, Y., Liao, T., Tang, J., Cai, D., and
Thies, J. (2024). TeCH: Text-guided Reconstruction
of Lifelike Clothed Humans. In International Confer-
ence on 3D Vision (3DV).
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).
Image-to-image translation with conditional adversar-
ial networks. CVPR.
Jiang, W., Yi, K. M., Samei, G., Tuzel, O., and Ranjan, A.
(2022). Neuman: Neural human radiance field from a
single video. In Proceedings of the European confer-
ence on computer vision (ECCV).
Li, P., Xu, Y., Wei, Y., and Yang, Y. (2020). Self-correction
for human parsing. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. (2019).
Semantic image synthesis with spatially-adaptive nor-
malization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition.
Qian, G., Mai, J., Hamdi, A., Ren, J., Siarohin, A., Li, B.,
Lee, H.-Y., Skorokhodov, I., Wonka, P., Tulyakov, S.,
and Ghanem, B. (2024). Magic123: One image to
high-quality 3d object generation using both 2d and
3d diffusion priors. In The Twelfth International Con-
ference on Learning Representations (ICLR).
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
ford, A., Chen, M., and Sutskever, I. (2021). Zero-shot
text-to-image generation.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-
mer, B. (2021). High-resolution image synthesis with
latent diffusion models.
Saito, S., Huang, Z., Natsume, R., Morishima, S.,
Kanazawa, A., and Li, H. (2019). Pifu: Pixel-aligned
implicit function for high-resolution clothed human
digitization. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision (ICCV).
Saito, S., Simon, T., Saragih, J., and Joo, H. (2020). Pi-
fuhd: Multi-level pixel-aligned implicit function for
high-resolution 3d human digitization. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR).
Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. (2004).
Image quality assessment: from error visibility to
structural similarity. IEEE Transactions on Image
Processing, 13(4):600–612.
Weng, C.-Y., Curless, B., Srinivasan, P. P., Barron, J. T., and
Kemelmacher-Shlizerman, I. (2022). HumanNeRF:
Free-viewpoint rendering of moving people from
monocular video. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 16210–16220.
Xiu, Y., Yang, J., Cao, X., Tzionas, D., and Black,
M. J. (2023). ECON: Explicit Clothed humans Op-
timized via Normal integration. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR).
Xiu, Y., Yang, J., Tzionas, D., and Black, M. J. (2022).
ICON: Implicit Clothed humans Obtained from Nor-
mals. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 13296–13306.
Yu, T., Zheng, Z., Guo, K., Liu, P., Dai, Q., and Liu,
Y. (2021). Function4d: Real-time human volumet-
ric capture from very sparse consumer rgbd sensors.
In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR2021).
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
910