
image synthesis. In International Conference on
Learning Representations.
Han, I., Yang, S., Kwon, T., and Ye, J. C. (2023). Highly
personalized text embedding for image manipulation
by stable diffusion. arXiv preprint arXiv:2303.08767.
Haque, A., Tancik, M., Efros, A., Holynski, A., and
Kanazawa, A. (2023). Instruct-nerf2nerf: Editing
3d scenes with instructions. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion.
Ho, J. and Salimans, T. (2021). Classifier-free diffusion
guidance. In NeurIPS 2021 Workshop on Deep Gen-
erative Models and Downstream Applications.
Hu, M., Zheng, J., Liu, D., Zheng, C., Wang, C., Tao,
D., and Cham, T.-J. (2023). Cocktail: Mixing multi-
modality controls for text-conditional image genera-
tion. arXiv preprint arXiv:2306.00964.
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., and Zhou,
J. (2023). Composer: Creative and controllable image
synthesis with composable conditions. arXiv preprint
arXiv:2302.09778.
Kamata, H., Sakuma, Y., Hayakawa, A., Ishii, M., and
Narihira, T. (2023). Instruct 3d-to-3d: Text in-
struction guided 3d-to-3d conversion. arXiv preprint
arXiv:2303.15780.
Karras, T., Laine, S., and Aila, T. (2021). A style-based
generator architecture for generative adversarial net-
works. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 43(12):4217–4228.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,
and Aila, T. (2020). Analyzing and improving the im-
age quality of StyleGAN. In CVPR.
Li, J., Li, D., Savarese, S., and Hoi, S. (2023a). Blip-
2: Bootstrapping language-image pre-training with
frozen image encoders and large language models.
Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., and
Lee, Y. J. (2023b). Gligen: Open-set grounded text-to-
image generation. arXiv preprint arXiv:2301.07093.
Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov,
S., and Vondrick, C. (2023). Zero-1-to-3: Zero-shot
one image to 3d object.
Midjourney.com (2022). Midjourney. https://www.
midjourney.com.
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,
Ramamoorthi, R., and Ng, R. (2020). Nerf: Repre-
senting scenes as neural radiance fields for view syn-
thesis. In ECCV.
Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y.,
and Qie, X. (2023). T2i-adapter: Learning adapters
to dig out more controllable ability for text-to-image
diffusion models. arXiv preprint arXiv:2302.08453.
Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J., and
Kemelmacher-Shlizerman, I. (2022). Stylesdf: High-
resolution 3d-consistent image and geometry gener-
ation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
13503–13513.
Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. (2022).
Dreamfusion: Text-to-3d using 2d diffusion. arXiv.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and
Chen, M. (2022). Hierarchical text-conditional im-
age generation with clip latents. arXiv preprint
arXiv:2204.06125.
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
ford, A., Chen, M., and Sutskever, I. (2021). Zero-shot
text-to-image generation. In International Conference
on Machine Learning, pages 8821–8831. PMLR.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 10684–10695.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and
Aberman, K. (2023). Dreambooth: Fine tuning text-
to-image diffusion models for subject-driven genera-
tion. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., et al. (2022). Photorealistic text-
to-image diffusion models with deep language under-
standing. arXiv preprint arXiv:2205.11487.
Schwarz, K., Liao, Y., Niemeyer, M., and Geiger, A. (2020).
Graf: Generative radiance fields for 3d-aware image
synthesis. Advances in Neural Information Processing
Systems, 33:20154–20166.
Seo, J., Jang, W., Kwak, M.-S., Ko, J., Kim, H., Kim, J.,
Kim, J.-H., Lee, J., and Kim, S. (2023). Let 2d diffu-
sion model know 3d-consistency for robust text-to-3d
generation. arXiv preprint arXiv:2303.07937.
Serengil, S. I. and Ozpinar, A. (2020). Lightface: A hy-
brid deep face recognition framework. In 2020 Inno-
vations in Intelligent Systems and Applications Con-
ference (ASYU), pages 23–27. IEEE.
Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L.,
and Chen, D. (2023). Make-it-3d: High-fidelity 3d
creation from a single image with diffusion prior.
Wang, C., Chai, M., He, M., Chen, D., and Liao, J.
(2022). Clip-nerf: Text-and-image driven manipula-
tion of neural radiance fields. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 3835–3844.
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., and Liao,
J. (2023). Nerf-art: Text-driven neural radiance fields
stylization. IEEE Transactions on Visualization and
Computer Graphics, pages 1–15.
Wang, W., Yang, S., Xu, J., and Liu, J. (2020). Consistent
video style transfer via relaxation and regularization.
IEEE Transactions on Image Processing, 29:9125–
9139.
Xia, W. and Xue, J.-H. (2022). A survey on 3d-aware image
synthesis. arXiv preprint arXiv:2210.14267.
Zhang, K., Riegler, G., Snavely, N., and Koltun, V. (2020).
Nerf++: Analyzing and improving neural radiance
fields. arXiv:2010.07492.
Zhang, L. and Agrawala, M. (2023). Adding conditional
control to text-to-image diffusion models. arXiv
preprint arXiv:2302.05543.
Zhou, P., Xie, L., Ni, B., and Tian, Q. (2021).
Cips-3d: A 3d-aware generator of gans based on
conditionally-independent pixel synthesis. arXiv
preprint arXiv:2110.09788.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
594