
ation for music cover images. arXiv preprint
arXiv:2205.07301.
Esser, P., Rombach, R., and Ommer, B. (2021). Tam-
ing transformers for high-resolution image synthesis.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 12873–
12883.
Frans, K., Soros, L. B., and Witkowski, O. (2021). Clip-
draw: Exploring text-to-drawing synthesis through
language-image encoders. CoRR, abs/2106.14843.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion
probabilistic models. Advances in Neural Information
Processing Systems, 33:6840–6851.
Jain, A., Xie, A., and Abbeel, P. (2022). Vectorfusion:
Text-to-svg by abstracting pixel-based diffusion mod-
els. arXiv preprint arXiv:2211.11319.
Kingma, D. P. and Welling, M. (2019). An introduc-
tion to variational autoencoders. arXiv preprint
arXiv:1906.02691.
Li, T.-M., Luk
´
a
ˇ
c, M., Micha
¨
el, G., and Ragan-Kelley, J.
(2020). Differentiable vector graphics rasterization
for editing and learning. ACM Trans. Graph. (Proc.
SIGGRAPH Asia), 39(6):193:1–193:15.
Lopes, R. G., Ha, D., Eck, D., and Shlens, J. (2019). A
learned representation for scalable vector graphics.
CoRR, abs/1904.02632.
Ma, X., Zhou, Y., Xu, X., Sun, B., Filev, V., Orlov, N.,
Fu, Y., and Shi, H. (2022). Towards layer-wise image
vectorization. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 16314–16323.
Nash, C., Ganin, Y., Eslami, S. A., and Battaglia, P. (2020).
Polygen: An autoregressive generative model of 3d
meshes. In International conference on machine
learning, pages 7220–7229. PMLR.
Nishina, K. and Matsui, Y. (2024). Svgeditbench: A bench-
mark dataset for quantitative assessment of llm’s svg
editing capabilities.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. (2021). Learning
transferable visual models from natural language su-
pervision. CoRR, abs/2103.00020.
Reddy, P., Gharbi, M., Lukac, M., and Mitra, N. J. (2021).
Im2vec: Synthesizing vector graphics without vector
supervision. arXiv preprint arXiv:2102.02798.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 10684–10695.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-
ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,
S. S., Lopes, R. G., et al. (2022). Photorealistic text-
to-image diffusion models with deep language under-
standing. arXiv preprint arXiv:2205.11487.
Schaldenbrand, P., Liu, Z., and Oh, J. (2022). Styleclip-
draw: Coupling content and style in text-to-drawing
translation. arXiv preprint arXiv:2202.12362.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.,
Wightman, R., Cherti, M., Coombes, T., Katta, A.,
Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An
open large-scale dataset for training next generation
image-text models. arXiv preprint arXiv:2210.08402.
Shen, I.-C. and Chen, B.-Y. (2021). Clipgen: A deep
generative model for clipart vectorization and synthe-
sis. IEEE Transactions on Visualization and Com-
puter Graphics, 28(12):4211–4224.
Tian, X. and G
¨
unther, T. (2022). A survey of smooth vector
graphics: Recent advances in representation, creation,
rasterization and image vectorization. IEEE Transac-
tions on Visualization and Computer Graphics.
Timofeenko, B. A., Efimova, V., and Filchenkov, A. A.
(2023). Vector graphics generation with llms: ap-
proaches and models. Zapiski Nauchnykh Seminarov
POMI, 530(0):24–37.
Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete
representation learning. Advances in neural informa-
tion processing systems, 30.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Wang, Y. and Lian, Z. (2021). Deepvecfont: synthesizing
high-quality vector fonts via dual-modality learning.
ACM Transactions on Graphics (TOG), 40(6):1–15.
Wu, H., Shen, S., Hu, Q., Zhang, X., Zhang, Y., and Wang,
Y. (2024). Megafusion: Extend diffusion models to-
wards higher-resolution image generation without fur-
ther tuning.
Wu, R., Su, W., Ma, K., and Liao, J. (2023). Iconshop:
Text-guided vector icon synthesis with autoregressive
transformers.
Xing, X., Zhou, H., Wang, C., Zhang, J., Xu, D., and Yu, Q.
(2024). Svgdreamer: Text guided svg generation with
diffusion model. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 4546–4555.
Xu, X., Wang, Z., Zhang, E., Wang, K., and Shi, H.
(2022). Versatile diffusion: Text, images and vari-
ations all in one diffusion model. arXiv preprint
arXiv:2211.08332.
Xu, Z. and Wall, E. (2024). Exploring the capability of llms
in performing low-level visual analytic tasks on svg
data visualizations.
Yan, S. (2023). Redualsvg: Refined scalable vector graph-
ics generation. In International Conference on Artifi-
cial Neural Networks, pages 87–98. Springer.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang,
Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K.,
et al. (2022). Scaling autoregressive models for
content-rich text-to-image generation. arXiv preprint
arXiv:2206.10789.
Zhang, P., Zhao, N., and Liao, J. (2023). Text-guided vector
graphics customization.
Zou, B., Cai, M., Zhang, J., and Lee, Y. J. (2024). Vgbench:
Evaluating large language models on vector graphics
understanding and generation.
VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation
195