VectorWeaver: Transformers-Based Diffusion Model for Vector

Graphics Generation

Ivan Jarsky

, Maxim Kuzin

, Valeria Eﬁmova

, Viacheslav Shalamov

and Andrey Filchenkov

ITMO University, Kronverksky Pr. 49, St. Petersburg, Russia

ﬁ ﬁ

Keywords:

Vector Graphics, Image Generation, Diffusion Models, Transformer.

Abstract:

Diffusion models generate realistic results for raster images. However, vector image generation is not so

successful because of signiﬁcant differences in image structure. Unlike raster images, vector ones consist

of paths that are described by their coordinates, colors, and stroke widths. The number of paths needed

to be generated is unknown in advance. We tackle the vector image synthesis problem by developing a

new diffusion-based model architecture, that we call VectorWeaver, including two transformer-based stacked

encoders and two transformer-based stacked decoders. For training the model, we collected a vector im-

ages dataset from public resources, however, its size was not enough. To enrich and enlarge it we pro-

posed new augmentation operations speciﬁc for vector images. To train the model, we designed a spe-

ciﬁc loss function, which allowed the generation of objects with smooth contours without artifacts. Qual-

itative experiments demonstrate the superiority and computational efﬁciency of the proposed model com-

pared to the existing vector image generation methods. The vector image generation code is available

at https://github.com/CTLab-ITMO/VGLib/tree/main/VectorWeaver.

1 INTRODUCTION

Image generation methods have been developing

rapidly in recent years. New diffusion-based (Saharia

et al., 2022; Xu et al., 2022; Wu et al., 2024; Du et al.,

2023) and transformer-based (Chang et al., 2022; Yu

et al., 2022) approaches have achieved excellent gen-

eration quality. However, training of these models re-

quires hundreds of GPU days to obtain an image of

1024x1024 pixels.

An alternative to raster images is vector graphics,

whose scalability allows to achieve a sufﬁcient image

resolution. However, there was no progress in gener-

ating vector images until recently. It was possible to

generate vector images containing only points or thin

curves (Frans et al., 2021; Li et al., 2020). In contrast,

artists usually draw vector images with color-ﬁlled

shapes. Despite serious differences between vector

graphics and raster graphics, VectorFusion (Jain et al.,

https://orcid.org/0000-0003-1107-3363

https://orcid.org/0009-0001-0982-1740

https://orcid.org/0000-0002-5309-2207

https://orcid.org/0000-0002-5647-6521

https://orcid.org/0000-0002-1133-8432

2022) based on Stable Diffusion (Rombach et al.,

2022) was proposed to generate a raster image and

then vectorize it using Layer-wise Image Vectoriza-

tion (LIVE) approach (Ma et al., 2022). Main dis-

advantages of using LIVE vectorization are its long

generation time and requirement of hyperparameters

tuning for each case. At the same time, other vector-

ization approaches also have serious disadvantages,

such as non-optimal shape generation – inappropriate

shape number and their redundant complexity (Dzi-

uba et al., 2023). Therefore, we choose another ap-

proach of vector image generation without referring

to raster domain.

Vector image generation and text-to-image syn-

thesis in vector format is challenged by the com-

plete difference in the structure of images in the raster

and vector domains. While raster images are a ma-

trix of pixels, vector images contain a set of shapes

that need to be displayed on the canvas. One of the

most popular vector graphics formats is SVG format,

which is based on the XML text format. However,

generating such images directly in text format, for

example, using LLM, is a very difﬁcult task (Tim-

ofeenko et al., 2023), especially because of the re-

quirement to conform to a regular ﬁle structure with

184

Jarsky, I., Kuzin, M., Eﬁmova, V., Shalamov, V. and Filchenkov, A.

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation.

DOI: 10.5220/0013185100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

184-195

ISBN: 978-989-758-728-3; ISSN: 2184-4321

(a) (b)

Figure 1: (a) The general scheme of VectorWeaver. Four transformer models are trained, two for encoding and two for

decoding. We denote embeddings with the colored rectangles in the scheme. Note that in our model diffusion is used

in the innermost latent space. (b) Images unconditionally generated by the proposed custom-trained VectorWeaver model.

Generation of random latent vector is performed using the diffusion model, the entire image is synthesised using VAE Decoder

(SVG and Path Decoders).

many shapes and their attributes. Currently there is

no such high-quality model, which could generate di-

verse and high-quality images in terms of structure

and content. Therefore, in our research we focus on

generating images consisting only of shapes formed

from B

ezier curves with their attributes. A non-trivial

issue remains both the selection of the number of

shapes applied to the canvas and the number of curves

they should consist of. As a result, the existing ex-

cessive variability and uncertainty create instability

when generating images. Another problem is the low

prevalence of vector graphics, which results in a small

amount of qualitative and detailed vector images in

the public domain

In this paper we solve aforementioned problems

and make the following contributions:

• We propose a new computationally efﬁcient ar-

chitecture for vector image generation based on

diffusion process and transformer-based VAE ca-

pable to process path sequences of an arbitrary

length, which we call VectorWeaver. The model

achieved competitive performance on the task of

unconditional vector image generation.

• We do not use vectorization approaches and dif-

ferentiable rasterization. Instead, we adapt exist-

ing loss functions for vector graphics domain. We

also involve the adversarial loss to improve the

smoothness of the generated images.

• We propose augmentation operations speciﬁc for

vector graphics.

2 RELATED WORK

2.1 Vector Graphics and Images

Vectorization

The basic idea of vector graphics is to store images

as a sequence of shapes to be drawn on the can-

vas. SVG is the most popular vector graphics format,

which deﬁnes the image using the XML markup. The

main tags describing ﬁgures are <circle>, <ellipse>,

<rect>, <polygon>, but the most universal tag cov-

ering all these types of shapes is <path>, an arbi-

trary line. To draw such a shape, commands moveTo,

lineTo, cubicCurveTo, and others with the corre-

sponding coordinates are used. Thus, the task of gen-

erating a vector image is reduced to generating a se-

quence of path tags, each of which is a sequence of

these commands.

The main purpose of vector graphics is to sketch

simple, often abstract, images; mostly demanded in

design, for example, are logos, fonts, and icons.

The key feature and advantage of vector images over

bitmaps is the preservation of resolution quality re-

gardless of scale. However, the process of rendering

vector graphics suffers in the presence of a large num-

ber of details and ﬁgures, therefore, generated images

should be quite compact and not over complicated.

An important milestone in the vector graphics do-

main was DiffVG (Li et al., 2020), a library for the

differentiable rasterization of vector images. The

main contribution of DiffVG is transparent – after

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

185

vector images rasterization, we can use bitmap losses

to process them. DiffVG proves its efﬁciency by of-

fering the VAE and GAN models, both generating the

parameters of a vector image, whose rasterization is

similar to the input image. However, these two mod-

els allow generating only simple vector fonts, gener-

ating more complex images might require additional

losses and approaches.

One of the ﬁrst works using DiffVG was the

Im2Vec vectorizer (Reddy et al., 2021). The pro-

posed VAE is a modiﬁcation of the model proposed

earlier in DiffVG. The authors paid special attention

to the generation of closed ﬁgures and implemented

a new method that samples and sequentially deforms

the base circle. The authors claimed that collecting

large datasets of vector images ceased to be a neces-

sity, however, the experiments show that the thesis can

hold true only for simple images consisting of a very

small number of shapes.

The LIVE vectorizer (Ma et al., 2022) also ex-

tended opportunities of DiffVG. It iteratively adds

a new path to the image, which is further sequen-

tially optimized. It became possible to obtain vec-

tor analogs of raster images containing the speciﬁed

number of paths. However, this approach is very time-

consuming and vectorization even of a simple image

can be performed in more than tens of minutes.

Another iterative approach is used in the CLIPGen

vectorizer (Shen and Chen, 2021). The idea is to train

a class-guided model to predict if a new layer with

generated parameters should be added to the current

vector image. However, this approach only works for

very simple images and performs vectorization very

slowly.

An overview of vectorization methods is provided

in the article (Dziuba et al., 2023). The authors

ﬁnd that modern approaches are either highly time-

consuming or not versatile. Therefore, we conclude

that vectorization is undesirable to use when generat-

ing vector images.

2.2 Vector Graphics Generation

Variational autoencoders (VAE) (Kingma and

Welling, 2019) were among the ﬁrst popular architec-

tures for generating raster graphics. To memorize and

sample objects, it was originally proposed to predict

the mean and standard deviation, which deﬁnes the

continuous space of normal distribution. However,

the VQ-VAE (Van Den Oord et al., 2017) work

showed that additional quantization helps to improve

convergence without losing the diversity of generated

images.

The advent of techniques for generating raster im-

ages was followed by the emergence of methods for

synthesizing vector graphics. Among them, VAEs

were chosen as the main architecture for generating

vector graphics.

One of the ﬁrst simplest models is the SVG-

VAE (Lopes et al., 2019), designed to generate fonts.

A character is represented as exactly one <path>,

and thus the task is to generate only one sequence of

path commands. To memorize the drawing style of

the characters, the authors trained a conditional VAE.

The second model generates the commands sequence

of the output symbol, taking the previously predicted

style of the reference character and the label of the

new character as input.

To generate images containing several shapes, the

DeepSVG architecture was proposed (Carlier et al.,

2020); it is also a VAE using a two-step method of

encoding and decoding an image. The ﬁrst step en-

codes each image path, then the second step encodes

the set of previously encoded paths. After that, decod-

ing is performed twice in the reversed way. DeepSVG

allowed generating images containing several shapes

as well memorizing and synthesizing various param-

eters of shapes, for example, color, opacity, and width

of the strokes. A serious disadvantage of these ap-

proaches is that they strongly depend on the size of

the training dataset, the collection of which is a non-

trivial task. The authors train an autoencoder that

translates vector parameters into a latent space with-

out using rasterization. They represent each segment

as a token in a transformer (Vaswani et al., 2017).

However only small vector images (with maximum

of ﬁve paths) generation was presented, therefore, the

model is poorly applicable to the generation task.

In DeepVecFont (Wang and Lian, 2021), the au-

thors proposed to generate an additional raster image

by carrying out iterative optimization of the resulting

vector image at the end, bringing it closer to the raster

analog. This approach helps straighten the uneven-

ness of the vector image lines.

One of the attempts to use GAN for vector graph-

ics synthesis was the CoverGAN (Eﬁmova et al.,

2022), designed to generate a vector image for a cover

based on the user’s music track and given emotions.

The authors developed the idea of Im2Vec and used

circles as well as ovals, triangles, and other regular

shapes as basic shapes for the variety of the results.

Iterative optimization has become another area

of developing models for generating vector im-

ages. CLIPDraw (Frans et al., 2021) and StyleCLIP-

Draw (Schaldenbrand et al., 2022) consider the gen-

eration and the style transfer according to a given text

description as a sequential process of changing the

current vector image, starting with a randomly gen-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

186

erated one. The CLIP model (Radford et al., 2021)

is used as a loss function, which allows evaluating

the text-to-image relationship. Generated images are

drawn with “strokes” and are having a lot of shapes,

which is not acceptable for vector graphics.

There is also a paper (Yan, 2023) that attempts

to combine ideas from the works of DeepSVG and

DeepVecFont. It proposes to generate an image not

only with a two-level VAE, but also at the same time

generate a raster copy of it and ﬁnally make a ﬁne

tuning of the vector image referring to the raster one.

However, qualitative results were obtained only for

simple font images.

The models discussed in this section lack sufﬁ-

cient quality. Furthermore, the use of DiffVG imposes

signiﬁcant limitations due to its instability and the

inability to optimize image batches simultaneously.

However, we ﬁnd the concept of DeepSVG, which

employs a two-tier VAE, to be quite intriguing, and

we incorporate it into our work.

2.3 Image Generation with Diffusion

Models

Diffusion models (Ho et al., 2020) are the state-of-

the-art for generating raster images (Saharia et al.,

2022). Its idea is to train a model to remove noise

added to images. The training consists of two stages.

At the ﬁrst stage, called the forward pass, noise is

added to a slightly forgotten image. This operation is

repeated a large number of times (usually T = 1000).

The second stage, called the backward pass, is pre-

dicting the original image from the noisy one.

After training, we can feed the model with random

noise and after T steps of denoising obtain realistic

images.

The idea behind the Latent Diffusion (Rombach

et al., 2022) is that it is unnecessary to train the model

to remove noise from the original image. Instead, we

can train an autoencoder once and then operate not

with the image space, but with the latent space.

This approach has been applied to generate vec-

tor images in VectorFusion (Jain et al., 2022). It uses

the Stable Diffusion to generate a raster image, then

vectorizes it using LIVE (Ma et al., 2022), and ﬁ-

nally ﬁne-tunes it using the Sample Distribution Loss.

Despite the increased generation quality, the prob-

lems remain the same as in CLIPDraw, because the

Stable Diffusion model is also trained on raster im-

ages, which results in an image drawn “with strokes””

(homogeneous areas are covered with several shapes,

which is considered low-quality in vector graphics).

Moreover, this method inherits all drawbacks of the

LIVE method, for example, such as high image gen-

eration time and the requirement for pre-setting hy-

perparameters.

Another work (Zhang et al., 2023) applies Sta-

ble Diffusion with vectorization methods, in order to

change the input vector image according to the trans-

mitted text description. Using Stable Diffusion, the

method ﬁrst generates a target bitmap based on the

text condition, then, based on dual features extracted

from deep networks, the input vector image is adapted

to the target bitmap and the ﬁnal optimization of this

vector image is performed. In addition to the fact that

this method is too narrowly focused and does not al-

low generating vector images unconditionally or by

textual condition, the use of a pre-trained diffusion

raster model leads to problems similar to those that

arise with vectorizing approaches.

One of the latest works, which uses the diffu-

sion process, is SVGDreamer (Xing et al., 2024) - an

optimization-based method for creating vector images

on text prompts. The authors introduce semantic-

driven image vectorization method (SIVE), which

aim is to distinguish well the background and fore-

ground objects and to optimize them with the guid-

ance of attention maps, obtained from Text-to-Image

Diffusion model. After that Vectorized Particle-based

Score Distillation (VPSD) is used to ﬁnetune previ-

ously generated image. The authors show pleasant re-

sults, however, the resulting images contain too many

shapes and the whole process is quite highly loaded.

The main problem with the approach is the use of vec-

torization, which is difﬁcult to correctly conﬁgure for

automatic use.

Nevertheless, the use of diffusion approaches has

signiﬁcantly improved the quality of vector image

generation. We also use a diffusion model in our

work.

2.4 Vector Image Generation with

LLMs

A promising approach to generating vector graphics

may be the use of large language models (LLMs).

Since a vector image in SVG format is a text ﬁle, it

would be possible to train a language model to gen-

erate such ﬁles. There are studies (Nishina and Mat-

sui, 2024; Zou et al., 2024; Xu and Wall, 2024; Cai

et al., 2024) devoted to evaluating the quality of SVG

image generation using open, well-known pre-trained

chat models. The conclusion of them is that the qual-

ity of image generation is very strongly inﬂuenced by

prompt engineering, but at the same time it is almost

impossible to achieve high-quality results without se-

rious ﬁnetuning of models for generating vector im-

ages.

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

187

We found only one paper (Wu et al., 2023) ded-

icated to SVG generation using LLM. The authors

trained a Transformer-Decoder model to generate

SVG text tokens in an autoregressive way based on

a text condition. Although the authors demonstrate

good results in their work, we observed that the model

likely memorized the dataset on which it was trained

and it shows poor results for unseen data. As a result,

even for simple text queries like ”horse”, ”ﬂower”

or ”dog” we obtained distorted images. The model

struggles to comprehend complex text queries and

generate diverse images. Nevertheless, we believe

that this approach holds great promise; however, it

requires a substantial high-quality dataset, which is

currently lacking.

3 METHOD

Our approach aims to avoid vector image rasterization

to solve the problem of “drawing with strokes” (when

vector shapes are unclosed or a lot of them form one

homogeneous area) by using transformers to train an

autoencoder and a diffusion model to generate image

embedding in a latent space. The main task here is to

train a highly qualitative autoencoder because the task

of training a diffusion model to generate embeddings

in a latent space has already been solved (Rombach

et al., 2022). To train the autoencoder, we need to

collect a dataset large enough to train a model, cre-

ate augmentations and loss functions for the vector

graphics domain and develop an optimal architecture

for fast learning.

In Section 3.1, we provide a detailed overview of

our data processing. We then discuss in Section 3.2

our approach as a whole, followed by an in-depth ex-

amination of its components: the encoding of paths

and entire images in Section 3.3, as well as the pre-

diction and decoding of paths based on the internal

representation in Section 3.4. Subsequently, we delve

into the details of applying the latent diffusion model

in Section 3.5 and in Section 3.6 discuss the loss func-

tions implemented in our study.

3.1 Dataset Collection

One of the main problems in the vector graphics do-

main is the lack of large open collections of data.

While Stable Diffusion was trained on the LAION

dataset (Schuhmann et al., 2022) consisting of ﬁve

billion images, papers on the generation of vector

images use less than one million images. For ex-

ample, DeepSVG was trained on the SVG-Icons8

dataset (Carlier et al., 2020) consisting of 100, 000

images.

To train the initial version of our model, we need

images with a maximum of 20 paths, because al-

most all typical vector icons satisfy this condition.

We cannot use vectorizers due to the following rea-

sons: (1) they generate too many paths (Dziuba et al.,

2023) for simple images (software algorithms (Tian

and G

unther, 2022)); (2) they are extremely time-

consuming (LIVE, ClipGen); (3) our experiments

with Im2Vec showed that it is very unstable and is

unable to memorize and vectorize images containing

more than 10 shapes.

To collect the largest dataset, we searched for im-

ages from open sources on the Internet. The most

of search results are giant maps, diagrams, and other

images containing thousands of paths. They are too

complicated and are not of interest to generate. We

managed to collect 10M images, most of which were

unsuitable for high-quality image generation as they

have curved lines, fonts, diagrams, and other clipart.

Therefore, we train our autoencoder model in two

stages: at ﬁrst, we use all the collected data (with aug-

mentations) to train the model to operate with vector

shapes in a basic way, and on the second stage we

employ only high-quality data for the ﬁnetuning.

However, the amount of collected data is still not

enough to train such a model. To overcome this, we

created the following augmentations to expand our

dataset:

1. Images mixing (one path from each of the 20 im-

ages),

2. Path random shufﬂing within an image,

3. Adding anchor points on existing paths,

4. Reversing the order of anchor points in paths,

5. Cyclic shift of anchor points in cyclic paths,

6. Random colors change,

7. Adding random noise to coordinates of anchor

points.

Most of these augmentations spoil the images but it

should help the model to better understand the struc-

ture of vector images and avoid overﬁtting. In addi-

tion, we used randomly generated images to enlarge

the dataset; their effect is considered in the Discussion

section.

The proportions of data augmentations in the

dataset were as follows: approximately 80% con-

sisted of augmented images, 5% were random images,

and the remaining were original vector images. This

balanced approach helps enhance the model’s robust-

ness by providing a diverse set of training examples,

which is crucial for improving performance and re-

ducing overﬁtting.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

188

To ﬁne-tune the model, we additionally collected

0.5M of images from open trusted sources

. At the

ﬁne-tuning stage, we used only such augmentations

that did not change the ﬁnal image, such as (3), (4),

and (5).

3.2 Autoencoder

We follow the DeepSVG (Carlier et al., 2020) dou-

ble staged encoding and decoding approach. At ﬁrst,

we transform each path of the image to correspond-

ing embeddings using Path Encoder. Then we com-

bine these embeddings using SVG Encoder and ob-

tain the representation of the entire image. The decod-

ing process is made in reverse order. All paths at the

encoding and decoding stages are processed indepen-

dently. We train two transformers, one for encoding

and decoding paths (Path Encoder/Decoder) and one

for processing the entire image representation (SVG

Encoder/Decoder). The general scheme of our model

training is presented in Fig. 1a. We call it Vector-

Weaver because the second transformer “weaves” the

resulting vector images from curve embeddings.

3.3 Encoding Process

Path Encoder. To encode a path, we must ﬁrst repre-

sent all its anchor point coordinates as tokens. Al-

though the coordinates are ﬂoat values, a common

practice is to discretize them and use with a special

dictionary as in (Nash et al., 2020; Esser et al., 2021).

Inspired by this approach, we divided each axis by

N = 64 equal segments. Following this discretiza-

tion, we trained the embedding dictionary. Namely,

we employ two dictionaries, the ﬁrst one for x coor-

dinates and the second for y coordinates. Besides, we

use the ﬂoat value of the actual coordinate position

within its segment not to lose information about the

real location of the point (ﬂoat in-block coordinates).

To create a token, we use the positions (indices)

of the anchor points inside its path and the position

(index) of this path in the SVG ﬁle. These discrete in-

dices are mapped to continuous learnable vectors us-

ing dictionaries. All features (discretized coordinates

tokens, ﬂoat in-block values and their positional en-

codings) are passed through linear layers resulting in

the ﬁnal token.

Additionally, we provide the transformer with in-

formation about the ﬁlling attribute of this path. To

do this, we trained three dictionaries of tokens, each

consisting of 255 embeddings. Using them, we trans-

late the RGB color into a set of three tokens for the

https://freesvg.org/ and https://www.svgrepo.com/

collections/

transformer. The overall path encoding scheme can be

seen in Fig. 2. Summarizing, in path encoder we use

these types of tokens: tokens of segment coordinates,

positional encodings of segments, colour encodings

(RGB), token of the path position in the whole image.

We use the transformer encoder here to en-

rich these tokens with information about each other.

Therefore, after enriching all the embeddings via the

transformer encoder, we combine them into one com-

mon embedding path using concatenation and a small

multi-linear neural network.

SVG Encoder. Having passed all n paths through

the Path Encoder, we can obtain embeddings of all

paths. It remains to combine them into a common em-

bedding for the entire image. To do this, we use the

transformer decoder (using of decoder is simply an

implementation detail for generating two tokens, we

could also use an encoder instead of decoder), hav-

ing previously added positional embeddings to the to-

kens, as can be seen in Fig. 3.

The SVG encoder predicts two embeddings - µ

and σ - that are parameters of the normal distribu-

tion from which the ﬁnal embedding of the image is

sampled. We use a standard reparameterization trick,

which also helps the diffusion model perform better.

The diffusion model removes noise from the embed-

ding of the image. Without reparametrization trick it

may not do it perfectly and miss the real latent embed-

ding. However, the reparameterization trick guaran-

tees that some area around each embedding µ is also

decoded into the image we need. Therefore, during

the training of the diffusion model, we set σ = 0 in

order to predict better central vectors of the latent dis-

tribution.

3.4 Decoding Process

SVG Decoder. First of all, we need to obtain informa-

tion about each path from the embedding of the entire

image (sampled from the latent space). Moreover, we

need to understand how many paths the original im-

age has.

To solve this problem, the transformer predicts not

only the embeddings of the paths but the probabil-

ity of their existence (technically, we use and train

a small network on each generated path embedding

for that). We impose a large loss on this part, which

severely penalizes the model for throwing out real

paths or drawing non-existent ones.

The SVG decoder is a regular transformer de-

coder, which should solve a one-to-many problem, as

presented in Fig. 4a.

Path Decoder. The remaining part is the predic-

tion of the vector image parameters. The embedding

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

189

Figure 2: Path encoding scheme. The ﬁgure shows which characteristics of the shape we use and how we transform them into

an embedding of the entire shape. Namely, we encode to tokens discretized shape coordinates, their ﬂoat in-block coordinates,

shape color, shape position in the entire image. Additionally, we utilize positional encoding for the coordinates.

Figure 3: SVG encoding scheme. This is a single trans-

former that combines several shapes embeddings into pa-

rameters of the normal distribution that deﬁnes the entire

image.

obtained from SVG Decoder is transformed to a ﬁxed

set of tokens. We also utilize positional encoding

for these tokens. The sequence of tokens is passed

to Transformer Encoder in order to enrich them with

neighboring tokens. In the end we use a pair of linear

layers to obtain information about initial vector image

attributes.

3.5 Latent Diffusion

After training the autoencoder, we separately train an

unconditional diffusion model which operates in the

latent space of the trained VAE. We generated and

sampled target embeddings of the images from our

dataset using VAE Encoder (Path and SVG Encoders)

and trained the diffusion model to predict these em-

beddings from their noised equivalents. We used a

15-layer DenseNet for this task.

Noteworthy, all diffusion models allow adding

generation by prompts easily. For instance, for the

purpose of text-based generation, it is enough just to

feed text to the input of a network that removes noise

from the vector embeddings considering text-features.

However, due to the fact that our dataset does not have

high-quality text descriptions, text generation using

our model remains unresolved. We plan to solve this

issue, for example, by expanding our dataset using

text descriptions derived from raster Vision-Language

models. Therefore, at the moment the generation with

our model is unconditional and only one noisy vector

is supplied.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

190

(a) (b)

Figure 4: Blocks of Decoding layer. (a) SVG decoding scheme. This block transforms latent code to paths embeddings with

ﬂags which indicates if a path has to be visible or not. (b) Path decoding scheme. This layer follows the SVG Decoder block

and predicts path parameters (coordinates and colors) of each generated path embedding.

3.6 Losses

As mentioned above, we impose cross-entropy loss

with a large weight on determining the number

of paths in the decoder. In fact, this is a very simple

task for the model and this loss converges already at

initial steps.

To preserve the image structure, we impose MSE

loss L

MSE

x,y

on the coordinates, thereby, forcing the

model to give the same output as it receives at the

input. However, we found that this approach alone is

not good enough and requires improvement, such as

the use of other loss functions, because the lines in the

generated images were not precise enough.

This effect is simpler to illustrate on raster images.

Training a raster convolutional autoencoder with the

MSE loss results in blurry images. It occurs because

the model is easier to learn a smooth transition instead

of sharp color change at the objects boundaries, while

the MSE loss remains quite low. Similarly, in the case

of vector images, long straight lines can be made a

little more crooked, without affecting the loss greatly.

Such images are discussed in the Discussion section.

To solve the problem with crooked contours, we

used a standard trick, adding an adversarial discrim-

inator. The idea is to add a loss L

adv

from the scor-

ing model to the overall loss, discriminator is adver-

sarially trained to determine the input of the autoen-

coder from its output. The discriminator can easily

determine smooth transitions that a given raster image

is fake, motivating thus the autoencoder to generate

sharper images. Such an approach was also applied

in the original Stable Diffusion.

The discriminator architecture is similar to the in-

troduced VAE Encoder (Path and SVG Encoders).

The only difference is that it lacks discretization mod-

ules to maintain backpropagation through the outputs

of the model. Such an encoder has a weaker gener-

alizing ability and learns worse, but it easily detects

errors that the MSE loss is blind to.

We also added an MSE loss for color detection,

MSE

color

, and the KL loss for VAE, L

. As a result,

the total loss is:

L = λ · L

+ L

MSE

x,y

+ L

adv

+ L

MSE

color

+ L

. (1)

4 EXPERIMENTS AND RESULTS

4.1 Experimental Evaluation

The training our model, VectorWeaver, was per-

formed on 8 NVIDIA Tesla A100 GPU with 80 GB

of memory during 14 days. The ﬁne-tuning was per-

formed using 2 NVIDIA Tesla A100 with 32 GB of

memory for one day. The AdamW optimization al-

gorithm was chosen for training with learning rate

cos(5e

−6

). The generation results are presented in

Fig. 1b and Fig. 5. The diffusion process is presented

in Fig. 7.

We compared our model with DeepSVG trained

on our vector images dataset and VectorFusion, which

needs no training. We measured basic generation

metrics FID and Inception Score (IS). We conducted

a survey asking participants to rate the vector im-

ages generated by the proposed model, DeepSVG,

and VectorFusion on a scale of 1 to 5 (1 stands for

completely inappropriate, 5 stands for the perfect ﬁt).

100 assessors took part in the survey. The results of

the vector image generation with the three methods

are presented in Fig. 6. Moreover, we compared the

execution time of these methods. The comparison re-

sults are presented in Tab. 1. Additionally, we present

examples of image interpolation 8 provided by our

model.

Due to the lack of colors and crooked contours,

DeepSVG was always considered by assessors as the

worst option. Since VectorFusion generates realistic

and complex images, assessors sometimes rate it bet-

ter than the proposed method. However, its gener-

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

191

Figure 5: 100 randomly selected images generated by VectorWeaver.

Table 1: Comparison of DeepSVG, VectorFusion, and our method using IS and FID metrics, normalized survey results and

algorithms running times.

Method IS↑ FID↓

Assessors’

score↑

Avg time

per img, s

DeepSVG 1.53 30.7 0.19 ± 0.03 0.06

VectorFusion 1.2 15.1 0.48 ± 0.05 1800

VectorWeaver 1.8 4.3 0

9±0.03 3

ated images often contain artifacts, and the drawing

is obtained “with strokes”. In 90% of cases, asses-

sors called our generation the best. We also measured

the Inception Score. The results of all three models

turned out to be quite low because the Inception net-

work was trained on a raster dataset. However, our

result was the best apparently due to a smaller num-

ber of artifacts and greater simplicity of the images.

We obtained similar results with the FID metric.

In comparison with DeepSVG the main advantage

of our model is the quality of generation. VectorFu-

sion generates complex vector images with a large

number of “strokes”, but it takes much time. In pa-

per (Jain et al., 2022), the authors stated that it takes

30 minutes to generate one image. In our case, the

diffusion model inference and decoder application re-

quires only 3 seconds on average.

4.2 Discussion

Training Without the Discriminator Loss. At the

initial stage, we trained our model without adversar-

ial loss. As seen in Fig. 9, the results are much awrier

compared with the ones generated with adversarial

loss use. The discriminator sees uneven joints on the

lines and detects the image as synthetic.

However, such images seem to be more human-

made, they look as if a person drew them in an image

editor. This can be used as a generator of images in

such a speciﬁc style.

Training on Random Images. In Subsection 3.1,

we mentioned adding random images to the dataset

(the coordinates of the paths are random noise). They

help prevent the model’s overﬁtting because it is

forced to remember them as well. If the model starts

to generalize random images poorly, then the loss on

this part of the dataset grows. In practice, 5 percent of

the images in our dataset were generated randomly.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

192

Text query Ship

Coffee

cup

Flag Eiffel Flower

Vector

Weaver

(our)

DeepSVG

VectorFusion

Figure 6: Qualitative comparison of vector image generation results using our VectorWeaver model, DeepSVG, and Vector-

Fusion.

Figure 7: The diffusion process. Each line corresponds to a generation of a single image. We start with random noise in a

latent space and remove it until we come to a good result. We produce 1000 steps and show every hundredth.

Limitations. At the moment, our method oper-

ates with images consisting of no more than 20 paths

deﬁned by no more than 20 anchor points. The selec-

tion of the number of paths was dictated by the lim-

ited computational resources. However, we believe

that our approach would generalize to more compli-

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

193

Figure 8: Example of interpolation with our model between two randomly generated images from left to right.

Figure 9: 100 examples of generation without adversarial loss.

cated images if we train the model using more com-

putational resources.

Unfortunately, it is currently difﬁcult to assem-

ble a high-quality and large dataset of text and im-

age pairs, therefore, we made the generation uncon-

ditional. However, once such a dataset is assem-

bled, one can easily add text generation by adding

CLIP (Radford et al., 2021) model as a text feature

extractor for a diffusion model.

5 CONCLUSION

In this work, we have proposed a method to gener-

ate vector images based on the diffusion process and

stacked transfomer-based autoencoders. Our method,

VectorWeaver, operates exclusively within the vector

image domain. As a result, it avoids many of the chal-

lenges associated with transforming images between

raster and vector formats. Moreover, to enlarge train-

ing dataset, we have proposed vector image augmen-

tations. We have demonstrated that our model is capa-

ble of generating vector images consisting of an arbi-

trary number of paths and compared it with DeepSVG

and VectorFusion. The comparison showed that our

model generates smoother vector images and does it

600 times faster than VectorFusion. We trained the

proposed model to generate colored icons, but it can

be trained to generate more complex images.

In the future, we plan to add more augmentations,

make the model text-conditional and to train it with

more resources. We expect the proposed architecture

to serve as a strong baseline for further research in the

little-explored ﬁeld of vector graphics.

ACKNOWLEDGEMENTS

The research was supported by the ITMO University,

project 623097 ”Development of libraries containing

perspective machine learning methods”

REFERENCES

Cai, M., Huang, Z., Li, Y., Ojha, U., Wang, H., and Lee,

Y. J. (2024). Leveraging large language models for

scalable vector graphics-driven image understanding.

Carlier, A., Danelljan, M., Alahi, A., and Timofte,

R. (2020). Deepsvg: A hierarchical generative

network for vector graphics animation. CoRR,

abs/2007.11301.

Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman,

W. T. (2022). Maskgit: Masked generative image

transformer. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 11315–11325.

Du, R., Chang, D., Hospedales, T., Song, Y.-Z., and Ma, Z.

(2023). Demofusion: Democratising high-resolution

image generation with no $$$.

Dziuba, M., Jarsky, I., Eﬁmova, V., and Filchenkov, A.

(2023). Image vectorization: a review. arXiv preprint

arXiv:2306.06441.

Eﬁmova, V., Jarsky, I., Bizyaev, I., and Filchenkov,

A. (2022). Conditional vector graphics gener-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

194

ation for music cover images. arXiv preprint

arXiv:2205.07301.

Esser, P., Rombach, R., and Ommer, B. (2021). Tam-

ing transformers for high-resolution image synthesis.

In Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition, pages 12873–

12883.

Frans, K., Soros, L. B., and Witkowski, O. (2021). Clip-

draw: Exploring text-to-drawing synthesis through

language-image encoders. CoRR, abs/2106.14843.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion

probabilistic models. Advances in Neural Information

Processing Systems, 33:6840–6851.

Jain, A., Xie, A., and Abbeel, P. (2022). Vectorfusion:

Text-to-svg by abstracting pixel-based diffusion mod-

els. arXiv preprint arXiv:2211.11319.

Kingma, D. P. and Welling, M. (2019). An introduc-

tion to variational autoencoders. arXiv preprint

arXiv:1906.02691.

Li, T.-M., Luk

c, M., Micha

el, G., and Ragan-Kelley, J.

(2020). Differentiable vector graphics rasterization

for editing and learning. ACM Trans. Graph. (Proc.

SIGGRAPH Asia), 39(6):193:1–193:15.

Lopes, R. G., Ha, D., Eck, D., and Shlens, J. (2019). A

learned representation for scalable vector graphics.

CoRR, abs/1904.02632.

Ma, X., Zhou, Y., Xu, X., Sun, B., Filev, V., Orlov, N.,

Fu, Y., and Shi, H. (2022). Towards layer-wise image

vectorization. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 16314–16323.

Nash, C., Ganin, Y., Eslami, S. A., and Battaglia, P. (2020).

Polygen: An autoregressive generative model of 3d

meshes. In International conference on machine

learning, pages 7220–7229. PMLR.

Nishina, K. and Matsui, Y. (2024). Svgeditbench: A bench-

mark dataset for quantitative assessment of llm’s svg

editing capabilities.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

transferable visual models from natural language su-

pervision. CoRR, abs/2103.00020.

Reddy, P., Gharbi, M., Lukac, M., and Mitra, N. J. (2021).

Im2vec: Synthesizing vector graphics without vector

supervision. arXiv preprint arXiv:2102.02798.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022). High-resolution image synthesis

with latent diffusion models. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 10684–10695.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-

ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,

S. S., Lopes, R. G., et al. (2022). Photorealistic text-

to-image diffusion models with deep language under-

standing. arXiv preprint arXiv:2205.11487.

Schaldenbrand, P., Liu, Z., and Oh, J. (2022). Styleclip-

draw: Coupling content and style in text-to-drawing

translation. arXiv preprint arXiv:2202.12362.

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.,

Wightman, R., Cherti, M., Coombes, T., Katta, A.,

Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An

open large-scale dataset for training next generation

image-text models. arXiv preprint arXiv:2210.08402.

Shen, I.-C. and Chen, B.-Y. (2021). Clipgen: A deep

generative model for clipart vectorization and synthe-

sis. IEEE Transactions on Visualization and Com-

puter Graphics, 28(12):4211–4224.

Tian, X. and G

unther, T. (2022). A survey of smooth vector

graphics: Recent advances in representation, creation,

rasterization and image vectorization. IEEE Transac-

tions on Visualization and Computer Graphics.

Timofeenko, B. A., Eﬁmova, V., and Filchenkov, A. A.

(2023). Vector graphics generation with llms: ap-

proaches and models. Zapiski Nauchnykh Seminarov

POMI, 530(0):24–37.

Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete

representation learning. Advances in neural informa-

tion processing systems, 30.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wang, Y. and Lian, Z. (2021). Deepvecfont: synthesizing

high-quality vector fonts via dual-modality learning.

ACM Transactions on Graphics (TOG), 40(6):1–15.

Wu, H., Shen, S., Hu, Q., Zhang, X., Zhang, Y., and Wang,

Y. (2024). Megafusion: Extend diffusion models to-

wards higher-resolution image generation without fur-

ther tuning.

Wu, R., Su, W., Ma, K., and Liao, J. (2023). Iconshop:

Text-guided vector icon synthesis with autoregressive

transformers.

Xing, X., Zhou, H., Wang, C., Zhang, J., Xu, D., and Yu, Q.

(2024). Svgdreamer: Text guided svg generation with

diffusion model. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 4546–4555.

Xu, X., Wang, Z., Zhang, E., Wang, K., and Shi, H.

(2022). Versatile diffusion: Text, images and vari-

ations all in one diffusion model. arXiv preprint

arXiv:2211.08332.

Xu, Z. and Wall, E. (2024). Exploring the capability of llms

in performing low-level visual analytic tasks on svg

data visualizations.

Yan, S. (2023). Redualsvg: Reﬁned scalable vector graph-

ics generation. In International Conference on Artiﬁ-

cial Neural Networks, pages 87–98. Springer.

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang,

Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K.,

et al. (2022). Scaling autoregressive models for

content-rich text-to-image generation. arXiv preprint

arXiv:2206.10789.

Zhang, P., Zhao, N., and Liao, J. (2023). Text-guided vector

graphics customization.

Zou, B., Cai, M., Zhang, J., and Lee, Y. J. (2024). Vgbench:

Evaluating large language models on vector graphics

understanding and generation.

VectorWeaver: Transformers-Based Diffusion Model for Vector Graphics Generation

195