Neural Style Transfer for Vector Graphics

Ivan Jarsky

1 a

, Valeria Eﬁmova

1 b

, Artyom Chebykin

2 c

, Viacheslav Shalamov

1 d

and

Andrey Filchenkov

1 e

ITMO University, Kronverksky Pr. 49, St. Petersburg, Russia

SUAI, Bolshaya Morskaya Street 67A, St. Petersburg, Russia

ﬁ ﬁ

Keywords:

Vector Graphics, Computer Vision, Neural Style Transfer, DiffVG.

Abstract:

Neural style transfer draws researchers’ attention, but the interest focuses on bitmap images. Various models

have been developed for bitmap image generation both online and ofﬂine with arbitrary and pre-trained styles.

However, the style transfer between vector images has not almost been considered. Our research shows that

applying standard content and style losses insigniﬁcantly changes the vector image drawing style because

the structure of vector primitives differs a lot from pixels. To handle this problem, we introduce new loss

functions. We also develop a new method based on differentiable rasterization that uses these loss functions

and can change the color and shape parameters of the content image corresponding to the drawing of the style

image. Qualitative experiments demonstrate the effectiveness of the proposed VectorNST method compared

with the state-of-the-art neural style transfer approaches for bitmap images and the only existing approach for

stylizing vector images, DiffVG. Although the proposed model does not achieve the quality and smoothness

of style transfer between bitmap images, we consider our work an important early step in this area. VectorNST

code and demo service are available at https://github.com/IzhanVarsky/VectorNST.

1 INTRODUCTION

Style transfer is a task of computer vision aiming to

create new visual art objects. Its objective is to syn-

thesize an image, which combines recognizable style

patterns of a style image and preserves the subject of

a content image.

The pioneering work of Gatys et al. (Gatys et al.,

2015) in the ﬁeld of neural style transfer (NST)

showed that correlations between image represen-

tations extracted from deep neural networks could

capture the visual style of an image. Based on

this, they proposed the ﬁrst NST method. Using

Gram matrices-based loss functions and training feed-

forward neural networks (Li et al., 2017; Ulyanov

et al., 2016; Li and Wand, 2016; Johnson et al., 2016),

utilizing one model for multiple styles (Dumoulin

et al., 2016) and many other essential improvements

to the basic method have been proposed. The au-

thors of (Deng et al., 2022) suggested an approach

https://orcid.org/0000-0003-1107-3363

https://orcid.org/0000-0002-5309-2207

https://orcid.org/0009-0002-3163-3727

https://orcid.org/0000-0002-5647-6521

https://orcid.org/0000-0002-1133-8432

Content Image Style Image Result

Figure 1: We propose a novel neural style transfer method

VectorNST for vector graphics. It takes as inputs a vector

content image and some style image and produces a result-

ing vector image with a style from the style image trans-

ferred to the content image.

for stylizing images using transformer-based archi-

tecture. Contrastive learning strategy is used in the

CAST (Zhang et al., 2022) for training style transfer

686

Jarsky, I., Eﬁmova, V., Chebykin, A., Shalamov, V. and Filchenkov, A.

Neural Style Transfer for Vector Graphics.

DOI: 10.5220/0012438200003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

686-693

ISBN: 978-989-758-679-8; ISSN: 2184-4321

generator. Dual language-image encoder CLIP (Rad-

ford et al., 2021) was used for the image generation

and stylization (Kwon and Ye, 2022) using natural

language prompts.

One of the main limitations of these methods is

that they process bitmap images only of a ﬁxed reso-

lution, which is an essential constraint preventing ma-

nipulations with high-resolution images. Image scal-

ing is not applicable for bitmap images without a de-

crease in quality. Meanwhile, scalability is the feature

of vector graphics.

Prior researches tackle the task of vector graphic

processing for fonts (Wang and Lian, 2021) and sim-

ple graphics such as icons and emoji (Carlier et al.,

2020; Reddy et al., 2021), and sketch-like image gen-

eration (Frans et al., 2021; Schaldenbrand et al., 2022)

using natural language prompts. The study (Eﬁmova

et al., 2022) suggests an approach for generating vec-

tor images consisting of multiple B

ezier curves con-

ditioned by a music track and its emotion.

To the best of our knowledge, the only work that

is relevant to NST for vector graphics is the DiffVG

method proposed in (Li et al., 2020). However, the

authors do not address the NST problem directly, but

only provide tools applicable to it. We can thus con-

clude that the ﬁeld of vector NST remains majorly

untouched.

Two paths exist that lead to styled vector images:

(1) rasterize vector input, apply bitmap style transfer

algorithms, and then vectorize the result and (2) ap-

ply style transfer without directly to the input vector

image.

We believe that the second path is preferable due

to the following two reasons. First, VGG (Simonyan

and Zisserman, 2014) and other backbones that are

used for feature extraction are trained on ImageNet,

which makes them bitmap-based and unable to clas-

sify vector images. Therefore, it is necessary to sep-

arately train the network for feature extraction. Sec-

ond, the ﬁrst path has a bottleneck: stylized bitmap

images must be converted into vector form, which can

be done using software algorithms, which produce

various artifacts on the image and produce 10 − 500

times more curves. A large number of curves makes

the vector images difﬁcult to edit. Vectorization ap-

proaches without this disadvantage are DiffVG (Li

et al., 2020), which produces artifacts on the result-

ing image, and LIVE (Ma et al., 2022), which is a

very resource- and time-consuming. Thus, we con-

sider the raster-then-vectorize approach of generating

vector images to be unsuccessful.

Being motivated by this, we decided to ﬁnd if it

is possible to transfer style of a vector image. Our

contribution is a novel style transfer method for vec-

tor images based on learning how to transform an

image via backpropagating contour and perceptual

losses through differentiable rasterization transforma-

tion, VectorNST. Some samples of stylized vector im-

ages are presented in Fig. 1.

2 RELATED WORK

2.1 Neural Style Transfer for Raster

Graphics

Gatys et al. (Gatys et al., 2015) discovered the possi-

bility to separate representations of content and style

obtained using a pre-trained CNN (Krizhevsky et al.,

2017). They proposed an NST algorithm combining

the content of one image with the style of another.

It jointly optimizes the loss function responsible for

style synthesis and loss for content reconstruction us-

ing multiple feature maps from a pre-trained VGG

network (Simonyan and Zisserman, 2014). The algo-

rithm starts with random noise and changes pixel val-

ues with gradient-based optimization to obtain a styl-

ized image. While producing high-quality results and

ﬂexibility, this method is computationally expensive

since it requires many forward and backward passes.

To overcome this shortcoming, Johnson et

al. (Johnson et al., 2016) proposed a feed-forward

style transfer network, which synthesizes stylized im-

ages in one forward pass; the pre-trained VGG model

is used as a loss network. Its performance is similar

to the results of Gatys et al., but reduces the infer-

ence time. However, the algorithm limitation is that

one trained style transfer network can only be used

for one style.

Dumoulin et al. (Dumoulin et al., 2016) tackled

this problem by introducing a conditional style trans-

fer network that can handle multiple styles and is

based on a conditional instance normalization algo-

rithm. Deﬁning a speciﬁc style requires only trainable

parameters of scaling and shifting. Moreover, the la-

tent space of these trainable parameters can be used

to interpolate between styles and capture new artistic

styles.

To address the problem of high-resolution image

generation, Yoo et al. (Yoo et al., 2019) proposed

an algorithm based on whitening and coloring trans-

forms for the direct change of style representation to

match the covariance matrix of content representa-

tion. Wavelet Corrected Transmission (WCT2) us-

ing Haar wavelet pooling and unpooling allows losing

less structural information and maintains the statisti-

cal properties of VGG feature space during styliza-

tion. It can stylize a 1024 × 1024 resolution image in

Neural Style Transfer for Vector Graphics

687

4.7 seconds and obtain a photorealistic result without

postprocessing.

A Transformer-based (Vaswani et al., 2017) ap-

proach, initially proposed for language processing,

can be an alternative to the classic CNN-based meth-

ods as it has achieved state-of-the-art results in many

computer vision tasks.Park et al. (Park and Lee, 2019)

proposed the SANet method using the attention mech-

anism and the identity loss function, which heavily

monitors the preservation of image content. However,

such an encoder-transfer-decoder architecture cannot

handle long-term dependencies, which leads to vari-

ous distortions and loss of details in a stylized image.

Using transformers’ ability to handle long-range de-

pendencies, Deng et al. (Deng et al., 2022) introduced

a transformer-based style transfer framework StyTr

which splits content and style images into patches and

feeds them into different encoders, and then the trans-

former decoder stylizes the content sequence accord-

ing to the style sequence. However, due to the use

of a patch-based mechanism, it is difﬁcult to extract

and preserve global and local features in a stylized

image. Zhang et al. (Zhang et al., 2022) presented

a framework for style transfer and image style repre-

sentation based on contrastive learning. Furthermore,

style representations are learned directly from image

features as well as the global distribution of style. The

proposed multi-layer style projector with CNN layers

taking as input feature maps from ﬁne-tuned VGG19

encodes the image into a set of codes that are proper

guidance for the style transfer generator.

2.2 Vector Graphics

Vector graphics is the most commonly used for var-

ious fonts, illustrations, icons, emblems, logos, and

other resolution-independent images. Vector graphics

is usually declared as a set of primitives such as lines,

curves, and circles with many geometric and color at-

tributes.

The most common vector image format is SVG,

which is an XML markup text ﬁle describing geomet-

ric shapes that are mathematically deﬁned by control

points. SVG supports many tags and attributes, but

the most interesting is the <path> tag, which can be

used to describe a shape using B

ezier curves. The

main advantages of vector graphics are lossless scal-

ability, simplicity, and the memory-efﬁciency.

Most of the existing methods for neural vector im-

age generation are based on work by Li et al. (Li et al.,

2020). They introduced the differentiable rasterizer

for vector graphics, DiffVG, that allows direct opti-

mization of vector image components such as B

ezier

curves instead of a matrix of pixels.

On the basis of DiffVG, Frans et al. (Frans et al.,

2021) introduced the CLIPDraw method that syn-

thesizes vector images conditioned by natural lan-

guage prompts. CLIPDraw iteratively optimizes a

set of RGBA B

ezier curves through gradient de-

scent optimizing cosine distance between text encod-

ing and image encoding from the pre-trained CLIP

model (Radford et al., 2021). By adjusting text

prompts, the model produces different stylized im-

ages, which, however, look like sketches rather than

pictures. The model performs worse than generative

models in high-resolution image generation tasks.

Model-free method for image vectorization,

LIVE (Ma et al., 2022) is an approach that offers a

completely differentiable way to vectorize bitmap im-

ages. Unlike the DiffVG method, which uses random

path initialization, LIVE uses an initialization method

that determines the best place to add a new path based

on the color and size of the component. Although this

approach does not use any deep learning model, it im-

plements an iterative image vectorization algorithm,

and vectorization of more complex examples requires

a lot of resources and takes a long time.

3 METHOD

To develop the style transfer for vector graphics, we

use DiffVG to parse vector images and obtain shape

parameters: anchor points of vector primitives, shape

colors, and line widths. Anchor points are the basis

for any vector image, they are used to build curves,

which form the ﬁgures in the image. Each point is

characterized by coordinates [x,y]. Also, any curve

has a color, it is stored in RGBA format in the interval

[0;1], and a thickness, which is a ﬂoat number. Un-

like style transfer for raster images where only pixel

values change, vector images have 3 uncorrelated

groups of shape parameters listed above, which can

be updated. Changing the parameters of vector prim-

itives is equivalent to transferring the drawing style

for bitmap images. Compare NST approaches: in

bitmap domain, to transfer style we can only change

the color of the pixels in the particular pattern. In vec-

tor domain, the drawing style consists of uncorrelated

groups of parameters, which can be updated simulta-

neously or separately.

Based on the above, we aim to develop a model

capable of transferring the drawing style of one vec-

tor image called a style image to another vector image

called a content image preserving its subject. We do

not start the style transfer with a new empty or random

vector image, but with a content image, which means

that we only change existing shapes and do not create

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

688

new vector primitives. The method we propose be-

longs to the iterative optimization methods category,

it transfers the style by direct iterative updating shape

parameters (Jing et al., 2019). The number of itera-

tions determines the inﬂuence of the style image on

the result of the style transfer. To allow evaluation

of the resulting vector image, it should be rasterized

using DiffVG. After that, the similarity between the

current image and the style image is measured by the

LPIPS method (Zhang et al., 2018) and the similar-

ity between the current image and content image is

measured with the Contour Loss. Both LPIPS and the

Contour Loss are described in detail in subsection 3.3.

The scheme of the method is presented on Fig. 2.

3.1 Differentiable Rasterization with

DiffVG

No algorithm exists to compare the similarity between

two vector images. However, it is possible to raster-

ize them and then evaluate their similarity as bitmap

images. In this case, rasterization must be performed

by a differentiable operator, which is available in Dif-

fVG, allowing thus to apply backpropagation for im-

age updating.

DiffVG is a library which provide functions for

reading SVG from source ﬁle or parsing SVG from

string. The ﬁgures and their numerical characteris-

tics, read by this library, are stored as PyTorch ten-

sors, which can be accumulated and transferred to a

differentiable rasterization function. The result of this

action is a rendered image in RGBA format, stored as

a PyTorch tensor. Subsequently, this image can either

be saved to disk, or used in further operations - for

example, when calculating the loss function. Thus,

DiffVG allows to optimize the numerical parameters

of the original SVG image using differentiable raster-

ization.

3.2 Feature Extraction

As a feature extractor, we have chosen the standard

VGG-19 network pre-trained on ImageNet. We use

the deep embeddings of the 16 convolutional, 5 max

pooling, and 16 ReLU activation functions of the

19-layer VGG network

. We group these 37 deep

embeddings into several intervals by their indices:

[0,4),[9,16),[16, 23), [23,30),[30,36) (we select fea-

tures before ReLU) following paper (Zhang et al.,

2018). We did not take deep embeddings with indices

4 to 8 because otherwise, it leads to marred contours

in the ﬁnal image.

https://pytorch.org/hub/pytorch vision vgg/

3.3 Losses

Gatys et al. (Gatys et al., 2015) proposed to calcu-

late the style loss based on a Gram matrix, which is

effective at representing wide varieties of both natu-

ral and non-natural textures. The style loss was de-

signed to capture global statistics but it tosses spatial

arrangements, which leads to unsatisfying results for

modeling shape parameters and obtaining indecent re-

sults for vector images. On contrary, loss evaluation

can be done based on the perceptual distance between

images. This can be a solution for our task because

perceptual losses eliminate the aforementioned draw-

backs of the basic method for vector graphics. We

introduce our complete loss function:

L = LPIPS(x,y) + λ · L

contour

(x,z), (1)

where LPIPS is the perceptual loss we discuss in de-

tail in the next subsection, L

contour

is the regulariza-

tion on contours we discuss in subsection 3.3.

Learned Perceptual Image Patch Similar-

ity (LPIPS) Metric for Vector Graphics.

LPIPS (Zhang et al., 2018) has been used for

many computer vision tasks, for example, image

restoration and super-resolution. In E-LPIPS (Ket-

tunen et al., 2019), authors proposed to use random

transformations before calculating the perceptual

similarity between images. After conducting experi-

ments, we found that most of these transformations

lead to poorer results for NST for vector graphics.

Only the color scale transformation, the coefﬁcient

of which is sampled from the standard normal distri-

bution, results in more pleasing colors and smoother

contours in the output image.

We use the L

term to normalize the feature di-

mension in all pixels and layers to unit length as it is

more stable and computationally effective. Instead of

summing L

distances between the image activation

maps as it was proposed in the original paper (Zhang

et al., 2018), we average them to avoid a high range

that can cause artifacts in the output image.

Our LPIPS loss implementation is:

LPIPS(x,y) =

∑

l=0

∑

h,w



ˆx

− ˆy



, (2)

where x,y ∈ R

1xCxHxW

are input images scaled by

random channel transformation, L is the number of

feature maps used from VGG, (H

) - sizes of

height, width and channels in corresponding feature

map, (h,w) - indices of height and width, ˆx

and ˆy

are L

normalized feature vectors from feature map l

in position (h,w).

The equation illustrates how the distance between

style and output images is obtained: we apply the ran-

dom transformation on both input images, extract and

Neural Style Transfer for Vector Graphics

689

Figure 2: Method overview. We propose a method for real-time style transfer for vector graphics. The optimization consists of

two parts. The upper part evaluates the perceptual similarity between the rasterized style image and the output image and aims

to convey the style and color of the drawing. The lower part penalizes the differences between the contours of the rasterized

content image and the output image to preserve the overall shape of the image.

normalize their features from L layers, and, thus, ob-

tain ˆx

and ˆy

. Then, we compute mean squared

distances, and, ﬁnally, we average scores obtained

from each layer.

Contour Loss. DiffVG is used to obtain contours

of content and current images. We parse them and

change the ﬁll and stroke colors of shapes to black and

white, respectively. With these vector primitives, we

obtain new raster contour images with DiffVG. Then,

we crop random patches from both images. We at-

tempted to compute the difference between patches

of various size and found that the size of (W /4,H/4)

is the most appropriate. After that, we calculate L

term:

contour

(x,z) =

∑

i=1

| x

− z

|, (3)

where x is the patch of the current image and z is

the corresponding patch of the target content image.

It forces the input image to respect the target image

since the L

loss penalizes the distance between them.

As a result, it makes images smoother, and more exact

and helps obtain sharp outlines.

4 EXPERIMENTS AND RESULTS

In this section, we investigate the behavior of our

method and compare it with six other NST methods.

We describe the visual differences between the results

of these methods, provide the results of a user survey,

and estimate the running time of vector methods.

Experiment Setup. We used the Adam optimizer for

each of 3 parameter groups used in DiffVG to rep-

resent a vector image. The learning rate for color

parameters and stroke width was 0.01 and 0.1 cor-

respondingly. Point learning rate lr was chosen de-

pending on the number of shapes in the image, n:

if n ∈ (0,300),lr = 0.2; n ∈ (300,1000),lr = 0.3;

n ∈ (1000, 1600), lr = 0.4; else lr = 0.8. Weight of

the contour loss λ = 100.

Methods. We included the following methods in the

comparison: (1) DiffVG, the only existing style trans-

fer for vector images, similar to Gatys et al.. We se-

lected loss weights following the original implemen-

tation

: λ

style

= 500,λ

content

= 1. (2) Gatys et al., the

ﬁrst and the most widespread method for bitmap im-

ages. (3) SANet, (4) StyTr

, and (5) CAST are three

state-of-the-art methods for raster style transfer based

on encoder-decoder structure. (6) AttentionedDeep-

Paint (ADP), a method for sketch colorization condi-

tioned by given style image

based on GANs.

Dataset. To assess the quality of the resulting im-

ages, we collected a dataset of 500 vector images,

mostly sketchy animals, cars, and landscapes from the

FreeSVG website

. It contains freely distributed SVG

https://github.com/BachiLi/diffvg

https://github.com/ktaebum/AttentionedDeepPaint

https://freesvg.org

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

690

ﬁles of various domains with no speciﬁc focus.

Metrics. Evaluating the results in the ﬁeld of NST

is a sophisticated problem and there is no gold stan-

dard by which the best model can be identiﬁed. No

method can determine how accurately the image style

was reproduced, because this task is imprecise, and

even a human is often unable to give a correct assess-

ment. Nevertheless, we made attempts to compare the

models using style and content losses proposed in the

original article by Gatys et al.. However, using this

approach, we encountered difﬁculties that did not al-

low us to make a comparison in this way. Instead, we

evaluated generated images by ourselves, involved as-

sessors for quality estimation, and compared the time

of inference.

4.1 Visual Comparison

The results of the application of the methods with var-

ious style and content image pairs are presented in

Fig. 3.

As can be seen from Fig. 3, the Attentioned Deep

Paint, SANet, StyTr

, and CAST methods transfer the

style but add a lot of artifacts to the images, while

losing content patterns. All raster methods make uni-

form areas non-uniform. The StyTr

method achieves

good stylization effects for the owl and hippo images,

but at the same time, the stylized images of the tiger

and the ﬁrst landscape contain noticeable artifacts that

distort the perception of the content. CAST preserve

objects’ contours, however, it adds unacceptable extra

background.

Although the DiffVG algorithm changes the col-

ors of content images, it blurs the contours or adds

distortions, and it cannot convey the style, which is

clearly seen in the examples images of a tiger, a hip-

popotamus, a car, and landscapes. It produces much

fewer artifacts, all contours are clear, the pictures are

smooth, the color changes (but not everywhere), and

the drawing is not transferred, that is, the image con-

tent almost does not change.

Our method seeks a trade-off between following

the style and freezing the content. It changes the

shape and color of vector primitives to preserve the

content as much as possible. The sharpness of the

contours does not change.

4.2 User Study

We attracted 40 assessors to evaluate the quality of

images generated by VectorNST. We conducted a sur-

vey asking participants to assess 10 images generated

by each method on a scale of 1 to 5 (1 stands for com-

pletely inappropriate, 5 stands for the perfect ﬁt). The

Table 1: Comparison of survey results to the proposed Vec-

torNST with DiffVG, Gatys et al., StyTR

, SANet, CAST,

and Attentioned Deep Paint.

Method Score

VectorNST (ours) 0.56 ± 0.04

DiffVG 0.44 ± 0.05

Gatys et al. 0.42 ± 0.06

StyTR

0.62 ± 0.05

SANet 0.43 ± 0.06

CAST 0.59 ± 0.06

Attentioned Deep Paint 0.11 ± 0.04

Table 2: Timings in seconds. Small stands for 256 × 256

bitmap images and for vector images with a number of

shapes less than 100. Medium stands for 512 × 512 bitmap

images and for vector images with the number of shapes

less between 100 and 700. Big is for bitmap images

1024 × 1024 and greater and for vector images with more

than 700 shapes.

Method Small Medium Big

Gatys et al. 1.61 4.14 11.59

DiffVG 4.20 26.21 98.57

VectorNST 5.93 33.52 112.10

images were grouped by method without providing

any information about the methods. Survey results

are presented in Tab. 1.

4.3 Time Comparison

We compare the time required for processing a single

image by our method, Gatys et al.approach, and its

implementation for vector graphics in DiffVG. Be-

cause three other methods use pre-trained networks,

we excluded them from the comparison.

The speed of Gatys et al.depends only on the size

of the content image. On the contrast, the speed of our

method and DiffVG depend on (1) the content image

size (because how many points need to be sampled

during rasterization depends on its size); (2) the num-

ber of paths (because when creating an image with

contours, the number of paths is important and it de-

termines the size of the image during rasterization);

(3) the total number of parameters (the sum of the pa-

rameters of all three optimizers).

The results of the time comparison can be found

in Tab. 2. VectorNST is a bit slower than DiffVG

because it spent time on computing the contour loss

value. Gatys et al.is considerably faster.

Neural Style Transfer for Vector Graphics

691

Content

Image

Style Image

Ours

(vector)

DiffVG

(vector)

Gatys et al.

(raster)

StyTr

(raster)

SANet

(raster)

CAST

(raster)

ADP

(raster)

Figure 3: Qualitative comparisons of style transfer results using different methods.

5 CONCLUSION

In this paper, we proposed a novel neural style trans-

fer method for vector graphics, VectorNST, which al-

lows processing illustrations such as sketchy animals,

cars, and landscapes. We introduced a loss function

consisting of two parts, an adapted LPIPS loss and a

contour loss, the latter providing more accurate style

transfer and content information preservation. Ex-

perimental results demonstrated that our method gen-

erates gorgeous stylized vector images and achieves

higher human assessment results compared to SANet,

Attentioned Deep Paint, and DiffVG methods.

Further improvement of our method would in-

clude adding a transformer-based model for more ac-

curate preservation of the vector image contours. An-

other direction would be to overcome the limitation

rooted in DiffVG by making the model capable of

changing the input parameters of a number of curves

or anchor points via backpropagation. Additionally,

future work may include collecting a vector image

dataset for improving style transfer inference time as

it can be done ofﬂine using a pre-trained style net-

work.

ACKNOWLEDGEMENTS

The research was supported by the ITMO University,

project 623097 ”Development of libraries containing

perspective machine learning methods”.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

692

REFERENCES

Carlier, A., Danelljan, M., Alahi, A., and Timofte, R.

(2020). Deepsvg: A hierarchical generative network

for vector graphics animation. Advances in Neural In-

formation Processing Systems, 33:16351–16361.

Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L.,

and Xu, C. (2022). Stytr2: Image style transfer with

transformers. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 11326–11336.

Dumoulin, V., Shlens, J., and Kudlur, M. (2016). A

learned representation for artistic style. arXiv preprint

arXiv:1610.07629.

Eﬁmova, V., Jarsky, I., Bizyaev, I., and Filchenkov,

A. (2022). Conditional vector graphics gener-

ation for music cover images. arXiv preprint

arXiv:2205.07301.

Frans, K., Soros, L. B., and Witkowski, O. (2021).

Clipdraw: Exploring text-to-drawing synthesis

through language-image encoders. arXiv preprint

arXiv:2106.14843.

Gatys, L. A., Ecker, A. S., and Bethge, M. (2015). A

neural algorithm of artistic style. arXiv preprint

arXiv:1508.06576.

Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., and Song,

M. (2019). Neural style transfer: A review. IEEE

transactions on visualization and computer graphics,

26(11):3365–3385.

Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual

losses for real-time style transfer and super-resolution.

In European conference on computer vision, pages

694–711. Springer.

Kettunen, M., H

ark

onen, E., and Lehtinen, J. (2019). E-

lpips: robust perceptual image similarity via ran-

dom transformation ensembles. arXiv preprint

arXiv:1906.03973.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60(6):84–90.

Kwon, G. and Ye, J. C. (2022). Clipstyler: Image style

transfer with a single text condition. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 18062–18071.

Li, C. and Wand, M. (2016). Combining markov random

ﬁelds and convolutional neural networks for image

synthesis. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 2479–

2486.

Li, T.-M., Luk

c, M., Gharbi, M., and Ragan-Kelley, J.

(2020). Differentiable vector graphics rasterization for

editing and learning. ACM Transactions on Graphics

(TOG), 39(6):1–15.

Li, Y., Wang, N., Liu, J., and Hou, X. (2017). De-

mystifying neural style transfer. arXiv preprint

arXiv:1701.01036.

Ma, X., Zhou, Y., Xu, X., Sun, B., Filev, V., Orlov, N.,

Fu, Y., and Shi, H. (2022). Towards layer-wise image

vectorization. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 16314–16323.

Park, D. Y. and Lee, K. H. (2019). Arbitrary style transfer

with style-attentional networks. In proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 5880–5888.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning transferable visual models

from natural language supervision. In International

Conference on Machine Learning, pages 8748–8763.

PMLR.

Reddy, P., Gharbi, M., Lukac, M., and Mitra, N. J. (2021).

Im2vec: Synthesizing vector graphics without vector

supervision. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 7342–7351.

Schaldenbrand, P., Liu, Z., and Oh, J. (2022). Styleclip-

draw: Coupling content and style in text-to-drawing

translation. arXiv preprint arXiv:2202.12362.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Ulyanov, D., Lebedev, V., Vedaldi, A., and Lempitsky,

V. (2016). Texture networks: Feed-forward synthe-

sis of textures and stylized images. arXiv preprint

arXiv:1603.03417.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wang, Y. and Lian, Z. (2021). Deepvecfont: synthesizing

high-quality vector fonts via dual-modality learning.

ACM Transactions on Graphics (TOG), 40(6):1–15.

Yoo, J., Uh, Y., Chun, S., Kang, B., and Ha, J.-W. (2019).

Photorealistic style transfer via wavelet transforms. In

Proceedings of the IEEE/CVF International Confer-

ence on Computer Vision, pages 9036–9045.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The unreasonable effectiveness of deep

features as a perceptual metric. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 586–595.

Zhang, Y., Tang, F., Dong, W., Huang, H., Ma, C., Lee,

T.-Y., and Xu, C. (2022). Domain enhanced arbitrary

image style transfer via contrastive learning. In ACM

SIGGRAPH 2022 Conference Proceedings, pages 1–

Neural Style Transfer for Vector Graphics

693