The Risk of Image Generator-Speciﬁc Traces in Synthetic Training Data

Georg Wimmer

∗ a

, Dominik Söllinger

∗ b

and Andreas Uhl

Paris Lodron University Salzburg, Jakob-Haringer-Str. 2, 5020 Salzburg, Austria

Keywords:

Image Synthesis, GAN, Diffusion Model, Synthetic Artifacts.

Abstract:

Deep learning based methods require large amounts of annotated training data. Using synthetic images to train

deep learning models is a faster and cheaper alternative to gathering and manually annotating training data.

However, synthetic images have been demonstrated to exhibit a unique model-speciﬁc ﬁngerprint that is not

present in real images. In this work, we investigate the effect of such model-speciﬁc traces on the training of

CNN-based classiﬁers. Two different methods are applied to generate synthetic training data, a conditional

GAN-based image-to-image translation method (BicycleGAN) and a conditional diffusion model (Palette).

Our results show that CNN-based classiﬁers can easily be fooled by generator-speciﬁc traces contained in

synthetic images. As we will show, classiﬁers can learn to discriminate based on the traces left by the generator,

instead of class-speciﬁc features.

1 INTRODUCTION

Large amounts of publicly available data are the basis

for training deep neural networks. To properly train

a neural network, huge amounts of annotated train-

ing data are necessary. Of course, gathering such

amounts of data is time consuming and expansive. A

potential solution to that problem is to generate syn-

thetic data (Man and Chahl, 2022). With the emer-

gence of versatile and powerful new approaches like

generative adversarial networks (GAN) and diffusion

models (DM) that are able to generate photo-realistic

images, synthetic image generation has become a hot

topic in science. There are different approaches to us-

ing synthetic data for training deep neural networks.

They range from using synthetic data only for certain

classes (e.g. to remove class imbalances (Elreedy and

Atiya, 2019)) to a general mixture of real and syn-

thetic training data for all classes (categories) to the

exclusive use of synthetic data.

An often raised question is how closely synthetic

data resembles real data and whether the synthetic

data are suited to train or improve a neural network.

Recent work in various ﬁelds like e.g. medical imag-

ing (Torﬁ et al., 2022; Chen et al., 2022), face recog-

nition (Qiu et al., 2021; Zhang et al., 2021) and object

https://orcid.org/0000-0001-5529-0154

https://orcid.org/0000-0002-4262-9195

https://orcid.org/0000-0002-5921-8755

∗

These authors contributed equally

detection (Li et al., 2022) show the potential of us-

ing synthetic data. On the other hand, mostly driven

by the need to detect deep fakes, some work (Wes-

selkamp et al., 2022) has also shown that image syn-

thesis methods leave unique artifacts in the image do-

main that are not contained in the real data. Hence, it

can be argued that as long as such artifacts are present,

synthetic images do not perfectly resemble real data.

Such generator-speciﬁc traces might not be an issue

for applications where the images across all cate-

gories (classes) are generated from the same image

synthesis method (as e.g. in face recognition). How-

ever, we hypothesize that generator-speciﬁc traces

can cause severe problems if deep learning based

classiﬁers are trained using synthetic images, where

different classes are generated using different image

generators. The same may apply for classiﬁers that

are trained using a mixture of real and synthetic data,

if speciﬁc classes of the training data are only repre-

sented by synthetic data from a given generator (e.g.

class 1 = synthetic, class 2 = real). For example, we

might encounter such scenarios in the ﬁeld of medi-

cal image analysis where a lack of data is an imma-

nent problem as data cannot be easily shared due to

privacy concerns. Additionally, there is often a lack

of data for speciﬁc classes of images. Therefore, it

might be tempting to increase the number of images

with synthetic data. If a deep learning based classiﬁer

is now trained on both real and synthetic data (e.g.,

class 1 = synthetic and class 2 = real), it may learn to

Wimmer, G., Söllinger, D. and Uhl, A.

The Risk of Image Generator-Speciﬁc Traces in Synthetic Training Data.

DOI: 10.5220/0012420600003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

199-206

ISBN: 978-989-758-679-8; ISSN: 2184-4321

199

Figure 1: Example of a security label.

discriminate based on the traces left by the generator,

which may provide a perfect decision boundary. The

same applies if the classiﬁer is trained using only syn-

thetic data (e.g., class 1 = synthetic (generator A) and

class 2 = synthetic (generator B)).

In this work, we therefore investigate whether

the aforementioned hypothesis indeed poses an issue

based on a scenario we encountered in industry. The

detection of genuine and counterfeit "security labels"

based on synthetically generated security labels. To

the best of our knowledge, no previous work has in-

vestigated the effect of generator-speciﬁc traces (e.g.,

artifacts) on classiﬁers that are trained using synthetic

images.

2 PRODUCT AUTHENTICATION

BASED ON SECURITY LABELS

In order to stop product counterfeiting, the company

"Authentic Vision"

offers security labels that can

be attached to a product or its packaging. One ex-

ample of such a security label can be seen in Figure

1. Each label features a hologram with a unique ran-

dom pattern and multiple security features to prevent

the security label from being replicated. For instance,

the holographic foil shows reﬂections when exposed

to light. Hence, missing reﬂections indicate forgery.

Additionally, removing the security label exposes a

diagonal, grid-like pattern which is referred to as the

VOID pattern. Examples of security labels with a

clearly visible VOID pattern can be found in Figure

2(c) and (f). Once the VOID pattern becomes visi-

ble, the product cannot be considered genuine. Note

that in some cases, the VOID pattern is fairly hard to

detect, in particular in cases where the VOID pattern

only appears at small parts of the label. In order to

verify whether a product is genuine or not, users can

scan the security label with a smartphone app, which

then performs all necessary checks to verify whether

the product is genuine.

https://www.authenticvision.com/

In order to verify whether a VOID pattern is visi-

ble or not, a convolutional neural network (CNN) is

employed. This VOID pattern detector is trained on

thousands of manually annotated security label scans

in order to reliably detect scans with VOID patterns.

As a result, changing key characteristics of the secu-

rity label, such as its shape, can be challenging since

new training data must ﬁrst be gathered in order to re-

train the model.

A potential solution to speed up the tedious task of

manually collecting and annotating security labels is

the use of synthetic data. One way to approach this

task is by synthesizing environmental-speciﬁc char-

acteristics (e.g., reﬂections, over-/underexposure, im-

age blur), onto previously unseen labels (which might

have a different shape).

In this work, we analyze whether synthetic data is a

viable solution for that problem. This is done by com-

paring the results of the VOID pattern detector using

either synthetic, real or mixed data to train the detec-

tor.

3 EXPERIMENTAL SETUP

3.1 Dataset

The dataset utilized in this work is composed of two

types of images: enrollment images and scan images.

Enrollment images are images of a security label that

were captured in the factory. On the other hand, scan

images are images of a security label captured in the

real world (from users that scan the security label

with the smartphone app). Since each security la-

bel is unique, each scan image can be uniquely as-

signed to an enrollment image. For every enrollment

image, there is at least one scan image in the dataset.

Examples of two enrollment images and their corre-

sponding scan images can be observed in Figure 2

(top row). In total, there are 43517 enrollment images

and 289309 scan images which exhibit a resolution

of 165×189 pixel. Each of the scan image has been

labeled as either genuine (class ’AUTH’) or counter-

feit (class ’VOID’). Note that all images underwent

an initial prealignment to ensure that the hologram is

perfectly aligned at the image center. Due to the pre-

alignment, extracting a patch of size 128×128 from

the image center results in an image that shows the

hologram but almost no background. An example of

such center-cropped enrollment and scan images can

be found in Figure 2 (bottom row).

In order to validate the performance of the VOID

pattern detector (described in Sec. 3.3) on scan im-

ages that have not been used for the training of the

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

200

(a) Enrollment (b) Scan AUTH (c) Scan VOID

(d) Enrollment (e) Scan AUTH (f) Scan VOID

Figure 2: Examples of enrollment and scan images. Each

row shows an original enrollment image (left) as well as

a corresponding authentic scan (center) and void image

(right). The bottom row shows the center-cropped version

of the same enrollment and scan images.

detector nor the image synthesis methods (Sec. 3.2),

the dataset is split into two sets: A training set com-

posed of ∼ 90% of the enrollment images and the cor-

responding scan images, and a test set composed of

the remaining ∼ 10% of the enrollment images and

the corresponding scan images. As a result, the train-

ing set has 39193 enrollment and 259871 scan images

(240075 Authentic + 19796 VOID). We further de-

note the Authentic (AUTH) scan images and their cor-

responding enrollment images from the training set

as AUTH training set and the VOID scan images and

their corresponding enrollment images from the train-

ing set as VOID training set. The test set has 4324

enrollment and 29438 scan images (26990 AUTH +

2448 VOID).

3.2 Synthetic Image Generation

We employ two different DL-based methods for

conditional multi-modal image synthesis. The ﬁrst

method is based on a GAN, and the second model

is based on a diffusion model. Both methods are em-

ployed to generate scan images based on enrollment

images as input. The methods are trained using image

pairs, where one image is an enrollment image and the

other image a corresponding scan image. Both meth-

ods are once trained to generate Authentic scan im-

ages (using paired data from the AUTH training set)

and once trained to generate VOID scan images (us-

ing paired data from the VOID training set).

A brief description of the employed architectures

can be found in the following.

BicycleGAN. BicycleGAN (Zhu et al., 2017) (B-

GAN) is a GAN-based conditional image generator

for image-to-image translation. An advantage of

the BicycleGAN model compared to most other

DL-based image generation methods is its ability

to produce both diverse and visually appealing

results. This is achieved by learning an invertible

mapping between the output of a generator (a scan

image) and an 8-dimensional latent code, so that

different latent code vectors actually lead to different

outputs. The style of the scan images is encoded

in the low-dimensional latent vector, which can

be randomly sampled at test time. We use the

implementation available at this website

. The

BicycleGAN model is trained for 60 epochs with

a batch size of 2. To ﬁt the size of the images to

the BicycleGAN’s required image size, the training

images (both enrollment and scan images) are ﬁrst

made quadratic (165 × 189 → 189 × 189) by adding

black background at the left and right side of the

images, and then scaled to the size 256 × 256.

Finally, the generated synthetic scan images are

downsized (256 × 256 → 189 × 189) and the black

background is removed (189 × 189 → 165 × 189).

All enrollment images were initially transformed to

grayscale. Grayscale conversion and the described

resizing method led to the best results, in the author’s

opinion (we tested a lot of different conﬁgurations to

ﬁnd the best working one).

Palette. Palette (Saharia et al., 2022a) (P-DM) is an

image-to-image diffusion model. An image-to-image

diffusion model is a conditional diffusion model of

the form p(y|x), where x and y are images. In our

case, x is an enrollment image and y is a scan image.

For the diffusion process, Palette uses a U-Net archi-

tecture (Ho et al., 2020) with several modiﬁcations

inspired by recent work (Dhariwal and Nichol, 2021;

Saharia et al., 2022b; Song et al., 2020). Note that

unlike in the original work, we train the network on

128×128 sized images (center-cropped from the en-

rollment and scan images) to reduce the quite substan-

tial learning and inference time. We use the imple-

mentation available at this website

. The Palette Au-

thentic image synthesizer is trained on AUTH training

set for 240k iterations with a batch size of 64. The

VOID image synthesizer is trained on VOID train-

ing set for 354k iterations with a batch size of 64.

Random horizontal ﬂipping is applied to increase the

number of samples in the training set.

https://github.com/junyanz/BicycleGAN

https://github.com/Janspiry/

Palette-Image-to-Image-Diffusion-Models

The Risk of Image Generator-Speciﬁc Traces in Synthetic Training Data

201

3.3 CNN-Based VOID Pattern Detector

In order to assess the ability of the synthetic data

to replace real data, a CNN-based VOID pattern de-

tector is trained to differentiate genuine scan images

(AUTH images) from counterfeits (VOID images).

An EfﬁcientNet-B0 (Tan and Le, 2019) pretrained on

ImageNet is utilized for this purpose. The model

is trained to distinguish AUTH from VOID samples

using a binary cross entropy loss. Thanks to the

pretrained model, training with an ADAM optimizer

(learning rate lr = 10

−4

, weight decay w = 10

−3

and

batch size b = 64) for only 3000 iterations is sufﬁcient

to achieve model convergence. Furthermore, note

that there is a signiﬁcant class imbalance between the

number of scan images of the AUTH training set and

the VOID training set. In order to overcome this issue,

we applied weight sampling to make sure that images

of each class have the same likelihood of being shown

to the model.

3.4 Evaluation Protocol

To analyze the impact of the employed training data

on the accuracy of the VOID pattern detector, we

conduct experiments with the following training data

compositions:

• Real world scan images from the AUTH (∼240k

images) and VOID training set (∼20k images).

This is the baseline accuracy.

• Real world scan images from the VOID training

set (∼ 20k images) and synthetic scan images

from class AUTH (50k images). This data compo-

sition is motivated by the possibly too low number

of training samples from the VOID training set to

properly train the image synthesis methods.

• Synthetic images from both classes (50k images

per class)

In order to generate synthetic training data for the

AUTH and VOID class, separate models are trained

for each class (AUTH and VOID) and architecture

(B-GAN and P-DM). We generate a set of 50k

synthetic images for each class. The synthetic images

of class Authentic are generated from 10k enrollment

images from the AUTH training set (5 scan images

per enrollment image) and the synthetic images of

class VOID are generated from 1k enrollment images

from the VOID training set (50 scan images per

enrollment image).

The model performance evaluation is carried

out on the real scan images of the test set. The test

set is quite imbalanced with respect to the number

(a) B-GAN

AUTH

(b) P-DM AUTH (c) B-GAN

VOID

(d) P-DM VOID

(e) B-GAN

AUTH

(f) P-DM AUTH (g) B-GAN

VOID

(h) P-DM VOID

Figure 3: Synthetic images from both classes (AUTH and

VOID) and models (B-GAN and P-DM) generated from the

enrollment images in Figure 2 (Top row 2(a) and bottom

row 2 (d)).

of images per class (∼27k Authentic scan images

and ∼2450 VOID scan images). Hence, we employ

evaluation metrics that are suited for imbalanced

datasets. As performance measures we compute the

per-class accuracy (number of correctly classiﬁed

images of a given class divided by the number of

samples of the given class) and the Balanced accuracy

(sum over the per-class accuracies divided by the

number of classes (2)). The VOID pattern detector

is trained and evaluated 5 times (runs) and we report

the means and the standard deviations of the results

over the 5 runs.

4 RESULTS

The following section reports the results of the image

quality analysis (Sec. 4.1) and the performance of the

VOID pattern detector trained on synthetic and real

images (Sec. 4.2).

4.1 Image Quality Analysis

In this section, we analyze the quality of the differ-

ent generators by visually analyzing the image qual-

ity and by measuring the Frechet Inception Distance

(FID) (Heusel et al., 2017). In Figure 3, some rep-

resentative image samples generated by both genera-

tion methods (i.e., B-GAN and P-DM) are shown. To

generate the synthetic scan images, grayscale trans-

formed enrollment images of the original size are

used for B-GAN and center cropped enrollment im-

ages for P-DM.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

202

Table 1: Frechet Inception Distance (FID) for each genera-

tor.

Model Trained on Image Size FID

B-GAN AUTH 128×128 22.65

B-GAN AUTH original 27.47

P-DM AUTH 128×128 11.64

B-GAN VOID 128×128 35.52

B-GAN VOID original 36.77

P-DM VOID 128×128 10.97

At the ﬁrst glance, the generated synthetic images

from both image generation methods all look con-

vincing and show image effects that also appear in the

real scan images. The B-GAN generated images im-

press with their diversity. However, at the pixel level

one can observe that the B-GAN generated images

are more blurry than the original images. Further-

more, there are no B-GAN generated VOID images in

which the VOID pattern appears only on small parts

of the label (which occurs in some of the real VOID

images and also the P-DM generated VOID images).

The synthetic images generated by the P-DM are also

quite diverse, but not to the same extent than the ones

from the B-GAN. In the authors opinion, the qual-

ity of the generated images from the P-DM appears

slightly better than those from the B-GAN.

The FID scores of each generator are shown in Ta-

ble 1. Note that in literature, FID scores are mostly

used to measure the sample quality of unconditional

image generators. However, to measure the FID

score of a conditional image generator, the FID score

should only depend on the distribution of the scan

images and be independent on the underlying enroll-

ment images. Therefore, we create a ground-truth set

for FID score computation by randomly sampling 15k

real scan images from the AUTH and VOID training

set, respectively. We then take the corresponding en-

rollment image of each scan and generate a synthetic

image using the generator. Afterwards, we calculate

the FID score between the two sets from the same

class, each composed of 15k images.

As can be seen in Table 1, the FID scores obtained

by B-GAN trained on AUTH images are lower than

on VOID images. This indicates that synthetic AUTH

images have a superior quality, which is also inline

with our visual observation. A reasonable argument

to justify this observation is by the number of sam-

ples. While the AUTH generators were trained on

∼240k real images, the VOID generators could only

be trained on ∼20k images due to a lack of real VOID

scan images.

Surprisingly, this is not the case for P-DM. In con-

trast to B-GAN, the P-DM model trained on VOID

images achieves a slightly lower FID than the model

trained on AUTH images. However, generating some

VOID scan images using P-DM for some selected en-

(a) Real void (b) Real void (c) Real void (d) Real void

(e) Synth void (f) Synth void (g) Synth void (h) Synth void

Figure 4: Examples of real and synthetic VOID scan images

for chosen enrollment images from the training set. The top

row shows the real images. The bottom row shows some

synthetic images generated by the P-DM model.

rollment shows the potential explanation for the low

FID score. Figure 4 show a series of VOID images

generated by P-DM (bottom row) from a single en-

rollment image as well as some corresponding real

VOID images (top row). As can be seen by compar-

ing both rows, the model seems to have memorized

the training images and hence the generated samples

lack diversity. Even the VOID pattern in the synthetic

images occurs at the same position as in the real im-

ages (right bottom part of the hologram). Note that

this phenomenon has already been observed before

(Carlini et al., 2023).

In summary, after inspecting the images visually

as well as using FID, images generated by the diffu-

sion model tend to have a better image quality. How-

ever, generated images lack diversity and they are

more prone to overﬁtting (this can be observed in par-

ticular for the VOID model). Scan images generated

by BicycleGAN exhibit a lower quality but are clearly

more diverse.

We argue that although overﬁtting clearly is not

ideal, it is not an issue for the subsequent experi-

ment using the VOID pattern detector. In fact, if the

generator just learned to copy scan images from the

training set, a VOID pattern detector trained on these

data should exhibit the same performance as a detec-

tor trained on real images. Hence, we expect that scan

images produced by the diffusion model should out-

perform scan images produced by BicycleGAN in the

VOID pattern detection evaluation. At least if FID is

a useful metric to measure the quality of a dataset for

such a task.

4.2 VOID Pattern Detection Results

In Table 2, we present the average results on the test

set over 5 runs of training and evaluation of the CNN-

The Risk of Image Generator-Speciﬁc Traces in Synthetic Training Data

203

Table 2: Mean VOID detection results and the standard de-

viations (in brackets).

Data ACC

AUTH VOID Image Size AUTH VOID Balanced

Real Real 128×128 0.99 (0.00) 0.84 (0.01) 0.91 (0.01)

Real Real original 1.00 (0.00) 0.83 (0.01) 0.91 (0.00)

B-GAN Real 128×128 0.12 (0.00) 0.98 (0.01) 0.56 (0.02)

B-GAN Real original 0.00 (0.00) 1.00 (0.00) 0.50 (0.00)

P-DM Real 128×128 0.42 (0.00) 0.96 (0.01) 0.76 (0.05)

B-GAN B-GAN 128×128 0.99 (0.00) 0.69 (0.05) 0.83 (0.02)

B-GAN B-GAN original 0.99 (0.00) 0.63 (0.02) 0.81 (0.01)

P-DM P-DM 128×128 0.88 (0.00) 0.73 (0.03) 0.81 (0.02)

based VOID detection. The CNN is trained either

with real scan images, synthetic images, or using real

scan images for the class VOID and synthetic im-

ages for the class AUTH. As expected, the clearly

best results are achieved using only real world train-

ing data with a balanced accuracy of 91% for both im-

age sizes. This constitutes the performance baseline,

which synthetic images should come close too. Us-

ing real world training data from the VOID class and

synthetic images for the AUTH class did not work at

all for B-GAN and only quite poor for P-DM. When

the CNN is trained using synthetic image from one

and the same method, then the results are clearly bet-

ter than for the mixed scenario (B-GAN:81%/83%,

P-DM:81%), but still clearly worse than the results

using real image data for training.

From the results in Table 2, we can see that the

worst classiﬁcation accuracy is achieved if synthetic

training data is used for class AUTH and real train-

ing data for class VOID. The fact that using mixed

training data produces worse results than using only

synthetic images, despite the fact that real-world data

produces the best results, leaves only one plausible

explanation: The CNN-based VOID detector pref-

erentially learns model-speciﬁc features over class-

speciﬁc features. Such detectors can only fail on

the test set, which is composed of real data. This is

clearly recognizable when we observe the per-class

accuracies for using mixed training data: Nearly all

test set images were classiﬁed as VOID, the same

class that includes the real training data.

So, in case of the mixed training data, the VOID

detector primarily learned to differentiate between

real and synthetic images, instead of learning to dif-

ferentiate between the two classes AUTH and VOID

as it was supposed to. This means that the synthetic

image data is clearly distinguishable from the real im-

ages and the differences between real and synthetic

data provide a more obvious decision boundary for

the VOID detector than class-speciﬁc features.

So, the further questions are what are these differ-

ences between real and synthetic images and how can

these differences be reduced.

5 MODEL-SPECIFIC

FINGERPRINTS IN

SYNTHETIC DATA

It was shown in previous research (Wesselkamp et al.,

2022; Corvi et al., 2023), that synthetically generated

images from both GAN and diffusion model based

approaches include visible artifacts in the frequency

domain and that synthetic and real images exhibit sig-

niﬁcant differences in the mid-high frequency signal

content. These characteristics of synthetic images can

be used to identify them as synthetically generated.

We visually highlight the ﬁngerprint of synthetic

images from a image generation model in the fol-

lowing way. We randomly select a set of n = 5000

real (referred to as R) and synthetic (referred to as S)

scan images from one class. Then the ﬁngerprint of

the trained image synthesis model is computed as fol-

lows:

FP(R, S) =

∑

i=1

log(|F(S

)|) −

∑

i=1

log(|F(R

)|)

(1)

where F denotes the Fourier transformation, S

de-

notes a synthetic scan image and R

a real scan image.

Taking the logarithm over the absolute valued Fourier

transformed image distinctly highlights the artifacts

in the frequency domain caused by the image genera-

tors. Broadly speaking, the ﬁngerprint is the spectral

difference between synthetic images and real images,

where ideally all values should be close to zero.

In Figure 5, we present the ﬁngerprints FP of the

two employed image generators of both classes. We

can observe that the B-GAN ﬁngerprints exhibit a dis-

tinct grid pattern. This pattern is most likely a result

of the transposed-convolution operation used for up-

sampling, which is known to cause checkerboard arti-

facts in the spatial domain (Frank et al., 2020). The B-

GAN ﬁngerprints of the two classes are similar from

a visual point of view, but there are also clear differ-

ences between them. This indicates that the B-GAN

models of the two classes generate slightly different

model-based image characteristics.

The P-DM FPs do not exhibit the clearly notice-

able grid-like pattern like B-GAN FPs (at least it is

less prominent). However, there are some other types

of repeating pattern. These patterns seem to be char-

acteristic for each class. In any case, there is an ob-

vious difference between the FPs of both classes. We

assume that the presence of repeated patterns in the

frequency domain of the synthetic images is one of

the fundamental reasons why the VOID pattern detec-

tor trained on synthetic images performs worse than

on real images. Even worse, the pattern that emerges

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

204

(a) FP B-GAN AUTH (b) FP B-GAN VOID

Figure 5: B-GAN and P-DM ﬁngerprints (FP) of synthetic

images from the classes VOID and AUTH.

appears to be distinct for each generator and class

(note that each ﬁngerprint shown in Figure 5 exhibits

a unique pattern). That means that the VOID detec-

tor can be easily tempted to classify images using

model-based characteristics instead of class-speciﬁc

characteristics (e.g., the VOID pattern). Since the test

set is composed of real images, which do not exhibit

such ﬁngerprint, the VOID detector will fail to clas-

sify those images. This of course leads to the problem

that the VOID detector is not able to properly differ-

entiate between the classes when applied to real scan

images, whenever the VOID model is trained at least

partly with synthetic data (one class real scan images

the other one synthetic images or both classes trained

with synthetic images).

To deal with that problem, we employ two meth-

ods from the research area of deep fake detection that

were presented in (Wesselkamp et al., 2022). The two

methods ’Mean-Spectrum Attack’ and ’Frequency-

Peaks Attack’ both aim to remove or lessen the ﬁn-

gerprints on synthetic images. These two methods

were employed in (Wesselkamp et al., 2022) to coun-

terattack methods that recognize synthetic images.

Roughly explained, the method Mean-Spectrum At-

tack (MSA) calculates the average difference in the

frequency spectrum between synthetic and real im-

ages and then subtracts the average difference from

each synthetic image in the frequency domain. How-

ever, we observed that it is only able to slightly re-

duce the peaks in the ﬁngerprints as shown in Figure

5. The method Frequency-Peaks Attack (FPA) fol-

lows a slightly different strategy and aims to remove

Table 3: VOID detection results before and after ﬁnger-

print removal using the two methods Mean-Spectral Attack

(MSA) and Frequency-Peak Attack (FPA).

Data ACC

FP rem. AUTH VOID AUTH VOID Bal.

- B-GAN Real 0.12 (0.00) 0.98 (0.01) 0.56 (0.02)

MSA B-GAN Real 0.20 (0.00) 0.98 (0.00) 0.62 (0.04)

FPA B-GAN Real 0.03 (0.00) 1.00 (0.00) 0.51 (0.01)

- P-DM Real 0.42 (0.00) 0.96 (0.01) 0.76 (0.05)

MSA P-DM Real 0.20 (0.00) 0.98 (0.00) 0.62 (0.04)

FPA P-DM Real 0.03 (0.00) 1.00 (0.00) 0.51 (0.01)

- B-GAN B-GAN 0.99 (0.00) 0.69 (0.05) 0.83 (0.02)

MSA B-GAN B-GAN 1.00 (0.00) 0.63 (0.02) 0.81 (0.01)

FPA B-GAN B-GAN 1.00 (0.00) 0.14 (0.07) 0.57 (0.03)

- P-DM P-DM 0.88 (0.00) 0.73 (0.03) 0.81 (0.02)

MSA P-DM P-DM 0.37 (0.00) 0.91 (0.04) 0.65 (0.01)

FPA P-DM P-DM 0.84 (0.00) 0.27 (0.19) 0.56 (0.04)

(a) FP B-GAN MSA (b) FP B-GAN FPA

Figure 6: Fingerprints (FP) of B-GAN generated images

from the class AUTH after ﬁngerprint removal using the

methods MSA and FPA.

the peaks in the frequency domain, that are typically

in synthetic images and can be observed in the ﬁnger-

prints shown in Figure 5. However, we observed that

this methods does not only remove peaks but partly

changes also other image characteristics like the color

and brightness distribution of the images.

In Table 3, we present the results of the VOID de-

tection model (mean and Std (in brackets) over the

results of 5 runs) after applying the two ﬁngerprint

removal techniques MSA and FPA, compared to the

results without ﬁngerprint removal. Unfortunately,

it turned out both ﬁngerprint removal methods did

not improve the results but even made them worse.

Even more so, the ﬁngerprints of synthetic images

did hardly change and were still present after the ﬁn-

gerprint removal. This can be observed in Figure

6, where we show the ﬁngerprints of B-GAN gener-

ated images after applying the two ﬁngerprint removal

techniques to the generated images.

6 CONCLUSION

In this work we analyzed the impact of synthetic train-

ing data on a deep-learning based classiﬁer. The clas-

siﬁcation models were trained either with real, syn-

thetic, or mixed data (one class of real images, the

The Risk of Image Generator-Speciﬁc Traces in Synthetic Training Data

205

other one with synthetic images). The clearly best

classiﬁcation rates were achieved using only real-

world data (91%), followed by using only synthetic

data (81-83%), and the worst results were achieved

using mixed data (50-76%). A comparison of the fre-

quency spectra of the synthetic images showed that

each generator exhibits unique model-speciﬁc char-

acteristics (a model-speciﬁc ﬁngerprint). It’s likely

that this ﬁngerprint is one of the reasons for the de-

graded performance of the classiﬁers trained on syn-

thetic data and, even more so, on mixed data. In any

case, the experiments clearly demonstrate that, de-

spite substantial developments in the ﬁeld of synthe-

sis, synthetic training data should always be used with

caution, especially when synthetic data is used to re-

place only data of a single class. There is always the

risk that a deep learning based classiﬁer learns to dif-

ferentiate based on model-speciﬁc characteristics and

not, as intended, based on class-speciﬁc features.

The employed FP removal techniques did not

work as intended and were unable to bridge the gap

between real and synthetic data. In future work,

we therefore plan to develop better FP removal tech-

niques.

ACKNOWLEDGEMENTS

This work has been partially supported by the

Salzburg State Government within the Science and In-

novation Strategy Salzburg 2025 (WISS 2025) under

the project AIIV Salzburg (Artiﬁcial Intelligence in

Industrial Vision), project no 20102-F2100737- FPR.

REFERENCES

Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V.,

Tramèr, F., Balle, B., Ippolito, D., and Wallace, E.

(2023). Extracting training data from diffusion mod-

els.

Chen, Y., Yang, X.-H., Wei, Z., Heidari, A. A., Zheng, N.,

Li, Z., Chen, H., Hu, H., Zhou, Q., and Guan, Q.

(2022). Generative adversarial networks in medical

image augmentation: A review. Computers in Biology

and Medicine, 144:105382.

Corvi, R., Cozzolino, D., Poggi, G., Nagano, K., and Ver-

doliva, L. (2023). Intriguing properties of synthetic

images: from generative adversarial networks to dif-

fusion models.

Dhariwal, P. and Nichol, A. (2021). Diffusion models beat

gans on image synthesis. Advances in neural informa-

tion processing systems, 34:8780–8794.

Elreedy, D. and Atiya, A. F. (2019). A comprehensive

analysis of synthetic minority oversampling technique

(smote) for handling class imbalance. Information

Sciences, 505:32–64.

Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A.,

Kolossa, D., and Holz, T. (2020). Leveraging fre-

quency analysis for deep fake image recognition. In

International conference on machine learning, pages

3247–3258. PMLR.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

Advances in neural information processing systems,

30.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion

probabilistic models. Advances in neural information

processing systems, 33:6840–6851.

Li, G., Ji, Z., Qu, X., Zhou, R., and Cao, D. (2022). Cross-

domain object detection for autonomous driving: A

stepwise domain adaptative yolo approach. IEEE

Transactions on Intelligent Vehicles, 7(3):603–615.

Man, K. and Chahl, J. (2022). A review of synthetic im-

age data and its use in computer vision. Journal of

Imaging, 8(11).

Qiu, H., Yu, B., Gong, D., Li, Z., Liu, W., and Tao, D.

(2021). Synface: Face recognition with synthetic data.

In Proceedings of the IEEE/CVF International Con-

ference on Computer Vision, pages 10880–10890.

Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Sali-

mans, T., Fleet, D., and Norouzi, M. (2022a). Palette:

Image-to-image diffusion models. In ACM SIG-

GRAPH 2022 Conference Proceedings, pages 1–10.

Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and

Norouzi, M. (2022b). Image super-resolution via iter-

ative reﬁnement. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 45(4):4713–4726.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A.,

Ermon, S., and Poole, B. (2020). Score-based gen-

erative modeling through stochastic differential equa-

tions. arXiv preprint arXiv:2011.13456.

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional conference on machine learning, pages 6105–

6114. PMLR.

Torﬁ, A., Fox, E. A., and Reddy, C. K. (2022). Differen-

tially private synthetic medical data generation using

convolutional gans. Information Sciences, 586:485–

500.

Wesselkamp, V., Rieck, K., Arp, D., and Quiring, E. (2022).

Misleading deep-fake detection with gan ﬁngerprints.

In 2022 IEEE Security and Privacy Workshops (SPW),

pages 59–65, Los Alamitos, CA, USA. IEEE Com-

puter Society.

Zhang, H., Grimmer, M., Ramachandra, R., Raja, K., and

Busch, C. (2021). On the applicability of synthetic

data for face recognition. In 2021 IEEE International

Workshop on Biometrics and Forensics (IWBF), pages

1–6. IEEE.

Zhu, J., Zhang, R., Pathak, D., Darrell, T., Efros,

A. A., Wang, O., and Shechtman, E. (2017). To-

ward multimodal image-to-image translation. CoRR,

abs/1711.11586.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

206