Towards Adversarially Robust AI-Generated Image Detection

Annan Zou

Department of Computer Science, Vanderbilt University, Nashville, U.S.A.

Keywords: Artificial Intelligence Generated Content (AIGC), Adversarial Robustness, Image Classification.

Abstract: Over the last few years, Artificial Intelligence Generated Content (AIGC) technology has rapidly matured

and garnered public attention due to its ease of use and quality of results. However, due to these characteristics,

forged images generated by AIGC technology have a high potential of being misused and causing negative

social consequences. While AI-based tools can identify AI-generated images with reasonable accuracy, these

models did not consider the factor of adversarial robustness or resistance against intentional attacks. This

paper empirically evaluated several existing AIGC detection models’ adversarial robustness under select at-

tack setups. Overall, it is discovered that even naked-eye unnoticeable perturbation of source images can

consistently cause sharp drops in performance in all the models in question. This study proposes constructing

a Convolutional Neural Network (CNN) based AIGC classifier model with additional adversarial training

using a combination of transformation-based and l-based adversarial examples constructed with existing

AIGC data. This paper uses clean and adversarial data sets to test the performance of the resulting model. The

results show the model's remarkable robustness to the adversarial attack techniques described above while

maintaining relative accuracy on clean datasets.

1 INTRODUCTION

Artificial Intelligence Generated Content (AIGC) is

creating content via artificial intelligence-based

systems. The topic of AIGC has garnered enormous

public and academic attention for the past few years

due to its rapid improvement, proliferation, and

commercial adoption. AIGC systems can produce

various content types from different inputs. However,

text-to-image generation models have also been a

particular focus of controversy due to their ample

potential for misuse. Some worried they were

threatening the art industry and stifling creativity in

the visual arts.

In contrast, others pointed out that they could be

used to disseminate misinformation by generating

photorealistic scenes of non-existent events (Z. Sha et

al., 2022). Moreover, text-to-image AIGC has been,

and is still, improving at an impressive pace, to the

point that some are worried that it will soon reach the

point where it becomes altogether impossible for

human eyes to distinguish between AI-generated

images and natural photographs. This uncertain future

necessitates the development of tools that can reliably

discern AI-generated images.

There have been several works that focused on

AIGC image classification. Bird & Lotfi used a

customized Convolutional Neural Network (CNN)

with a Gradient class activation map (Grad-CAM) to

offer a more explainable approach to classify AI-

generated images and generalize AIGC artifacts (J.

Bird Jordan, and L. Ahmad, 2023). Xi developed a

novel dual-stream network for pure image

classification that synthesizes AIGC artifacts in high

and low-frequency regions of a given image with a

cross-attention module (Z. Xi et al., 2023). All the

models achieved impressive classification accuracy

that far exceeds regular forensic models. However, no

studies have focused on AIGC classifiers’ adversarial

robustness – or how well they perform with inputs

intentionally engineered to misguide or otherwise

disrupt their classification (A. Madry et al., 2017).

All the methods above utilize Convolutional

Neural Network (CNN) architectures, which have

been proven to be especially susceptible to

adversarial attacks such as projected gradient descent

(PGD) in previous studies (A. Madry et al., 2017), (L.

Engstrom et al., 2019).

This paper aims to address these issues by (1)

assessing the adversarial robustness of existing

models and (2) creating a model that can achieve a

380

Zou, A.

Towards Adversarially Robust AI-Generated Image Detection.

DOI: 10.5220/0012816300003885

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 380-385

ISBN: 978-989-758-705-4

higher degree of robustness against both adversarial

attacks and naturally occurring transformations. The

first goal is achieved using a combination of the

CIFAKE database developed by Bird & Lotfi (J. Bird

Jordan, and L. Ahmad, 2023), adversarial

perturbation methods utilizing the Adversarial

Robustness Toolbox, and input transformation

methodology inspired by Engstrom et al. to generate

an adversarial dataset based on the CIFAKE database

and then testing the performance of existing models

on the dataset (L. Engstrom et al., 2019), (M. I.

Nicolae, and M. Sinn, 2018). The second goal is

constructing a CNN-based model that achieves

adversarial robustness utilizing data augmentation

and adversarial training. To ascertain that goal (2) is

fulfilled, this study comprehensively evaluates the

resulting model’s performance with the clean

CIFAKE database and those above adversarially

perturbed datasets, including comparative and

ablation studies. In summary, the contributions that

this paper have made include:

• Constructed an adversarial dataset of real

and T2I (Text-to-image) AI-generated

images using different adversarial

perturbation techniques, totaling a 120,000-

image count with two 60,000-image groups

corresponding to clean and attacked input

data.

• Evaluated the adversarial robustness of

existing models using adversarial databases

and preprocessing scripts that perform

rotations and translations. This evaluation

shows that existing models are vulnerable to

adversarial perturbation and preprocessing.

• Introduced an adversarially robust model for

T2I AIGC detection using a combination of

preprocessors and adversarial training.

Using the aforementioned adversarial

datasets, this model is verified to have

superior adversarial and spatial robustness

than existing models while maintaining

comparable accuracy on clean datasets.

2 METHOD

This section can be divided into three sections. The

first section (II. A, II. B, II. C) describes the source of

the data, the    bounded adversarial perturbation

methods to generate the adversarial datasets, and the

adversarial spatial transformation procedures. The

second section (II. D) briefly describes the models

used in the robustness evaluation and the methods and

metrics used. The third section (II. E) describes, in

detail, the overall architecture of an adversarially

robust AIGC detection model, including

preprocessors, the classifier proper, the adversarial

training procedures, and the loss function.

2.1 Base Dataset

This paper utilizes the CIFAKE database which

consists of 120,000 images as the base dataset. 60,000

of these are photographic images taken from the

CIFAR-10 dataset, a database widely used for image

classification tasks (J. Bird Jordan, and L. Ahmad,

2023). These are 32 × 32 resolution images of real

subjects divided into ten classes, in RGB channels.

The other 60,000 are RGB images generated by

Stable Diffusion v1.4, a popular, public T2I model

that utilizes the principle of latent diffusion to

generate synthetic images. The AIGC dataset is

formatted the same as the CIFAR data: ten classes of

32 x 32 images of objects in RGB channels. For this

study, the dataset is divided into two equal-sized

subsets each consisting of 30,000 pairs of natural and

AIGC images. For each of the subsets, 83.3% (25,000

pairs) are used for training and 16.7% (5,000 pairs)

are used for testing.

2.2 Gradient-Based Adversarial

Methods

Adversarial attacks on classifiers are considered

optimization problems that maximize the given

classification’s loss while minimizing the

perturbation to the input. Formally, given any

classifier 







   that maps input x to label y, an

adversarial attack seeks out an adversarial

perturbation δ such that:









   



 



 



  (1)

Where   



is a 



norm and  is the given

perturbation budget.

Many state-of-the-art adversarial attacks are

based on the method of Fast Gradient-Sign Method

(FGSM) first proposed by Goodfellow, Shlens and

Szegedy. in 2014 where the perturbed input  is

given by (I. J. Goodfellow et al., 2014):

      󰒭







  



 (2)

Such that 󰒭



 is the gradient of the original

model’s loss function with respect to the model

parameters, input x and output y. An improved and

much more powerful derivative of the FGSM is an

iterative version that breaks the problem down into

Towards Adversarially Robust AI-Generated Image Detection

381

several smaller maximization problems of step size α

and step count    . This variation, known as

Projected Gradient Descent (PGD), is formalized by

the expression:









  󰒭







  







(3)

Where denotes the projection operator that

projects the finished iterations back to the constraint

space 



  and therefore clipping δ to the

interval



 



For this study’s purpose, the Auto-PGD attack

first introduced by Croce et al. implemented by the

Adversarial Robustness Toolbox (ART) (M. I.

Nicolae, and M. Sinn, 2018), (F. Croce, and M. Hein,

2022) is chosen. Its key improvement over the base

PGD attack is the ability to dynamically adapt its step

size based on the rate of learning, allowing it to use

larger step sizes to find good starting points over the

whole attack space and smaller step sizes for more

aggressive search of local maxima (F. Croce, and M.

Hein, 2022). It achieves this by setting N checkpoints

to decide if the step size should be halved from the

initial size, and whenever the step size is halved, it

starts from the best previously found parameters. An

iteration count of 500 is used for the attack on the

adversarial dataset.

2.3 Adversarial Spatial

Transformations

Apart from the gradient-based methods, Engstrom et

al. proposed an alternative view on adversarial

perturbations, specifically, they questioned the

concept of “perturbation budget” defined based on

using 



norms as the sole metric of image similarity

(L. Engstrom et al., 2019). They argue that human

perception often defines images with large 



norm

variations as visually similar, specifically, images

that have undergone small rotation or translation

operations. The optimization view of this spatial

translation based adversarial perturbation is given by:

   



  (4)

Where each pixel at position (u, v) in the given

image undergoes the spatial operation T:

















 















(5)

The author had proposed several methods of

solving the maximization problem. These include (1)

first order minimization towards the gradient of the

loss function from a random choice, (2) grid search

over all the possible combinations over the attack

parameter space, and (3) generating k different

random choices of attack parameters and searching

among these. Engstrom et al. had concluded that the

third method, dubbed the worst-of-k method, achieves

a balance between computational performance and

loss maximization, while having the advantage of not

requiring full knowledge of the target model’s loss

function, unlike gradient-based attacks (L. Engstrom

et al., 2019).

2.4 Target Models of Robustness

Evaluation

Four models that can be considered state-of-the-art in

AIGC detection are evaluated. These include the

ResNet-18 image classification residual neural

network, the customized light CNN architecture by

Bird et al., the cross-attention enhanced dual-stream

network proposed by Xi et al., and an ensemble-based

CG detection network developed by Quan et al. that

employs a modified FGSM adversarial training

method similar to the ones described here (Z. Sha et

al., 2022), (J. Bird Jordan, and L. Ahmad, 2023), (Z.

Xi et al., 2023), (W. Quan et al., 2020). All these

models use cross-entropy loss as their loss functions.

2.5 Adversarial Training

This section introduces the author’s attempt at

training an adversarially robust AI generated image

detection model. The ResNet-18 architecture is used

as a base classifier for this task (Z. Sha et al., 2022).

The core of the training stage is the technique of Fast

Adversarial Training introduced by Wong, Rice, and

Kolter (E. Wong et al., 2020). Theoretically,

adversarial training is a minimax or saddle point

problem such that:

















 







   



 



(6)

The training technique of Wong, Rice, and Kolter

is based on an FGSM adversary (E. Wong et al.,

2020). While the base FGSM adversarial technique

had been described as not empirically robust against

PGD attacks, Fast Adversarial Training utilizes

random non-zero initialization of the FGSM

perturbations to achieve robustness on par with PGD

adversarial training while being computationally

much less costly due to removing the iterative factor

(E. Wong et al., 2020).

Besides the gradient-based adversarial technique,

the aforementioned adversarial spatial

transformations are then introduced into the

adversary image generation process, since the two

types of attacks had been proven to be orthogonal to

each other and their effects are simply additive.

DAML 2023 - International Conference on Data Analysis and Machine Learning

382

Engstrom et al. proposed using the same worst-of-k

method as described above to generate adversarial

samples and adding extra degrees of translation and

rotation



  



had been proven to help with

generalizing across different attack landscapes while

not affecting the clean accuracy by much (L.

Engstrom et al., 2019). Hence, the choice parameter k

is set as 10, and the maximum rotation and translation

are set to 30° and 5 pixels.

Finally, inspired by Wang et al., all training data

are augmented with a 20% probability of either

Gaussian blur with σ  Uniform [0, 3] and JPEG

compression with quality 

Uniform     (R. Wang et al., 2019).

3 RESULTS

This section presents the results of the robustness

evaluation of the aforementioned models as well as

the evaluation of the performance of the paper’s

proposed model.

3.1 Robustness Evaluation of Existing

Models

The adversarial datasets are attacked using a white-

box attack scheme, as the target model gradients are

input into the AutoPGD procedure to generate the

adversarial datasets. For this purpose, the adversarial

dataset is copied for each target model, and each copy

is attacked individually using the gradient

information from the target.

A toggleable preprocessing script is used to

perform a worst-of-k attack on input images prior to

classification tasks. The choice parameter k is fixed

to 10, and the maximum rotation and translation

parameters are set to 15° and 20% (3 pixels) in any

given direction, respectively.

According to Engstrom et al., spatial translations

and gradient based attacks occupy orthogonal attack

spaces and reduce classification accuracy in an

additive manner (L. Engstrom et al., 2019). Therefore,

this paper attempts both attack models individually

and then combined. The results are then compared to

the accuracy obtained from the clean CIFAKE dataset.

The results of the evaluation are presented in Table 1.

As shown in the table, the classifiers universally

experienced significant accuracy degradation with

any form of adversarial attack. Gradient-based

AutoPGD significantly outperforms Worst-of-10 in

reducing classification accuracy, and the degradation

effects are indeed roughly additive to each other. The

cross-attention-based networks by Xi and Quan show

higher natural accuracy as well as higher robustness

spatial transformation-based adversarial attacks. The

ENet model in particular shows significantly higher

robustness against AutoPGD attacks than the model

by Xi et al. due to incorporating gradient-based

adversarial training. Though, the model by Xi et al.

still displays a higher degree of resistance to PGD

attack than “simple” CNN models.

3.2 Robustness Evaluation of the

Proposed Model

This section employs the same methodology

described in the above sector to evaluate the model

constructed with the forms of adversarial training

proposed by this paper. Table 2 displays the accuracy

of the proposed model, along with ablation

experiments performed with each individual

adversarial training method as well as disabling the

Gaussian-based and JPEG artifact-based data

augmentation by Wang et al. (R. Wang et al., 2019).

The proposed model showed significantly

improved adversarial robustness against both PGD

attacks and adversarial spatial transformations than

any of the existing models described above. The

natural accuracy is comparable to the unmodified

ResNet-18 model but worse than the cross-attention-

based models described above.

The ablation study of removing either adversarial

training components clearly highlights the orthogonal

nature of the spatial transformation attacks and the

gradient based perturbations. The absence of either

training modules completely nullify the resistance to

the corresponding attacks and degrade the combined

adversarial robustness accordingly. Though,

removing the additional augmentation step did not

significantly affect the adversarial robustness of the

complete model but slightly increased the model’s

natural accuracy. This result goes against the

common understanding that harder training data

typically generate better classifier accuracies.

4 DISCUSSION

Rodriguez et al. proposed that more complex deep

learning models are more susceptible to adversarial

perturbation attacks. However, the results shown in

Table 1 suggest otherwise. While Model-Centric

ENet showed a higher degree of robustness due to

incorporating adversarial training, even the non-

adversarially trained dual-stream network by Xi et al.

is surprisingly more robust against the AutoPGD

Towards Adversarially Robust AI-Generated Image Detection

383

Table 1: Comparative Accuracy of Evaluated Models Under Different Adversarial Attacks.

Defense Model

Attack Type

Accuracy

ResNet-18

Natural

84.40%

Auto-PGD 



ε=0.033

4.10%

Worst-of-10 transformation

31.10%

APGD + W-10

2.20%

Bird & Lotfi (J. Bird Jordan, and L. Ahmad, 2023)

Natural

83.30%

Auto-PGD 



ε=0.033

3.70%

Worst-of-10 transformation

28.20%

APGD + W-10

2.80%

Model-Centric ENet (W. Quan et al., 2020)

Natural

92.70%

Auto-PGD 



ε=0.033

39.30%

Worst-of-10 transformation

43.10%

APGD + W-10

31.50%

Xi et al. (Z. Xi et al., 2023)

Natural

93.30%

Auto-PGD 



ε=0.033

18.00%

Worst-of-10 transformation

45.40%

APGD + W-10

13.10%4

Table 2: Proposed Model’s Accuracy Under Different Adversarial Attacks.

Training Mode

Attack Type

Accuracy

Fast Adversarial Training + W-10

Natural

82.40%

Auto-PGD 



ε=0.033

56.10%

Worst-of-10 transformation

79.80%

APGD + W-10

55.20%

Worst-of-10 augmentation only

Natural

86.00%

Auto-PGD 



ε=0.033

5.90%

Worst-of-10 transformation

85.50%

APGD + W-10

7.10%

Fast Adversarial Training

Natural

82.90%

Auto-PGD 



ε=0.033

58.50%

Worst-of-10 transformation

25.10%

APGD + W-10

41.20%

Fast Adversarial Training + W-10 (No Aug)

Natural

82.80%

Auto-PGD 



ε=0.033

55.50%

Worst-of-10 transformation

81.00%

APGD + W-10

53.90%

attack than either plain ResNet-18 or even Bird &

Lotfi’s network that only has 6 convolutional layers in

total (J. Bird Jordan, and L. Ahmad, 2023), (Z. Xi et

al., 2023), (W. Quan et al., 2020). This discrepancy

might have stemmed from the difference in the nature

of the tasks. The study by Rodriguez et al. focused on

medical image detection where the features are more

concentrated and aligned to human perception.

Therefore, deeper neural networks might create

unnecessarily complex decision boundaries that are

more sensitive to adversarial perturbations – in other

words, close to overfitting. However, unlike

traditional image forgery techniques or shape

classification tasks, artifacts of AI-generated images

are not limited to high-frequency areas or primary

features, and there is in fact evidence of major

differences in the overall statistical distribution of the

image (J. Bird Jordan, and L. Ahmad, 2023), (Z. Xi et

al., 2023). These two factors might mean that AIGC

detection tasks necessitate deeper networks for better

extraction of latent features since it is harder to

determine whether a given feature is robust or strongly

relevant to the prediction outcome. Consequently,

higher complexity models learned for AIGC detection

might be less susceptible to adversarial perturbations

of small magnitude/budget. However, the accuracy

DAML 2023 - International Conference on Data Analysis and Machine Learning

384

drop caused by adversarial examples is still severe on

the more complex models and warrants actual robust

training techniques.

Time and computational constraints are the main

limitations of this study. This study is performed by

an individual researcher utilizing Google Colab

instances with Tesla T4 GPUs, which necessitated

choices such as utilizing 32x32 color images from

CIFAR-10 and the ResNet-18 architecture, which

ensures training performance but might cause the

resulting model to have difficulties with generalizing

across different AIGC models. In particular, 32x32

pixels is a significantly lower resolution than the

majority of natural and T2I AI-generated images

currently available on the internet. This may be the

cause of the lower prediction accuracy as there is less

space for AI generation artifacts, like the ones

hypothesized by Sha et al. to manifest (Z. Sha et al.,

2022). However, increasing the resolution of the

training samples or model depth would cause a

multiplicative increase in all training costs, including

T2I image generation, adversarial attacks, and general

model training. If there are fewer resource constraints,

investigating the interplay of adversarial training,

cross-attention-based ensemble models, and higher

resolution samples is a promising future area, as both

latter factors are shown to improve the natural

accuracy of models (J. Bird Jordan, and L. Ahmad,

2023), (W. Quan et al., 2020).

5 CONCLUSION

This study focuses on the problem of the adversarial

robustness of models that detect AI-generated images.

It aims to (1) evaluate the adversarial robustness of

existing models and (2) construct a model that can

achieve a higher degree of robustness against

adversarial attacks that are either gradient or spatial

transformation-based. For purpose (1), several state-

of-the-art AIGC detection models are evaluated

against both PGD attacks and adversarial translations

and rotations. Both attacks are proven to be highly

effective at reducing the classification accuracy of all

models. For purpose (2), adversarial training and data

is utilized along with a convolutional image classifier

model, which has an improved degree of robustness

against both kinds of adversarial attacks while

preserving the accuracy of the base classifier.

In conclusion, this study proves the susceptibility

of CNN-based AIGC detection models to adversarial

attacks and the possibility of enhancing these models’

robustness with adversarial training. As AIGC

technology continues to improve and proliferate at an

unprecedented pace, AI-based classification

technology might be the best solution for combating

their abuse. Based on this paper’s results, future

models that detect AIGC should also take the issue of

adversarial robustness in consideration, especially

when it comes to distinguishing between what is real

and what is fake.

REFERENCES

Z. Sha, Z. Li, N. Yu, and Zhang, Y, De-fake: Detection and

attribution of fake images generated by text-to-image

diffusion models, arXiv preprint arXiv: 2210.06998,

2022.

J. Bird Jordan, and L. Ahmad, CIFAKE: Image Classifica-

tion and Explainable Identification of AI-Generated

Synthetic Images, arXiv preprint arXiv: 2303.14126,

2023.

Z. Xi, W. Huang, K. Wei, W. Luo, and P. Zheng, AI-Gener-

ated Image Detection using a Cross-Attention Enhanced

Dual-Stream Network, arXiv preprint arXiv:

2306.07005, 2023,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A.

Vladu, Towards deep learning models resistant to adver-

sarial attacks, arXiv preprint arXiv:1706.06083, 2017.

L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry,

Exploring the landscape of spatial robustness, In Inter-

national conference on machine learning, PMLR, 2019,

pp. 1802-1811.

M. I. Nicolae, and M. Sinn, Adversarial Robustness Toolbox

v1.2.0. CoRR, 1807.01069, arXiv preprint arXiv:

1807.01069, 2018.

I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and

harnessing adversarial examples, arXiv preprint

arXiv:1412.6572, 2014.

F. Croce, and M. Hein, Reliable evaluation of adversarial

robustness with an ensemble of diverse parameter-free

attacks, In International conference on machine learning,

2022, pp. 2206-2216.

W. Quan, K. Wang, D. M. Yan, X. Zhang, and D. Pellerin,

Learn with diversity and from harder samples: Improv-

ing the generalization of CNN-Based detection of com-

puter-generated images, Forensic Science International:

Digital Investigation, 2020, vol. 35, pp. 301023.

E. Wong, L. Rice, and J. Z. Kolter, Fast is better than free:

Revisiting adversarial training, arXiv preprint

arXiv:2001.03994, 2020.

R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Wang, and Y. Liu,

Fakespotter: A simple baseline for spotting AI-synthe-

sized fake faces, arXiv preprint arXiv:1909.06122, 2019.

Towards Adversarially Robust AI-Generated Image Detection

385