Using Machine Learning to Distinguish Human-Written from

Machine-Generated Creative Fiction

Andrea Cristina McGlinchey

and Peter J. Barclay

2 a

Lumerate, Canada

School of Computing, Engineering & the Built Environment, Edinburgh Napier University, Scotland

Keywords:

Large Language Models, Generative AI, Machine Learning, Classiﬁer Systems, Fake Text Detection.

Abstract:

Following the universal availability of generative AI systems with the release of ChatGPT, automatic detection

of deceptive text created by Large Language Models has focused on domains such as academic plagiarism and

“fake news”. However, generative AI also poses a threat to the livelihood of creative writers, and perhaps

to literary culture in general, through reduction in quality of published material. Training a Large Language

Model on writers’ output to generate “sham books” in a particular style seems to constitute a new form of

plagiarism. This problem has been little researched. In this study, we trained Machine Learning classiﬁer

models to distinguish short samples of human-written from machine-generated creative ﬁction, focusing on

classic detective novels. Our results show that a Na

ıve Bayes and a Multi-Layer Perceptron classiﬁer achieved

a high degree of success (accuracy > 95%), signiﬁcantly outperforming human judges (accuracy < 55%).

This approach worked well with short text samples (around 100 words), which previous research has shown

to be difﬁcult to classify. We have deployed an online proof-of-concept classiﬁer tool, AI Detective, as a ﬁrst

step towards developing lightweight and reliable applications for use by editors and publishers, with the aim

of protecting the economic and cultural contribution of human authors.

1 INTRODUCTION

Generative AI has made remarkable advances in re-

cent years, and is now widely available with the re-

lease of ChatGPT and similar systems. Along with

many beneﬁcial uses, there are diverse concerns for

misuse, including generation of incorrect or harmful

advice (Oviedo-Trespalacios et al., 2023), propaga-

tion of biases (Feng et al., 2023), creation of deepfake

video, fake news and fake product reviews (Botha and

Pieterse, 2020), and various forms of plagiarism, es-

pecially in scientiﬁc and other academic ﬁelds (Odri

and Yoon, 2023). Previous research has investigated

techniques for identifying artiﬁcially generated text in

these domains, with the aim of mitigating the societal

harm from such misuse (see Section 2.3).

Generative AI also presents a threat to the liveli-

hood of writers and other creative artists, and may de-

value their work. Models are often trained on writers’

outputs, without their permission, and then the mod-

els can be used to generate similar content.

https://orcid.org/0009-0002-7369-232X

We might characterise this problem as “AI-

mediated plagiarism”: rather than taking or modi-

fying authors’ work directly, a bad actor can create

content using a generative AI trained on the authors’

work. Increasing awareness of the issue is signalled

by developments such as the New York Times’ an-

nouncement in 2023 that it was suing OpenAI and

Microsoft for copyright infringement (Grynbaum and

Mac, 2023).

We note a gap in the literature around develop-

ing detection tools in the area of creative ﬁction, and

broach this problem by investigating whether Ma-

chine Learning (ML) models can reliably distinguish

between short text samples from human-written nov-

els and similar text automatically generated.

Our early results show a good level of success,

working with text excerpts from classic detective nov-

els, and suggest that relatively simple classiﬁers can

outperform humans in identifying automatically gen-

erated creative prose. Moreover, the approach works

well with short text samples, which previous studies

found difﬁcult to classify (see Section 7.2).

McGlinchey, A. C. and Barclay, P. J.

Using Machine Learning to Distinguish Human-Written from Machine-Generated Creative Fiction.

DOI: 10.5220/0013110100003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 79-90

ISBN: 978-989-758-737-5; ISSN: 2184-433X

2 BACKGROUND

2.1 Advances in Text Generation

Recent increases in data availability and computing

power have facilitated approaches to automatic text

generation based on neural network models and deep

learning (Goyal et al., 2023).

A neural network is trained by optimising the

weights and biases based on the observed vs. desired

outputs (Bas et al., 2022). As networks are not re-

stricted to pre-existing patterns, the text generated by

these models can be more “creative” according to the

underlying semantic relationships (Pandey and Roy,

2023).

The introduction of Transformer models in 2017

represented another major advance (Vaswani, 2017).

The architecture of a Transformer consists of an en-

coder and a decoder. The encoder block contains a

multi-head self-attention layer; the decoder block also

has a cross-attention layer, enabling it to use the out-

put of the encoder as context for text generation (Han

et al., 2021). The attention component of Transform-

ers underlies their success in text generation, as it en-

ables a language model to decipher the correct mean-

ing of a word using its context. For example, when

“it” is used in a sentence, the model can better in-

terpret what is being referred to. These types of text

generation models learn from huge amounts of data,

which enables them to generate high quality output.

Increases in scale led to Large Language Models

(LLMs), a notable early example being BERT from

Google, which used the Transformer architecture to

achieve dramatic improvements over earlier models

(Devlin et al., 2019).

OpenAI’s GPT-1 was developed in 2018, and ver-

sion 3.5 was released to the public in 2022 via the

ChatGPT interface, making this technology univer-

sally available. GPT-3 showed its ability to create text

that was seemingly indistinguishable from human-

written text (Pandey and Roy, 2023). The initial ver-

sion of GPT used 110 million learning parameters,

and this number greatly increased with each version.

GPT-2 used 1.5 billion and GPT-3 used 175 billion

(Floridi and Chiriatti, 2020). The number of parame-

ters used for GPT-4 has been speculated to be approx-

imately 100 trillion. Now ChatGPT has the ability to

write human-like essays, news articles, and academic

papers, as well as to complete text summarisation in

multiple languages (Zaitsu and Jin, 2023).

Many other companies have also launched LLMs

with similar capabilities, including GitHub’s and Mi-

crosoft’s Copilot, Meta’s LLama, Anthropic’s Claude,

and Google’s Gemini.

2.2 Text Generation: Concerns for

Misuse

ChatGPT is freely available and widely adopted, rais-

ing the possibility of harm from inaccurate responses,

or deliberate misuse by bad actors generating mis-

leading texts.

One concern regarding text generated by LLMs

is quality. As responses are based on statistical pat-

terns and correlations found in large datasets, they

can at times be irrelevant, nonsensical or offensive

(Wach et al., 2023); different models can vary re-

garding what is considered inaccurate or offensive

(Feng et al., 2023). As LLMs are pre-trained on

large datasets which include opinions and perspec-

tives, there is a risk of the introduction of biases for

downstream models (Barclay and Sami, 2024). More

ominously, ChatGPT can write fake news at scale,

a task which was previously labour-intensive. This

makes it easy to create media supporting or discredit-

ing certain views, political regimes, products or com-

panies (Koplin, 2023).

As awareness of these concerns increased, re-

search has focused on ways to distinguish human-

written from artiﬁcially generated text.

There has also been concern in creative industries

that generative AI could be used to replicate artists’

and writers’ work, and possibly replace them. In

2023, members of the Writers Guild of America went

on strike for 148 days seeking an agreement on pro-

tections regarding the use of AI in the television, ﬁlm,

and online media industries (Salamon, 2024).

2.3 Related Work

Numerous methods have been employed for the cre-

ation of AI-content detectors in other domains, in-

cluding zero-shot classiﬁers, ﬁne-tuning Neural Lan-

guage Models (NLMs), as well as specialised classi-

ﬁers trained from scratch (Jawahar et al., 2020).

OpenAI, the creator of ChatGPT, launched its own

AI classiﬁer to identify AI-generated text; however,

they removed availability in July 2023 owing to its

low accuracy (see https://platform.openai.com). A

study which reviewed ﬁve AI-content detection tools

(OpenAI, Writer, Copyleaks, GPTZero, and Cross-

Plag) observed high variability across the tools, and

a reduced ability to detect content from a more re-

cent version of GPT. Comparing GPT-4 with GPT-

3.5 content, three of the ﬁve tools could ﬁnd 100% of

GPT-3.5 content, but none achieved this level of de-

tection for GPT-4 – the highest result was 80% by the

OpenAI Classiﬁer; then one detector managed 40%

and the rest only 20%. This suggests that AI detection

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

tools will have to evolve in response to the increasing

sophistication of LLMs (Elkhatat et al., 2023).

The majority of zero-shot detectors evaluate the

average per-token log probability of the generated

text. A commercially available tool which uses zero-

shot classiﬁcation, GPTZero, has been found, in a

study using medical texts, to have an accuracy of 80%

in identifying AI-generated texts (Habibzadeh, 2023).

Another example of a zero-shot detector is

“DetectGPT”, which uses minor rewrites of text and

then plots log probability of the original vs. the

rewrites. The “Perturbation Discrepancy” created re-

sults in the log probability for human text tending

towards zero, whereas the AI text was expected to

give relatively larger values (Mitchell et al., 2023).

DetectGPT was found to perform better than other

existing zero-shot methods at detecting AI-generated

fake news articles (Mitchell et al., 2023). However,

a more recent study reports having outperformed De-

tectGPT by leveraging the log rank information (Su

et al., 2023).

Detectors using RoBERTa, which is based on the

pre-trained model BERT, have been referred to as be-

ing “state-of-the-art” for AI text detection (Crothers

et al., 2023). The success of this approach is re-

ported in a study which used a supervised Multi-

Layer Perceptron (MLP) algorithm to train RoBERTa,

achieving an accuracy of over 97% on the test dataset.

The dataset was created using URLs shared on Red-

dit, which were passed through GPT-3.5 Turbo for

rephrasing. This study also achieved similar results

using the Text-to-Text Transfer Transformer model

(T5) as a starting point (Chen et al., 2023). Another

study based on an Amazon reviews dataset also found

the RoBERTa model gave a 97% accuracy, the highest

result among those tested (Puttarattanamanee et al.,

2023).

An alternative approach for detecting whether text

has been created by a human or an AI is by train-

ing a purpose-built Machine Learning model. One

study looked at medical abstracts and radiology re-

ports using several different ML techniques, includ-

ing text perplexity, and singular and multiple decision

trees. Perplexity is deﬁned as the exponentiation of

the entropy of the text, giving an intuitive measure of

the uncertainty in word choice. This study also used

one LLM, a pre-trained BERT model. Their results

showed that ChatGPT created text which had a rela-

tively lower perplexity (was more predictable) com-

pared with human-written text. Nonetheless, perplex-

ity gave the worst results compared with the other

methods in this study, (with similar poor results for

the singular decision tree). The multiple decision tree

achieved a percentage F1 score of nearly 90% for

both datasets. The pre-trained BERT model still out-

performed all other approaches with F1 scores above

95% for both datasets (Liao et al., 2023).

Another study by Islam et al tested eleven differ-

ent Machine Learning models, with a multiple tree

model performing best with an accuracy of 77%. The

dataset for this study used a combination of news ar-

ticles from CNN and data scraped from the question-

and-answer website Quora (Islam et al., 2023).

Support Vector Machine (SVM) models have

demonstrated success in distinguishing between

human-written and AI-generated text. A study which

looked at identifying fake news on social media

achieved 98% accuracy, the highest result for all mod-

els tested within the experiments, using SVM algo-

rithms (Sudhakar and Kaliyamurthie, 2024). An-

other study, investigating SVM models to detect fake

news, also achieved an accuracy of approximately

98% when the title and ﬁrst 1000 characters of the

article were used in testing (Altman et al., 2021).

An experiment comparing logistic regression with

a Na

ıve Bayes algorithm, where the dataset fo-

cused on identifying fake news, found the logistic

regression algorithm performed better with an accu-

racy of 98.7% (Sudhakar and Kaliyamurthie, 2022).

The use of statistical-based techniques has also been

shown as a promising approach, such as “GPT-Who”,

which uses a logistic regression classiﬁer to map

extracted Uniform Information Density (UID) fea-

tures. The hypothesis of UID is that humans prefer

evenly spreading information without sudden peaks

and troughs. This detector performed better than

other statistical based detectors, and at the same level

as ﬁne-tuned transformer-based methods (Venkatra-

man et al., 2023).

2.4 Detection of Automatically

Generated Creative Fiction

Prior studies addressed various domains includ-

ing: medical-related text (Hamed and Wu, 2023;

Habibzadeh, 2023), student submissions (Orenstrakh

et al., 2023; Walters, 2023; Elkhatat et al., 2023),

and news articles (Islam et al., 2023; Mitchell et al.,

2023), as well as popular websites with user contribu-

tions (Islam et al., 2023; Chen et al., 2023).

Despite concerns about the misuse of AI in the

creative community, we note a considerable lack of

research on detecting artiﬁcially generated creative

ﬁction

. A number of studies have investigated use of

We will use the term “creative ﬁction” as we do not

distinguish “genre” ﬁction from “literary” ﬁction, but here

we do not consider other forms of creative writing such as

lyrics, screenplays, and poetry.

Using Machine Learning to Distinguish Human-Written from Machine-Generated Creative Fiction

generative AI to assist writers (Ippolito et al., 2022;

Landa-Blanco et al., 2023; Guo et al., 2024; Gero,

2023; Stojanovic et al., 2023). While this may raise

literary questions regarding quality of writing, and

philosophical considerations of the nature of human

creativity, it does not directly threaten the livelihood

of creative writers. Wholesale generation of entire

novels is another matter. In 2024, the Authors’ Guild

noted the prevalence of “sham books” for sale on

Amazon (The Authors Guild, 2024), and some high-

proﬁle stars signed an open letter requesting technol-

ogy companies to pledge protection for human artists’

work (Robins-Early, 2024).

We do not expect that detectors trained on other

domains will work well with creative ﬁction; indeed,

one study demonstrated a 20% reduction in effective-

ness across different sources within a single domain

(fake news) (Janicka et al., 2019).

The only study found in the literature address-

ing the detection of artiﬁcially generated creative ﬁc-

tion was the Ghostbuster study (which also included

two other domains) (Verma et al., 2023). However,

this study did not use published novels. The creative

writing dataset was created from writing prompts and

their associated stories from Reddit. When there was

no prompt available for a story, ﬁrst ChatGPT was

given the story and asked to create a prompt, then that

prompt was used to re-create the story.

Ghostbuster is a sophisticated and effective model,

created in three steps: each document was ﬁrst fed

into weaker language models to retrieve word proba-

bilities, which were then combined into a set of fea-

tures by searching over vector and scalar functions.

The resulting features were used in a linear classi-

ﬁer. In the full Ghostbuster trial, the creative writing

dataset performed well, but worse (F1 = 98.4%) than

two other more commonly used types of datasets, fake

news and student essays (both F1 = 99.5%). This sug-

gests that it may be more difﬁcult to distinguish AI-

generated creative writing compared with other con-

tent types, underlining the need for further research in

this area.

Additionally, this study attempted to address brit-

tleness seen in other detectors’ inability to generalise

across different LLMs. Ghostbuster outperformed

other models tested, but still showed a 6.8% F1 de-

crease when analysing text from another LLM com-

pared to ChatGPT.

2.5 Aims of this Research

Having noted the concerns of the creative community,

and the lack of research on detecting artiﬁcially gen-

erated creative ﬁction, we investigate the use of ML

classiﬁers for this task, using only short samples of

text. We take this approach as:

• Earlier research has shown that ML models can be

effective in detecting other types of AI-generated

content.

• As LLMs are rapidly evolving, and often propri-

etary, we do not wish to depend on the technology

we are trying to detect.

• A useful detector should be able to run indepen-

dently with relatively low resources, so it could

be deployed easily in the workﬂow of reviewers,

editors, and publishers.

Additionally, we make a ﬁrst comparison of the

effectiveness of a ML-based classiﬁer versus human

judgment in identifying artiﬁcially generated prose.

3 EXPERIMENTAL METHOD

Our broad approach is to chop human-written detec-

tive novels into a sequence of excerpts, then to gen-

erate similar texts using ChatGPT-3.5 Turbo, both by

rewriting existing excerpts and by using a customised

prompt only (with no excerpt provided as an exam-

ple). The generated texts undergo just enough data

preparation to ensure there are no obvious “tells” ﬂag-

ging their provenance. We then attempt to train ML

classiﬁer systems to distinguish the human-written

from the machine-generated texts. We also compare

the accuracy of the best classiﬁers with samples taken

from two unseen novels, one by a different author, for

insight into how well the models can generalise. We

further compare the classiﬁers’ success with that of 19

human judges who had attempted to make the same

determination for a small selection of text samples in

an online quiz.

We focus on short text samples, as these have

proved difﬁcult to detect in prior studies, and with

a view to eventually creating a lightweight tool that

could spot-check individual sections of text during the

editing/publishing process.

3.1 Datasets and Data Preparation

For the human-written prose, we used out-of-

Gutenberg (https://www.gutenberg.org), as these are

well-known, and the language is not too old-

fashioned. Three novels

were used in the base

(human-written) dataset, creating 1424 excerpts,

“The Murder on the Links”, “Poirot Investigates” &

“The Man in the Brown Suit”.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

adding three more

in the extended base dataset (2713

excerpts). Furthermore, one unseen Agatha Christie

novel and one novel by another writer, Dorothy L

Sayers

, were used to investigate the ability of the

classiﬁer to generalise to unseen but broadly similar

text.

A Python script was used to create our base data

set, by chopping the novels into a sequence of ex-

cerpts of the desired length, always terminating on a

full stop

, and removing extraneous text such as page

numbers. In these experiments, all excerpts were of

length “approximately 100 words”, as described be-

low. The texts were vectorised without pre-processing

using Scitkit-Learn’s CountVectorizer.

The AI-generated texts were produced in two

ways: (1) by asking ChatGPT to rewrite a sample

from the base dataset of novel excerpts, and (2) by

asking it to write a similar text based on a prompt,

with no sample text provided. Using a Python script,

all requests were sent to OpenAI’s API in a ran-

domised order; one text excerpt was generated or

rewritten per API request. We expect more robust re-

sults from the rewrites datasets, where we were able

to generate many more texts than in the prompt-only

datasets, but we are still able to compare the two

approaches. For detection experiments, the gener-

ated texts in each dataset were mixed with the same

number of human-written excerpts randomly selected

from a base dataset.

To enable a fair comparison of the models, 80%

of the base dataset was used as the training / test set –

this was split as 70% training and 30% test – and the

other 20% was held back as an unseen validation set.

The generated datasets used are summarised in Table

3.2 OpenAI API Settings

The temperature setting and prompt wording used for

the AI-generated text were derived by extensive trial-

and-error over many iterations, arriving at a prompt

which produced text with no obvious tells that it had

been artiﬁcially generated. We noted that too long or

speciﬁc a prompt sometimes resulted in some require-

ments being ignored. Lower temperatures generated

less varied text; in particular, rewritten excerpts often

just had a few word substitutions. Too high a temper-

ature resulted in text too unlike the target material.

The ﬁnal OpenAI API settings are shown below:

“The Mysterious Affair at Styles”, “The Big Four” &

“The Secret Adversary”.

“The Secret of the Chimneys” & “Whose Body?”

Terminating on other symbols such as ? and ! did not

give reliable results.

• model=“gpt-3.5-turbo-0125”

• temperature=0.7

• prompt = “You will take the role of an author of

crime novels. A text excerpt will be provided, you

have to review it for number of space characters

and key details. Create a new text excerpt which

contains the same key details but appears struc-

turally different to the original. The new text must

have approximately the same number of spaces as

the original. Only return the new text passage. Do

not include place holders, line breaks or any other

text except the new passage. Text excerpt:”

3.3 Text Excerpt Length Distribution

The lengths of excerpts in the base datasets vary

somewhat due to the requirement of separating

chunks at a full stop. The OpenAI API did not prove

accurate in generating texts of a required length, al-

though there was some modest improvement by re-

questing a particular number of spaces, rather than

number of words. Moreover, the API showed a pref-

erence towards generating shorter texts rather than

longer. Figure 1 shows how the distribution of char-

acter length is signiﬁcantly different between the base

and generated datasets when the expectation was to

target approximately 600 characters (100 words).

With a difference in mean length of 42 characters,

and a large jump in standard deviation (68 vs. 120),

the length distribution raised a problem as it could

bias the classiﬁcation. This issue was addressed in

two steps. For the human-written text preparation,

the text was cut (at the nearest full stop) to a ran-

domly selected length within a deﬁned range; and for

all datasets, outliers were removed to reduce the varia-

tions in length. The resultant “balanced” datasets had

excerpts of the same mean length (563 characters) and

much closer standard deviations (61 vs. 81). All ex-

periments reported here were run on these balanced

datasets.

3.4 Models Tested

We tested six ML models from Scikit Learn

(https://scikit-learn.org), based on their earlier use in

detecting generated text in other domains. These

were: Support Vector Machine, Logistic Regression,

Random Forest, MLP Classiﬁer, Decision Tree and

ıve Bayes. The inclusion of a single decision tree

is to allow comparison with the Random Forest (mul-

tiple decision trees), and the decision to include Na

ıve

Bayes was based on its known ability in text classiﬁ-

cation tasks, where it is computationally efﬁcient and

Using Machine Learning to Distinguish Human-Written from Machine-Generated Creative Fiction

Table 1: Description of generated data sets.

Data Set

Reference

Description Type of AI

generation

Total No.

examples

AC3Train Training data separated out from a base data set created using 3

Agatha Christie books where human text was excerpts split out