UNCOVER: Identifying AI Generated News Articles

by Linguistic Analysis and Visualization

Lucas Liebe

1 a

, Jannis Baum

1 b

, Tilman Sch

utze

1 c

, Tim Cech

2 d

Willy Scheibel

1 e

and J

urgen D

ollner

1 f

University of Potsdam, Digital Engineering Faculty, Hasso Plattner Institute, Germany

University of Potsdam, Digital Engineering Faculty, Germany

Keywords:

Explainable AI, Text Generation, Linguistic Text Analysis, Topic Modeling, Entity Recognition, Stylometry.

Abstract:

Text synthesis tools are becoming increasingly popular and better at mimicking human language. In trust-

sensitive decisions, such as plagiarism and fraud detection, identifying AI-generated texts poses larger difﬁ-

culties: decisions need to be made explainable to ensure trust and accountability. To support users in identi-

fying AI-generated texts, we propose the tool UNCOVER. The tool analyses texts through three explainable

linguistic approaches: Stylometric writing style analysis, topic modeling, and entity recognition. The result

of the tool is a prediction and visualization of the analysis. We evaluate the tool on news articles by means of

accuracy of the prediction and an expert study with 13 participants. The ﬁnal prediction is based on classiﬁ-

cation of stylometric and evolving topic analysis. It achieved an accuracy of 70.4 % and a weighted F1-score

of 85.6 %. The participants preferred to base their assessment on the prediction and the topic graph. In con-

trast, they found the entity recognition to be an ineffective indicator. Moreover, ﬁve participants highlighted

the explainable aspects of UNCOVER and overall the participants achieved 69 % accuracy. Eight participants

expressed interest to continue using UNCOVER for identifying AI-generated texts.

1 INTRODUCTION

In recent years, artiﬁcial intelligence has become able

to generate texts that are similar to those written by

humans. While readers could easily recognize if a

computer wrote a text just a few years ago, today’s

systems are getting increasingly better at producing

convincing content leading to new challenges (Floridi

and Chiriatti, 2020). One of the most impactful sys-

tems is ChatGPT

by OpenAI due to the great pub-

lic attention it has gained (Lund and Wang, 2023).

This release sparked many discussions in media about

ﬁnding a way of identifying generated texts in the ar-

eas of: (1) Application of text generators in many

https://orcid.org/0009-0004-9252-4764

https://orcid.org/0009-0008-8874-0279

https://orcid.org/0009-0007-3321-9489

https://orcid.org/0000-0001-8688-2419

https://orcid.org/0000-0002-7885-9857

https://orcid.org/0000-0002-8981-8583

https://openai.com/blog/chatgpt

daily use systems like translators

, (2) Evaluation

of Students’ writing skills

and (3) Validity of news

sources

. This lead to regulators considering to halt

AI development, until strategies for how to deal with

such technologies are developed.

To tackle the issue,

OpenAI released a prototype for a black box tool to

identify texts from multiple generative models, which

successfully identiﬁes 26 % of AI authors (Jan Hen-

drik Kirchner, 2023).

Aside from working with the relatively low accu-

racy, end-users should not be required to put their trust

in a black box solution which they have no control

over (Rudin, 2019). A well explainable tool can build

trust, by providing a deep understanding. Moreover,

for such sensitive decisions humans need to be part of

https://www.forbes.com/sites/bernardmarr/2023/03/

01/the-best-examples-of-what-you-can-do-with-chatgpt/

https://abcnews.go.com/Health/wireStory/explainer-

chatgpt-schools-blocking-96269407

https://www.wired.com/story/ai-write-disinformation-

dupe-human-readers/

https://www.forbes.com/sites/jackkelly/2023/06/05/

artiﬁcial-intelligence-is-getting-regulated/

Liebe, L., Baum, J., Schütze, T., Cech, T., Scheibel, W. and Döllner, J.

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization.

DOI: 10.5220/0012163300003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 39-50

ISBN: 978-989-758-671-2; ISSN: 2184-3228

a trustworthy solution to enable introspection. Visu-

ally intuitive and convincing explanations will make

services in this domain more accessible than lengthy

textual explanations.

This work applies explainable linguistic analysis

to the task of identifying AI-generated text to offer an

in-depth linguistic comparison of AI-generated and

human-written texts. To achieve this, we introduce

UNCOVER, which employs stylometric approaches,

topic modeling, and entity recognition to analyze the

linguistic features of news articles. We apply Sty-

lometry, as a concept that is already successfully used

to differentiate human authors, by implementing best

practices in this ﬁeld in an explainable way. For

the topic modeling approach of UNCOVER, we pro-

pose the “Topic Evolution Model” (TEM), that we de-

rived from the “Topic Flow Model” by Churchill et al.

(2018). In addition to re-implementing the original,

we made various adjustments to their model and de-

veloped a visualization for the resulting topic graph.

TEM resolves the requirement of large numbers of

documents in each temporal period and is optimized

to work with overall small corpora, instead of just

small documents. The component featuring entity

recognition mostly consists on coreference resolution

and its visualization.

To evaluate UNCOVER, we conduct an expert

study. 13 participants evaluated the tool on mul-

tiple usability aspects. Further, we introduce and

test a novel AI-news data set for public benchmark-

ing. The data set consists of training and evalua-

tion data to compute metrics of accuracy. This data

set is, to the best of our knowledge, the ﬁrst pub-

licly available, medium-sized-text data set featuring

AI-generated news articles.

First, this work reviews related work for the

identiﬁcation of AI-generated text in subsection 2.1

and linguistic approaches to text analysis in subsec-

tion 2.2. The proposed tool consists of multiple

components: a stylometric component, discussed in

subsection 3.1, topic modeling, discussed in subsec-

tion 3.2, a prediction based on both of these compo-

nents, explained in subsection 3.3, and entity recogni-

tion, found in subsection 3.4. We introduce our self-

generated dataset in subsection 4.1 and the conducted

expert study in subsection 4.2. Each of the four com-

ponents – stylometry, topic modeling, prediction and

entity recognition – is evaluated in subsection 4.3,

subsection 4.4, subsection 4.5, and subsection 4.6 re-

spectively. Explainability, limitations, threats to the

validity, and possible negative impacts on society are

discussed in section 5. We conclude this paper in sec-

tion 6.

2 RELATED WORK

The tool UNCOVER builds upon work in the areas of

(1) identiﬁcation of AI-generated texts and (2) lin-

guistic features, coherence analysis and authorship at-

tribution.

2.1 Identifying AI Generated Text

OpenAI presented a service applicable to various

generation models, where they achieved classiﬁca-

tion performance of 26 % AI texts correctly classi-

ﬁed as such (Jan Hendrik Kirchner, 2023). Many

other commercial classiﬁcation tools are available on-

line achieving good results, for instance gowinston.ai,

contentatscale.ai, and gptzero.me. Other research ap-

proaches, like the Giant Language Model Test Room

(GLTR) (Gehrmann et al., 2019), GROVER (Zellers

et al., 2019), and a ﬁne-tuned Bidirectional Encoder

Representations from Transformers (BERT) (Ippolito

et al., 2019) are trained to recognize the output of a

singular model. GLRT provides a human-in-the-loop

solution to help users make informed decisions in-

stead of providing a prediction itself (Gehrmann et al.,

2019). BERT models achieve the best accuracy of the

mentioned approaches (Ippolito et al., 2019). How-

ever, this method has not been re-implemented and

only evaluated on GPT-2, newer or multiple AI gener-

ators have not been tested, making it unclear how this

approach would perform today. UNCOVER aims to

correctly classify Large Language Models (LLMs) in

general making the task more complex. AI-generated

texts also attracted attention in the generation of fake

news. Further research aims to detect such news

through data mining techniques (Shu et al., 2017;

Zhang and Ghorbani, 2020). However, this approach

is not sufﬁcient to UNCOVER since we aim to identify

AI-generated texts independently of the facts.

2.2 Linguistic Text Analysis

By analyzing and schematically describing the con-

tents of a text, linguistic text analysis can improve

Natural Language Understanding (Zhang and Wang,

2022). We have identiﬁed three approaches relevant

to our use case.

Stylometry. Human authors can be differenti-

ated by the statistical distribution of “Style Mark-

ers” (Houvardas and Stamatatos, 2006). “Style mark-

ers” are n-grams of textual features that can consist

of characters, words, or part-of-speech tags. Houvar-

das et al. found that 3- (tri), 4-, and 5-grams con-

tain the most information for successful author iden-

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

tiﬁcation and further highlighted trigrams for shorter

texts (Houvardas and Stamatatos, 2006). Posadas-

Dur

an et al. (2017) introduced an algorithm for ex-

tracting syntactic n-grams from sentences using a de-

pendency tree. These were found to perform better

than most other n-grams, with character n-grams in

second place (R

ıos-Toledo et al., 2022). However, the

proposed measures were not tested on AI generators.

UNCOVER uses character and syntactic trigrams for

its analysis.

Topic Modeling. The variety of topic models is

ever-growing, and many of the more recent models

employ less explainable AI methods such as neural

networks (Churchill and Singh, 2022). Models based

on these methods are not applicable to UNCOVER

due to its constraint of being explainable. Another

branch of topic modeling research has focused on cre-

ating graph-based methods, which are more explain-

able by nature. A more recent graph-based models is

Topic Flow Model, which is used to produce seman-

tic graphs that describe how topics change throughout

deﬁned temporal periods of the text corpus (Churchill

et al., 2018). While the Topic Flow Model was created

to work with short documents within the corpus’ pe-

riods, the number of documents in each period needs

to be large for it to produce relevant output.

Entity Recognition. Named Entities (NE) have

been found to positively inﬂuence the performance

of Machine Learning Systems that require context

information (Zhang and Wang, 2022). The Stan-

ford Named Entity Recognizer uses multiple machine

learning sequence models and rule-based components

to label 12 different NE classes (Finkel et al., 2005).

Entity Grids are a way of representing a text to capture

the location in which NE occur and can be a measure

of coherence (Mohiuddin et al., 2018). One method

of achieving this is called coreference resolution, that

ﬁnds all expressions in a text that refer to the same

entity (Clark and Manning, 2016). Due to the genera-

tion of Language Models word-by-word, we expect to

ﬁnd and observed abnormalities in AI-generated text.

3 APPROACH

UNCOVER uses the linguistic approaches of stylome-

try, topic modeling, and entity recognition. The ﬁnal

tool and its results are presented in a web interface

found through the projects GitHub repository

https://github.com/hpicgs/unCover

( on, IN ) ( a, DT ) ( plush, JJ ) ( red, JJ )

( stool, NN )( the, DT )( at, IN )

( counter, NN )( Victor, NNP )

( sat, VBD )

Figure 1: Syntactic dependency tree for “Victor sat at the

counter on a plush red stool”.

3.1 Stylometry

UNCOVER employs character and syntactic trigrams.

The trigrams are obtained from the text as it is, i.e.

without preprocessing, since techniques such as char-

acter normalization or stemming would disrupt char-

acter trigrams (Chen and Manning, 2014). The distri-

bution of syntactic trigrams is extracted from depen-

dency trees generated by the Stanford NLP Depen-

dency Parser (Chen and Manning, 2014) (Figure 1).

Counting all discovered character triples creates the

character trigram distribution. Both distributions are

trimmed to only include the top 100 most common

trigrams. One logistic regression model is trained for

each author and distribution. For classiﬁcation, mod-

els are split into Humans and AI and the highest score

of each group is compared against a threshold. If both

groups or no group hits the threshold, the stylometry

component cannot decide on the author. This case is

represented as an “unsure” result. If only one of the

groups passes the threshold, the text is classiﬁed to be

written by that group.

3.2 Topic Modeling

Coherent articles are texts where the covered topics

are introduced, change slightly to cover different as-

pects, and then get replaced by a different topic, or

evolve into a connected and advanced theme. UN-

COVER analyzes and describes this speciﬁc trait. The

patterns found when analyzing the topics are then

used to differentiate human-written and AI-generated

texts. To illustrate how this is achieved by UNCOVER,

we will ﬁrst introduce Topic Evolution Model (TEM),

then go into detail on how it is integrated into the tool,

and ﬁnally explain how texts are classiﬁed based on

TEM’s output.

3.2.1 Topic Evolution Model

UNCOVER’s most important requirement for topic

modeling is being able to generate an explainable

overview of the change of topics in a single news ar-

ticle. Churchill et al. (2018) have introduced Topic

Flow Model (TFM) as a graph-based model to ana-

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

lyze changes to topics over time in a large corpus con-

sisting of multiple temporal periods, each containing

many documents. Based on this work, we have de-

veloped Topic Evolution Model (TEM) to ﬁt our re-

quirements of working with signiﬁcantly smaller cor-

pora such as a single news article, where single para-

graphs make up the periods, each consisting of indi-

vidual sentences as documents. In the following, we

will illustrate TEM’s main differences to TFM.

First Period. Churchill et al. (2018) introduce

nutrition (nut), and energy values for words in the

corpus, given by the following equations

nut(w)

= (1 − c) + c ·

t f (w)

t f (w

∗

)

(1)

energy(w)

∑

t=1

(nut(w)

− nut(w)

t−i

), (2)

where w and p are the given word and period, w

∗

the most common word in p, t f (w)

is the term fre-

quency of w in p, and c ∈ [0, 1] is a tuning parameter.

By deﬁnition, the energy (Equation 2) of all words in

the ﬁrst period in TFM is equal to 0. This may lead to

all words falling through the energy threshold, mean-

ing no emerging words are found. TEM instead sets

the energy of words in the ﬁrst period equal to their

squared nutrition to allow for the existence of emerg-

ing words in the ﬁrst period.

Flood Words. TFM classiﬁes all words that appear

in at least half of all documents in a period as ﬂood

words (Churchill et al., 2018), an a posteriori alterna-

tive of stop words that should be ignored in the anal-

ysis. Therefore, all words are ﬂood words when a

period only has two documents. To enable processing

paragraphs with only two sentences, TEM classiﬁes

words as ﬂood only when they appear in more than

half of all documents.

Correlation. For all pairs of words in a period,

TFM applies the following formula for term correla-

tion c

k,z

of word k to word z at time t

k,z

{

k,z

}

/(n

{

}

− n

{

k,z

}

)

{

}

− n

{

k,z

}

)/(

− n

{

}

− n

{

}

+ n

{

k,z

}

)

(3)

k,z

= log



k,z





{

k,z

}

{

}

−

{

}

− n

{

k,z

}

− n

{

}



, (4)

where n

is the number of documents all words in A

co-occur in, and

is the number of documents in

period t (Churchill et al., 2018).

This formula has an edge case where division by 0

occurs when two terms in a period only co-occur and

Period 1

Period 2

planet space moon explor limit select year

explor wave new take place discoveri open colon

rang space whole possibl multipl extend planet

major mileston histori

Period 3

cargo reusabl spacecraft design human transport

capabl colon land space also starship alreadi ideal

multipl make first step vehicl process carri launch

100 use suppli intern spacex ton station

Period 4

discoveri come year make

made expect alreadi

sever even

collect sampl bennu mission colon understand

launch better space recent develop asteroid

scientist composit potenti robot could design

provid valuabl resourc

Period 5

colon better space understand scientist develop

multipl potenti discoveri explor could mark abl even

major lead gain mileston futur univers histori abil

asteroid

final

osirisrex

spacecraft

help

moon planet

endless new whole era could discoveri open world

Period 6

Period 7

mar step launch 2020

explor design robot

next surfac

Period 0

Figure 2: Example of an AI-generated Topic Evolution

graph by GPT-3.

never occur in a document by themselves. The way

this is handled by TFM is unknown. When the num-

ber of documents in a period is low, this edge case

is common. TEM solves this by recursively merging

strictly co-occurring terms into the same node in the

semantic graph. This way, the correlation between

strictly co-occurring terms never has to be evaluated.

All terms of the node are accounted for in the for-

mula, to consider nodes with multiple terms during

topic distance evaluation.

Topic Distance. TFM matches newly discovered

topics to existing themes soley based on the re-

currence of a single leader term that describes the

theme (Churchill et al., 2018). When the number of

documents in multiple periods is small, so is the to-

tal number of terms in each of these periods. This

makes it likely that a leader term doesn’t reoccur in a

topic, even when the theme persists. To be able to rec-

ognize existing themes despite the leader term’s po-

tential absence, TEM reemploys the measurement of

topic distance by comparing all new emerging topics

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

with the predecessor period’s topics. The topic dis-

tance between a pair of topics is given by Churchill

et al. (2018) as follows

min(

)

∩t

, (5)

with t

and t

being the sets of words in the two topics.

TEM can process corpora containing periods with

as low as two documents, and therefore meets UN-

COVER’s requirements by allowing it to process indi-

vidual articles. TEM generates a list of periods. Each

period contains a list of topics, which in turn have a

list of words and a theme identiﬁer. Discovered topics

that are sufﬁciently similar to an existing topic receive

the same theme identiﬁer as the existing topic, which

describes the topic evolution.

3.2.2 Integration into UNCOVER

To analyze a text with TEM, UNCOVER ﬁrst splits the

text into paragraphs by considering newline charac-

ters, and each paragraph into sentences by consider-

ing sentence termination characters such as periods or

question marks. Should a paragraph consist of only a

single sentence, it is merged with its predecessor. This

ensures that each paragraph is a made up of at least

two sentences to make them meet the requirements for

periods of documents in TEM’s input. Before the text

is passed into TEM, all non-alphanumerical charac-

ters and stop-words are removed, all letters are trans-

formed into lowercase, and all words are stemmed.

TEM’s output is ﬁnally visualized with a vertical,

directed graph. The lists of words for each topic in a

period are placed in horizontally aligned nodes, and

these rows are aligned from top to bottom. The nodes

of topics with a theme identiﬁer that has occurred in

the predecessor period are connected with a directed

edge to the respective predecessor topic. Figure 2

shows one example of such topic evolution graphs.

3.2.3 Classiﬁcation

Aside from serving as a user-directed visualization,

the discussed graph is also used for automatic clas-

siﬁcation with a multinomial logistic regression on

the following Topic Evolution connectivity metrics

(TEcm):

1. The absolute value of 1 minus the ratio of the

number of distinct theme identiﬁers, and the to-

tal number of topics

2. The ratio of the number of topics with the most

common theme identiﬁer, and the total number of

topics

3. The number of periods that have at least one in-

coming edge, i.e. a topic with predecessor topic,

divided by the total number of periods minus 1

4. The ratio of the longest chain of connected peri-

ods, and the total number of periods

All of the connectivity metrics represent a different

interpretation of connectedness of the graph between

0 and 1, with 0 being the least, and 1 being the most

connected.

3.3 Final Prediction

The prediction of stylometry and the prediction of

Topic Evolution connectivity metrics (TEcm) are com-

bined together into a single ﬁnal output through a ba-

sic decision tree. To begin with, we output the pre-

dictions if both components agree. If the stylometry

result is uncertain or TEcm’s conﬁdence level is over

80%, we immediately output the TEcm classiﬁcation.

If stylometry predicted the text to be AI-generated,

we output this decision if TEcm’s conﬁdence level is

below 70 %. However, if the text is predicted to be

human-written by stylometry, we decrease the thresh-

old to 60 % to minimize misclassifying human au-

thors. Finally, if the conﬁdence exceeds the threshold,

we present an uncertain outcome. Figure 3 shows one

example for a ﬁnal prediction of an AI-generated text

presented to the user after running the full analysis.

3.4 Entity Recognition

A coherently written text should introduce and men-

tion entities in an orderly way that does not con-

fuse its readers. Therefore, especially the change and

occurrence of different entities may offer a relevant

indicator for an AI generator’s weaknesses. Stan-

ford’s CoreNLP Parser offers the ability to recognize

and track entities in multiple sentences (Finkel et al.,

2005). For entity recognition, UNCOVER uses the

parser with its default parameters. For coreference

resolution, CoreNLP offers various models, out of

which UNCOVER uses the most accurate neural net-

work. The output is visually represented as a vertical

stacked bar chart, where each sentence is displayed

together with a stack of bars (see Figure 4). Each bar

Figure 3: Screenshot covering an example classiﬁcation.

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

In a groundbreaking development, space colonization

has expanded to multiple planets, moons, and asteroids,

ushering in a new era of exploration and discovery.

For years, space exploration has been limited to a few

select planets and moons.

But now, a new wave of exploration is taking place, as

space colonization is being extended to multiple planets,

moons, and asteroids.

This is a major milestone in the history of space

exploration, as it opens up a whole new range of

possibilities for exploration and discovery.

The first step in this process was the launch of the

SpaceX Starship, which is a reusable spacecraft

designed to transport humans and cargo to other

planets and moons.

The Starship is capable of carrying up to 100 tons of

cargo and has already been used to transport supplies

to the International Space Station.

It is also capable of landing on multiple planets and

moons, making it an ideal vehicle for space colonization.

The next step was the launch of the Mars 2020 rover,

which is a robotic rover designed to explore

the surface of Mars.

The rover has already made several discoveries,

including evidence of ancient life on the planet, and is

expected to make even more discoveries

in the coming years.

Finally, the most recent development is the launch of the

OSIRIS-REx mission, which is a robotic spacecraft

designed to collect samples from the asteroid Bennu.

The mission will help scientists better understand the

composition of asteroids and could potentially provide

valuable resources for space colonization.

These developments mark a major milestone in the

history of space exploration and could potentially lead

to even more discoveries in the future.

Legend

space colonization

this new era of exploration and discovery

space exploration

a new wave of exploration

the history of space exploration

The first step in this process

The Starship

the Mars 2020 rover

the OSIRIS - REx mission , which is a

robotic spacecraft designed to collect

samples from the asteroid Bennu

Figure 4: An example of an entity occurrences diagram.

The visualization may be challenging to comprehend, be-

cause the used parser is not perfect.

represents an entity mentioned in a rolling window

that consists of the respective sentence, its predeces-

sor, and its successor. The bars are sized according to

the number of times the entity was mentioned in the

corresponding rolling window.

4 EVALUATION

We evaluate all three components of UNCOVER re-

garding accuracy and explainability. To represent the

difference of effect of the answers, the weighted F1-

score is calculated with half penalty for unsure an-

swers and a full penalty for opposite answers.

4.1 Dataset

Our ﬁnal datasets consist of scraped news articles

written by human authors and generated news articles

sourced between March and Mai 2023. In total, the

training dataset contains 2837 news articles by ﬁve

different authors scraped from the Guardian

as a big

trust-worthy news outlet. It also comprises 2400 arti-

cles that were created using ﬁve distinct LLM queries,

with 600 articles generated for each query. To write

news articles on this topic, GPT-2 (Radford et al.,

2019) was given news titles as a prompt. Similarly,

we utilized GPT-3 (Brown et al., 2020) twice - ﬁrst,

we tasked it in the same manner as GPT-2, and sec-

ond, we provided the full-text news article and asked

GPT-3 to rewrite it as a news article. This approach

allowed us to evaluate possible divergences in the

results with queries containing different amounts of

context. To match its input style, GROVER (Zellers

et al., 2019) was given full news articles and titles,

along with the original URL and other mock data.

The original news articles were scraped from Google

News using search queries of similar topics as the

Guardian articles.

To ensure the accuracy of our components in real-

world scenarios, we devised a separate dataset for

testing purposes. We included 200 articles each from

human authors, GPT-2, GPT-3, and GROVER. The

GPT-3 articles comprised 100 titles-only and 100 full-

text generated articles. For this test dataset, we chose

ten general news queries like “environmental con-

cern”, scraped the top 15 results of Google News, and

generated articles as described for the training dataset.

This time, pages that could not be scraped with our

system or had excessively long articles were skipped

entirely to ensure all classers contained the same 200

sources.

4.2 Expert Study

As another mean of evaluation, an expert study was

set up to evaluate UNCOVER’s practical use and ex-

plainability with 13 participants from an academic

machine learning or Natural Language Processing

background. They were provided with ten unique

random texts from the dataset and the correspond-

ing UNCOVER output and asked to classify the texts

while using the tool. The participants were ques-

tioned about the use and understandability of UN-

COVER and it’s individual visualizations. On average,

the participants classiﬁed texts correctly in 69% of

cases. As to the usefulness of UNCOVER, half of the

participants stated that UNCOVER’s output changed

www.theguardian.com

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

Table 1: Assessment results from the classiﬁcation of news articles using character trigrams (a), syntactic trigrams (b), com-

bined stylometry (c), topic evolution connectivity metrics (d), and ﬁnal prediction (e).

GPT-2

GPT-3

GROVER

Human

(a)

Character

Trigrams

Machine

Human

Unsure

80% 6% 14 %

58% 18 % 24%

24% 37 % 39%

14% 54 % 32%

(b)

Syntactic

Trigrams

Machine

Human

Unsure

67% 6% 27 %

70% 8% 22 %

76% 7% 17 %

57% 19 % 24%

(c)

Combined

Stylometry

Machine

Human

Unsure

86% 6% 8%

69% 12 % 19%

53% 14 % 33%

32% 33 % 35%

(d)

TEcm

Machine

Human

Unsure

84% 16 % 0%

81% 19 % 0%

55% 45 % 0%

52% 48 % 0%

(e)

Final

Metric

Machine

Human

Unsure

91% 1 % 8%

85% 4 % 11%

62% 21 % 17%

35% 45 % 20%

25%

50%

75%

100%

their mind about the origin of the text two or more

times out of ten and most of them would want to use

the tool again next time. Before being questioned

about it or informed of the goal, ﬁve participants high-

lighted the explainable aspects of the tool. More de-

tailed descriptions of the result can be found in the fol-

lowing subsections or in the Appendix subsection A.1

4.3 Stylometry

For the training of a model, the training dataset was

split into 80 % training and 20 % validation data. The

ﬁnal accuracy of the stylometry regression on the val-

idation data was 73 % with less than 5 % of human-

written texts predicted as AI-generated. This changed

on the test dataset, where the model combining both

trigrams achieves a total accuracy of 59.3% and

weighted F1-score of 80.66% (see Table 1c). Syn-

tactic trigrams alone had an accuracy of 57.6 % (see

Table 1b) and performed slightly better than charac-

ter trigrams with an accuracy of 53.4 % (see Table 1a).

However, since syntactic trigrams performed much

worse on texts labeled as human-written, they can not

be considered overall better. A result of the expert

study was that the stylometry approach, as a main part

of UNCOVER’s prediction, had the highest inﬂuence

on the participants decisions when classifying texts.

Since a logistic regression model was used for classi-

ﬁcation and the same 100 trigrams were used as fea-

tures for every single trained regression, the classiﬁ-

cations are still explainable to technical users, as the

model contains relevance values for each trigram.

4.4 Topic Modeling

For testing, we conducted Mann-Whitney

tests (Nachar, 2008) on all pairs between

GPT-2 (Radford et al., 2019), GROVER (Zellers

et al., 2019), GPT-3 (Brown et al., 2020)

(two groups with different prompt styles), and

humans (three different authors). All of these

tests proved a signiﬁcantly different mean between

the groups with p < 5%, with the exception of

connectivity metric 3 between GPT-2 and humans,

and connectivity metric 2 between GROVER and

humans. Figure 5 shows density plots comparing

the connectivity metrics for human-written texts

(solid line) to those generated by different AI models

(dashed lines).

The logistic regression classiﬁcation based on

these TEcm achieves an overall accuracy of 67.4 %

and weighted F1-score of 72.77%. A more detailed

analysis of the results can be seen in Table 1d. On the

validation dataset we achieved an average 77.98% ac-

curacy using 5-fold-cross-validation. In comparison

to the Stylometry on both datasets TEcm performs

slightly better. However, it also does not manage to

generalize better loosing accuracy on the test dataset

aswell. The smaller F1-score and many wrong pre-

dictions on human authors are mainly caused by the

lack of certainty in TEcm classiﬁcation.

Figure 2 shows an example of a Topic Evolution

graph. This graph predominantly consists of a sin-

gle topic evolving throughout the majority of tem-

poral periods (article paragraphs). We have found

the Topic Evolution graphs to give clear insights into

how themes develop over the course of the analyzed

articles. Graphs for human-written articles tend to

have multiple distinct sections of a few periods in

length that are internally very connected. In contrast,

graphs for AI-generated articles often contain a sin-

gle evolving topic that spans most of the article and

the majority of nodes. This ﬁnding is supported by

the conducted expert study, with participants judging

the graphs to have a median “understandable” (4/6)

clarity and having an average higher inﬂuence on the

participants’ decision than the entity diagram with a

median of “strong” (5/6).

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

(a) connectivity metric 1. (b) connectivity metric 2.

Figure 5: Density plot for the different Topic Evolution connectivity metrics (TEcm) comparing humans (solid line) to different

AI generators (dashed lines) with mean values shown as vertical lines.

4.5 Final Prediction

The combined result of TEcm and Stylometry regres-

sion achieved an accuracy of 70.4 % and weighted F1-

score of 85.65%. 14.2 % of the classiﬁcations are un-

sure and 15.8 % are assigned to the wrong label. In

Table 1e the results are presented in more detail. The

participants in the expert study described this aspect

as the best component of UNCOVER, because it con-

tains the most direct indication on the author of a text.

Four participants said that they would use the tool in

the future only because of the prediction. The out-

put of UNCOVER aligned with the intuition of eight

participants on self-chosen articles and led to three

participants questioning their intuition.

4.6 Entity Diagram

From our experience, this approach was not success-

ful at distinguishing AI-generated text from human-

written text. We found that many misclassiﬁcations

occurred due to inaccuracies of the Stanford NLP

parser. This behavior leads to an explosion of newly

introduced entities in texts, making it harder to ﬁnd

patterns for human authors and Language Models.

While the Named Entity Recognizer itself is only

based on explainable machine learning models, the

used coreference resolution model is a neural net-

work. Therefore it is the only part in UNCOVER’s

components to be considered non-explainable. Fig-

ure 4 shows UNCOVER’s visualization of entity oc-

currence.

UNCOVER’s visualization of the Stanford NLP

parser’s output is overall hard to oversee, because of

the number of sentences in an article and the number

of entities used. This was also noticed by the partici-

pants in the expert study. Eleven of 13 subjects noted

that the diagram does not produce perceivable differ-

ences in entity occurrence patterns based on whether

the text was written by a human or generated by AI.

The Entity Diagram also turned out to be the worst

rated component in the tool on clarity and helpful-

ness. While, in the context of our study, the total

number of entities used per sentence may give some

indication of complexity, it is not possible to reliably

distinguish AI-generated texts from human-generated

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

ones. Even though the participants disliked this com-

ponent for the task of classiﬁcation, three of them said

they would want to use it for other tasks. For instance,

one subject said that it is incredibly helpful to identify

parts of the text that are interesting to him.

5 DISCUSSION

UNCOVER can differentiate human from machine

authors. It achieves a high classiﬁcation accuracy

on multiple state-of-the-art generation models, while

not requiring a pre-trained Large Language Model

(LLM). For instance OpenAI’s classiﬁer achieved

26% correctly classiﬁed AI texts (Jan Hendrik Kirch-

ner, 2023), while UNCOVER achieves 70.4 %. In addi-

tion, through the provided visualization on topic evo-

lution, a user can better analyze structures inside a

text and make an informed decision.

As we think it is most undesirable to make false

claims about human authors, we set up the training to

reduce the occurrence of incorrectly classiﬁed human

texts which is represented in our weighted F1-score

of 85.65 %. However, the highest error rate is found

among human authors in the ﬁnal evaluation. This is

at least partly due to the fact that the human-written

texts in the validation data set consisted of randomly

crawled news articles. In comparison, the models

used were trained to speciﬁcally recognize texts writ-

ten by ﬁve authors from the same publisher, indicating

a similar writing style.

When we compare the classiﬁcation performance

of different models, we can see that they differ in their

ability to resemble human-written texts based on the

concept used. For instance, GPT-2 can be more read-

ily identiﬁed when using character trigrams, whereas

GPT-3 unexpectedly is identiﬁed better than GPT-2

when using syntactic trigrams. This highlights the

need for AI-detection methods to incorporate multi-

ple text analysis concepts and cover a wide range of

aspects. In the evaluation of GPT-3 articles, we com-

bined the two queries to generate them into one accu-

racy, as no differences in their data were found in our

experiments.

Through the expert study, Natural Entity Recogni-

tion (NER) has not been proven to beneﬁt the goal of

UNCOVER but still offers further insights into the text.

Other researchers also pointed out, that NER needs to

be improved as a separate component to achieve more

consistent results (Zhang and Wang, 2022). The ex-

pert study showed, that the tool overall builds trust

with users, making them more secure in their deci-

sions.

Achieved Explainability. The primary objective of

UNCOVER was to provide clear explanations for a

system’s decisions, using visual aids that enable users

to make informed choices. While one visualization

has been found to be useful in an expert study, more

and better visualizations need to be developed to im-

prove explainability. Nevertheless, we have achieved

complete technical explainability of all crucial com-

ponents. The decision-making process of UNCOVER

is based on a decision tree that employs explainable

metrics. Experts familiar with logistic regression can

interpret the trained models and understand the tool’s

decision based on this. The topic modeling met-

rics can be derived from the topic graph visualiza-

tion. However, due to the vast number of parameters

and the complexity of graph algorithms, the decision-

making process cannot be fully comprehended by an

average user.

Limitations. UNCOVER currently only works for

English texts and its approaches might lose accu-

racy quickly, because of the rapid advances made by

LLMs. New model releases, like GPT-4 (OpenAI,

2023), or ﬁne-tuned systems that are developed to

break our analysis are an ever-existing threat. We

already documented differences between classiﬁca-

tion results of the used models and might experi-

ence worse performance on other models that have

not been included in training as well. When eval-

uating the performance on human-authored texts we

can already see this effect that trained authors achieve

much better accuracy. On the same subject, the tool

was only evaluated on self-generated news from three

LLMs and needs further evaluation on more data to

evaluate how generalized the performance is.

Threats to Validity. A potential drawback of the re-

ported ﬁndings is that the accuracy calculations were

based on data generated through simplistic prompts.

We compensated this effect by providing two differ-

ent prompts to GPT-3 in the generation of the dataset

and therefore including multiple levels of complexity.

Newer models can take speciﬁc task queries to fol-

low a speciﬁc story line which would make the topic

modeling classiﬁcation more difﬁcult. Similarly, style

transfer is a concept where LLM mimic a certain writ-

ing style (de Rivero et al., 2021), which would make

the detection via stylometry impossible if applied suc-

cessfully. Further, we assumed that news outlet texts

on the top of Google News are written by humans

when creating test data, which may not be accurate

anymore and impact our human classiﬁcation accu-

racy. The tool also was only evaluated on English

texts, speciﬁcally news articles, which carry speciﬁc

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

language traits, which can differ a lot between lan-

guages and text types. Therefore, the evaluations are

not generalizable on a larger scale. The reported per-

formance has also not been compared to other tools

available online, making it difﬁcult to judge its efﬁ-

ciency.

The conducted study has limited abstraction po-

tential since only 15 participants took part. Further,

it only questioned participants studying towards a de-

gree in computer science at the same institution with

prior experience in artiﬁcial intelligence. Therefore

the results of the study are sampling biased and not

generalized. Also, most participants self-applied to

the study because they are interested in the area of

Natural Language Processing, however these effects

where necessary to accept to achieve a larger number

of participants. Another factor is the bias found in

UNCOVER’s visualizations, as they have been created

to visualize effects that we observed. By evaluating

them in an expert study we found that people build

similar intuitions, which could have been inﬂuenced

by the style of presenting the visualizations.

Negative Societal Impact. The accuracy of the

identiﬁer is not high enough to trust the output but

may inﬂuence people’s opinions. This effect can

lead to problems when deploying such services. The

proposed tool can potentially bring injustice to au-

thors with writing patterns similar to generative mod-

els. For instance, non-native authors could struggle

to write as coherently as native writers. These au-

thors could struggle by facing wrongful social judg-

ment of being framed for not writing their publica-

tions by themselves. In the same way, authors that

use generators might get exposed by such tools.

6 CONCLUSIONS

We presented UNCOVER, a tool that uses concepts of

linguistic text analysis to distinguish between human-

written and AI-generated news articles. The con-

cepts considered are stylometry, topic modeling, and

entity recognition. For topic modeling, we intro-

duced Topic Evolution Model (TEM). A ﬁnal classiﬁ-

cation and two visualizations are shown to the user of

UNCOVER inside a web interface.

We evaluated the tool on news articles by means

of accuracy of the prediction and an expert study with

13 participants. Stylometry was found to overall be

able to identify AI-authors. TEM is very successful

in describing topics and their development, while also

being a good measure for theme coherence. In the

study, participants preferred to base their assessment

on the prediction and the topic graph, while rating

the entity recognition as the least effective indicator.

Eight participants expressed interest in continuing to

use UNCOVER for identifying AI-generated texts and

ﬁve participants highlighted the explainable aspects.

Because we observed common inconsistencies in

how entities occur in AI-generated texts, we think en-

tity recognition can become helpful if developed fur-

ther. Therefore, we will look into an improvement in

coreference resolution and different visualizations of

this component. During the development of the tool,

new LLMs already have been released. UNCOVER

should be evaluated with models like GPT-4 (OpenAI,

2023) and Google’s PaLM 2 (Anil et al., 2023). Be-

sides the evaluation on different models, we only eval-

uated the performance on news articles. In the future,

the accuracy should be evaluated on differently sized

and differently structured texts. Other linguistic ap-

proaches, like sentiment analysis could be tested and

added to the tool. Finally, a larger user study should

be considered to complement the positive ﬁndings of

this work.

To ensure reproducibility of the tool and results,

the code is published open-source

together with our

generated news dataset.

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their feed-

back. This work was partially funded by the Ger-

man Federal Ministry for Education and Research

(BMBF) through grant 01IS22062 (“AI research

group FFS-AI”).

REFERENCES

Anil, R., Dai, A. M., Firat, O., Johnson, M., et al.

(2023). PaLM 2 technical report. arXiv CoRR, cs.CL.

arXiv:2305.10403.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,

Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,

Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language models are few-shot

learners. In Proc. Neural Information Processing Sys-

tems, NeurIPS ’20, pages 1877–1901. Curran Asso-

ciates, Inc.

Chen, D. and Manning, C. D. (2014). A fast and accurate

dependency parser using neural networks. In Proc.

https://github.com/hpicgs/unCover

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

SIGDAT Conference on Empirical Methods in Natural

Language Processing, EMNLP ’14, pages 740–750.

ACL.

Churchill, R. and Singh, L. (2022). The evolution of topic

modeling. ACM Computing Surveys, 54(10):215:1–

35.

Churchill, R., Singh, L., and Kirov, C. (2018). A temporal

topic model for noisy mediums. In Proc. 22nd Paciﬁc-

Asia Conference on Knowledge Discovery and Data

Mining, PAKDD ’18, pages 42–53. Springer.

Clark, K. and Manning, C. D. (2016). Deep reinforce-

ment learning for mention-ranking coreference mod-

els. In Proc. SIGDAT Conference on Empirical Meth-

ods on Natural Language Processing, EMNLP ’16,

pages 2256–2262. ACL.

de Rivero, M., Tirado, C., and Ugarte, W. (2021). For-

malStyler: GPT based model for formal style transfer

based on formality and meaning preservation. In Proc.

13th International Conference on Knowledge Discov-

ery and Information Retrieval, KDIR ’21, pages 48–

56. SciTePress.

Finkel, J. R., Grenager, T., and Manning, C. D. (2005).

Incorporating non-local information into information

extraction systems by Gibbs sampling. In Proc. 43rd

Annual Meeting of the Association for Computational

Linguistics, ACL ’05, pages 363–370. ACL.

Floridi, L. and Chiriatti, M. (2020). GPT-3: Its nature,

scope, limits, and consequences. Springer Minds and

Machines, 30:681–694.

Gehrmann, S., Strobelt, H., and Rush, A. M. (2019). GLTR:

Statistical detection and visualization of generated

text. In Proc. 57th Annual Meeting of the Associa-

tion for Computational Linguistics: System Demon-

strations, ACL ’19, pages 111–116. ACL.

Houvardas, J. and Stamatatos, E. (2006). N-gram feature

selection for authorship identiﬁcation. In Proc. 12th

International Conference on Artiﬁcial Intelligence:

Methodols, Systems, and Applications, AIMSA ’06,

pages 77–86. Springer.

Ippolito, D., Duckworth, D., Callison-Burch, C., and Eck,

D. (2019). Automatic detection of generated text is

easiest when humans are fooled. In Proc. 58th An-

nual Meeting of the Association for Computational

Linguistics, ACL ’19, pages 1808–1822. Association

for Computational Linguistics.

Jan Hendrik Kirchner, Lama Ahmad, S. A. . J. L. (2023).

OpenAI AI classiﬁer. https://openai.com/blog/ne

w-ai-classifier-for-indicating-ai-written-text. last

Accessed: 2023-03.

Lund, B. D. and Wang, T. (2023). Chatting about ChatGPT:

how may AI and GPT impact academia and libraries?

Library Hi Tech News, 40(3):26–29.

Mohiuddin, T., Joty, S., and Nguyen, D. T. (2018). Coher-

ence modeling of asynchronous conversations: A neu-

ral entity grid approach. In Proc. 56th Annual Meet-

ing of the Association for Computational Linguistics

– Volume 1: Long Papers, ACL ’18, pages 558–568.

ACL.

Nachar, N. (2008). The Mann-Whitney U: A test for assess-

ing whether two independent samples come from the

same distribution. Tutorials in Quantitative Methods

for Psychology, 4(1):13–20.

OpenAI (2023). GPT-4 technical report. arXiv CoRR,

cs.CL. arXiv:2303.08774v3.

Posadas-Dur

an, J.-P., Sidorov, G., G

omez-Adorno, H.,

Batyrshin, I., Mirasol-M

elendez, E., Posadas-Dur

an,

G., and Chanona-Hern

andez, L. (2017). Algorithm for

extraction of subtrees of a sentence dependency parse

tree. Acta Polytechnica Hungarica, 14(3):79–98.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,

Sutskever, I., et al. (2019). Language models are un-

supervised multitask learners. Technical report, Ope-

nAI.

ıos-Toledo, G., Posadas-Dur

an, J. P. F., Sidorov, G., and

Castro-S

anchez, N. A. (2022). Detection of changes

in literary writing style using n-grams as style mark-

ers and supervised machine learning. PLOS ONE,

17(7):1–24.

Rudin, C. (2019). Stop explaining black box machine learn-

ing models for high stakes decisions and use inter-

pretable models instead. Nature Machine Intelligence,

1(5):206–215.

Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017).

Fake news detection on social media: A data mining

perspective. ACM SIGKDD Explorations Newsletter,

19(1):22–36.

Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,

A., Roesner, F., and Choi, Y. (2019). Defending

against neural fake news. NeurIPS ’19. Curran As-

sociates, Inc.

Zhang, C. and Wang, J. (2022). Tag-Set-Sequence learn-

ing for generating question-answer pairs. In Proc.

14th International Conference on Knowledge Discov-

ery and Information Retrieval, KDIR ’22, pages 138–

147. SciTePress.

Zhang, X. and Ghorbani, A. A. (2020). An overview of on-

line fake news: Characterization, detection, and dis-

cussion. Elsevier Information Processing & Manage-

ment, 57(2):102025:1–26.

APPENDIX

Expert Study Design

The expert study with 13 participants was conducted

using Google Forms

to collect the given answers and

guide the participants through the process. The partic-

ipants were observed by an instructor in person during

the time of the study to capture direct comments on

the components. Before the study, participants were

told that the efﬁciency of the tool was to be evaluated

and the background in explainable AI was not men-

tioned explicitly. During the session, instructors min-

imized their communication with the participants, ex-

https://docs.google.com/forms/d/1-

dLyWXKx01stPUcldRJ35dovfQmpuRyxcdtIqSj6bjk/preﬁll

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

Table .2: Answers to “How strong did each component inﬂuence your decision?” (a) and “Please rate the visualizations based

on their clarity.” (b).

(a)

not at all

very little

little

strong

stronger

very greatly

Prediction 0 2 2 4 5 0

Topic Graph 1 1 2 7 2 0

Entity Diagram 2 2 7 2 0 0

(b)

no clue

confusing

difﬁcult

understandable

easy

very easy

Topic Graph 0 0 1 5 6 1

Entity Diagram 0 2 6 4 1 0

cept for a question and answer session in the middle of

the session. One 14th expert was used to evaluate the

design of the study before collecting the ﬁnal results.

After this ﬁrst execution, we adopted some unclear in-

structions, added more information, and changed the

layout of the examples.

Detailed Course of the Study. The ﬁrst step asks

participants for their participant ID to reconstruct the

results. The ID could not match to participants’

names but the test data in the main section. Then

participants indicate their experience with machine

learning and Natural Language Processing on a scale

from 1 (unexperienced) to 6 (expert). For machine

learning, we observed one participant entering 2,

three participants entering 4 and 5 each, and six par-

ticipants entering 3. In regards to Natural Language

Processing, one participant entered 1, three partici-

pants entered 3 and 4 each, and six participants en-

tered 2.

In the second step, UNCOVER‘s components are

introduced in brief texts to test how easily participants

can understand the tool. The introduction is written to

explain how the components work and what informa-

tion they offer in three Texts with less than 130 words

each and contains two pictures. To leave the partici-

pants unbiased, common patterns and other beneﬁcial

information, based on our experience, to separate hu-

man and AI-generated texts are left out. Afterward,

participants were questioned on their understanding

of the components. Entity Diagram and overall Pre-

diction both achieved one “perfect” vote and six votes

each for “good” and “better”. Topic Graph received

nine “better” and four “good” votes.

The third step is time to ask comprehension ques-

tions to the instructors, six did not use this opportunity

and continued further on their own.

The main part of the study is taking place in the

fourth step, where participants have to evaluate ten

unique texts that were randomly chosen from our

dataset. Five participants talked about good explain-

ability while working on this task by themselves.

Most participants correctly classiﬁed seven of the

given texts, rarely choosing unsure, and achieving a

total accuracy of 69%. Then they explained their de-

cision process and what they looked out for. The par-

ticipants used the text a lot, paying attention to similar

sentence structures, synonyms and choice of words,

and punctuation. Only ﬁve participants named com-

ponents of UNCOVER but claimed to discover patterns

and ﬁnd them helpful to classify the examples. Next,

each participant analyzed one text, that they chose

themselves and gave to us before the start of the study.

In eight cases the self-chosen text was analyzed ac-

cording to the participants own perspective on the text

author.

In the ﬁfth and ﬁnal step of the study we asked

participants how often they changed their opinion on

a text based on UNCOVER. Six participants said two

times, three said one time, two said four times, and

zero and seven times were answered by one partici-

pant each. In the next question, participants indicated

how much each component inﬂuenced their decision.

The results are shown in Table .2. It also shows the re-

sults of the ratings on the clarity of the visualizations.

The last question in the questionnaire asked the par-

ticipants how likely they would want to use this tool

on their own on a scale from 0 (never) to 6 (always).

To this, one participant answered 2, ﬁve participants

chose 5, four participants chose 3, and three selected

4. Three participants who gave a score of 3 or below

stated that they believe their own estimation of a text

author is sufﬁcient.

After the study, we held a debrieﬁng with the par-

ticipants to give them the opportunity to ask further

questions about the tool. This part showed that the

participants had a great interest in the functionality of

the tool and enjoyed to use it.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval