Classifying Human-Generated and AI-Generated Election Claims in

Social Media

Alphaeus Dmonte

1 a

, Marcos Zampieri

1 b

, Kevin Lybarger

1 c

, Massimiliano Albanese

1 d

and Genya Coulter

George Mason University, U.S.A.

OSET Institute, U.S.A.

Keywords:

AI-Generated Content, Misinformation, Elections, LLMs, Authorship Attribution.

Abstract:

Politics is one of the most prevalent topics discussed on social media platforms, particularly during major

election cycles, where users engage in conversations about candidates and electoral processes. Malicious ac-

tors may use this opportunity to disseminate misinformation to undermine trust in the electoral process. The

emergence of Large Language Models (LLMs) exacerbates this issue by enabling malicious actors to generate

misinformation at an unprecedented scale. Artiﬁcial intelligence (AI)-generated content is often indistinguish-

able from authentic user content, raising concerns about the integrity of information on social networks. In

this paper, we present a novel taxonomy for characterizing election-related claims. This taxonomy provides an

instrument for analyzing election-related claims, with granular categories related to jurisdiction, equipment,

processes, and the nature of claims. We introduce ElectAI, a novel benchmark dataset comprising 9,900 tweets,

each labeled as human- or AI-generated. We annotated a subset of 1,550 tweets using the proposed taxonomy

to capture the characteristics of election-related claims. We explored the capabilities of LLMs in extracting the

taxonomy attributes and trained various machine learning models using ElectAI to distinguish between human-

and AI-generated posts and identify the speciﬁc LLM variant.

1 INTRODUCTION

The widespread use of social media has fundamen-

tally changed political discourse with a direct impact

on elections. Social media enables candidates and po-

litical entities to directly communicate with the elec-

torate through platforms, such as X (formerly Twit-

ter), profoundly shaping engagement strategies. This

direct communication channel facilitates the rapid

propagation of claims about election processes, in-

cluding claims of fraud, by politicians and their sup-

porters. Claims of fraud that are intentionally false

or not fully supported by credible evidence can neg-

atively impact election processes and compromise

the integrity of elections. As a result, elections in

many countries, including the United States (US),

have become more contentious in recent years (Sha-

effe, 2020).

https://orcid.org/0009-0009-7896-5834

https://orcid.org/0000-0002-2346-3847

https://orcid.org/0000-0001-5798-2664

https://orcid.org/0000-0002-3207-5810

The emergence of Large Language Models

(LLMs) has revolutionized content generation across

various domains. LLMs are capable of generating ﬂu-

ent and syntactically correct text that is virtually indis-

tinguishable from human-generated text. LLMs often

generate factually incorrect statements and, in some

cases, may hallucinate (Huang et al., 2023). For ex-

ample, Galactica (Taylor et al., 2022), an LLM de-

veloped by Meta to summarize research papers and

generate Wikipedia articles, was found to generate

biased and incorrect content (Heaven, 2022). Chat-

GPT was also found to produce biased and inaccurate

outputs (Wach et al., 2023). Furthermore, malicious

actors can leverage the power of LLMs to intention-

ally produce uninformed claims and misinformation

to spread them on social media (Zhou et al., 2023).

The convergence of advanced LLMs and the

widespread use of social media creates the perfect

conditions for the spread of election misinformation.

In this paper, we introduce a comprehensive taxon-

omy to further our understanding of human- and AI-

generated election claims in social media. The tax-

onomy features several dimensions that often char-

Dmonte, A., Zampieri, M., Lybarger, K., Albanese, M. and Coulter, G.

Classifying Human-Generated and AI-Generated Election Claims in Social Media.

DOI: 10.5220/0012797900003767

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT 2024), pages 237-248

ISBN: 978-989-758-709-2; ISSN: 2184-7711

237

acterize election claims, such as jurisdiction, elec-

tion infrastructure, and the nature of claims. We use

the taxonomy to annotate ElectAI, the ﬁrst benchmark

dataset for election claim understanding. ElectAI con-

tains a mix of 9,900 human- and AI-generated tweets

in English. We present several experiments on claim

characterization and authorship identiﬁcation using

ElectAI with the objective of exploring the following

research questions:

• RQ1: How well do state-of-the-art LLMs under-

stand the various attributes of election claims?

• RQ2: How do humans and machines compare in

their ability to distinguish between human- and

AI-generated election claims in social media?

Our main contributions are as follows:

1. We propose a novel taxonomy to characterize

and understand election-related claims and the at-

tributes associated with these claims.

2. We present ElectAI, the ﬁrst benchmark dataset

for election claims. The dataset comprises both

human- and AI-generated tweets annotated with

the proposed taxonomy. We make ElectAI freely

available to the research community.

3. We present an evaluation of four LLMs (Llama-

2, Mistral, Falcon, and Flan-T5) with respect to

claim characterization on the ElectAI dataset.

4. We evaluate the capability of multiple models and

humans to discriminate between human- and AI-

generated posts in the ElectAI dataset.

The remainder of the paper is organized as fol-

lows. Section 2 discusses the importance of fair and

secure elections, motivating our choice of US elec-

tions as a case study, whereas Section 3 discusses

related work. Then, Section 4 introduces the pro-

posed taxonomy and Section 5 presents the ElectAI

dataset. Next, Sections 6 and 7 present our evaluation

of claim understanding and authorship attribution, re-

spectively. Finally, Section 8 provides some conclud-

ing remarks and future research directions.

2 MOTIVATION

Fair and secure elections are the backbone of democ-

racy, and the spread of misinformation on social me-

dia can undermine their integrity. In 2017, former

US Secretary of the Department of Homeland Secu-

rity (DHS), Jeh Johnson, declared that “election in-

frastructure in this country should be designated as a

https://languagetechnologylab.github.io/ElectAI/

subsector of the existing Government Facilities crit-

ical infrastructure sector. Given the vital role elec-

tions play in this country, certain systems and assets

of election infrastructure meet the deﬁnition of crit-

ical infrastructure, in fact and in law.” (Department

of Homeland Security, 2017) As a result, US elec-

tion infrastructure is considered critical infrastructure,

as much as energy, transportation, or other systems

critical to national security. Threats to election se-

curity can erode voter trust, delegitimize elections,

and weaken democratic institutions. The spread of

misinformation, disinformation, and malinformation

(MDM) is a growing threat to fair and secure elec-

tions. MDM campaigns can target individual polit-

ical candidates, local or state administrative entities,

or even manufacturers of voting equipment. MDM

campaigns erode trust in election processes and can

also have signiﬁcant ﬁnancial consequences.

While the proposed approach has been demon-

strated in the context of US elections, it can be eas-

ily generalized to create and study election-related

datasets in any language or geopolitical context.

American elections provide an interesting case study

as they are inherently different from almost any type

of election in other countries. With nearly 10,000

unique election jurisdictions, no other country has the

level of decentralization that US elections do, and no

other nation cedes as much control of federal elec-

tions to state and local bodies. As of 2020, over 260

million Americans were registered to vote. In juris-

dictions such as Los Angeles County, California, vot-

ers will vote for more contests and candidates in one

general election than the average voter overseas might

vote for in their lifetime.

As elections are conducted at the local level, there

is great variability in the infrastructure and technical

expertise of counties and towns within states. While

the decentralization of elections avoids single points

of failure, it requires each jurisdiction to manage and

secure an extremely complex information technol-

ogy system that includes voter registration databases,

electronic poll books, vote-capture devices, optical

scanners, vote tallying devices, election night report-

ing systems, and network technologies to securely

transfer data. The complexity and diversity of this

infrastructure further compound the challenges in as-

sessing election-related claims. Thus, the US is the

ideal test case for ElectAI.

Another dimension contributing to the uniqueness

of US elections is the emphasis on freedom of expres-

sion, allowing anyone to say virtually anything in any

public forum. The United States of America is the

world’s longest-lasting representative democracy, and

it has served as a model to countless other nations in-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

238

spired by the uniquely American ideals of indepen-

dence, self-determination, and free expression. For

voters who live in the United States, there are very

few restrictions on political speech by design. The

First Amendment to the U.S. Constitution enshrines

strong historic protections of freedom of expression,

whether it is speech by individuals or by the press,

free practice of religion, the right of assembly, as well

as the right to petition the government for a redress of

grievances.

At the nexus of these widely admired but difﬁcult-

to-sustain ideals, is the social media platform now

known as X, formerly Twitter, nimbly intersecting

First Amendment rights and protections. While not

as large as competing platforms such as TikTok or

Facebook, it is fair to postulate that the cumulative

impact of documenting current geopolitical events in

real time with comparatively few barriers to entry has

permanently enshrined X into the American lexicon.

Elections, and the manner in which information about

them is disseminated via social media channels, have

been fundamentally transformed by the X platform.

More Americans get their election information via X

than they do from previously trusted sources such as

local newspapers or political parties. X offers an un-

precedented level of instantaneous access to a voter’s

local election ofﬁcials, as well as direct communi-

cation with state and federal authorities. Failing to

quantify the impact of AI-generated election misin-

formation on social media users now will set me-

dia literacy, public conﬁdence in elections, and po-

tentially, the stability of democratic processes them-

selves in imminent danger. By design, ElectAI allows

for a narrowing or broadening of scope, without lim-

iting its application to just one ﬁeld of research.

3 RELATED WORK

There have been several studies addressing automatic

veriﬁcation of claims, rumor detection, and several

other tasks related to the detection and mitigation of

misinformation. Chen et al. (Chen et al., 2022) pro-

posed a 3-stage pipeline for claim veriﬁcation that (i)

gets the encodings for the claims, (ii) retrieves rel-

evant documents that main contain a support claim,

and (iii) ﬁnds appropriate evidence in them. Yang

et al. (Yang et al., 2022) propose a Multi-Instance

Learning (MIL) approach for rumor and stance de-

tection. Barbera et al. (Barbera, 2018) explored mis-

information spread and its contributing factors in the

context of the 2016 US Presidential elections.

Recent developments in fact veriﬁcation systems

have signiﬁcantly improved the accuracy and efﬁ-

ciency of automatic claim veriﬁcation in various set-

tings. Hu et al. (Hu et al., 2023) proposed a fact-

veriﬁcation approach based on training an evidence

retriever and a claim veriﬁer with the retrieved evi-

dence. The approach considers the faithfulness (the

model’s decision-making process) and plausibility

(convincing to humans) of the retrieved evidence.

While the great majority of studies on fact veriﬁ-

cation are on English texts, an approach for Ara-

bic claim veriﬁcation in social media was proposed

by (Sheikh Ali et al., 2023). In the proposed pipeline,

they ﬁrst identify a claim presented in a post and

then evaluate the trustworthiness of the claim. Multi-

modal fact-checking has also been explored in a few

studies such as those by Yao et al. (Yao et al., 2023)

and by Sun et al. (Sun et al., 2023), who introduced

one of the datasets described next.

There have been various datasets developed for

claim understanding and veriﬁcation. One of the most

widely-used datasets is FEVER (Thorne et al., 2018)

developed for fact extraction and veriﬁcation. An-

other dataset developed by Sun et al. (Sun et al., 2023)

targets multi-modal misinformation in the medical

domain. The dataset contains multi-modal instances

from news and tweets as well as instances generated

by LLMs. A recent study by (Zhou et al., 2023) tar-

gets COVID-19 misinformation. The authors gener-

ate COVID-19-related misinformation using GPT-3

and they analyze the linguistic features of human- and

AI-generated texts. The analysis indicates signiﬁcant

differences between the human and AI-generated text

concerning the writing style, emotional tone, use of

speciﬁc keywords, use of informal language, and sev-

eral other features.

To the best of our knowledge, no other data set has

been created to address election claim understanding

thus far. Furthermore, only a few datasets contain-

ing human- and AI-generated content have been cu-

rated for claim understanding. This motivates us to

introduce ElectAI and make it freely available to the

research community.

4 TAXONOMY

The proposed taxonomy, shown in Figure 1, was de-

veloped by identifying the most common set of at-

tributes characterizing social media discourse about

elections, and it was validated by subject matter ex-

perts in election administration. Two of the authors

are regular attendees of an annual US-based election-

focused conference bringing together academia, gov-

ernment, industry, and non-proﬁt sector, and one au-

thor is a social media expert afﬁliated with a nonpar-

Classifying Human-Generated and AI-Generated Election Claims in Social Media

239

Statement

Jurisdiction

Infrastructures

Claim of

Fraud

County State Federal Equipment Processes

Machines Ballots

Voter

Registration

Vote

Counting

Corruption Others

Illegal

Voting

Others Others

Figure 1: The election claim taxonomy.

tisan election technology research organization work-

ing to increase conﬁdence in elections. Details are

omitted to preserve the anonymity of the submission.

The proposed taxonomy intentionally employs

unambiguous naming conventions, with classiﬁcation

structures that can be commonly understood by a

diverse group of academic researchers and election

stakeholders, as well as applied to elections glob-

ally. While identical terms may not be employed

worldwide, the foundational concepts of election ju-

risdiction, equipment, processes, and claims are com-

monly used by the election community globally and

are highly relevant topics to the rapidly evolving ﬁeld

of AI and social media research. The proposed tax-

onomy will assist in the creation of a working set

of comparative standards for future AI research en-

deavors while facilitating the efﬁcient collection of

quantiﬁable, high-quality training data for future ma-

chine learning initiatives. The race to develop data-

driven methods for protecting election administrators

and the voting public from the increasingly danger-

ous aftereffects of social media disinformation (par-

ticularly MDM generated by AI models) will be crit-

ical during the 2024 Presidential Election, when un-

precedented amounts of malicious election-related so-

cial media content pose the risk of derailing American

elections via ideological capture.

4.1 Jurisdiction

The Jurisdiction attribute refers to the government en-

tity conducting the election and includes three subcat-

egories:

• County: Refers to a speciﬁc county, such as

Fairfax County, Los Angeles County, or Monroe

County.

• State: Refers to a speciﬁc state, such as New York

(NY), Virginia (VA), or California (CA).

• Federal: Refers to federal elections, namely pres-

idential elections.

4.2 Infrastructure

The Infrastructure attribute encompasses an array of

components employed to carry out elections and in-

cludes the Equipment and Processes subcategories.

Equipment: Equipment refers to any system that is

used to cast or count votes and includes three subcat-

egories:

• Machines: Refers to electronic voting equip-

ment, such as direct recording electronic systems

(DREs), ballot marking devices (BMDs), scan-

ners, and electronic poll books (e-poll books).

• Ballots: Refers to paper ballots, including mail-in

ballots.

• Other: Refers to any other type of voting equip-

ment not captured by the Machines or Ballots sub-

categories.

Processes: Processes refers to activity conducted dur-

ing the administration of elections and includes three

subcategories:

• Voter Registration: Refers to the process of reg-

istering eligible voters in a centrally managed

voter registration database.

• Vote Counting: Refers to the process of tally-

ing and tabulating votes cast in an election. This

includes verifying the authenticity of ballots and

systematically counting the votes to determine the

outcome of the election.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

240

• Other: Refers to any other election process not

captured by the Voter Registration or Vote Count-

ing subcategories.

4.3 Claim of Fraud

The Claim of Fraud attribute refers to claims of

election-related fraud in a media post and includes

three subcategories:

• Corruption: Refers to claims of corruption (e.g.,

a candidate offering money or favors in exchange

for votes).

• Illegal Voting: Refers to claims of illegal vot-

ing (e.g., votes cast by illegal aliens or individuals

voting multiple times).

• Other: Refers to any other claims of fraud not

captured by the Corruption or Illegal Voting sub-

categories (e.g., ballot tampering or coercion tac-

tics).

5 THE ElectAI DATASET

5.1 Data Collection and Generation

The ElectAI dataset comprises a total of 9,900 tweets

that were either human- or AI-generated. All tweets

include a label indicating whether they are created by

a human or AI, and all AI-generated tweets include a

label indicating the LLM variant that was used to cre-

ate them. A subset of 1,550 tweets were annotated us-

ing the proposed taxonomy, as reported in Table 1, in-

cluding 850 human-generated and 700 AI-generated

tweets.

Table 1: Summary of annotated posts.

Human AI Total

First Annotation Round 600 0 600

Second Annotation Round 250 700 950

Total 850 700 1,550

To populate the human-generated part of the

dataset, we initially sampled the existing Voter-

Fraud2020 dataset (Abilov et al., 2021), a multimodal

dataset consisting of posts about potential events

during the 2020 US presidential elections. Since

the VoterFraud2020 dataset also includes tweets that

are general statements, we use speciﬁc keywords

like jurisdiction, equipment, process, device, scan-

ner, counting, registration, corruption, etc. to iden-

tify and sample 850 tweets relevant to our task. We

use the proposed taxonomy to create narrative frames,

as described in recent work on COVID-19 (Zhou

et al., 2023), to generate 700 synthetic tweets us-

ing LLMs. These tweets were generated after the

ﬁrst round of annotations. The taxonomy attributes

were used to generate tweets that spanned a diverse

range of election-related topics and mimicked human-

generated tweets. We use the following LLMs to gen-

erate the tweets.

Falcon-7B-Instruct. (Penedo et al., 2023) hence-

forth Falcon, is a decoder-only model ﬁne-tuned with

instruct and chat datasets. This model was adapted

from the GPT-3 (Brown et al., 2020) model, with dif-

ferences in the positional embeddings, attention, and

decoder-block components. The base model Falcon-

7B, on which this model was ﬁne-tuned, outperforms

other open-source LLM models like MPT-7B and

RedPajama, among others. The limitation of this

model is that it was mostly trained on English data,

and hence, it does not perform well in other lan-

guages.

Llama-2-7B-Chat. (Touvron et al., 2023) hence-

forth Llama-2, is an auto-regressive language model

with optimized transformer architecture. This model

was optimized for dialogue use cases. The model

was trained using publicly available online data. The

model outperforms most other open-source chat mod-

els and has a performance similar to models like Chat-

GPT. This model, however, works best only for En-

glish.

Mistral-7B-Instruct-v0.2. (Jiang et al., 2023)

henceforth Mistral, is a transformer-based large lan-

guage model which uses two architectures coupled to-

gether, a grouped-query attention along with sliding-

window attention. The model uses a byte-fallback to-

kenizer. The model outperforms several open-source

LLMs including Llama-2-13B on various NLP tasks.

The following prompt was used to generate the

tweets.

Question: Write {# of tweets} unique tweets us-

ing informal language and without the use of opin-

ion statements, declarative statements, and call-to-

action statements, about {claim of fraud} caused

by {infrastructure} in {state}. You may choose

the actual county in the state mentioned. You may

include actual websites, people’s names, and user-

names.

Answer:

The prompt includes the claim of fraud, infras-

tructure, and state attributes of the taxonomy. The

Classifying Human-Generated and AI-Generated Election Claims in Social Media

241

labels from the taxonomy were substituted for prompt

attributes, such as equipment and processes instead of

infrastructure, corruption, and illegal voting instead

of the claim of fraud, and mentioning states like Ari-

zona, Virginia, etc. We only use the state-level juris-

dictions as this better mimics the human tweets. How-

ever, the prompt includes instructions to include the

county jurisdiction for some tweets. The prompt also

includes speciﬁc instructions regarding the number of

tweets to be generated. We use this dataset consist-

ing of 1,550 human- and AI-generated tweets as an

evaluation dataset for the claim understanding task,

where we extract the claim-related attributes from the

tweets.

In addition to exploring the extraction of claim-

related attributes from human- and AI-generated

tweets, we explored the ability of humans and ma-

chine learning models to distinguish between human-

and AI-generated tweets. To facilitate this explo-

ration, we generate additional tweets with Llama-

2, Falcon, and Mistral using the prompt described

above. For the human-generated portion, we sam-

ple the VoterFraud2020 dataset. Our training dataset

for this authorship attribution task consists of 8,000

tweets and each tweet is labeled with human, llama,

falcon, or mistral label, depending on the model used

to generate the tweet. Table 2 shows example in-

stances from the training dataset.

Table 2: Example tweets from the training dataset.

Tweet Label

@RealNews: Why are people trying to vote twice

in Arkansas? Isn’t that illegal? #VoterFraud #Bald-

FacedLie

human

Breaking news out of Guilford County! It looks like

someone has been tampering with ballots! Stay vig-

ilant and follow @integritywatchdog for updates.

#GuilfordCounty #ElectionCorruption #BallotTam-

pering

llama

Voting rights in #Iowa are in jeopardy! Anyone with

information about voter registration irregularities or

suspicious activity should contact @Iowa Election

immediately - @InformedIowan

falcon

You know what’s not cool? Voter registration irreg-

ularities in Kent County, Rhode Island. Hopefully

the Board of Elections can sort this out before it’s

too late! #CleanVoterRolls

mistral

The claim characterization dataset is reused as a

test dataset for the authorship attribution task. How-

ever, for this task, we re-annotate each tweet of the

test dataset with a single label, depending on the

model used to generate the tweet. We generate an

additional 350 tweets using the Mistral model to be

included in the test dataset.

5.2 Annotation

We ﬁrst developed a preliminary taxonomy that cap-

tured information related to jurisdiction, infrastruc-

ture, processes, and claims of fraud, similar to the

ﬁnal taxonomy presented in Figure 1. The prelimi-

nary taxonomy had more granular attributes (36 leaf

nodes) than the taxonomy presented in Figure 1 (12

leaf nodes). Using this preliminary taxonomy, a team

of undergraduate and graduate students annotated 600

of the 850 human-generated tweets sampled from the

VoterFraud2020 dataset. Each student annotated a

sample of 50 tweets. Each tweet was annotated by

up to three annotators, and a majority vote was used

to adjudicate discrepancies and create the gold refer-

ence standard. We used this initial annotation effort

to reﬁne and consolidate the preliminary taxonomy to

create the taxonomy shown in Figure 1.

Table 3 presents the inter-annotator agreement

(IAA) for the taxonomy attributes, as well as the over-

all score.

Table 3: IAA, measured as F1. Classes included in the

claim understanding experiments are presented in bold.

Label F1

Jurisdiction - State 0.86

Jurisdiction - County 0.66

Jurisdiction - Federal 0.28

Equipment - Machines 0.79

Equipment - Ballots 0.70

Equipment - Other 0.04

Processes - Voter Registration 0.16

Processes - Vote Counting 0.53

Processes - Other 0.21

Claim of Fraud - Corruption 0.68

Claim of Fraud - Illegal Voting 0.65

Claim of Fraud - Other 0.45

Overall 0.59

We calculate IAA for the doubly annotated sam-

ples using F1-score, by holding one annotator as the

reference and the other annotator as the prediction.

We use F1 to provide a metric that can be compared

with our classiﬁcation results. We then used the re-

vised taxonomy and guidelines to annotate an addi-

tional 250 human-generated tweets along with the 700

AI-generated ones, for a total of 950 tweets in round

These tweets are not annotated using the taxonomy de-

scribed above as the Mistral model was not publicly avail-

able at the time of annotations. However, we included these

tweets in the authorship attribution task to have instances

with all the labels, we include these these tweets.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

242

Figure 2: Annotations statistics based on the two rounds of annotation.

2. Each tweet was annotated by up to three anno-

tators. The data were annotated by undergraduate

and graduate students in information technology and

computer science. We provided detailed guidelines

crafted based on the taxonomy, and the tweets were

assigned a label using a majority vote.

As can be seen in Table 3, there is a substantial to

a high agreement for labels Jurisdiction -State, Juris-

diction - County, Equipment - Machines, Equipment

- Ballots, Claim of Fraud - Corruption, and Claim of

Fraud - Illegal Voting. However, the labels Jurisdic-

tion - Federal, Equipment-Others, Processes - Voter

Registration, Processes - Vote Counting, Processes -

Other, and Claim of Fraud - Other, had fair to slight

agreement. An overall F1-score of 0.59 indicates a

moderate agreement between the annotators over all

the labels. This level of agreement can be attributed

to a low agreement for some labels.

5.3 Data Statistics

The 600 tweets from the ﬁrst annotation effort were

combined with the 950 tweets annotated using the re-

ﬁned taxonomy, to get a dataset with 1,550 annotated

tweets, that we use for the claim understanding task.

To maintain consistency between the annotations, we

map the annotations from the ﬁrst annotation effort to

the reﬁned taxonomy. Figure 2 shows the statistics of

the dataset for each label. As seen, most of the tweets

had a mention of Jurisdiction-State. Whereas, not

many tweets mention Processes-Voter Registration,

Equipment-Others, and Jurisdiction-Federal. The

mention of the various claims of fraud is almost

evenly distributed between the tweets. We observed

that some tweets mentioned more than one type of

claim of fraud. We had a similar observation for other

categories like Equipment, and Jurisdiction.

The authorship data set consists of tweets with

multiclass labels associated with the tweet author,

human, llama, falcon, mistral. Table 4 shows the num-

ber of instances for each of the labels in the training

and the test sets.

Table 4: Label distribution for instances in the training and

test datasets.

Dataset llama falcon mistral human Total

Train 2000 2000 2000 2000 8000

Test 365 310 350 875 1900

6 CLAIM CHARACTERIZATION

In this section, we report the performance of multiple

LLMs in extracting the taxonomy attributes from the

ElectAI dataset.

6.1 Large Language Models

Recently, LLMs have been widely used for several

Natural Language Processing (NLP) tasks, includ-

ing text generation and text classiﬁcation. We use

the three LLMs described in Section 5.1 as well as

the Flan-T5 model described below, for claim under-

standing (extraction of taxonomy attributes).

Classifying Human-Generated and AI-Generated Election Claims in Social Media

243

Flan-T5-XL. (Chung et al., 2024) henceforth Flan-

T5, is a language model based on the T5 (Raffel

et al., 2020) model, which is a Text-to-Text trans-

former model. This model was ﬁne-tuned for better

zero-shot and few-shot learning for over 1000 differ-

ent tasks. The model is one of the few LLMs with

support for languages other than English.

6.2 Approach

We approach the election claim understanding task as

a question-answering (Q&A) task, using LLMs in an

in-context learning setting. In this Q&A setting, the

input prompt is a question focused on a single tax-

onomy attribute, along with the target tweet, and the

output is a yes/no response, indicating whether the

speciﬁc attribute is relevant to the tweet. The set of

questions that span the taxonomy attributes is shown

in Table 5. In initial experimentation, we found that

the taxonomy attributes associated with Other, like

Jurisdiction-Other, were ambiguous and challenging

for the models to extract. In mapping the taxonomy

attributes to the questions in Table 5, we omitted the

attributes associated with Other and instead included

the associated parent label. For example, Jurisdiction-

Other was replaced with a broader Jurisdiction la-

bel indicating whether any jurisdiction information is

present (yes vs. no). We explored zero-shot learning,

where no example inputs and outputs are provided in

the prompt, and few-shot learning, where 3-5 input-

output pairs are provided as examples in the prompt.

However, the few-shot experimentation did not im-

prove performance, so we only present the zero-shot

results. Information extraction performance is evalu-

ated using the F1-score. The following prompt was

used to elicit model predictions.

Tweet: {tweet}

Answer the following question with a yes or a no.

Question: {question}

Answer:

6.3 Results and Discussion

Table 6 shows the F1 scores for the LLMs evaluated

using the annotated benchmark dataset.

Mistral achieved the best overall performance at

0.667 F1, followed by the Flan-T5 model at 0.661

F1. falcon achieved the lowest performance among

the evaluated models at 0.504 F1. We also show the

F1 scores for each of the labels used. Apart from

the falcon model, all the models had a better perfor-

mance identifying the Jurisdiction and the Jurisdic-

tion - State labels. Mistral and Flan-T5 models have a

Table 5: The input questions to the zero-shot prompts. Each

question is based on an associated label from the taxonomy.

ID Question

1 Does the tweet mention a jurisdiction?

2 Does the tweet mention a state election?

3 Does the tweet mention a county election?

4 Does the tweet mention the federal election?

5 Does the tweet mention any election equipment?

6 Does the tweet mention electronic voting equipment machines?

7 Does the tweet mention ballots or related equipment?

8 Does the tweet mention any election-related process?

9 Does the tweet mention the vote counting process?

10 Does the tweet mention any election-related claim of fraud?

11 Does the tweet mention corruption in elections?

12 Does the tweet mention illegal voting?

Table 6: Zero-shot in-context learning results. We report

the F1 scores for individual labels as well as the macro-F1

score for each of the models.

Label

Llama-2 Mistral falcon Flan-T5

Jurisdiction 0.828 0.851 0.554 0.898

Jurisdiction - State 0.865 0.923 0.602 0.933

Jurisdiction - County 0.562 0.852 0.705 0.824

Jurisdiction - Federal 0.239 0.290 0.263 0.277

Equipment 0.526 0.429 0.427 0.524

Equipment - Machines 0.720 0.904 0.692 0.870

Equipment - Ballots 0.510 0.609 0.540 0.549

Processes 0.598 0.606 0.398 0.607

Processes - Vote Counting 0.539 0.644 0.536 0.419

Claim of Fraud 0.897 0.758 0.513 0.917

Claim of Fraud - Corruption 0.452 0.523 0.497 0.544

Claim of Fraud - Illegal Voting 0.577 0.611 0.317 0.571

Overall 0.610 0.667 0.504 0.661

better performance for the Equipment - Machines la-

bel, as compared to the falcon and Llama-2 models.

However, the performance of all the models in identi-

fying the Jurisdiction-Federal label is low. Our anal-

ysis suggests that most tweets do not contain explicit

references to federal elections and that federal juris-

diction is typically implicit. Performance for labels

like Equipment and Processes is low. Some tweets

may mention equipment, like voting machines, bal-

lots, or other election equipment. However, the mod-

els struggle to identify this information from a given

tweet. The tweets are annotated for particular labels

even if some information is implicit. Hashtags are

also considered during the annotation process. We

observe that some LLMs are unable to identify this

implicit information. For example, if the word “bal-

lots” is preceded by a hashtag (“#ballots”), the LLMs

are unable to accurately identify the tweets referenc-

ing ballots. Another issue is the LLM’s ability to un-

derstand and extract information from a given text.

For example, the LLMs may label a tweet as men-

tioning a machine, even though voting machines are

not mentioned in the tweet. Similarly, the LLMs may

be unable to detect some speciﬁc terms and phrases

like the name of a county if the word ’county’ is not

SECRYPT 2024 - 21st International Conference on Security and Cryptography

244

explicitly mentioned. These issues affect the perfor-

mance of the LLMs, as evident from the results.

7 AUTHORSHIP ATTRIBUTION

In this section, we report on our evaluation of the per-

formance of the Authorship Attribution Task, where

the objective is to identify the author of the tweet,

whether human or speciﬁc LLM.

7.1 Classiﬁcation Models

We conducted experiments with various models as

outlined below.

Random Forest. (Breiman, 2001) is a supervised

machine-learning algorithm based on decision trees.

It aggregates the outputs of multiple decision trees

to produce the ﬁnal output. Our experimentation in-

volves utilizing two types of input features: Term

Frequency - Inverse Document Frequency (TF-IDF)

and Word2Vec embeddings (Mikolov et al., 2013a;

Mikolov et al., 2013b; Mikolov et al., 2013c), where

we add the individual word vectors for each word in

the sentence to output a sentence vector.

Bidirectional Encoder Representations from

Transformers (BERT). (Devlin et al., 2019) is

a transformer-based model that has become ubiq-

uitous in NLP. It has demonstrated state-of-the-art

performance across various NLP tasks. In our

study, we ﬁne-tuned the BERT-based model with a

multi-class classiﬁcation objective. The BERT-base

model is pre-trained on a substantial corpus of data,

contributing to its robustness and effectiveness.

Robustly Optimized BERT Pretraining Approach

(RoBERTa). (Liu et al., 2019) is a variation on

BERT that uses a more dynamic masking strategy,

which improves feature learning. Similar to BERT,

we ﬁne-tune this model for multi-class classiﬁcation.

7.2 Approach

We conducted multi-class classiﬁcation to distinguish

between human- and AI-generated tweets and resolve

the speciﬁc LLM author. Three models, as described

previously, were either trained or ﬁne-tuned for this

task. The Random Forest classiﬁer was trained us-

ing two sets of features: TF-IDF and Word2Vec

embeddings. We utilized grid search to determine

the optimal values for the two hyperparameters, N-

estimators and Max Depth. Speciﬁcally, we ex-

plored values of N-estimators=[50,100,200] and Max

Depth=[20,40,50]. For the BERT and RoBERTa mod-

els, we employed the AdamW optimizer with a learn-

ing rate of 1e-5. The models underwent training for

3 epochs. Subsequently, all models were ﬁne-tuned

using the training data and evaluated using the test

dataset. We report the F1 score for each model.

7.3 Turing Test

We performed a Turing test to evaluate how well AI-

generated tweets mimicked human-generated tweets.

During the second phase of annotations, the annota-

tors were asked to label the tweets as human- or AI-

generated. These annotations were compared with the

actual labels for the tweets. The overall accuracy in

identifying the human- and AI-generated tweets was

36.5%. We also calculated how accurately the an-

notators identiﬁed only the AI-generated tweets. We

observed that only about 19% of the annotators were

able to identify the AI-generated tweets. These re-

sults indicate that the AI-generated tweets are sim-

ilar to human-written tweets. We observe speciﬁc

patterns in the tweets generated by speciﬁc LLMs.

Identifying these patterns and writing styles becomes

straightforward if these tweets are presented consec-

utively. However, it may be difﬁcult to identify these

patterns if the tweets are presented randomly, espe-

cially with the human-written tweets included among

the AI-generated tweets.

7.4 Results and Discussion

The best classiﬁer for Random Forest with TF-IDF

vectors had the N-estimators value of 200 with a

Max Depth of 50, while the best classiﬁer for Ran-

dom Forest with Word2Vec Embeddings had N-

estimators=200 and a Max Depth of 40. Both the

BERT and the RoBERTa models achieved the max-

imum training F1 score at epoch 2.

Table 7 shows the F1 scores for each of the in-

dividual labels as well as the overall model. The

transformer-based models, BERT and RoBERTa,

achieve higher performance than the Random For-

est models overall. All models have a superior

performance identifying the human-generated tweets

achieving a higher F1-score as compared to the tweets

generated using LLMs. Moreover, these high per-

formance of models, especially the transformer-based

models can be attributed to the similarity between the

tweets generated using several LLMs. We observe

that tweets generated by each LLM tend to be homo-

Classifying Human-Generated and AI-Generated Election Claims in Social Media

245

Table 7: Model performance for the Authorship Attribution task. We report the F1 scores for each label as well as the

macro-F1 scores for each of the models.

Model

Label

human llama falcon mistral Overall

Random Forest + TF-IDF 0.982 0.638 0.752 0.861 0.837

Random Forest + Word2Vec Embeddings 0.926 0.518 0.580 0.617 0.660

BERT 0.991 0.964 0.975 0.953 0.971

RoBERTa 0.998 0.974 0.987 0.975 0.983

(a) TF-IDF Vectors (b) Word2Vec Embeddings

Figure 3: The clusters of the embeddings used to train the machine learning models for authorship attribution task.

geneous in sentence structure, vocabulary, phrasing,

and other linguistic features. This makes it easier for

the machine learning models to classify the tweets for

the authorship attribution task, as they are exposed to

a large number of tweets from each LLM in training.

To better understand the linguistic similarities

and differences across the human and AI-generated

tweets, we clustered the tweets based on the em-

beddings used to train the models, as shown in Fig-

ure 3. For the transformer models, we concatenated

the last four hidden layers of the pre-trained BERT

and RoBERTa models to create sentence embeddings,

based on prior work (Devlin et al., 2019). Note

that these embeddings were generated using the pre-

trained models, not the ﬁne-tuned versions from our

experimentation. We use t-SNE (Van der Maaten and

Hinton, 2008) to convert the high-dimensional em-

beddings to two-dimensional vectors. These are then

plotted based on their respective author labels. We ob-

serve that clusters are formed based on the LLMs used

to generate the tweets and these clusters are separa-

ble for the transformer-based models. Similarly, for

the TF-IDF embeddings, the human-generated tweets

form a separate cluster and are separable from the

others. However, Word2Vec embeddings form small

clusters, where the authors are not as easily sepa-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

246

rated. The separability of the clusters using the TF-

IDF, BERT, and RoBERTa embeddings demonstrates

that each LLM uses a distinct voice, which makes

it relatively easy to resolve the tweet author. There

are inherent patterns in the tweets, which are visually

indistinguishable from humans, however, machine-

learning models can identify these patterns and iso-

late the tweets. However, these patterns become in-

creasingly apparent when humans are provided with

a greater quantity of tweets generated by a speciﬁc

LLM.

8 CONCLUSIONS

We presented a novel taxonomy developed to further

our understanding of election claims in social me-

dia. We used this taxonomy to curate ElectAI, the ﬁrst

benchmark dataset for election claim understanding

containing tweets generated by humans and AI. Using

ElectAI, we performed various experiments on claim

characterization – answering RQ1 – and authorship

attribution of human- vs. AI-generated tweets – an-

swering RQ2.

With respect to RQ1, our results showed that al-

though LLMs have achieved state-of-the-art perfor-

mance on several NLP tasks, these models have a

moderate performance on claim understanding in an

in-context learning setting. Among the models tested,

we showed that Mistral performed best with an over-

all performance of 0.667 F1 score while falcon per-

formed worst at 0.504 F1 score. With respect to RQ2,

we showed that humans perform very poorly in dis-

criminating between human- and AI-generated con-

tent. The results of a Turing Test show that humans

can correctly discriminate human- from AI-generated

with just over 36% accuracy. Computational mod-

els, on the other hand, perform very well on this task

with a RoBERTa model achieving a 0.983 F1 score in

identifying the authorship of tweets in a four-class ex-

periment with tweets generated by humans and three

different LLMs.

As part of our future work, we plan to annotate

more instances of the ElectAI dataset using the pro-

posed taxonomy and conduct experiments on a larger

scale. We plan to evaluate various approaches that

can potentially improve the performance of the mod-

els, such as few-shot learning, instruction ﬁne-tuning,

and chain-of-thought prompting. Finally, the work

presented here is a ﬁrst step toward identifying mis-

information in election claims. We further plan to ex-

tend this work to identify misinformation and verify

the claims using fact-veriﬁcation approaches.

REFERENCES

Abilov, A., Hua, Y., Matatov, H., Amir, O., and Naaman,

M. (2021). Voterfraud2020: a multi-modal dataset of

election fraud claims on twitter.

Barbera, P. (2018). Explaining the spread of misinformation

on social media: evidence from the 2016 us presiden-

tial election. APSA Comparative Politics Newsletter.

Breiman, L. (2001). Random forests. Machine learning,

45:5–32.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-

shot learners. Advances in neural information pro-

cessing systems, 33:1877–1901.

Chen, J., Zhang, R., Guo, J., Fan, Y., and Cheng, X. (2022).

Gere: Generative evidence retrieval for fact veriﬁca-

tion. In Proceedings of the 45th International ACM

SIGIR Conference on Research and Development in

Information Retrieval, pages 2184–2189.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-

dus, W., Li, Y., Wang, X., Dehghani, M., Brahma,

S., et al. (2024). Scaling instruction-ﬁnetuned lan-

guage models. Journal of Machine Learning Re-

search, 25(70):1–53.

Department of Homeland Security (2017). Statement by

Secretary Jeh Johnson on the designation of election

infrastructure as a critical infrastructure subsector.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Proceed-

ings of the 2019 Conference of the North American

Chapter of the Association for Computational Lin-

guistics: Human Language Technologies, Volume 1,

pages 4171–4186.

Heaven, W. D. (2022). Why Meta’s latest large language

model survived only three days online. MIT Technol-

ogy Review, 15:2022.

Hu, X., Hong, Z., Guo, Z., Wen, L., and Yu, P. (2023).

Read it twice: Towards faithfully interpretable fact

veriﬁcation by revisiting evidence. In Proceedings

of the 46th International ACM SIGIR Conference on

Research and Development in Information Retrieval,

pages 2319–2323.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,

Chen, Q., Peng, W., Feng, X., Qin, B., et al. (2023).

A survey on hallucination in large language models:

Principles, taxonomy, challenges, and open questions.

arXiv preprint arXiv:2311.05232.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,

Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel,

G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b.

arXiv preprint arXiv:2310.06825.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized BERT pre-

training approach. arXiv preprint arXiv:1907.11692.

Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

Classifying Human-Generated and AI-Generated Election Claims in Social Media

247

space. In International Conference on Learning Rep-

resentations.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. Advances in

neural information processing systems, 26.

Mikolov, T., Yih, W.-t., and Zweig, G. (2013c). Linguis-

tic regularities in continuous space word representa-

tions. In Proceedings of the 2013 Conference of the

North American chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 746–751.

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap-

pelli, A., Alobeidli, H., Pannier, B., Almazrouei,

E., and Launay, J. (2023). The ReﬁnedWeb dataset

for Falcon LLM: outperforming curated corpora with

web data, and web data only. arXiv preprint

arXiv:2306.01116.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).

Exploring the limits of transfer learning with a uni-

ﬁed text-to-text transformer. The Journal of Machine

Learning Research, 21(1):5485–5551.

Shaeffe, K. (2020). Far more americans see ‘very strong’

partisan conﬂicts now than in the last two presidential

election years. Pew Research Center.

Sheikh Ali, Z., Mansour, W., Haouari, F., Hasanain, M.,

Elsayed, T., and Al-Ali, A. (2023). Tahaqqaq: a real-

time system for assisting twitter users in arabic claim

veriﬁcation. In Proceedings of the 46th international

ACM SIGIR conference on research and development

in information retrieval, pages 3170–3174.

Sun, Y., He, J., Lei, S., Cui, L., and Lu, C.-T. (2023). Med-

mmhl: A multi-modal dataset for detecting human-

and llm-generated misinformation in the medical do-

main. arXiv preprint arXiv:2306.08871.

Taylor, R., Kardas, M., Cucurull, G., Scialom, T.,

Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V.,

and Stojnic, R. (2022). Galactica: A large language

model for science. arXiv preprint arXiv:2211.09085.

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal,

A. (2018). FEVER: a large-scale dataset for fact ex-

traction and veriﬁcation. In Proceedings of the 2018

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, Volume 1, pages 809–819.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava,

P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C.,

Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu,

J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal,

N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H.,

Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I.,

Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T.,

Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X.,

Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poul-

ton, A., Reizenstein, J., Rungta, R., Saladi, K., Schel-

ten, A., Silva, R., Smith, E. M., Subramanian, R.,

Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan,

J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A.,

Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R.,

Edunov, S., and Scialom, T. (2023). Llama 2: Open

foundation and ﬁne-tuned chat models. arXiv preprint

arXiv:2307.09288.

Van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of machine learning research,

9(11).

Wach, K., Duong, C. D., Ejdys, J., Kazlauskait

e, R.,

Korzynski, P., Mazurek, G., Paliszkiewicz, J., and

Ziemba, E. (2023). The dark side of generative arti-

ﬁcial intelligence: A critical analysis of controversies

and risks of chatgpt. Entrepreneurial Business and

Economics Review, 11(2):7–24.

Yang, R., Ma, J., Lin, H., and Gao, W. (2022). A weakly

supervised propagation model for rumor veriﬁcation

and stance detection with multiple instance learning.

In Proceedings of the 45th International ACM SIGIR

Conference on Research and Development in Informa-

tion Retrieval, pages 1761–1772.

Yao, B. M., Shah, A., Sun, L., Cho, J.-H., and Huang, L.

(2023). End-to-end multimodal fact-checking and ex-

planation generation: A challenging dataset and mod-

els. In Proceedings of the 46th International ACM

SIGIR Conference on Research and Development in

Information Retrieval, pages 2733–2743.

Zhou, J., Zhang, Y., Luo, Q., Parker, A. G., and De Choud-

hury, M. (2023). Synthetic lies: Understanding AI-

generated misinformation and evaluating algorithmic

and human solutions. In Proceedings of the 2023 CHI

Conference on Human Factors in Computing Systems,

pages 1–20.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

248