approach to generate text, leading to the Analogy
Materials Corpus (AMC) used in this paper. Finally,
we compare the human and machine AMC texts using
sentiment.
This paper is structured as follows. Firstly we
discuss the background for comparing human and
LLM generated text. We describe HC3 corpus (and
each of its constituent sub-collections), before
describing how the human portion of AMC coprus
was compiled. We then detail how an LLM was used
to generate analogous texts before presenting an
analysis of the AMC texts.
We briefly describe our system before
presenting and analysing our resutls on the HC3
corpus before analysing the AMC results. Finaly
some conclusions and future work are discussed.
2 BACKGROUND
Widespread adoption of LLM since ChatGPT has
raised concerns about its output and the presence of
any hidden biases therein. Studies of LLM have
shown the larger and more powerful models possess
some surprising abilities, such as the ability to
interpret analogical comparisons (Webb et al, 2023),
they have shown an ability in terms of Theory of
Mind (Strachan et al, 2024). (Ichen & Holyoak, 2024)
evaluated text that they were confident was not
included in any LLM training data to evaluate GPT-
4’s ability to detect & explain any contained
metaphors. We did not follow this effort to ensure the
novelty of the query text as we wish to better reflect
typical usage of these LLM, which includes a
combination of novel and familiar text in each query.
We argue that analogies offer a better mechanism
to explore the sentiment of machine generated text.
While the question-and-answer scenario restricts the
range of possible responses to a prompt, generating
novel analogies in contrast opens a much wider range
of response types and topics. The semantic restriction
that questions imposed on the range of possible
answers is in effect removed by requesting the LLM
to generate a comparable story both one that requires,
or is even founded upon, a reasonable semantic
distanced between the original and generated stories.
Thus, we argue, that generating analogous stories
to a presented text imposes fewer constraints on the
responses and thereby uncovers a more faithful
reflection of the contents and biases contained within
the LLM machine itself. Later in this paper we shall
detail the Analogy Materials Corpus (AMC) that
contains parallel human and machine generated text,
containing analogous pairs of English texts.
3 PARALLEL CORPORA OF
HUMAN AND MACHINE TEXT
This section describes two corpora of parallel human
and machine generated text. Firstly, the pre-existing
the Human ChatGPT Comparison Corpus (HC3)
(Guo et al, 2023), which was produced under a
question-answer scenario, by recording comparable
answers to a given list of questions.
3.1 HC3 Corpus
The Human ChatGPT Comparison Corpus (HC3) is a
collection of we collected 24,322 questions, 58,546
human answers and 26,903 ChatGPT answers (Guo
et al, 2023). The corpus contains paired responses
from both human experts and ChatGPT, allowing
comparison of broad trends in the ability to each to
generate text. Questions were grouped according to
theme, including; open-domain, financial, medical,
legal, and psychological areas. Their lexical analysis
showed that ChatGPT uses more NOUN, VERB,
DET, ADJ, AUX, CCONJ and PART words, while
using less ADV and PUNCT words.
Sentiment analysis of text in (Guo et al, 2023)
used a version of Roberta that was fine-tuned on a
Twitter corpus. Additionally, their sentiment analysis
focused on the collection as a whole and didn’t
examine the individual sub-collections. Limitations
of the previous work include difficulty in reproducing
the results (because of fine-tuning) and difficulty in
benchmarking results against more established
sentiment analysis models. Our sentiment analysis
uses Vader (Hutto and Gilbert, 2014) one of the most
widely used sentiment analysis models. This is
compared with the newer TextBlob (Loria, 2018)
model.
Table 1: Word count on the HC3 texts.
The HC3_medical group contains text with
strongly positive and strongly negative sentiments
while the finance collection is dominated by a neutral
sentiment.
The human texts contained an average of 120.5
words while the GPT texts were approximately twice
that length at 241.3 words.
KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval
336