designed to resemble summarization. In this task,
crucial sentences are deliberately removed or masked
from an input document, and the model generates
these sentences as a single output sequence alongside
the remaining sentences. This approach is akin to
producing an extractive summary (Zhang Jingqing,
2020). Furthermore, Pegasus has demonstrated state-
of-the-art performance in summarization across all 12
downstream tasks, as assessed by metrics like
ROUGE and human evaluations.
(4) T5-Base_GNAD is a fine-tuned variant that
has attained the subsequent results on the evaluation
set: Loss (2.1025), Rouge-1(27.5357), Rouge-2
(8.5623), Rouge-L (19.1508), Rougelsum (23.9029),
and Generation Length (52.7253).
Automatic summarization stands as a pivotal
challenge within the field of NLP, presenting
numerous complexities encompassing language
comprehension (such as discerning the vital content
components) and content generation (including the
aggregation and rephrasing of identified content to
produce a summary) (Sreyan Ghosh, 2022).
When categorizing types of summaries, we
encompass two dimensions: extractive
summarization and abstractive summarization.
Extractive summarization entails the creation of a
summary that is a subset of the original text, as it
contains all the words present in the original text,
while abstractive summarization potentially contains
new phrases and sentences that may not appear in the
source text.
To our best knowledge, the vast majority of
researchers used text database (like Wikipedia,
CNN/DM, TAC dataset and so on) to evaluate
capability of information extraction of LLM (Li
Liuqing, Cabrera-Diego), but rarely researchers
perform a evaluation of text summarization of
ChatGPT and other AI tools (e.g. Claude) or language
models by using different subject materials (e.g.
Agricultural Science, Physic, Chemical, Computer
Science). We have selected ten subject fields based on
Web of Science (WOS) Categories and then collected
five highly cited theses in each field, chosen by peer
researchers for their high-quality abstracts and
analytical content.
2 METHODOLOGY
This paper aims to conduct research utilizing English
abstracts from various fields, ranging from
Agricultural Science to Philosophy & Religion. Our
metric of choice is the Jensen-Shannon divergence
(D
JS
), which has exhibited strong correlation with
manual evaluation methods such as Pyramid,
Coverage, and Responsiveness, in predicting system
rankings (Louis Annie, Saggion H.).
Before delving into the details of the Jensen-
Shannon divergence (JS divergence), it is important
to introduce the concept of Kullback-Leibler
divergence (KL divergence) (Kullback S. 1951). KL
divergence is an information-theoretic measure that
quantifies the dissimilarity between two probability
distributions over the same event space. Within
information theory, KL divergence can be interpreted
as a measure of information loss when multiple
messages are encoded using a second distribution. In
the context of summary evaluation, this translates to
encoding a source document using an Automatic Text
Summary (ATS) system. Consider two probability
distributions, P and Q, where P represents the
distribution of words in the source document and Q
represents the distribution in the candidate summary.
The Kullback-Leibler (KL) divergence is defined as
follows:
D
KL
(PQ) =
W
log
2
(1)
While the resulting values of KL divergence are
always non-negative, it lacks the symmetric property
(D
KL
(P||Q) ≠ D
KL
(Q||P)), fails to satisfy the triangular
inequality, and tends to yield divergent values
(Thomas M. Cover, 2012). To address these limitations,
Lin et al. (Lin C.Y. 2006) proposed the use of the
Jensen-Shannon divergence (D
JS
) to measure
information loss between two documents. The DJS is
formally defined by Equation (2):
D
JS
(P Q) =
w
log
2
+ Q
w
log
2
(2)
In Equation (2), P
w
represents the probability
distribution of term win the source document, while
Q
w
represents the probability distribution of term w in
the candidate summary. The probability distribution
of each term w is computed using Equation (3):
=
(3)
Here C is the count of word w and N is the number
of tokens. Specifically, we set = 1*10
-10
and B =
|V|, where V stands for the number of all different
terms obtained from source document and candidate
summary.
Based on this evaluation method, we applied our
corpora to the AI models, allowing them to
summarize the abstract texts. Specifically, we
instructed the models to generate a summary
containing approximately n words (n = the number of
words multiplied by 30%) to prevent excessive
rephrasing. We collected all the generated summaries
and utilized Python to calculate the D
JS
scores.
Since certain language models (LLMs) lacked
specific websites with chat-box interfaces, we
Evaluating Text Summarization Generated by Popular AI Tools
351