Hallucinations in LLMs and Resolving Them: A Holistic Approach
Rajarshi Biswas, Sourav Dutta and Dirk Werth
August-Wilhelm Scheer Institute, Uni-Campus D 5 1, 66123 Saarbr
¨
ucken, Germany
{firstname.lastname}@aws-institut.de
Keywords:
Natural Language Processing, Natural Language Generation, Generative AI.
Abstract:
Generative artificial intelligence, in recent times, is producing tremendous interest across industry and
academia leading to rapid growth. Developments in model architecture, training datasets and large scale
computing enable the realization of impressive generative tasks in textual computing, computer vision etc.
However, the generative processes suffer from various challenging artifacts that can generate confusion, risks
or compromise the security. In this paper, we explore in detail the problem of inconsistent or hallucinogenic
generation in natural language generation (NLG). We define the problem and survey the current techniques for
detection, measurement and mitigation on five different tasks, which are, abstractive summarization, question
answering, dialogue generation, machine translation and named entity recognition combined with information
retrieval.
1 INTRODUCTION
The emergence of powerful large language models
(LLMs) based on deep neural architectures, such as,
Transformers, BERT, GPT is enabling generative ar-
tificial intelligence to scale impressive feats and at-
tract unprecedented attention across the board. Natu-
ral language generation (NLG) is one of the primary
yet challenging generative tasks in natural language
processing and it is the focus of the LLMs. NLG
comprises a wide variety of tasks, such as, coherent
text, summary, dialogue generation, question answer-
ing, translation etc. that witnessed rapid growth in
the last decade. However, the significant progress in
NLG is accompanied with challenges such as lack of
diversity in surface realization, loss of context and in-
consistent or hallucinogenic generation.
In this work, we concentrate on analyzing hallu-
cinogenic generation for five major downstream tasks
in NLG, which are, abstractive summarization, ques-
tion answering, dialogue generation, machine transla-
tion and named entity recognition combined with in-
formation retrieval. Hallucination is a form of degen-
eracy that demands attention from the research com-
munity. It is a serious issue with generative models
in NLG and refers to situations in which the model
generates inconsistent or nonsensical text that contra-
dicts the source material, context or objective. It is
important to study this phenomena since generative
models like LLMs are being widely adopted in sev-
eral critical services, e.g., health, banking in our so-
ciety where hallucinations can severely limit the per-
formance of the deployed models affecting the quality
of service. Moreover, it can also jeopardise the safety
of the applications leading to loss of trust and serious
damage. For example, inconsistent response genera-
tion in a banking application can lead to an incorrect
transaction causing loss of funds or more seriously
a hallucinogenic response from a LLM in the health
sector can lead to severe problems like wrong medi-
cation, drug overdose threatening the life of a patient.
As a consequence, efforts are being made in the
community to understand the issue of hallucination
or inconsistent generation in NLG. However, most of
the studies are directed towards machine translation
and text summarization. This leaves a gap in under-
standing the problem of hallucinations from a broader
perspective that span different tasks. So, in this work
we exhaustively survey the current works in this area
across five different NLG tasks mentioned previously.
We believe that studying the problem across different
tasks would lead to deeper understanding, formation
of an unified idea and help to identify global trends in
hallucinogenic generation. Furthermore, we also dis-
cuss different ideas for mitigating inconsistent gener-
ation in the three different NLG tasks studied.
We organize the rest of the paper in a way, such
that, section 2 describes the different variants and con-
tributing factors for hallucination in NLG. In sections
3, 4, 5, 6, 7 we survey the current efforts in un-
104
Biswas, R., Dutta, S. and Werth, D.
Hallucinations in LLMs and Resolving Them: A Holistic Approach.
DOI: 10.5220/0013094500003890
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 3, pages 104-115
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
derstanding hallucination in abstractive summariza-
tion, question answering, dialogue generation, ma-
chine translation and named entity recognition along
with information retrieval. Under section 8 we dis-
cuss different ways of resolving the problem of hallu-
cinations considering holistic as well as task specific
measures. Following this in section 9 we discuss po-
tential future areas that can be researched for manag-
ing hallucinations in a better way. Finally, we sum-
marize our findings in section 10.
2 HALLUCINATION: VARIANTS
AND CONTRIBUTING
FACTORS
In this section, we briefly describe the different vari-
ants of hallucination and the different factors con-
tributing to it. In the context of natural language
processing, Hallucination is defined as automatically
generated content that is nonsensical and unfaith-
ful compared to the source content (Filippova, 2020;
Maynez et al., 2020; Parikh et al., 2020; Zhou et al.,
2020). Depending on the tasks, prior research work
divides it into two categories, Intrinsic and Extrinsic
hallucinations (Dziri et al., 2021; Huang et al., 2021;
Maynez et al., 2020). In the first category, the gener-
ated output contradicts the input source allowing it to
be classified as erroneous generation. Extrinsic hal-
lucination refers to generations that cannot be veri-
fied from source content thus it may not be incorrect
every time. Nonetheless, it is still problematic and
poses a safety risk. The primary factors contributing
to hallucination in NLG are data sources and model
training choices. On the data front factors such as
heuristic data collection (Lebret et al., 2016; Wise-
man et al., 2017; Parikh et al., 2020; Wang, 2020)
or tasks that require diversity in the generations, e.g.,
open-domain dialogue generation in a subjective tone
(Rashkin et al., 2021) leads to source-output diver-
gence. This divergence is one of the key contributing
factors behind hallucination. Model training related
factors causing hallucination could be faulty repre-
sentation learning, wrong decoding, exposure bias,
parametric-knowledge bias etc. For instance, an en-
coder learning wrong correlations (Li et al., 2018;
Feng et al., 2020) or having a faulty understanding
(Parikh et al., 2020) can lead to inconsistent gener-
ations. Similarly, focusing on the wrong part of the
encoded information or efforts directed at improving
diversity during decoding can result in hallucination
(Tian et al., 2019). The problem of exposure bias
(Bengio et al., 2015; Ranzato et al., 2015), that is, dis-
parity in decoding during training and inference also
leads to inconsistency. This is due to MLE optimiza-
tion using ground-truth prefixes for next token predic-
tion in contrast to using self generated history during
inference (He et al., 2021).
3 ABSTRACTIVE
SUMMARIZATION
Hallucination: In NLP, abstractive summarization
refers to the task of generating a short, concise sum-
mary from the source text such that it contains all the
relevant details in the source (Yu et al., 2021). Even
though neural approaches have obtained much suc-
cess in this task, recent studies find that neural tech-
niques generate inconsistent or hallucinogenic con-
tent (Falke et al., 2019; Maynez et al., 2020). More-
over, it is observed that generated summaries with
large amount of inconsistencies can still obtain very
high ROUGE scores. These findings underscore the
importance of studying the problem of hallucinations
in this task.
Measurement: The degree of inconsistency in the
generated summaries are measured using metrics that
are mostly model based. These can be categorized
into unsupervised and semi-supervised metrics. The
unsupervised metrics can be further classified into
information extraction based, natural language in-
ference based and question-answering based respec-
tively. Information extraction based methods ex-
tract details in the form of relation tuples from both
the source and generated summary for verification
of factual accuracy. In a similar light, question-
answering based metrics measure factual accuracy
between source and output through generation of per-
tinent questions that are assumed to produce simi-
lar answers. In general, these metrics follow three
steps, which are, question generation from the gen-
erated output, extracting answers from the source &
output, and scoring the correctness of answers ob-
tained from the source and the output. In contrast nat-
ural language inference metrics assume that there is
a ground-truth for a faithful summary.
Resolution: Practitioners in abstractive summariza-
tion use various techniques for coping this issue.
For example, graph neural networks are used in
(Zhu et al., 2021) for encoding facts from the source
text and further integration of reward functions in
(Huang et al., 2020) for better understanding in-
teractions between entities in the source. Exter-
nal knowledge embedding obtained from embed-
ding facts from wikipedia is also used in (Gunel et al.,
2020) for improving factual consistency. Techniques
Hallucinations in LLMs and Resolving Them: A Holistic Approach
105
like (Aralikatte et al., 2021) propose focus-attention
mechanism for making decoders generate tokens that
are related to the facts or topic of the source. Keeping
with attention-based methods, the work in (Cao et al.,
2018) uses a dual attention sequence-to-sequence
framework for ensuring that generated summaries
take into account the source text and the facts ex-
tracted from them. Contrastive learning technique is
used in (Cao and Wang, 2021) for enabling the mod-
els to distinguish between positive ground-truth sum-
maries and automatically generated negative sum-
maries containing factual inconsistencies or halluci-
nations. Apart from these post-processing is also em-
ployed in the works to get rid off the inconsistent facts
in the generated summaries.
4 QUESTION-ANSWERING
Hallucination: Generative question answering is
gaining prominence with the growing success of gen-
erative artificial intelligence. It is more powerful
and effective compared to first generation question-
answering systems that merely tried to find facts in
the source text that support the questions. The objec-
tive of generative question answering is to frame more
detailed and complete answers that may require gath-
ering information from all over the source. As a re-
sult, sometimes the system needs to consult multiple
source documents as a single document may not con-
tain all the information needed for framing a definitive
answer. However, this process can induce the adverse
side-effect hallucinations since some of these docu-
ments may contain extraneous or contradictory infor-
mation. The closest form of a definition of Hallucina-
tion in generative question answering is semantic drift
(Li et al., 2021). It shows how a generated answer
drifts away from the correct answer during genera-
tion. Apart from this the majority of the works in this
area leverage human evaluation for measuring factual
correctness of the generated answers as a measure of
inconsistency.
Measurement: Hallucination in generative question
answering is measured using the metric Semantic
Overlap (Sellam et al., 2020). It is a BERT-based
metric that correlates with human judgment. Factual
correctness is also employed for measuring consis-
tency (Zhang et al., 2020a) between generated text
and source document using information extraction.
Automatic question answer based metric is pro-
posed (Durmus et al., 2020; Wang et al., 2020) for
measuring consistency in generated summaries. In
this approach, first question-answer pairs are created
using a question generation model from the gener-
ated summary. Subsequently, a model is used for
extracting answers from the source document for the
questions generated in the previous step. If the an-
swers don’t match then the generated summary is
regarded as unfaithful. This technique is also used
in measuring hallucination in generative question an-
swering. Apart from these metrics, human evalua-
tion is frequently used in this field for measuring the
consistency or faithfulness of the generated answers.
Human evaluation is often also used to complement
automatic N-gram overlap metrics, such as, BLEU,
ROUGE, METEOR, as these correlate poorly with
human judgments.
Resolution: Techniques used in generative question
answering for resolving hallucination concentrate on
leveraging external knowledge bases and informa-
tion resources for improving the factual correctness
or faithfulness of the generated answers. Another ap-
proach (Bi et al., 2019) generates answers by accu-
mulating information from multiple sources, such
as, knowledge-bases, passages, vocabulary, questions
etc. Neural model (Yin et al., 2016) is used for gen-
erating answers to factoid questions using informa-
tion from knowledge-base. More recent approaches
(Fan et al., 2019) create individual knowledge graph
for every question for condensing information while
reducing redundancy for tackling hallucination. An-
other method (Li et al., 2021) extracts rationale for
an answer in the encoding stage and biases the de-
coder to generate the answer using the rationale and
the actual input. For reducing hallucination in the
answers, the authors in (Krishna et al., 2021) pro-
pose a sparse attention-based transformer model
as the answer generator for effectively handling the
retrieved documents. It models long-range depen-
dence employing local attention and mini-batch K-
means clustering. Similarly for mitigating halluci-
nation in (Su et al., 2022), a new framework is pro-
posed that jointly models answer-generation with
machine reading. The generation model is comple-
mented by the machine reading module. It provides
salient answer related information to the generation
model to improve faithfulness of the generated an-
swer.
5 DIALOGUE GENERATION
Hallucination: Dialogue generation is probably the
most widely adopted generation tasks in natural lan-
guage processing with wide ranging applications like
chatbots, voice-assistants etc. It can be broadly cate-
gorized into task specific and open domain dialogue
generation. In the first category, we expect responses
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
106
to contain specific information while in the second
type often an engaging response is desired without
too much repetition from the conversational history
for relatively long conversation. Due to this nature
the tolerance for hallucination is higher in this task
compared to other generation tasks. Hallucination in
dialogue generation is considered intrinsic if certain
specific information is absent or misrepresented in the
generated response. Whereas if the generated conver-
sation is not firmly grounded in hard facts and is dif-
ficult to be explicitly verified using knowledge bases
or conversational history then it is termed as extrinsic
hallucination. In our work, we discuss the problems
related to open domain dialogue generation as it more
relevant to the modern dialogue systems that are de-
veloped incorporating state-of-the-art LLMs trained
on huge amounts of training data. In open domain di-
alogue systems there can be broadly two sources of
hallucinations. First, responses that contradict previ-
ous responses from the same system leading to incon-
sistency (Li et al., 2020; Welleck et al., 2019; Zhang
et al., 2021a), incoherence (Beyer et al., 2021; Dziri
et al., 2019) termed as self-inconsistency. Secondly,
when the systems generates responses that are incon-
sistent with regards to an external source, e.g., fac-
tually incorrect responses then it is termed as exter-
nal inconsistency (Mielke et al., 2022; Roller et al.,
2021). Another factor influencing inconsistency in
open domain dialogue generation is the lack of con-
sistency in the Persona/Character assumed by the di-
alogue system. This often leads to contradictions and
in turn to hallucinations. As a result, there is re-
search (Hancock et al., 2019; Mazar
´
e et al., 2018;
Yavuz et al., 2019; Zhang et al., 2020b) to develop
systems that are persona consistent with the help of
suitable datasets (Dinan et al., 2019a; Zhang et al.,
2018). Additionally, there are also works in open do-
main dialogue generation that use external knowledge
bases and graphs for generating informative responses
(Dinan et al., 2019b; Zhou et al., 2018). Hallucina-
tion in such systems is treated as factual inconsistency
and has received equal amount of attention from the
dialogue generation community (Dziri et al., 2021;
Rashkin et al., 2021; Santhanam et al., 2021; Shus-
ter et al., 2021).
Measurement: Evaluation of hallucination in open
domain dialogue generation is still an open problem
as there is no standard metric for measuring it. Dia-
logue systems, such as, chat-bots are often evaluated
using factual correctness or consistency. Some auto-
mated metrics used for measurement are Knowledge
F1, Rare F1 (Shuster et al., 2021) both of which are
based on statistics while others are model based tech-
niques. Knowledge F1 utilizes ground-truth datasets
where knowledge is labeled. This refers to gold stan-
dard knowledge sentences to which a person referred
for conversation during dataset collection. Knowl-
edge F1 measures the overlap between the generated
and gold knowledge sentences. This metric tries to
measure if the generated responses are able to capture
the available knowledge and thus if they make sense.
Rare F1 only considers the infrequent words in the
dataset for computing the F1 metric. This is done to
negate the influence of common uni-grams. However,
overlap based metrics cannot provide comprehensive
evaluation since the same semantic meaning could be
represented in a wide variety of surface realizations.
For addressing this different model based techniques
have been proposed for measuring consistency. For
example, using natural language inference (NLI)
(Dziri et al., 2019; Welleck et al., 2019), learnable
evaluation metrics (Zhang et al., 2021b) or use of
an additional test for measuring coherence (Beyer
et al., 2021). These methods offer more flexibility and
can support generations with different surface realiza-
tions.
Resolution: The problem of hallucination in open do-
main dialogue generation can be mitigated using dif-
ferent techniques. One of the ways is by introducing
extra information in the data. The authors in (Shen
et al., 2021) propose a measurement based on features
of dialogue quality which can be used to remove sam-
ples from the training set that get a lower score on
this measurement. In turn this can improve perfor-
mance in terms of self-consistency. Retrieval is used
to augment dialogue generation approaches, such as,
Knowledge Grounded Dialogue where is it performs
knowledge selection and helps to reduce hallucina-
tions substantially (Shuster et al., 2021). Control
codes concatenated with dialogue inputs is proposed
in (Rashkin et al., 2021) for reducing hallucinations.
It makes the model more aware of how the gener-
ations rely on evidence based in knowledge. Im-
proved dialogue modeling techniques have also been
studied for reducing hallucinations during generation,
e.g., the use of inductive attention in dialogue mod-
els based on the transformer architecture (Wu et al.,
2021).
6 MACHINE TRANSLATION
Hallucination: Machine translation (MT) refers to
the automatic conversion of text from one language
into another, aiming for both grammatical accuracy
and semantic fidelity (Bahdanau, 2014). While neural
machine translation (NMT) models have dramatically
improved translation quality, particularly with the
Hallucinations in LLMs and Resolving Them: A Holistic Approach
107
advent of transformer-based architectures (Vaswani
et al., 2017), they are still prone to generating hal-
lucinations. These hallucinations occur when the sys-
tem introduces information that is not present in the
source text, or mistranslates critical content, leading
to outputs that may seem fluent but are semantically
incorrect or inconsistent (Raunak et al., 2021; M
¨
uller
et al., 2020). These errors are particularly prevalent
in low-resource language pairs and in cases where the
model overfits to patterns in the training data. Hal-
lucinations in machine translation can severely im-
pact the reliability of translations, especially in criti-
cal domains such as legal, medical, or technical fields,
where accuracy is paramount (Raunak et al., 2021).
Measurement: Evaluating hallucinations in machine
translation poses a unique challenge, as traditional
metrics like BLEU (Papineni et al., 2002) or ME-
TEOR (Banerjee and Lavie, 2005), which compare
the machine output to reference translations, may not
effectively capture the degree of hallucination. Re-
cent studies have proposed new approaches to better
measure hallucinations, including both model-based
and human evaluation metrics. One common ap-
proach involves using adequacy-based human eval-
uation, where human annotators judge how well
the translation aligns with the source content (Spe-
cia et al., 2011). For automated methods, source-
reference alignment techniques can identify mis-
translations or extraneous information by comparing
source and target alignments to ensure fidelity (He
et al., 2016). This focuses on improving transla-
tion quality by aligning source and target text, help-
ing to detect hallucinations or extraneous information.
This approach ensures better fidelity in translations
by refining how models maintain consistency between
the input and output sequences. NLI (Natural Lan-
guage Inference) model-based metrics (Zhou et al.,
2021) mainly aimed to fact-check and align gener-
ated text, are able to detect hallucinations in gener-
ated outputs. Such methods compare the translated
content (hypothesis) against the source (premise) for
contradictions or factual inaccuracies. Another ap-
proach uses confidence-based filtering, where low-
confidence outputs from the translation model are
flagged as potentially hallucinatory (Tu et al., 2017).
Resolution: Addressing hallucinations in machine
translation involves both improving the underlying
model architecture and leveraging external resources.
One promising approach is data augmentation, par-
ticularly for low-resource languages, which can help
mitigate hallucinations caused by insufficient train-
ing data (Sennrich et al., 2016). In addition, back-
translation, where the model translates target lan-
guage sentences back into the source language and
compares them to the original text, has been used to
reduce inconsistencies (Edunov et al., 2018). Other
efforts focus on improving the attention mechanisms
within transformers. For example, coverage mecha-
nisms have been employed to ensure that every part of
the source sentence is attended to during translation,
reducing the likelihood that the model will “invent”
content not present in the source (Tu et al., 2016). In-
corporating external knowledge bases has also been
explored, particularly integrating knowledge embed-
dings into NLP tasks like translation helps maintain
factual consistency, reducing the risk of hallucina-
tions, especially in technical or specialized content
(Wang et al., 2021). Moreover, the use of reinforce-
ment learning for sequence prediction tasks, includ-
ing NMT, shows how reward functions can be tai-
lored to encourage factual accuracy, reducing issues
like hallucination during translation (Bahdanau et al.,
2022). Finally, post-editing techniques, where hu-
man editors review and correct translations, are often
employed in high-stakes scenarios to ensure final out-
put quality, especially when dealing with critical con-
tent (Toral et al., 2018).
7 NAMED ENTITY
RECOGNITION AND
INFORMATION RETRIEVAL
Hallucination: Named Entity Recognition (NER) is
a fundamental NLP task aimed at identifying and
classifying proper nouns such as people, organiza-
tions, and locations within a text (Lample et al.,
2016). Despite significant progress with neural mod-
els, these systems can still exhibit hallucinations,
where entities are misclassified or incorrectly gener-
ated. For example, models might mistakenly recog-
nize a non-existent entity or mislabel a correct entity
due to insufficient context or model limitations (Su
et al., 2024). This misclassification can impact appli-
cations relying on accurate entity identification, such
as information extraction and semantic search. In In-
formation Retrieval (IR), the objective is to retrieve
documents or data that are relevant to a user’s query
(Sch
¨
utze et al., 2008). Although neural IR models
have improved the relevance and ranking of retrieved
results, they can occasionally retrieve documents that
are irrelevant or hallucinated, meaning the retrieved
results do not genuinely align with the user’s query
intent (Nogueira and Cho, 2019; James and Kannan,
2017). These hallucinated results can stem from over-
fitting on training data or from inadequacies in the
query-document matching process.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
108
Measurement: To measure hallucinations in NER,
various evaluation metrics are employed. Precision,
recall, and F1-score are commonly used to compare
the predicted entities against a gold standard anno-
tated dataset. Precision measures the proportion of
correctly identified entities out of all entities iden-
tified by the model, recall measures the proportion
of correctly identified entities out of all entities that
should have been identified, and F1-score provides
a balance between precision and recall. Unsuper-
vised metrics also play a role, such as entity linking,
where entities recognized by the model are matched
against external knowledge bases to verify their cor-
rectness. Cross-document consistency checks can
further identify discrepancies by ensuring that enti-
ties are consistently recognized across multiple doc-
uments (Jiang et al., 2016). For IR, effectiveness is
measured through metrics such as Precision@K, Re-
call@K, and Mean Reciprocal Rank (MRR). Pre-
cision@K measures the proportion of relevant docu-
ments among the top K retrieved documents, while
Recall@K assesses the proportion of relevant docu-
ments retrieved within the top K results. MRR eval-
uates the rank of the first relevant document in the
list. Additionally, query-document relevance scoring,
which involves assessing the alignment between the
query and the retrieved documents, and external val-
idation against curated datasets are used to gauge re-
trieval accuracy and address issues of hallucination
(Sch
¨
utze et al., 2008; Nogueira and Cho, 2019).
Resolution: Addressing hallucinations in NER in-
volves several advanced techniques. Contextual em-
beddings from models such as BERT (Devlin et al.,
2019) capture richer semantic information by provid-
ing context-dependent representations of words. This
approach improves the accuracy of entity recognition
by understanding the context in which entities appear.
Multi-task learning, which involves training models
on related tasks simultaneously, helps enhance entity
recognition by leveraging additional sources of infor-
mation (McCann et al., 2017). Integrating external
knowledge sources like knowledge graphs can also
reduce hallucinations by grounding the entity recog-
nition process in real-world data (He et al., 2020).
In IR, techniques to mitigate hallucination include
employing advanced retrieval architectures such as
dense retrievers and cross-encoder models. Dense re-
trievers use dense vector representations for query-
document matching, which improves the relevance
ranking of retrieved documents (Nogueira and Cho,
2019). Cross-encoder models, which jointly encode
the query and documents, further refine retrieval by
capturing complex relationships between them. Addi-
tionally, incorporating user feedback and techniques
like query expansion, where additional terms or con-
text are added to the query, helps refine retrieval re-
sults and address issues of hallucination (Azad and
Deepak, 2019).
8 APPROACHES TO RESOLVING
HALLUCINATIONS
The motivation behind this paper stems from the
growing reliance on Large Language Models (LLMs)
across a wide range of NLP tasks. While these models
have demonstrated remarkable advancements, they
also introduce a critical challenge: hallucinations.
Across tasks like abstractive summarization, ques-
tion answering, dialog generation, machine transla-
tion, NER, and information retrieval, hallucinations
manifest in various forms, from generating factual
inaccuracies to retrieving irrelevant or fabricated in-
formation. Despite significant progress in mitigating
these issues, hallucination remains a pervasive prob-
lem that compromises the reliability of LLMs in real-
world applications (Ji et al., 2023). The primary mo-
tivation for this paper is the need for a comprehen-
sive, cross-task analysis of hallucinations in LLMs.
While hallucinations in specific tasks such as summa-
rization or machine translation have been studied in
isolation (Raunak et al., 2021), there has been little
effort to systematically explore hallucinations across
multiple NLP tasks, each with its unique characteris-
tics and challenges. This paper aims to fill that gap
by providing a detailed investigation into the nature
of hallucinations in five distinct tasks, as well as out-
lining the current methods to detect and resolve them.
Our contribution is twofold: (1) a consolidated review
of hallucination across different NLP tasks, and (2)
proposing task-agnostic and task-specific approaches
to resolve hallucinations, thereby providing a frame-
work for future research.
8.1 Holistic Approach
While techniques that are task-specific, such as exter-
nal knowledge integration (Zhu et al., 2021) or using
better reward mechanisms (Chen et al., 2023), have
shown promise, we propose a more holistic approach
that could benefit all tasks:
Improving Model Interpretability: A crucial chal-
lenge is the black-box nature of LLMs, which makes
hallucinations difficult to predict or prevent. Im-
plementing interpretability mechanisms like atten-
tion visualization or rule-based model auditing can
help identify when and why hallucinations occur (Be-
linkov and Glass, 2019). Models like BERT, GPT,
Hallucinations in LLMs and Resolving Them: A Holistic Approach
109
and their variants could be enhanced with transpar-
ent architectures that allow for more insight into their
decision-making process, especially in tasks prone to
hallucination, like dialog generation and summariza-
tion (Ribeiro et al., 2016).
Task-Agnostic Regularization: Regularization tech-
niques, like fact-checking or constraint-based gen-
eration, should be applied consistently across tasks.
For example, incorporating external knowledge bases,
such as Wikipedia or structured databases, can help
ground generated outputs in factual information,
thereby reducing hallucination in both generative
(summarization, QA) and retrieval-based tasks (IR,
NER) (Petroni et al., 2019). This approach prevents
the model from generating content that strays too far
from verifiable truth, creating a safeguard against fab-
ricated information.
Adaptive Fine-Tuning for Specific Tasks: Although
LLMs are designed to generalize across tasks, fine-
tuning them on domain-specific data can significantly
reduce hallucinations. In tasks like machine transla-
tion and information retrieval, training models on spe-
cialized datasets and including domain-relevant enti-
ties can lead to more accurate and contextually ap-
propriate outputs (Sun et al., 2023). This reduces the
likelihood of hallucinating irrelevant or incorrect in-
formation, particularly when the task demands high
precision.
Evaluation and Feedback Mechanisms: One con-
sistent theme across tasks is the need for robust eval-
uation metrics. ROUGE, BLEU, and MRR are of-
ten insufficient to detect hallucinations because they
focus on fluency and surface-level similarities (Hon-
ovich et al., 2022). We suggest augmenting these met-
rics with fact-based or entity-level verification mech-
anisms. For instance, in question answering, auto-
matic fact-checking systems could be integrated to
score models on factual consistency, while in summa-
rization and translation, knowledge graphs could be
employed to cross-validate entity relationships (Cao
et al., 2020).
8.2 Task-Specific Considerations
Certain tasks, due to their inherent complexity and the
nature of the data they process, require tailored so-
lutions to effectively mitigate hallucinations. These
solutions address the unique challenges of each task,
allowing models to generate more accurate and con-
textually appropriate outputs.
Named Entity Recognition (NER): NER systems
are prone to hallucinations when they mislabel en-
tities or identify non-existent ones, especially in do-
mains where new entities frequently emerge, such as
healthcare, finance, or geopolitics. Grounding NER
models in dynamic, real-world knowledge bases, such
as Wikidata or domain-specific databases, can help
ensure that entity identification remains accurate and
up-to-date (Hu et al., 2022). By continuously updat-
ing the knowledge base and training the model on
evolving data, hallucinations can be reduced as the
system remains aware of the latest entities and their
relationships. Furthermore, integrating context-aware
mechanisms, where entity recognition adapts based
on sentence-level or document-level context, can im-
prove accuracy and minimize misidentifications, par-
ticularly in ambiguous scenarios where multiple enti-
ties are involved.
Machine Translation: Machine translation systems
are susceptible to hallucinations, particularly when
translating between languages with significant struc-
tural differences or when translating low-resource
languages. Ensuring linguistic consistency across lan-
guages is crucial for reducing hallucinations. One
approach is incorporating post-editing frameworks
where human translators verify and correct machine-
generated translations, thereby maintaining transla-
tion quality and factual accuracy. In addition, con-
trastive learning techniques, which explicitly train
the model to recognize and avoid incorrect or out-
of-context translations, can help minimize semantic
drift—the phenomenon where the translation strays
from the intended meaning (Raunak et al., 2021).
This can be particularly useful when translating spe-
cialized texts, such as legal or medical documents,
where precision is paramount.
Dialog Generation: Hallucinations in dialog gener-
ation often result in models producing off-topic, in-
coherent, or factually incorrect responses. One of the
primary challenges is maintaining the consistency and
coherence of conversations over multiple turns. In-
tegrating persona mechanisms—where the model is
conditioned on a set of attributes or knowledge about
the user—can help ground responses in the user’s
context, reducing the likelihood of irrelevant or in-
consistent replies (Zhang et al., 2020b). Addition-
ally, context memory mechanisms, which allow the
model to retain and reference information from ear-
lier in the conversation, can ensure that subsequent
responses stay coherent and relevant. By maintaining
a memory of the dialog history, models can avoid in-
troducing new, unrelated information that could lead
to hallucination.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
110
9 FUTURE WORK
Moving forward, we envision research focusing on
hybrid models that combine symbolic reasoning with
deep learning. This could address hallucinations by
introducing structured knowledge into the generative
process (Chen et al., 2020). Additionally, cross-
lingual hallucination detection in translation tasks and
further exploration into self-supervised fact-checking
methods for QA and summarization will likely en-
hance model robustness. Ultimately, addressing hal-
lucinations requires a concerted effort that combines
advances in model architectures, training strategies,
and evaluation techniques. Our work highlights the
importance of a unified approach to tackling halluci-
nations in LLMs, with the aim of developing models
that are not only powerful but also reliable and trust-
worthy (Schick and Sch
¨
utze, 2021).
10 CONCLUSION
In this paper, we explored the challenge of hallucina-
tions in Large Language Models (LLMs) across five
key NLP tasks: abstractive summarization, question
answering, dialog generation, machine translation,
named entity recognition, and information retrieval.
Despite advances in these tasks, hallucinations remain
a persistent problem, undermining model reliability.
We provided a comprehensive review of task-specific
manifestations, metrics, and methods to address hal-
lucinations, and proposed a unified framework that
emphasizes interpretability, regularization, and fine-
tuning. Moving forward, addressing hallucinations
will be crucial for improving the trustworthiness and
applicability of LLMs in real-world scenarios.
ACKNOWLEDGEMENTS
This research is funded in part by the “Bundesmin-
isterium f
¨
ur Wirtschft und Klimaschutz” within the
project “MERLOT” which was funded under the
project reference 68GX21008K.
REFERENCES
Aralikatte, R., Narayan, S., Maynez, J., Rothe, S., and Mc-
Donald, R. (2021). Focus attention: Promoting faith-
fulness and diversity in summarization. In Zong, C.,
Xia, F., Li, W., and Navigli, R., editors, Proceed-
ings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 6078–6095.
Association for Computational Linguistics.
Azad, H. K. and Deepak, A. (2019). Query expansion tech-
niques for information retrieval: a survey. Information
Processing & Management, 56(5):1698–1735.
Bahdanau, D. (2014). Neural machine translation by
jointly learning to align and translate. arXiv preprint
arXiv:1409.0473.
Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R.,
Pineau, J., Courville, A., and Bengio, Y. (2022). An
actor-critic algorithm for sequence prediction. In In-
ternational Conference on Learning Representations.
Banerjee, S. and Lavie, A. (2005). Meteor: An automatic
metric for mt evaluation with improved correlation
with human judgments. In Proceedings of the acl
workshop on intrinsic and extrinsic evaluation mea-
sures for machine translation and/or summarization,
pages 65–72.
Belinkov, Y. and Glass, J. (2019). Analysis methods in neu-
ral language processing: A survey. Transactions of the
Association for Computational Linguistics, 7:49–72.
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).
Scheduled sampling for sequence prediction with re-
current neural networks. Advances in neural informa-
tion processing systems, 28.
Beyer, A., Lo
´
aiciga, S., and Schlangen, D. (2021). Is in-
coherence surprising? targeted evaluation of coher-
ence prediction from language models. In Toutanova,
K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D.,
Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,
and Zhou, Y., editors, Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 4164–4173. Association
for Computational Linguistics.
Bi, B., Wu, C., Yan, M., Wang, W., Xia, J., and Li, C.
(2019). Incorporating external knowledge into ma-
chine reading for generative question answering. In
Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Pro-
ceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2521–2530, Hong
Kong, China. Association for Computational Linguis-
tics.
Cao, M., Dong, Y., Wu, J., and Cheung, J. C. K. (2020).
Factual error correction for abstractive summarization
models. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 6251–6258.
Cao, S. and Wang, L. (2021). CLIFF: Contrastive learn-
ing for improving faithfulness and factuality in ab-
stractive summarization. In Moens, M.-F., Huang, X.,
Specia, L., and Yih, S. W.-t., editors, Proceedings of
the 2021 Conference on Empirical Methods in Natural
Language Processing, pages 6633–6649. Association
for Computational Linguistics.
Cao, Z., Wei, F., Li, W., and Li, S. (2018). Faithful to the
original: fact-aware neural abstractive summarization.
Hallucinations in LLMs and Resolving Them: A Holistic Approach
111
In Proceedings of the Thirty-Second AAAI Confer-
ence on Artificial Intelligence and Thirtieth Innovative
Applications of Artificial Intelligence Conference and
Eighth AAAI Symposium on Educational Advances in
Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18.
AAAI Press.
Chen, T., Wang, X., Yue, T., Bai, X., Le, C. X., and Wang,
W. (2023). Enhancing abstractive summarization with
extracted knowledge graphs and multi-source trans-
formers. Applied Sciences, 13(13):7753.
Chen, W., Su, Y., Yan, X., and Wang, W. Y. (2020).
Kgpt: Knowledge-grounded pre-training for data-to-
text generation. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 8635–8648.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. In Proceedings
of the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186.
Dinan, E., Logacheva, V., Malykh, V., Miller, A. H., Shus-
ter, K., Urbanek, J., Kiela, D., Szlam, A., Serban, I. V.,
Lowe, R., Prabhumoye, S., Black, A. W., Rudnicky,
A. I., Williams, J. D., Pineau, J., Burtsev, M., and
Weston, J. (2019a). The second conversational intel-
ligence challenge (convai2). The Springer Series on
Challenges in Machine Learning, pages 187–208.
Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., and
Weston, J. (2019b). Wizard of wikipedia: Knowledge-
powered conversational agents. In International Con-
ference on Learning Representations.
Durmus, E., He, H., and Diab, M. (2020). FEQA: A ques-
tion answering evaluation framework for faithfulness
assessment in abstractive summarization. In Jurafsky,
D., Chai, J., Schluter, N., and Tetreault, J., editors,
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages 5055–
5070. Association for Computational Linguistics.
Dziri, N., Kamalloo, E., Mathewson, K., and Zaiane, O.
(2019). Evaluating coherence in dialogue systems us-
ing entailment. In Burstein, J., Doran, C., and Solorio,
T., editors, Proceedings of the 2019 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), pages 3806–
3812. Association for Computational Linguistics.
Dziri, N., Madotto, A., Za
¨
ıane, O., and Bose, A. J. (2021).
Neural path hunter: Reducing hallucination in dia-
logue systems via path grounding. In Moens, M.-F.,
Huang, X., Specia, L., and Yih, S. W.-t., editors, Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2197–
2214. Association for Computational Linguistics.
Edunov, S., Ott, M., Auli, M., and Grangier, D. (2018). Un-
derstanding back-translation at scale. In Proceedings
of the 2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 489–500.
Falke, T., Ribeiro, L. F., Utama, P. A., Dagan, I., and
Gurevych, I. (2019). Ranking generated summaries
by correctness: An interesting but challenging appli-
cation for natural language inference. In Proceedings
of the 57th annual meeting of the association for com-
putational linguistics, pages 2214–2220.
Fan, A., Gardent, C., Braud, C., and Bordes, A. (2019).
Using local knowledge graph construction to scale
Seq2Seq models to multi-document inputs. In Inui,
K., Jiang, J., Ng, V., and Wan, X., editors, Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 4186–4196, Hong
Kong, China. Association for Computational Linguis-
tics.
Feng, Y., Xie, W., Gu, S., Shao, C., Zhang, W., Yang, Z.,
and Yu, D. (2020). Modeling fluency and faithfulness
for diverse neural machine translation. In Proceed-
ings of the AAAI Conference on Artificial Intelligence,
volume 34, pages 59–66.
Filippova, K. (2020). Controlled hallucinations: Learning
to generate faithfully from noisy data. arXiv preprint
arXiv:2010.05873.
Gunel, B., Zhu, C., Zeng, M., and Huang, X. (2020).
Mind the facts: Knowledge-boosted coherent abstrac-
tive text summarization. ArXiv, abs/2006.15435.
Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J.
(2019). Learning from dialogue after deployment:
Feed yourself, chatbot! In Korhonen, A., Traum,
D., and M
`
arquez, L., editors, Proceedings of the 57th
Annual Meeting of the Association for Computational
Linguistics, pages 3667–3684. Association for Com-
putational Linguistics.
He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T.-Y., and
Ma, W.-Y. (2016). Dual learning for machine transla-
tion. Advances in neural information processing sys-
tems, 29.
He, Q., Wu, L., Yin, Y., and Cai, H. (2020). Knowledge-
graph augmented word representations for named en-
tity recognition. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 34, pages
7919–7926.
He, T., Zhang, J., Zhou, Z., and Glass, J. (2021). Exposure
bias versus self-recovery: Are distortions really incre-
mental for autoregressive text generation? In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 5087–
5102. Association for Computational Linguistics.
Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kuk-
liansy, D., Cohen, V., Scialom, T., Szpektor, I., Has-
sidim, A., and Matias, Y. (2022). True: Re-evaluating
factual consistency evaluation. In Proceedings of the
2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 3905–3920.
Hu, W., He, L., Ma, H., Wang, K., and Xiao, J. (2022).
Kgner: Improving chinese named entity recognition
by bert infused with the knowledge graph. Applied
Sciences, 12(15):7702.
Huang, L., Wu, L., and Wang, L. (2020). Knowledge graph-
augmented abstractive summarization with semantic-
driven cloze reward. In Jurafsky, D., Chai, J., Schluter,
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
112
N., and Tetreault, J., editors, Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 5094–5107. Association for Com-
putational Linguistics.
Huang, Y., Feng, X., Feng, X., and Qin, B. (2021). The fac-
tual inconsistency problem in abstractive text summa-
rization: A survey. arXiv preprint arXiv:2104.14839.
James, N. T. and Kannan, R. (2017). A survey on infor-
mation retrieval models, techniques and applications.
International Journals of Advanced Research in Com-
puter Science and Software Engineering ISSN.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,
Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey
of hallucination in natural language generation. ACM
Computing Surveys, 55(12):1–38.
Jiang, R., Banchs, R. E., and Li, H. (2016). Evaluating
and combining name entity recognition systems. In
Proceedings of the sixth named entity workshop, pages
21–27.
Krishna, K., Roy, A., and Iyyer, M. (2021). Hur-
dles to progress in long-form question answering.
In Toutanova, K., Rumshisky, A., Zettlemoyer, L.,
Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R.,
Chakraborty, T., and Zhou, Y., editors, Proceedings of
the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 4940–4957. Asso-
ciation for Computational Linguistics.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami,
K., and Dyer, C. (2016). Neural architectures for
named entity recognition. In Proceedings of the 2016
Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Lan-
guage Technologies, pages 260–270.
Lebret, R., Grangier, D., and Auli, M. (2016). Neural text
generation from structured data with application to the
biography domain. arXiv preprint arXiv:1603.07771.
Li, C., Bi, B., Yan, M., Wang, W., and Huang, S. (2021).
Addressing semantic drift in generative question an-
swering with auxiliary extraction. In Zong, C., Xia,
F., Li, W., and Navigli, R., editors, Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume
2: Short Papers), pages 942–947. Association for
Computational Linguistics.
Li, H., Zhu, J., Zhang, J., and Zong, C. (2018). Ensure the
correctness of the summary: Incorporate entailment
knowledge into abstractive sentence summarization.
In Proceedings of the 27th international conference
on computational linguistics, pages 1430–1441.
Li, M., Roller, S., Kulikov, I., Welleck, S., Boureau, Y.-
L., Cho, K., and Weston, J. (2020). Don’t say that!
making inconsistent dialogue unlikely with unlikeli-
hood training. In Jurafsky, D., Chai, J., Schluter, N.,
and Tetreault, J., editors, Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 4715–4728. Association for Com-
putational Linguistics.
Maynez, J., Narayan, S., Bohnet, B., and McDonald, R.
(2020). On faithfulness and factuality in abstractive
summarization. In Jurafsky, D., Chai, J., Schluter,
N., and Tetreault, J., editors, Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 1906–1919. Association for Com-
putational Linguistics.
Mazar
´
e, P.-E., Humeau, S., Raison, M., and Bordes, A.
(2018). Training millions of personalized dialogue
agents. In Riloff, E., Chiang, D., Hockenmaier, J., and
Tsujii, J., editors, Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 2775–2779. Association for Computational
Linguistics.
McCann, B., Bradbury, J., Xiong, C., and Socher, R. (2017).
Learned in translation: Contextualized word vectors.
Advances in neural information processing systems,
30.
Mielke, S. J., Szlam, A., Dinan, E., and Boureau, Y.-
L. (2022). Reducing conversational agents’ over-
confidence through linguistic calibration. Transac-
tions of the Association for Computational Linguis-
tics, 10:857–872.
M
¨
uller, M., Gonzales, A. R., and Sennrich, R. (2020). Do-
main robustness in neural machine translation. In Pro-
ceedings of the 14th Conference of the Association for
Machine Translation in the Americas (Volume 1: Re-
search Track), pages 151–164.
Nogueira, R. and Cho, K. (2019). Passage re-ranking with
bert. arXiv preprint arXiv:1901.04085.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meet-
ing of the Association for Computational Linguistics,
pages 311–318.
Parikh, A., Wang, X., Gehrmann, S., Faruqui, M., Dhin-
gra, B., Yang, D., and Das, D. (2020). ToTTo: A
controlled table-to-text generation dataset. In Webber,
B., Cohn, T., He, Y., and Liu, Y., editors, Proceed-
ings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1173–
1186. Association for Computational Linguistics.
Petroni, F., Rockt
¨
aschel, T., Riedel, S., Lewis, P., Bakhtin,
A., Wu, Y., and Miller, A. (2019). Language models as
knowledge bases? In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP),
pages 2463–2473.
Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015).
Sequence level training with recurrent neural net-
works. arXiv preprint arXiv:1511.06732.
Rashkin, H., Reitter, D., Tomar, G. S., and Das, D. (2021).
Increasing faithfulness in knowledge-grounded dia-
logue with controllable features. In Zong, C., Xia,
F., Li, W., and Navigli, R., editors, Proceedings of the
59th Annual Meeting of the Association for Compu-
tational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 704–718. Association for
Computational Linguistics.
Raunak, V., Menezes, A., and Junczys-Dowmunt, M.
(2021). The curious case of hallucinations in neural
Hallucinations in LLMs and Resolving Them: A Holistic Approach
113
machine translation. In Toutanova, K., Rumshisky,
A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,
Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,
Y., editors, Proceedings of the 2021 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, pages 1172–1183, Online. Association for Com-
putational Linguistics.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). why
should i trust you?” explaining the predictions of any
classifier. In Proceedings of the 22nd ACM SIGKDD
international conference on knowledge discovery and
data mining, pages 1135–1144.
Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M.,
Liu, Y., Xu, J., Ott, M., Smith, E. M., Boureau, Y.-L.,
and Weston, J. (2021). Recipes for building an open-
domain chatbot. In Merlo, P., Tiedemann, J., and Tsar-
faty, R., editors, Proceedings of the 16th Conference
of the European Chapter of the Association for Com-
putational Linguistics: Main Volume, pages 300–325.
Association for Computational Linguistics.
Santhanam, S., Hedayatnia, B., Gella, S., Padmakumar,
A., Kim, S., Liu, Y., and Hakkani-T
¨
ur, D. Z. (2021).
Rome was built in 1776: A case study on factual cor-
rectness in knowledge-grounded response generation.
ArXiv, abs/2110.05456.
Schick, T. and Sch
¨
utze, H. (2021). Exploiting cloze-
questions for few-shot text classification and natural
language inference. In Proceedings of the 16th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages
255–269.
Sch
¨
utze, H., Manning, C. D., and Raghavan, P. (2008). In-
troduction to information retrieval, volume 39. Cam-
bridge University Press Cambridge.
Sellam, T., Das, D., and Parikh, A. (2020). BLEURT:
Learning robust metrics for text generation. In Ju-
rafsky, D., Chai, J., Schluter, N., and Tetreault, J.,
editors, Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
7881–7892. Association for Computational Linguis-
tics.
Sennrich, R., Haddow, B., and Birch, A. (2016). Improving
neural machine translation models with monolingual
data. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 86–96.
Shen, L., Zhan, H., Shen, X., Chen, H., Zhao, X., and Zhu,
X. (2021). Identifying untrustworthy samples: Data
filtering for open-domain dialogues with bayesian op-
timization. In Proceedings of the 30th ACM Interna-
tional Conference on Information & Knowledge Man-
agement, page 1598–1608. Association for Comput-
ing Machinery.
Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J.
(2021). Retrieval augmentation reduces hallucination
in conversation. In Moens, M.-F., Huang, X., Spe-
cia, L., and Yih, S. W.-t., editors, Findings of the
Association for Computational Linguistics: EMNLP
2021, pages 3784–3803. Association for Computa-
tional Linguistics.
Specia, L., Hajlaoui, N., Hallett, C., and Aziz, W. (2011).
Predicting machine translation adequacy. In Proceed-
ings of Machine Translation Summit XIII: Papers.
Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q.,
and Fung, P. (2022). Read before generate! faithful
long form question answering with machine reading.
In Muresan, S., Nakov, P., and Villavicencio, A., ed-
itors, Findings of the Association for Computational
Linguistics: ACL 2022, pages 744–756, Dublin, Ire-
land. Association for Computational Linguistics.
Su, W., Tang, Y., Ai, Q., Wang, C., Wu, Z., and Liu, Y.
(2024). Mitigating entity-level hallucination in large
language models. arXiv preprint arXiv:2407.09417.
Sun, W., Shi, Z., Gao, S., Ren, P., de Rijke, M., and Ren, Z.
(2023). Contrastive learning reduces hallucination in
conversations. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 37, pages 13618–
13626.
Tian, R., Narayan, S., Sellam, T., and Parikh, A. P.
(2019). Sticking to the facts: Confident decoding
for faithful data-to-text generation. arXiv preprint
arXiv:1910.08684.
Toral, A., Wieling, M., and Way, A. (2018). Post-editing
effort of a novel with statistical and neural machine
translation. Frontiers in Digital Humanities, 5:9.
Tu, Z., Liu, Y., Shang, L., Liu, X., and Li, H. (2017). Neural
machine translation with reconstruction. In Proceed-
ings of the AAAI Conference on Artificial Intelligence,
volume 31.
Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016). Modeling
coverage for neural machine translation. In Proceed-
ings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 76–85.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in Neural
Information Processing Systems, 30.
Wang, A., Cho, K., and Lewis, M. (2020). Asking and an-
swering questions to evaluate the factual consistency
of summaries. In Jurafsky, D., Chai, J., Schluter, N.,
and Tetreault, J., editors, Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 5008–5020. Association for Com-
putational Linguistics.
Wang, H. (2020). Revisiting challenges in data-to-text
generation with fact grounding. arXiv preprint
arXiv:2001.03830.
Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., and
Tang, J. (2021). Kepler: A unified model for knowl-
edge embedding and pre-trained language representa-
tion. Transactions of the Association for Computa-
tional Linguistics, 9:176–194.
Welleck, S., Weston, J., Szlam, A., and Cho, K. (2019).
Dialogue natural language inference. In Korhonen,
A., Traum, D., and M
`
arquez, L., editors, Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3731–3741, Flo-
rence, Italy. Association for Computational Linguis-
tics.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
114
Wiseman, S., Shieber, S. M., and Rush, A. M. (2017). Chal-
lenges in data-to-document generation. arXiv preprint
arXiv:1707.08052.
Wu, Z., Galley, M., Brockett, C., Zhang, Y., Gao, X., Quirk,
C., Koncel-Kedziorski, R., Gao, J., Hajishirzi, H., Os-
tendorf, M., and Dolan, B. (2021). A controllable
model of grounded response generation. Proceed-
ings of the AAAI Conference on Artificial Intelligence,
35:14085–14093.
Yavuz, S., Rastogi, A., Chao, G.-L., and Hakkani-Tur, D.
(2019). DeepCopy: Grounded response generation
with hierarchical pointer networks. In Nakamura, S.,
Gasic, M., Zukerman, I., Skantze, G., Nakano, M.,
Papangelis, A., Ultes, S., and Yoshino, K., editors,
Proceedings of the 20th Annual SIGdial Meeting on
Discourse and Dialogue, pages 122–132. Association
for Computational Linguistics.
Yin, J., Jiang, X., Lu, Z., Shang, L., Li, H., and
Li, X. (2016). Neural generative question answer-
ing. In Iyyer, M., He, H., Boyd-Graber, J., and
Daum
´
e III, H., editors, Proceedings of the Workshop
on Human-Computer Question Answering, pages 36–
42, San Diego, California. Association for Computa-
tional Linguistics.
Yu, T., Liu, Z., and Fung, P. (2021). AdaptSum: To-
wards low-resource domain adaptation for abstrac-
tive summarization. In Toutanova, K., Rumshisky,
A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,
Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,
Y., editors, Proceedings of the 2021 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 5892–5904. Association for Compu-
tational Linguistics.
Zhang, C., Lee, G., D’Haro, L. F., and Li, H. (2021a). D-
score: Holistic dialogue evaluation without reference.
IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, 29:2502–2516.
Zhang, C., Lee, G., D’Haro, L. F., and Li, H. (2021b). D-
score: Holistic dialogue evaluation without reference.
IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, 29:2502–2516.
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and
Weston, J. (2018). Personalizing dialogue agents: I
have a dog, do you have pets too? In Gurevych, I.
and Miyao, Y., editors, Proceedings of the 56th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 2204–
2213. Association for Computational Linguistics.
Zhang, Y., Merck, D., Tsai, E., Manning, C. D., and Lan-
glotz, C. (2020a). Optimizing the factual correct-
ness of a summary: A study of summarizing radiol-
ogy reports. In Jurafsky, D., Chai, J., Schluter, N.,
and Tetreault, J., editors, Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 5108–5120. Association for Com-
putational Linguistics.
Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C.,
Gao, X., Gao, J., Liu, J., and Dolan, B. (2020b).
DIALOGPT : Large-scale generative pre-training for
conversational response generation. In Celikyilmaz,
A. and Wen, T.-H., editors, Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, pages 270–278.
Association for Computational Linguistics.
Zhou, B., Richardson, K., Ning, Q., Khot, T., Sabharwal,
A., and Roth, D. (2021). Temporal reasoning on im-
plicit events from distant supervision. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 1361–1371.
Zhou, C., Neubig, G., Gu, J., Diab, M., Guzman, P., Zettle-
moyer, L., and Ghazvininejad, M. (2020). Detecting
hallucinated content in conditional neural sequence
generation. arXiv preprint arXiv:2011.02593.
Zhou, K., Prabhumoye, S., and Black, A. W. (2018). A
dataset for document grounded conversations. In
Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii,
J., editors, Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 708–713. Association for Computational Lin-
guistics.
Zhu, C., Hinthorn, W., Xu, R., Zeng, Q., Zeng, M., Huang,
X., and Jiang, M. (2021). Enhancing factual con-
sistency of abstractive summarization. In Toutanova,
K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D.,
Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,
and Zhou, Y., editors, Proceedings of the 2021 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 718–733. Association for
Computational Linguistics.
Hallucinations in LLMs and Resolving Them: A Holistic Approach
115