Hallucinations in LLMs and Resolving Them: A Holistic Approach

Rajarshi Biswas, Sourav Dutta and Dirk Werth

August-Wilhelm Scheer Institute, Uni-Campus D 5 1, 66123 Saarbr

ucken, Germany

ﬁ

Keywords:

Natural Language Processing, Natural Language Generation, Generative AI.

Abstract:

Generative artiﬁcial intelligence, in recent times, is producing tremendous interest across industry and

academia leading to rapid growth. Developments in model architecture, training datasets and large scale

computing enable the realization of impressive generative tasks in textual computing, computer vision etc.

However, the generative processes suffer from various challenging artifacts that can generate confusion, risks

or compromise the security. In this paper, we explore in detail the problem of inconsistent or hallucinogenic

generation in natural language generation (NLG). We deﬁne the problem and survey the current techniques for

detection, measurement and mitigation on ﬁve different tasks, which are, abstractive summarization, question

answering, dialogue generation, machine translation and named entity recognition combined with information

retrieval.

1 INTRODUCTION

The emergence of powerful large language models

(LLMs) based on deep neural architectures, such as,

Transformers, BERT, GPT is enabling generative ar-

tiﬁcial intelligence to scale impressive feats and at-

tract unprecedented attention across the board. Natu-

ral language generation (NLG) is one of the primary

yet challenging generative tasks in natural language

processing and it is the focus of the LLMs. NLG

comprises a wide variety of tasks, such as, coherent

text, summary, dialogue generation, question answer-

ing, translation etc. that witnessed rapid growth in

the last decade. However, the signiﬁcant progress in

NLG is accompanied with challenges such as lack of

diversity in surface realization, loss of context and in-

consistent or hallucinogenic generation.

In this work, we concentrate on analyzing hallu-

cinogenic generation for ﬁve major downstream tasks

in NLG, which are, abstractive summarization, ques-

tion answering, dialogue generation, machine transla-

tion and named entity recognition combined with in-

formation retrieval. Hallucination is a form of degen-

eracy that demands attention from the research com-

munity. It is a serious issue with generative models

in NLG and refers to situations in which the model

generates inconsistent or nonsensical text that contra-

dicts the source material, context or objective. It is

important to study this phenomena since generative

models like LLMs are being widely adopted in sev-

eral critical services, e.g., health, banking in our so-

ciety where hallucinations can severely limit the per-

formance of the deployed models affecting the quality

of service. Moreover, it can also jeopardise the safety

of the applications leading to loss of trust and serious

damage. For example, inconsistent response genera-

tion in a banking application can lead to an incorrect

transaction causing loss of funds or more seriously

a hallucinogenic response from a LLM in the health

sector can lead to severe problems like wrong medi-

cation, drug overdose threatening the life of a patient.

As a consequence, efforts are being made in the

community to understand the issue of hallucination

or inconsistent generation in NLG. However, most of

the studies are directed towards machine translation

and text summarization. This leaves a gap in under-

standing the problem of hallucinations from a broader

perspective that span different tasks. So, in this work

we exhaustively survey the current works in this area

across ﬁve different NLG tasks mentioned previously.

We believe that studying the problem across different

tasks would lead to deeper understanding, formation

of an uniﬁed idea and help to identify global trends in

hallucinogenic generation. Furthermore, we also dis-

cuss different ideas for mitigating inconsistent gener-

ation in the three different NLG tasks studied.

We organize the rest of the paper in a way, such

that, section 2 describes the different variants and con-

tributing factors for hallucination in NLG. In sections

3, 4, 5, 6, 7 we survey the current efforts in un-

104

Biswas, R., Dutta, S. and Werth, D.

Hallucinations in LLMs and Resolving Them: A Holistic Approach.

DOI: 10.5220/0013094500003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 104-115

ISBN: 978-989-758-737-5; ISSN: 2184-433X

derstanding hallucination in abstractive summariza-

tion, question answering, dialogue generation, ma-

chine translation and named entity recognition along

with information retrieval. Under section 8 we dis-

cuss different ways of resolving the problem of hallu-

cinations considering holistic as well as task speciﬁc

measures. Following this in section 9 we discuss po-

tential future areas that can be researched for manag-

ing hallucinations in a better way. Finally, we sum-

marize our ﬁndings in section 10.

2 HALLUCINATION: VARIANTS

AND CONTRIBUTING

FACTORS

In this section, we brieﬂy describe the different vari-

ants of hallucination and the different factors con-

tributing to it. In the context of natural language

processing, Hallucination is deﬁned as automatically

generated content that is nonsensical and unfaith-

ful compared to the source content (Filippova, 2020;

Maynez et al., 2020; Parikh et al., 2020; Zhou et al.,

2020). Depending on the tasks, prior research work

divides it into two categories, Intrinsic and Extrinsic

hallucinations (Dziri et al., 2021; Huang et al., 2021;

Maynez et al., 2020). In the ﬁrst category, the gener-

ated output contradicts the input source allowing it to

be classiﬁed as erroneous generation. Extrinsic hal-

lucination refers to generations that cannot be veri-

ﬁed from source content thus it may not be incorrect

every time. Nonetheless, it is still problematic and

poses a safety risk. The primary factors contributing

to hallucination in NLG are data sources and model

training choices. On the data front factors such as

heuristic data collection (Lebret et al., 2016; Wise-

man et al., 2017; Parikh et al., 2020; Wang, 2020)

or tasks that require diversity in the generations, e.g.,

open-domain dialogue generation in a subjective tone

(Rashkin et al., 2021) leads to source-output diver-

gence. This divergence is one of the key contributing

factors behind hallucination. Model training related

factors causing hallucination could be faulty repre-

sentation learning, wrong decoding, exposure bias,

parametric-knowledge bias etc. For instance, an en-

coder learning wrong correlations (Li et al., 2018;

Feng et al., 2020) or having a faulty understanding

(Parikh et al., 2020) can lead to inconsistent gener-

ations. Similarly, focusing on the wrong part of the

encoded information or efforts directed at improving

diversity during decoding can result in hallucination

(Tian et al., 2019). The problem of exposure bias

(Bengio et al., 2015; Ranzato et al., 2015), that is, dis-

parity in decoding during training and inference also

leads to inconsistency. This is due to MLE optimiza-

tion using ground-truth preﬁxes for next token predic-

tion in contrast to using self generated history during

inference (He et al., 2021).

3 ABSTRACTIVE

SUMMARIZATION

Hallucination: In NLP, abstractive summarization

refers to the task of generating a short, concise sum-

mary from the source text such that it contains all the

relevant details in the source (Yu et al., 2021). Even

though neural approaches have obtained much suc-

cess in this task, recent studies ﬁnd that neural tech-

niques generate inconsistent or hallucinogenic con-

tent (Falke et al., 2019; Maynez et al., 2020). More-

over, it is observed that generated summaries with

large amount of inconsistencies can still obtain very

high ROUGE scores. These ﬁndings underscore the

importance of studying the problem of hallucinations

in this task.

Measurement: The degree of inconsistency in the

generated summaries are measured using metrics that

are mostly model based. These can be categorized

into unsupervised and semi-supervised metrics. The

unsupervised metrics can be further classiﬁed into

information extraction based, natural language in-

ference based and question-answering based respec-

tively. Information extraction based methods ex-

tract details in the form of relation tuples from both

the source and generated summary for veriﬁcation

of factual accuracy. In a similar light, question-

answering based metrics measure factual accuracy

between source and output through generation of per-

tinent questions that are assumed to produce simi-

lar answers. In general, these metrics follow three

steps, which are, question generation from the gen-

erated output, extracting answers from the source &

output, and scoring the correctness of answers ob-

tained from the source and the output. In contrast nat-

ural language inference metrics assume that there is

a ground-truth for a faithful summary.

Resolution: Practitioners in abstractive summariza-

tion use various techniques for coping this issue.

For example, graph neural networks are used in

(Zhu et al., 2021) for encoding facts from the source

text and further integration of reward functions in

(Huang et al., 2020) for better understanding in-

teractions between entities in the source. Exter-

nal knowledge embedding obtained from embed-

ding facts from wikipedia is also used in (Gunel et al.,

2020) for improving factual consistency. Techniques

Hallucinations in LLMs and Resolving Them: A Holistic Approach

105

like (Aralikatte et al., 2021) propose focus-attention

mechanism for making decoders generate tokens that

are related to the facts or topic of the source. Keeping

with attention-based methods, the work in (Cao et al.,

2018) uses a dual attention sequence-to-sequence

framework for ensuring that generated summaries

take into account the source text and the facts ex-

tracted from them. Contrastive learning technique is

used in (Cao and Wang, 2021) for enabling the mod-

els to distinguish between positive ground-truth sum-

maries and automatically generated negative sum-

maries containing factual inconsistencies or halluci-

nations. Apart from these post-processing is also em-

ployed in the works to get rid off the inconsistent facts

in the generated summaries.

4 QUESTION-ANSWERING

Hallucination: Generative question answering is

gaining prominence with the growing success of gen-

erative artiﬁcial intelligence. It is more powerful

and effective compared to ﬁrst generation question-

answering systems that merely tried to ﬁnd facts in

the source text that support the questions. The objec-

tive of generative question answering is to frame more

detailed and complete answers that may require gath-

ering information from all over the source. As a re-

sult, sometimes the system needs to consult multiple

source documents as a single document may not con-

tain all the information needed for framing a deﬁnitive

answer. However, this process can induce the adverse

side-effect hallucinations since some of these docu-

ments may contain extraneous or contradictory infor-

mation. The closest form of a deﬁnition of Hallucina-

tion in generative question answering is semantic drift

(Li et al., 2021). It shows how a generated answer

drifts away from the correct answer during genera-

tion. Apart from this the majority of the works in this

area leverage human evaluation for measuring factual

correctness of the generated answers as a measure of

inconsistency.

Measurement: Hallucination in generative question

answering is measured using the metric Semantic

Overlap (Sellam et al., 2020). It is a BERT-based

metric that correlates with human judgment. Factual

correctness is also employed for measuring consis-

tency (Zhang et al., 2020a) between generated text

and source document using information extraction.

Automatic question answer based metric is pro-

posed (Durmus et al., 2020; Wang et al., 2020) for

measuring consistency in generated summaries. In

this approach, ﬁrst question-answer pairs are created

using a question generation model from the gener-

ated summary. Subsequently, a model is used for

extracting answers from the source document for the

questions generated in the previous step. If the an-

swers don’t match then the generated summary is

regarded as unfaithful. This technique is also used

in measuring hallucination in generative question an-

swering. Apart from these metrics, human evalua-

tion is frequently used in this ﬁeld for measuring the

consistency or faithfulness of the generated answers.

Human evaluation is often also used to complement

automatic N-gram overlap metrics, such as, BLEU,

ROUGE, METEOR, as these correlate poorly with

human judgments.

Resolution: Techniques used in generative question

answering for resolving hallucination concentrate on

leveraging external knowledge bases and informa-

tion resources for improving the factual correctness

or faithfulness of the generated answers. Another ap-

proach (Bi et al., 2019) generates answers by accu-

mulating information from multiple sources, such

as, knowledge-bases, passages, vocabulary, questions

etc. Neural model (Yin et al., 2016) is used for gen-

erating answers to factoid questions using informa-

tion from knowledge-base. More recent approaches

(Fan et al., 2019) create individual knowledge graph

for every question for condensing information while

reducing redundancy for tackling hallucination. An-

other method (Li et al., 2021) extracts rationale for

an answer in the encoding stage and biases the de-

coder to generate the answer using the rationale and

the actual input. For reducing hallucination in the

answers, the authors in (Krishna et al., 2021) pro-

pose a sparse attention-based transformer model

as the answer generator for effectively handling the

retrieved documents. It models long-range depen-

dence employing local attention and mini-batch K-

means clustering. Similarly for mitigating halluci-

nation in (Su et al., 2022), a new framework is pro-

posed that jointly models answer-generation with

machine reading. The generation model is comple-

mented by the machine reading module. It provides

salient answer related information to the generation

model to improve faithfulness of the generated an-

swer.

5 DIALOGUE GENERATION

Hallucination: Dialogue generation is probably the

most widely adopted generation tasks in natural lan-

guage processing with wide ranging applications like

chatbots, voice-assistants etc. It can be broadly cate-

gorized into task speciﬁc and open domain dialogue

generation. In the ﬁrst category, we expect responses

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

106

to contain speciﬁc information while in the second

type often an engaging response is desired without

too much repetition from the conversational history

for relatively long conversation. Due to this nature

the tolerance for hallucination is higher in this task

compared to other generation tasks. Hallucination in

dialogue generation is considered intrinsic if certain

speciﬁc information is absent or misrepresented in the

generated response. Whereas if the generated conver-

sation is not ﬁrmly grounded in hard facts and is dif-

ﬁcult to be explicitly veriﬁed using knowledge bases

or conversational history then it is termed as extrinsic

hallucination. In our work, we discuss the problems

related to open domain dialogue generation as it more

relevant to the modern dialogue systems that are de-

veloped incorporating state-of-the-art LLMs trained

on huge amounts of training data. In open domain di-

alogue systems there can be broadly two sources of

hallucinations. First, responses that contradict previ-

ous responses from the same system leading to incon-

sistency (Li et al., 2020; Welleck et al., 2019; Zhang

et al., 2021a), incoherence (Beyer et al., 2021; Dziri

et al., 2019) termed as self-inconsistency. Secondly,

when the systems generates responses that are incon-

sistent with regards to an external source, e.g., fac-

tually incorrect responses then it is termed as exter-

nal inconsistency (Mielke et al., 2022; Roller et al.,

2021). Another factor inﬂuencing inconsistency in

open domain dialogue generation is the lack of con-

sistency in the Persona/Character assumed by the di-

alogue system. This often leads to contradictions and

in turn to hallucinations. As a result, there is re-

search (Hancock et al., 2019; Mazar

e et al., 2018;

Yavuz et al., 2019; Zhang et al., 2020b) to develop

systems that are persona consistent with the help of

suitable datasets (Dinan et al., 2019a; Zhang et al.,

2018). Additionally, there are also works in open do-

main dialogue generation that use external knowledge

bases and graphs for generating informative responses

(Dinan et al., 2019b; Zhou et al., 2018). Hallucina-

tion in such systems is treated as factual inconsistency

and has received equal amount of attention from the

dialogue generation community (Dziri et al., 2021;

Rashkin et al., 2021; Santhanam et al., 2021; Shus-

ter et al., 2021).

Measurement: Evaluation of hallucination in open

domain dialogue generation is still an open problem

as there is no standard metric for measuring it. Dia-

logue systems, such as, chat-bots are often evaluated

using factual correctness or consistency. Some auto-

mated metrics used for measurement are Knowledge

F1, Rare F1 (Shuster et al., 2021) both of which are

based on statistics while others are model based tech-

niques. Knowledge F1 utilizes ground-truth datasets

where knowledge is labeled. This refers to gold stan-

dard knowledge sentences to which a person referred

for conversation during dataset collection. Knowl-

edge F1 measures the overlap between the generated

and gold knowledge sentences. This metric tries to

measure if the generated responses are able to capture

the available knowledge and thus if they make sense.

Rare F1 only considers the infrequent words in the

dataset for computing the F1 metric. This is done to

negate the inﬂuence of common uni-grams. However,

overlap based metrics cannot provide comprehensive

evaluation since the same semantic meaning could be

represented in a wide variety of surface realizations.

For addressing this different model based techniques

have been proposed for measuring consistency. For

example, using natural language inference (NLI)

(Dziri et al., 2019; Welleck et al., 2019), learnable

evaluation metrics (Zhang et al., 2021b) or use of

an additional test for measuring coherence (Beyer

et al., 2021). These methods offer more ﬂexibility and

can support generations with different surface realiza-

tions.

Resolution: The problem of hallucination in open do-

main dialogue generation can be mitigated using dif-

ferent techniques. One of the ways is by introducing

extra information in the data. The authors in (Shen

et al., 2021) propose a measurement based on features

of dialogue quality which can be used to remove sam-

ples from the training set that get a lower score on

this measurement. In turn this can improve perfor-

mance in terms of self-consistency. Retrieval is used

to augment dialogue generation approaches, such as,

Knowledge Grounded Dialogue where is it performs

knowledge selection and helps to reduce hallucina-

tions substantially (Shuster et al., 2021). Control

codes concatenated with dialogue inputs is proposed

in (Rashkin et al., 2021) for reducing hallucinations.

It makes the model more aware of how the gener-

ations rely on evidence based in knowledge. Im-

proved dialogue modeling techniques have also been

studied for reducing hallucinations during generation,

e.g., the use of inductive attention in dialogue mod-

els based on the transformer architecture (Wu et al.,

2021).

6 MACHINE TRANSLATION

Hallucination: Machine translation (MT) refers to

the automatic conversion of text from one language

into another, aiming for both grammatical accuracy

and semantic ﬁdelity (Bahdanau, 2014). While neural

machine translation (NMT) models have dramatically

improved translation quality, particularly with the

Hallucinations in LLMs and Resolving Them: A Holistic Approach

107

advent of transformer-based architectures (Vaswani

et al., 2017), they are still prone to generating hal-

lucinations. These hallucinations occur when the sys-

tem introduces information that is not present in the

source text, or mistranslates critical content, leading

to outputs that may seem ﬂuent but are semantically

incorrect or inconsistent (Raunak et al., 2021; M

uller

et al., 2020). These errors are particularly prevalent

in low-resource language pairs and in cases where the

model overﬁts to patterns in the training data. Hal-

lucinations in machine translation can severely im-

pact the reliability of translations, especially in criti-

cal domains such as legal, medical, or technical ﬁelds,

where accuracy is paramount (Raunak et al., 2021).

Measurement: Evaluating hallucinations in machine

translation poses a unique challenge, as traditional

metrics like BLEU (Papineni et al., 2002) or ME-

TEOR (Banerjee and Lavie, 2005), which compare

the machine output to reference translations, may not

effectively capture the degree of hallucination. Re-

cent studies have proposed new approaches to better

measure hallucinations, including both model-based

and human evaluation metrics. One common ap-

proach involves using adequacy-based human eval-

uation, where human annotators judge how well

the translation aligns with the source content (Spe-

cia et al., 2011). For automated methods, source-

reference alignment techniques can identify mis-

translations or extraneous information by comparing

source and target alignments to ensure ﬁdelity (He

et al., 2016). This focuses on improving transla-

tion quality by aligning source and target text, help-

ing to detect hallucinations or extraneous information.

This approach ensures better ﬁdelity in translations

by reﬁning how models maintain consistency between

the input and output sequences. NLI (Natural Lan-

guage Inference) model-based metrics (Zhou et al.,

2021) mainly aimed to fact-check and align gener-

ated text, are able to detect hallucinations in gener-

ated outputs. Such methods compare the translated

content (hypothesis) against the source (premise) for

contradictions or factual inaccuracies. Another ap-

proach uses conﬁdence-based ﬁltering, where low-

conﬁdence outputs from the translation model are

ﬂagged as potentially hallucinatory (Tu et al., 2017).

Resolution: Addressing hallucinations in machine

translation involves both improving the underlying

model architecture and leveraging external resources.

One promising approach is data augmentation, par-

ticularly for low-resource languages, which can help

mitigate hallucinations caused by insufﬁcient train-

ing data (Sennrich et al., 2016). In addition, back-

translation, where the model translates target lan-

guage sentences back into the source language and

compares them to the original text, has been used to

reduce inconsistencies (Edunov et al., 2018). Other

efforts focus on improving the attention mechanisms

within transformers. For example, coverage mecha-

nisms have been employed to ensure that every part of

the source sentence is attended to during translation,

reducing the likelihood that the model will “invent”

content not present in the source (Tu et al., 2016). In-

corporating external knowledge bases has also been

explored, particularly integrating knowledge embed-

dings into NLP tasks like translation helps maintain

factual consistency, reducing the risk of hallucina-

tions, especially in technical or specialized content

(Wang et al., 2021). Moreover, the use of reinforce-

ment learning for sequence prediction tasks, includ-

ing NMT, shows how reward functions can be tai-

lored to encourage factual accuracy, reducing issues

like hallucination during translation (Bahdanau et al.,

2022). Finally, post-editing techniques, where hu-

man editors review and correct translations, are often

employed in high-stakes scenarios to ensure ﬁnal out-

put quality, especially when dealing with critical con-

tent (Toral et al., 2018).

7 NAMED ENTITY

RECOGNITION AND

INFORMATION RETRIEVAL

Hallucination: Named Entity Recognition (NER) is

a fundamental NLP task aimed at identifying and

classifying proper nouns such as people, organiza-

tions, and locations within a text (Lample et al.,

2016). Despite signiﬁcant progress with neural mod-

els, these systems can still exhibit hallucinations,

where entities are misclassiﬁed or incorrectly gener-

ated. For example, models might mistakenly recog-

nize a non-existent entity or mislabel a correct entity

due to insufﬁcient context or model limitations (Su

et al., 2024). This misclassiﬁcation can impact appli-

cations relying on accurate entity identiﬁcation, such

as information extraction and semantic search. In In-

formation Retrieval (IR), the objective is to retrieve

documents or data that are relevant to a user’s query

(Sch

utze et al., 2008). Although neural IR models

have improved the relevance and ranking of retrieved

results, they can occasionally retrieve documents that

are irrelevant or hallucinated, meaning the retrieved

results do not genuinely align with the user’s query

intent (Nogueira and Cho, 2019; James and Kannan,

2017). These hallucinated results can stem from over-

ﬁtting on training data or from inadequacies in the

query-document matching process.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

108

Measurement: To measure hallucinations in NER,

various evaluation metrics are employed. Precision,

recall, and F1-score are commonly used to compare

the predicted entities against a gold standard anno-

tated dataset. Precision measures the proportion of

correctly identiﬁed entities out of all entities iden-

tiﬁed by the model, recall measures the proportion

of correctly identiﬁed entities out of all entities that

should have been identiﬁed, and F1-score provides

a balance between precision and recall. Unsuper-

vised metrics also play a role, such as entity linking,

where entities recognized by the model are matched

against external knowledge bases to verify their cor-

rectness. Cross-document consistency checks can

further identify discrepancies by ensuring that enti-

ties are consistently recognized across multiple doc-

uments (Jiang et al., 2016). For IR, effectiveness is

measured through metrics such as Precision@K, Re-

call@K, and Mean Reciprocal Rank (MRR). Pre-

cision@K measures the proportion of relevant docu-

ments among the top K retrieved documents, while

Recall@K assesses the proportion of relevant docu-

ments retrieved within the top K results. MRR eval-

uates the rank of the ﬁrst relevant document in the

list. Additionally, query-document relevance scoring,

which involves assessing the alignment between the

query and the retrieved documents, and external val-

idation against curated datasets are used to gauge re-

trieval accuracy and address issues of hallucination

(Sch

utze et al., 2008; Nogueira and Cho, 2019).

Resolution: Addressing hallucinations in NER in-

volves several advanced techniques. Contextual em-

beddings from models such as BERT (Devlin et al.,

2019) capture richer semantic information by provid-

ing context-dependent representations of words. This

approach improves the accuracy of entity recognition

by understanding the context in which entities appear.

Multi-task learning, which involves training models

on related tasks simultaneously, helps enhance entity

recognition by leveraging additional sources of infor-

mation (McCann et al., 2017). Integrating external

knowledge sources like knowledge graphs can also

reduce hallucinations by grounding the entity recog-

nition process in real-world data (He et al., 2020).

In IR, techniques to mitigate hallucination include

employing advanced retrieval architectures such as

dense retrievers and cross-encoder models. Dense re-

trievers use dense vector representations for query-

document matching, which improves the relevance

ranking of retrieved documents (Nogueira and Cho,

2019). Cross-encoder models, which jointly encode

the query and documents, further reﬁne retrieval by

capturing complex relationships between them. Addi-

tionally, incorporating user feedback and techniques

like query expansion, where additional terms or con-

text are added to the query, helps reﬁne retrieval re-

sults and address issues of hallucination (Azad and

Deepak, 2019).

8 APPROACHES TO RESOLVING

HALLUCINATIONS

The motivation behind this paper stems from the

growing reliance on Large Language Models (LLMs)

across a wide range of NLP tasks. While these models

have demonstrated remarkable advancements, they

also introduce a critical challenge: hallucinations.

Across tasks like abstractive summarization, ques-

tion answering, dialog generation, machine transla-

tion, NER, and information retrieval, hallucinations

manifest in various forms, from generating factual

inaccuracies to retrieving irrelevant or fabricated in-

formation. Despite signiﬁcant progress in mitigating

these issues, hallucination remains a pervasive prob-

lem that compromises the reliability of LLMs in real-

world applications (Ji et al., 2023). The primary mo-

tivation for this paper is the need for a comprehen-

sive, cross-task analysis of hallucinations in LLMs.

While hallucinations in speciﬁc tasks such as summa-

rization or machine translation have been studied in

isolation (Raunak et al., 2021), there has been little

effort to systematically explore hallucinations across

multiple NLP tasks, each with its unique characteris-

tics and challenges. This paper aims to ﬁll that gap

by providing a detailed investigation into the nature

of hallucinations in ﬁve distinct tasks, as well as out-

lining the current methods to detect and resolve them.

Our contribution is twofold: (1) a consolidated review

of hallucination across different NLP tasks, and (2)

proposing task-agnostic and task-speciﬁc approaches

to resolve hallucinations, thereby providing a frame-

work for future research.

8.1 Holistic Approach

While techniques that are task-speciﬁc, such as exter-

nal knowledge integration (Zhu et al., 2021) or using

better reward mechanisms (Chen et al., 2023), have

shown promise, we propose a more holistic approach

that could beneﬁt all tasks:

Improving Model Interpretability: A crucial chal-

lenge is the black-box nature of LLMs, which makes

hallucinations difﬁcult to predict or prevent. Im-

plementing interpretability mechanisms like atten-

tion visualization or rule-based model auditing can

help identify when and why hallucinations occur (Be-

linkov and Glass, 2019). Models like BERT, GPT,

Hallucinations in LLMs and Resolving Them: A Holistic Approach

109

and their variants could be enhanced with transpar-

ent architectures that allow for more insight into their

decision-making process, especially in tasks prone to

hallucination, like dialog generation and summariza-

tion (Ribeiro et al., 2016).

Task-Agnostic Regularization: Regularization tech-

niques, like fact-checking or constraint-based gen-

eration, should be applied consistently across tasks.

For example, incorporating external knowledge bases,

such as Wikipedia or structured databases, can help

ground generated outputs in factual information,

thereby reducing hallucination in both generative

(summarization, QA) and retrieval-based tasks (IR,

NER) (Petroni et al., 2019). This approach prevents

the model from generating content that strays too far

from veriﬁable truth, creating a safeguard against fab-

ricated information.

Adaptive Fine-Tuning for Speciﬁc Tasks: Although

LLMs are designed to generalize across tasks, ﬁne-

tuning them on domain-speciﬁc data can signiﬁcantly

reduce hallucinations. In tasks like machine transla-

tion and information retrieval, training models on spe-

cialized datasets and including domain-relevant enti-

ties can lead to more accurate and contextually ap-

propriate outputs (Sun et al., 2023). This reduces the

likelihood of hallucinating irrelevant or incorrect in-

formation, particularly when the task demands high

precision.

Evaluation and Feedback Mechanisms: One con-

sistent theme across tasks is the need for robust eval-

uation metrics. ROUGE, BLEU, and MRR are of-

ten insufﬁcient to detect hallucinations because they

focus on ﬂuency and surface-level similarities (Hon-

ovich et al., 2022). We suggest augmenting these met-

rics with fact-based or entity-level veriﬁcation mech-

anisms. For instance, in question answering, auto-

matic fact-checking systems could be integrated to

score models on factual consistency, while in summa-

rization and translation, knowledge graphs could be

employed to cross-validate entity relationships (Cao

et al., 2020).

8.2 Task-Speciﬁc Considerations

Certain tasks, due to their inherent complexity and the

nature of the data they process, require tailored so-

lutions to effectively mitigate hallucinations. These

solutions address the unique challenges of each task,

allowing models to generate more accurate and con-

textually appropriate outputs.

Named Entity Recognition (NER): NER systems

are prone to hallucinations when they mislabel en-

tities or identify non-existent ones, especially in do-

mains where new entities frequently emerge, such as

healthcare, ﬁnance, or geopolitics. Grounding NER

models in dynamic, real-world knowledge bases, such

as Wikidata or domain-speciﬁc databases, can help

ensure that entity identiﬁcation remains accurate and

up-to-date (Hu et al., 2022). By continuously updat-

ing the knowledge base and training the model on

evolving data, hallucinations can be reduced as the

system remains aware of the latest entities and their

relationships. Furthermore, integrating context-aware

mechanisms, where entity recognition adapts based

on sentence-level or document-level context, can im-

prove accuracy and minimize misidentiﬁcations, par-

ticularly in ambiguous scenarios where multiple enti-

ties are involved.

Machine Translation: Machine translation systems

are susceptible to hallucinations, particularly when

translating between languages with signiﬁcant struc-

tural differences or when translating low-resource

languages. Ensuring linguistic consistency across lan-

guages is crucial for reducing hallucinations. One

approach is incorporating post-editing frameworks

where human translators verify and correct machine-

generated translations, thereby maintaining transla-

tion quality and factual accuracy. In addition, con-

trastive learning techniques, which explicitly train

the model to recognize and avoid incorrect or out-

of-context translations, can help minimize semantic

drift—the phenomenon where the translation strays

from the intended meaning (Raunak et al., 2021).

This can be particularly useful when translating spe-

cialized texts, such as legal or medical documents,

where precision is paramount.

Dialog Generation: Hallucinations in dialog gener-

ation often result in models producing off-topic, in-

coherent, or factually incorrect responses. One of the

primary challenges is maintaining the consistency and

coherence of conversations over multiple turns. In-

tegrating persona mechanisms—where the model is

conditioned on a set of attributes or knowledge about

the user—can help ground responses in the user’s

context, reducing the likelihood of irrelevant or in-

consistent replies (Zhang et al., 2020b). Addition-

ally, context memory mechanisms, which allow the

model to retain and reference information from ear-

lier in the conversation, can ensure that subsequent

responses stay coherent and relevant. By maintaining

a memory of the dialog history, models can avoid in-

troducing new, unrelated information that could lead

to hallucination.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

110

9 FUTURE WORK

Moving forward, we envision research focusing on

hybrid models that combine symbolic reasoning with

deep learning. This could address hallucinations by

introducing structured knowledge into the generative

process (Chen et al., 2020). Additionally, cross-

lingual hallucination detection in translation tasks and

further exploration into self-supervised fact-checking

methods for QA and summarization will likely en-

hance model robustness. Ultimately, addressing hal-

lucinations requires a concerted effort that combines

advances in model architectures, training strategies,

and evaluation techniques. Our work highlights the

importance of a uniﬁed approach to tackling halluci-

nations in LLMs, with the aim of developing models

that are not only powerful but also reliable and trust-

worthy (Schick and Sch

utze, 2021).

10 CONCLUSION

In this paper, we explored the challenge of hallucina-

tions in Large Language Models (LLMs) across ﬁve

key NLP tasks: abstractive summarization, question

answering, dialog generation, machine translation,

named entity recognition, and information retrieval.

Despite advances in these tasks, hallucinations remain

a persistent problem, undermining model reliability.

We provided a comprehensive review of task-speciﬁc

manifestations, metrics, and methods to address hal-

lucinations, and proposed a uniﬁed framework that

emphasizes interpretability, regularization, and ﬁne-

tuning. Moving forward, addressing hallucinations

will be crucial for improving the trustworthiness and

applicability of LLMs in real-world scenarios.

ACKNOWLEDGEMENTS

This research is funded in part by the “Bundesmin-

isterium f

ur Wirtschft und Klimaschutz” within the

project “MERLOT” which was funded under the

project reference 68GX21008K.

REFERENCES

Aralikatte, R., Narayan, S., Maynez, J., Rothe, S., and Mc-

Donald, R. (2021). Focus attention: Promoting faith-

fulness and diversity in summarization. In Zong, C.,

Xia, F., Li, W., and Navigli, R., editors, Proceed-

ings of the 59th Annual Meeting of the Association

for Computational Linguistics and the 11th Interna-

tional Joint Conference on Natural Language Pro-

cessing (Volume 1: Long Papers), pages 6078–6095.

Association for Computational Linguistics.

Azad, H. K. and Deepak, A. (2019). Query expansion tech-

niques for information retrieval: a survey. Information

Processing & Management, 56(5):1698–1735.

Bahdanau, D. (2014). Neural machine translation by

jointly learning to align and translate. arXiv preprint

arXiv:1409.0473.

Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R.,

Pineau, J., Courville, A., and Bengio, Y. (2022). An

actor-critic algorithm for sequence prediction. In In-

ternational Conference on Learning Representations.

Banerjee, S. and Lavie, A. (2005). Meteor: An automatic

metric for mt evaluation with improved correlation

with human judgments. In Proceedings of the acl

workshop on intrinsic and extrinsic evaluation mea-

sures for machine translation and/or summarization,

pages 65–72.

Belinkov, Y. and Glass, J. (2019). Analysis methods in neu-

ral language processing: A survey. Transactions of the

Association for Computational Linguistics, 7:49–72.

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).

Scheduled sampling for sequence prediction with re-

current neural networks. Advances in neural informa-

tion processing systems, 28.

Beyer, A., Lo

aiciga, S., and Schlangen, D. (2021). Is in-

coherence surprising? targeted evaluation of coher-

ence prediction from language models. In Toutanova,

K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D.,

Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,

and Zhou, Y., editors, Proceedings of the 2021 Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies, pages 4164–4173. Association

for Computational Linguistics.

Bi, B., Wu, C., Yan, M., Wang, W., Xia, J., and Li, C.

(2019). Incorporating external knowledge into ma-

chine reading for generative question answering. In

Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Pro-

ceedings of the 2019 Conference on Empirical Meth-

ods in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 2521–2530, Hong

Kong, China. Association for Computational Linguis-

tics.

Cao, M., Dong, Y., Wu, J., and Cheung, J. C. K. (2020).

Factual error correction for abstractive summarization

models. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 6251–6258.

Cao, S. and Wang, L. (2021). CLIFF: Contrastive learn-

ing for improving faithfulness and factuality in ab-

stractive summarization. In Moens, M.-F., Huang, X.,

Specia, L., and Yih, S. W.-t., editors, Proceedings of

the 2021 Conference on Empirical Methods in Natural

Language Processing, pages 6633–6649. Association

for Computational Linguistics.

Cao, Z., Wei, F., Li, W., and Li, S. (2018). Faithful to the

original: fact-aware neural abstractive summarization.

Hallucinations in LLMs and Resolving Them: A Holistic Approach

111

In Proceedings of the Thirty-Second AAAI Confer-

ence on Artiﬁcial Intelligence and Thirtieth Innovative

Applications of Artiﬁcial Intelligence Conference and

Eighth AAAI Symposium on Educational Advances in

Artiﬁcial Intelligence, AAAI’18/IAAI’18/EAAI’18.

AAAI Press.

Chen, T., Wang, X., Yue, T., Bai, X., Le, C. X., and Wang,

W. (2023). Enhancing abstractive summarization with

extracted knowledge graphs and multi-source trans-

formers. Applied Sciences, 13(13):7753.

Chen, W., Su, Y., Yan, X., and Wang, W. Y. (2020).

Kgpt: Knowledge-grounded pre-training for data-to-

text generation. In Proceedings of the 2020 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing (EMNLP), pages 8635–8648.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceedings

of the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and

Short Papers), pages 4171–4186.

Dinan, E., Logacheva, V., Malykh, V., Miller, A. H., Shus-

ter, K., Urbanek, J., Kiela, D., Szlam, A., Serban, I. V.,

Lowe, R., Prabhumoye, S., Black, A. W., Rudnicky,

A. I., Williams, J. D., Pineau, J., Burtsev, M., and

Weston, J. (2019a). The second conversational intel-

ligence challenge (convai2). The Springer Series on

Challenges in Machine Learning, pages 187–208.

Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., and

Weston, J. (2019b). Wizard of wikipedia: Knowledge-

powered conversational agents. In International Con-

ference on Learning Representations.

Durmus, E., He, H., and Diab, M. (2020). FEQA: A ques-

tion answering evaluation framework for faithfulness

assessment in abstractive summarization. In Jurafsky,

D., Chai, J., Schluter, N., and Tetreault, J., editors,

Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics, pages 5055–

5070. Association for Computational Linguistics.

Dziri, N., Kamalloo, E., Mathewson, K., and Zaiane, O.

(2019). Evaluating coherence in dialogue systems us-

ing entailment. In Burstein, J., Doran, C., and Solorio,

T., editors, Proceedings of the 2019 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, Volume 1 (Long and Short Papers), pages 3806–

3812. Association for Computational Linguistics.

Dziri, N., Madotto, A., Za

ıane, O., and Bose, A. J. (2021).

Neural path hunter: Reducing hallucination in dia-

logue systems via path grounding. In Moens, M.-F.,

Huang, X., Specia, L., and Yih, S. W.-t., editors, Pro-

ceedings of the 2021 Conference on Empirical Meth-

ods in Natural Language Processing, pages 2197–

2214. Association for Computational Linguistics.

Edunov, S., Ott, M., Auli, M., and Grangier, D. (2018). Un-

derstanding back-translation at scale. In Proceedings

of the 2018 Conference on Empirical Methods in Nat-

ural Language Processing, pages 489–500.

Falke, T., Ribeiro, L. F., Utama, P. A., Dagan, I., and

Gurevych, I. (2019). Ranking generated summaries

by correctness: An interesting but challenging appli-

cation for natural language inference. In Proceedings

of the 57th annual meeting of the association for com-

putational linguistics, pages 2214–2220.

Fan, A., Gardent, C., Braud, C., and Bordes, A. (2019).

Using local knowledge graph construction to scale

Seq2Seq models to multi-document inputs. In Inui,

K., Jiang, J., Ng, V., and Wan, X., editors, Proceed-

ings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 4186–4196, Hong

Kong, China. Association for Computational Linguis-

tics.

Feng, Y., Xie, W., Gu, S., Shao, C., Zhang, W., Yang, Z.,

and Yu, D. (2020). Modeling ﬂuency and faithfulness

for diverse neural machine translation. In Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

volume 34, pages 59–66.

Filippova, K. (2020). Controlled hallucinations: Learning

to generate faithfully from noisy data. arXiv preprint

arXiv:2010.05873.

Gunel, B., Zhu, C., Zeng, M., and Huang, X. (2020).

Mind the facts: Knowledge-boosted coherent abstrac-

tive text summarization. ArXiv, abs/2006.15435.

Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J.

(2019). Learning from dialogue after deployment:

Feed yourself, chatbot! In Korhonen, A., Traum,

D., and M

arquez, L., editors, Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 3667–3684. Association for Com-

putational Linguistics.

He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T.-Y., and

Ma, W.-Y. (2016). Dual learning for machine transla-

tion. Advances in neural information processing sys-

tems, 29.

He, Q., Wu, L., Yin, Y., and Cai, H. (2020). Knowledge-

graph augmented word representations for named en-

tity recognition. In Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, volume 34, pages

7919–7926.

He, T., Zhang, J., Zhou, Z., and Glass, J. (2021). Exposure

bias versus self-recovery: Are distortions really incre-

mental for autoregressive text generation? In Pro-

ceedings of the 2021 Conference on Empirical Meth-

ods in Natural Language Processing, pages 5087–

5102. Association for Computational Linguistics.

Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kuk-

liansy, D., Cohen, V., Scialom, T., Szpektor, I., Has-

sidim, A., and Matias, Y. (2022). True: Re-evaluating

factual consistency evaluation. In Proceedings of the

2022 Conference of the North American Chapter of

the Association for Computational Linguistics: Hu-

man Language Technologies, pages 3905–3920.

Hu, W., He, L., Ma, H., Wang, K., and Xiao, J. (2022).

Kgner: Improving chinese named entity recognition

by bert infused with the knowledge graph. Applied

Sciences, 12(15):7702.

Huang, L., Wu, L., and Wang, L. (2020). Knowledge graph-

augmented abstractive summarization with semantic-

driven cloze reward. In Jurafsky, D., Chai, J., Schluter,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

112

N., and Tetreault, J., editors, Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics, pages 5094–5107. Association for Com-

putational Linguistics.

Huang, Y., Feng, X., Feng, X., and Qin, B. (2021). The fac-

tual inconsistency problem in abstractive text summa-

rization: A survey. arXiv preprint arXiv:2104.14839.

James, N. T. and Kannan, R. (2017). A survey on infor-

mation retrieval models, techniques and applications.

International Journals of Advanced Research in Com-

puter Science and Software Engineering ISSN.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,

Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey

of hallucination in natural language generation. ACM

Computing Surveys, 55(12):1–38.

Jiang, R., Banchs, R. E., and Li, H. (2016). Evaluating

and combining name entity recognition systems. In

Proceedings of the sixth named entity workshop, pages

21–27.

Krishna, K., Roy, A., and Iyyer, M. (2021). Hur-

dles to progress in long-form question answering.

In Toutanova, K., Rumshisky, A., Zettlemoyer, L.,

Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R.,

Chakraborty, T., and Zhou, Y., editors, Proceedings of

the 2021 Conference of the North American Chapter

of the Association for Computational Linguistics: Hu-

man Language Technologies, pages 4940–4957. Asso-

ciation for Computational Linguistics.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami,

K., and Dyer, C. (2016). Neural architectures for

named entity recognition. In Proceedings of the 2016

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, pages 260–270.

Lebret, R., Grangier, D., and Auli, M. (2016). Neural text

generation from structured data with application to the

biography domain. arXiv preprint arXiv:1603.07771.

Li, C., Bi, B., Yan, M., Wang, W., and Huang, S. (2021).

Addressing semantic drift in generative question an-

swering with auxiliary extraction. In Zong, C., Xia,

F., Li, W., and Navigli, R., editors, Proceedings of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing (Volume

2: Short Papers), pages 942–947. Association for

Computational Linguistics.

Li, H., Zhu, J., Zhang, J., and Zong, C. (2018). Ensure the

correctness of the summary: Incorporate entailment

knowledge into abstractive sentence summarization.

In Proceedings of the 27th international conference

on computational linguistics, pages 1430–1441.

Li, M., Roller, S., Kulikov, I., Welleck, S., Boureau, Y.-

L., Cho, K., and Weston, J. (2020). Don’t say that!

making inconsistent dialogue unlikely with unlikeli-

hood training. In Jurafsky, D., Chai, J., Schluter, N.,

and Tetreault, J., editors, Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 4715–4728. Association for Com-

putational Linguistics.

Maynez, J., Narayan, S., Bohnet, B., and McDonald, R.

(2020). On faithfulness and factuality in abstractive

summarization. In Jurafsky, D., Chai, J., Schluter,

N., and Tetreault, J., editors, Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics, pages 1906–1919. Association for Com-

putational Linguistics.

Mazar

e, P.-E., Humeau, S., Raison, M., and Bordes, A.

(2018). Training millions of personalized dialogue

agents. In Riloff, E., Chiang, D., Hockenmaier, J., and

Tsujii, J., editors, Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Process-

ing, pages 2775–2779. Association for Computational

Linguistics.

McCann, B., Bradbury, J., Xiong, C., and Socher, R. (2017).

Learned in translation: Contextualized word vectors.

Advances in neural information processing systems,

30.

Mielke, S. J., Szlam, A., Dinan, E., and Boureau, Y.-

L. (2022). Reducing conversational agents’ over-

conﬁdence through linguistic calibration. Transac-

tions of the Association for Computational Linguis-

tics, 10:857–872.

uller, M., Gonzales, A. R., and Sennrich, R. (2020). Do-

main robustness in neural machine translation. In Pro-

ceedings of the 14th Conference of the Association for

Machine Translation in the Americas (Volume 1: Re-

search Track), pages 151–164.

Nogueira, R. and Cho, K. (2019). Passage re-ranking with

bert. arXiv preprint arXiv:1901.04085.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Parikh, A., Wang, X., Gehrmann, S., Faruqui, M., Dhin-

gra, B., Yang, D., and Das, D. (2020). ToTTo: A

controlled table-to-text generation dataset. In Webber,

B., Cohn, T., He, Y., and Liu, Y., editors, Proceed-

ings of the 2020 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 1173–

1186. Association for Computational Linguistics.

Petroni, F., Rockt

aschel, T., Riedel, S., Lewis, P., Bakhtin,

A., Wu, Y., and Miller, A. (2019). Language models as

knowledge bases? In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language

Processing and the 9th International Joint Conference

on Natural Language Processing (EMNLP-IJCNLP),

pages 2463–2473.

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2015).

Sequence level training with recurrent neural net-

works. arXiv preprint arXiv:1511.06732.

Rashkin, H., Reitter, D., Tomar, G. S., and Das, D. (2021).

Increasing faithfulness in knowledge-grounded dia-

logue with controllable features. In Zong, C., Xia,

F., Li, W., and Navigli, R., editors, Proceedings of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing (Vol-

ume 1: Long Papers), pages 704–718. Association for

Computational Linguistics.

Raunak, V., Menezes, A., and Junczys-Dowmunt, M.

(2021). The curious case of hallucinations in neural

Hallucinations in LLMs and Resolving Them: A Holistic Approach

113

machine translation. In Toutanova, K., Rumshisky,

A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,

Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,

Y., editors, Proceedings of the 2021 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 1172–1183, Online. Association for Com-

putational Linguistics.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why

should i trust you?” explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

international conference on knowledge discovery and

data mining, pages 1135–1144.

Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M.,

Liu, Y., Xu, J., Ott, M., Smith, E. M., Boureau, Y.-L.,

and Weston, J. (2021). Recipes for building an open-

domain chatbot. In Merlo, P., Tiedemann, J., and Tsar-

faty, R., editors, Proceedings of the 16th Conference

of the European Chapter of the Association for Com-

putational Linguistics: Main Volume, pages 300–325.

Association for Computational Linguistics.

Santhanam, S., Hedayatnia, B., Gella, S., Padmakumar,

A., Kim, S., Liu, Y., and Hakkani-T

ur, D. Z. (2021).

Rome was built in 1776: A case study on factual cor-

rectness in knowledge-grounded response generation.

ArXiv, abs/2110.05456.

Schick, T. and Sch

utze, H. (2021). Exploiting cloze-

questions for few-shot text classiﬁcation and natural

language inference. In Proceedings of the 16th Con-

ference of the European Chapter of the Association

for Computational Linguistics: Main Volume, pages

255–269.

Sch

utze, H., Manning, C. D., and Raghavan, P. (2008). In-

troduction to information retrieval, volume 39. Cam-

bridge University Press Cambridge.

Sellam, T., Das, D., and Parikh, A. (2020). BLEURT:

Learning robust metrics for text generation. In Ju-

rafsky, D., Chai, J., Schluter, N., and Tetreault, J.,

editors, Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics, pages

7881–7892. Association for Computational Linguis-

tics.

Sennrich, R., Haddow, B., and Birch, A. (2016). Improving

neural machine translation models with monolingual

data. In Proceedings of the 54th Annual Meeting of the

Association for Computational Linguistics (Volume 1:

Long Papers), pages 86–96.

Shen, L., Zhan, H., Shen, X., Chen, H., Zhao, X., and Zhu,

X. (2021). Identifying untrustworthy samples: Data

ﬁltering for open-domain dialogues with bayesian op-

timization. In Proceedings of the 30th ACM Interna-

tional Conference on Information & Knowledge Man-

agement, page 1598–1608. Association for Comput-

ing Machinery.

Shuster, K., Poff, S., Chen, M., Kiela, D., and Weston, J.

(2021). Retrieval augmentation reduces hallucination

in conversation. In Moens, M.-F., Huang, X., Spe-

cia, L., and Yih, S. W.-t., editors, Findings of the

Association for Computational Linguistics: EMNLP

2021, pages 3784–3803. Association for Computa-

tional Linguistics.

Specia, L., Hajlaoui, N., Hallett, C., and Aziz, W. (2011).

Predicting machine translation adequacy. In Proceed-

ings of Machine Translation Summit XIII: Papers.

Su, D., Li, X., Zhang, J., Shang, L., Jiang, X., Liu, Q.,

and Fung, P. (2022). Read before generate! faithful

long form question answering with machine reading.

In Muresan, S., Nakov, P., and Villavicencio, A., ed-

itors, Findings of the Association for Computational

Linguistics: ACL 2022, pages 744–756, Dublin, Ire-

land. Association for Computational Linguistics.

Su, W., Tang, Y., Ai, Q., Wang, C., Wu, Z., and Liu, Y.

(2024). Mitigating entity-level hallucination in large

language models. arXiv preprint arXiv:2407.09417.

Sun, W., Shi, Z., Gao, S., Ren, P., de Rijke, M., and Ren, Z.

(2023). Contrastive learning reduces hallucination in

conversations. In Proceedings of the AAAI Conference

on Artiﬁcial Intelligence, volume 37, pages 13618–

13626.

Tian, R., Narayan, S., Sellam, T., and Parikh, A. P.

(2019). Sticking to the facts: Conﬁdent decoding

for faithful data-to-text generation. arXiv preprint

arXiv:1910.08684.

Toral, A., Wieling, M., and Way, A. (2018). Post-editing

effort of a novel with statistical and neural machine

translation. Frontiers in Digital Humanities, 5:9.

Tu, Z., Liu, Y., Shang, L., Liu, X., and Li, H. (2017). Neural

machine translation with reconstruction. In Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

volume 31.

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016). Modeling

coverage for neural machine translation. In Proceed-

ings of the 54th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 76–85.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in Neural

Information Processing Systems, 30.

Wang, A., Cho, K., and Lewis, M. (2020). Asking and an-

swering questions to evaluate the factual consistency

of summaries. In Jurafsky, D., Chai, J., Schluter, N.,

and Tetreault, J., editors, Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 5008–5020. Association for Com-

putational Linguistics.

Wang, H. (2020). Revisiting challenges in data-to-text

generation with fact grounding. arXiv preprint

arXiv:2001.03830.

Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., and

Tang, J. (2021). Kepler: A uniﬁed model for knowl-

edge embedding and pre-trained language representa-

tion. Transactions of the Association for Computa-

tional Linguistics, 9:176–194.

Welleck, S., Weston, J., Szlam, A., and Cho, K. (2019).

Dialogue natural language inference. In Korhonen,

A., Traum, D., and M

arquez, L., editors, Proceed-

ings of the 57th Annual Meeting of the Association

for Computational Linguistics, pages 3731–3741, Flo-

rence, Italy. Association for Computational Linguis-

tics.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

114

Wiseman, S., Shieber, S. M., and Rush, A. M. (2017). Chal-

lenges in data-to-document generation. arXiv preprint

arXiv:1707.08052.

Wu, Z., Galley, M., Brockett, C., Zhang, Y., Gao, X., Quirk,

C., Koncel-Kedziorski, R., Gao, J., Hajishirzi, H., Os-

tendorf, M., and Dolan, B. (2021). A controllable

model of grounded response generation. Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

35:14085–14093.

Yavuz, S., Rastogi, A., Chao, G.-L., and Hakkani-Tur, D.

(2019). DeepCopy: Grounded response generation

with hierarchical pointer networks. In Nakamura, S.,

Gasic, M., Zukerman, I., Skantze, G., Nakano, M.,

Papangelis, A., Ultes, S., and Yoshino, K., editors,

Proceedings of the 20th Annual SIGdial Meeting on

Discourse and Dialogue, pages 122–132. Association

for Computational Linguistics.

Yin, J., Jiang, X., Lu, Z., Shang, L., Li, H., and

Li, X. (2016). Neural generative question answer-

ing. In Iyyer, M., He, H., Boyd-Graber, J., and

Daum

e III, H., editors, Proceedings of the Workshop

on Human-Computer Question Answering, pages 36–

42, San Diego, California. Association for Computa-

tional Linguistics.

Yu, T., Liu, Z., and Fung, P. (2021). AdaptSum: To-

wards low-resource domain adaptation for abstrac-

tive summarization. In Toutanova, K., Rumshisky,

A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I.,

Bethard, S., Cotterell, R., Chakraborty, T., and Zhou,

Y., editors, Proceedings of the 2021 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 5892–5904. Association for Compu-

tational Linguistics.

Zhang, C., Lee, G., D’Haro, L. F., and Li, H. (2021a). D-

score: Holistic dialogue evaluation without reference.

IEEE/ACM Transactions on Audio, Speech, and Lan-

guage Processing, 29:2502–2516.

Zhang, C., Lee, G., D’Haro, L. F., and Li, H. (2021b). D-

score: Holistic dialogue evaluation without reference.

IEEE/ACM Transactions on Audio, Speech, and Lan-

guage Processing, 29:2502–2516.

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and

Weston, J. (2018). Personalizing dialogue agents: I

have a dog, do you have pets too? In Gurevych, I.

and Miyao, Y., editors, Proceedings of the 56th An-

nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 2204–

2213. Association for Computational Linguistics.

Zhang, Y., Merck, D., Tsai, E., Manning, C. D., and Lan-

glotz, C. (2020a). Optimizing the factual correct-

ness of a summary: A study of summarizing radiol-

ogy reports. In Jurafsky, D., Chai, J., Schluter, N.,

and Tetreault, J., editors, Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 5108–5120. Association for Com-

putational Linguistics.

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C.,

Gao, X., Gao, J., Liu, J., and Dolan, B. (2020b).

DIALOGPT : Large-scale generative pre-training for

conversational response generation. In Celikyilmaz,

A. and Wen, T.-H., editors, Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics: System Demonstrations, pages 270–278.

Association for Computational Linguistics.

Zhou, B., Richardson, K., Ning, Q., Khot, T., Sabharwal,

A., and Roth, D. (2021). Temporal reasoning on im-

plicit events from distant supervision. In Proceedings

of the 2021 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, pages 1361–1371.

Zhou, C., Neubig, G., Gu, J., Diab, M., Guzman, P., Zettle-

moyer, L., and Ghazvininejad, M. (2020). Detecting

hallucinated content in conditional neural sequence

generation. arXiv preprint arXiv:2011.02593.

Zhou, K., Prabhumoye, S., and Black, A. W. (2018). A

dataset for document grounded conversations. In

Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii,

J., editors, Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing,

pages 708–713. Association for Computational Lin-

guistics.

Zhu, C., Hinthorn, W., Xu, R., Zeng, Q., Zeng, M., Huang,

X., and Jiang, M. (2021). Enhancing factual con-

sistency of abstractive summarization. In Toutanova,

K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D.,

Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,

and Zhou, Y., editors, Proceedings of the 2021 Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies, pages 718–733. Association for

Computational Linguistics.

Hallucinations in LLMs and Resolving Them: A Holistic Approach

115