Investigating Answer Validation Using Noise Identiﬁcation and

Classiﬁcation in Goal-Oriented Dialogues

Sara Mirabi

1 a

, Bahadorreza Ofoghi

, John Yearwood

, Diego Molla-Aliod

and Vicky Mak-Hau

School of Information Technology, Deakin University, Melbourne, Australia

School of Computing, Macquarie University, Sydney, Australia

Keywords:

Dialogue Systems, Multi-Agent Conversational Systems, Noisy Answers, Answer Validation, Error Detection,

Linear Programming, Optimization.

Abstract:

Goal-oriented conversational systems based on large language models (LLMs) provide the potential capability

to gather the necessary requirements for solving tasks or developing solutions. However, in real-world scenarios,

non-expert users may respond incorrectly to dialogue questions, which can impede the system’s effectiveness

in eliciting accurate information. This paper presents a novel approach to detecting and categorizing noisy

answers in goal-oriented conversations, with a focus on modeling linear programming problems. Using a

current LLM, Gemini, we develop multi-agent synthetic conversations based on problem statements from the

benchmark optimization modeling dataset NL4Opt to generate dialogues in the presence of noisy answers too.

Our experiments show the LLM is not sufﬁciently equipped with the capabilities to detect noisy answers and

hence, in almost 59% of the cases where there is a noisy answer, the LLM continues with the conversation

without any attempts at resolving the noise. Thus, we also propose a two-step answer validation method for

the identiﬁcation and classiﬁcation of noisy answers. Our ﬁndings demonstrate that while some LLM and

non-LLM-based models perform well in detecting answer inaccuracies, there is a need for further improvements

in classifying noisy answers into ﬁne-grained stress types.

1 INTRODUCTION

Nowadays, large language models (LLMs) enable non-

expert users to seek and retrieve information efﬁciently.

However, interaction with LLMs is not without chal-

lenges; non-expert users may provide incorrect infor-

mation or noise in conversations due to the lack of

knowledge or misunderstanding, which can mislead

information seeking and decrease the overall perfor-

mance of LLMs.

Noise in conversations can arise from various

sources, such as ambiguities, misunderstandings, or

intentional misdirection, leading to irrelevant or in-

accurate information. While LLMs have enhanced

information accessibility and interactions, assuring the

quality of user input, detecting and identifying vari-

ous error types, and rectifying them is an area that

demands more exploration.

In this work, we focus on noise in a goal-oriented

conversational system based on linear programming

(LP) problems. LP is a technique for solving optimiza-

https://orcid.org/0009-0007-8541-2539

tion problems (OPs). However, the intricate nature of

OPs and the lack of mathematical expertise can make

it challenging for manufacturers and business owners

to provide accurate input for modeling such problems

independently. In OPs, the necessary components typ-

ically include decision variables, which represent the

choices available for making decisions; objective func-

tions, which deﬁne the goal or the objective to be

optimized, e.g., maximization or minimization; and

constraints which are the requirements and limitations

that the solution must satisfy (Savic, 2002).

To analyze the behavior of the questioning agent in

the presence of noise and its ability to detect incorrect

responses, we generated goal-oriented conversations

considering noisy answers by simulating interactions

between a mathematically-aware agent and a business

client with an optimization need. The optimization

problems from the NL4Opt dataset (Ramamonjison

et al., 2023) provide the basis for the client’s responses.

To generate these conversations, our methodology em-

ploys prompt engineering with three primary agents.

The ﬁrst agent is tasked with generating queries to

elicit all necessary components, the second agent gen-

658

Mirabi, S., Ofoghi, B., Yearwood, J., Molla-Aliod, D. and Mak-Hau, V.

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues.

DOI: 10.5220/0013304000003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 658-669

ISBN: 978-989-758-737-5; ISSN: 2184-433X

Problem Statement (PS): A nut farmer has 80 acres of land on

which he grows almonds and pecans. The net revenue per acre

of almonds is $500 and the net revenue per acre of pecans is

$600. Each acre of almonds requires 1.5 days worth of labor

and $200 in maintenance costs. Each acre of pecans requires 3

days worth of labor and $250 in maintenance costs...

Hello! I have a nut farm and I want to optimize my net

revenue.

That's great! So, you want to maximize your net revenue from your nut

farm. To do that, we'll need to figure out exactly what factors influence

that revenue. Can you tell me, what are the different types of nuts you

grow?

Well, my farm grows 40 percent almonds and 76 percent

pecans, which is less than 70 percent in total.

Correctness Checker

Result: Incorrect

Reason: If 40% are

almonds and 76% are

pecans, this adds up to

116%.

Noise Classifier

Result: Inconsistent

Reason: The

percentages provided in

the answer do not add

up correctly.

Hi there! I'm Optigem, your friendly optimization assistant. I can help

you formulate optimization problems to help you make better decisions.

What kind of business are you in? I'd love to hear about what you're

working on.

If incorrect

Figure 1: A snippet of the generated conversation for an

NL4Opt problem statement that includes a generated ques-

tion by question generator (QG), a correct answer generated

by answer generator (AG), and a noisy answer generated by

noise generator (NG). The noisy answer is detected using

a correctness checker and classiﬁed with a noise classiﬁer

agent.

erates accurate answers derived directly from the text

descriptions in NL4Opt, and the third agent injects

deliberate noise into the conversations randomly from

within six categories including incomplete, too much

information, non-sequitur, confusion, inconsistent, and

exaggerated answers, mimicking potential inaccura-

cies in real-world user responses. We also introduce a

novel automated answer validation system that shifts

the focus from solely validating LLM-generated re-

sponses to evaluating the accuracy of user-side inputs

within a conversation, while classifying noisy user in-

puts into the six predeﬁned error categories. Figure 1

shows the process of generating conversations and val-

idating answers using the answer validation system.

The main contributions of this work therefore include:

•

We generate a dataset consisting of 1,101 dialogues

that include noisy answers mimicking real-world

user behavior with Gemini (Reid et al., 2024), and

manually annotate these noisy answers by three

annotators to ensure the correctness of the noise

generation process, where correctness refers to

the accuracy and relevancy of the intended noise

within its respective category.

•

We assess the LLM’s capabilities in detecting noisy

responses to questions and taking rectiﬁcation

steps to elicit correct information.

•

We propose an automated answer validation sys-

tem and assess their strengths and weaknesses in

conversational noise detection and classiﬁcation us-

ing Gemini (Reid et al., 2024), Mixtral (Jiang et al.,

2024a), Llama (Touvron et al., 2023) and non-

LLM models, i.e., BERT (Kenton and Toutanova,

2019), RoBERTa (Liu, 2019), and LSTM (Staude-

meyer and Morris, 2019).

The rest of this paper is organized as follows. In

Section 2, we present the related work. Section 3

outlines the proposed methodology, including dataset

generation, human annotation, LLM awareness analy-

sis, and answer validation. The results of our proposed

methods are discussed in Section 4. Section 5 provides

concluding insights, and future work is discussed in

Section 6.

2 RELATED WORK

Answer validation has been highlighted across systems,

from traditional question answering (QA) to modern

LLM-based conversational systems. In one of the early

works, Zhang and Zhang (2003) introduced a logic-

based approach for validating answers in a Chinese QA

system by utilizing lexical knowledge and logic form

transformation (LFT). Harabagiu and Hickl (2006)

focused on ﬁltering out noisy candidates that failed

to meet minimal entailment conditions and improving

the ranking of potential answers.

The answer validation exercise (AVE) conference

in 2006 introduced the answer validator (AV) task to

validate the correctness of the answers based on text-

hypothesis pairs (Pe

nas et al., 2006). Ofoghi et al.

(2009) utilized frame semantics in combination with

named entity-based analysis for answer identiﬁcation

and selection. Pakray et al. (2011) presented a step-

by-step hybrid approach that combines information

retrieval with machine reading techniques to validate

and rank answers.

Durmus et al. (2020) ﬁne-tuned a BERT model

to extract answers from context and compare them

to gold answers, detecting hallucinations. Dziri et al.

(2021) proposed a generate-then-reﬁne approach using

knowledge graphs to reduce and detect LLM hallu-

cinations with token-level critic and k-hop subgraph

queries. Konigari et al. (2021) ﬁne-tuned an XLNet-

base model to classify utterances and identify off-topic

deviations and contradictions. Pan et al. (2021) de-

veloped a model to detect real, misleading, or contra-

dictory information. Yu and Sagae (2021) used two

specialized classiﬁers based on RoBERTa to detect

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

659

safety and consistency. Jiang et al. (2024b) proposed

PedCoT, a zero-shot method using pedagogical prin-

ciples and CoT prompts to identify reasoning errors

in LLMs. Chen et al. (2024) introduced a framework

to evaluate four under-explored biases (gender, misin-

formation, beauty, authority) using LLMs and human

judgment.

Despite signiﬁcant advancements in conversational

answer validation, especially with the emergence of

LLMs for detecting answer-related issues such as hal-

lucinations, off-topic content, bias, misleading and

contradictory contexts, the LLMs’ awareness of in-

correct responses and their capabilities in performing

multiple classiﬁcations of noisy answers remains un-

explored. Therefore, we have developed a new frame-

work that studies LLMs’ abilities to detect noise in

conversations and provides an approach to answer val-

idation, addressing the complexities of detecting noisy

answers and their categories.

3 METHODS

3.1

Generating Conversations in a Noisy

Environment

In 2022, the ﬁrst NL4Opt dataset for LP problem for-

mulation was released that consisted of 1,101 problem

statements (Ramamonjison et al., 2023). The NL4Opt

dataset provides a collection of textual problem de-

scriptions speciﬁcally designed to test the development

of techniques in business, manufacturing, transporta-

tion, and other industries for mapping natural language

descriptions into mathematical models. With reference

to this dataset, we have automated the generation of

1,101 conversations using Gemini as shown in Figure 2.

We utilized three primary agents based on Gemini-1.5

Flash to generate these conversations:

i. Question Generator Agent (QG): This agent asks

precise goal-oriented questions through prompt engi-

neering and zero-shot learning to elicit essential details

for modeling LP problems. It focuses on the objective

function, decision variables, and constraints, similar to

the work by Abdullin et al. (2023). The QG agent does

not access the problem statement directly but relies on

the designed prompt and iterative user interactions.

ii. Answer Generator Agent (AG): This agent im-

personates a real non-mathematician user to simulate

realistic interactions with the QG agent using prompt

engineering and zero-shot learning, generating accu-

rate responses. These answers are directly derived

from the NL4Opt problem statements, which serve

as the primary knowledge source for the simulated

Is there any

missing information?

End

Query

Reponse

Problem Statement (PS): A restaurant employs

waiters earning $147 per week and cooks earning

$290 per week. It is required to keep the weekly

wage bill below $17,600. The restaurant requires a

minimum of 50 staff, of whom at least 12 must be

cooks. Union regulations require that the number

of cooks should be at least one third the number

of waiters...

Conversation

Generation

(Up to 10 rounds)

CH Summarization

Yes

Figure 2: Generating up to 10 rounds of initial conversation

per problem sampled from NL4Opt using multiple agents

consisting of the Question Generator (QG), Answer Genera-

tor (AG), and Noise Generator (NG). Conversation history

(CH) is compared with the original problem statements (PS)

to determine whether any information remains missing, re-

sulting in the continuation or termination of the conversa-

tions.

user to provide a reliable and domain-speciﬁc basis for

generating meaningful and precise answers, similar to

the work in (Abdullin et al., 2023). If the requested

information is not in the problem statement, the agent

states it does not know.

iii. Noise Generator Agent (NG): This agent gen-

erates noisy answers in dialogue turns at random in-

tervals using prompt engineering and few-shot learn-

ing. The generation of noisy answers in the speciﬁc

category proceeds randomly, constituting 10% of all

answers in a conversation. This diversity in answer

types aids in further evaluating the system’s ability to

identify and classify erroneous inputs. Based on a pilot

study, we categorize noise answers into six types of

noise that could result in infeasible or inaccurate LP

models:

1. Incomplete (INCM): The answer does not form

a complete sentence and is unﬁnished or abrupt.

Too Much Information (TMI): The answer in-

cludes both correct and excessive details not di-

rectly relevant to the question.

Non-Sequitur (NS): The answer presents unre-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

660

lated facts in a nonsensical or humorous manner.

Confusion (CNF): The answer contradicts itself

or provides information that conﬂicts with the pre-

vious answers in the current conversation history.

Inconsistent (INCN): The answer contains a

mathematical error, such as an illogical proportion

calculation or a misinterpretation of units.

Exaggerated (EXG): The answer uses unrealistic

or implausible numerical values.

The conversations proceed for up to 10 rounds until

the natural ending of the conversation. Once all nec-

essary components for the OP appear to be collected,

a Summarization Agent summarizes the information

from the AG and NG agents and compares it with the

original problem statement using prompt engineering

and zero-shot learning. If any components are found

missing, the QG and AG agents proceed with the di-

alogue to retrieve the missing details, with no further

noisy answers generated, as the goal is to ﬁnalize the

conversation.

Appendix A provides the prompts for generating

questions, answers, and noisy answers. Appendix B in-

cludes an example of multi-agent dialogue generation,

considering noisy answers.

3.2 Human Annotation

When analyzing the generated conversations, we found

that some generated noisy answers were not accurately

categorized in the intended noise category. Thus, we

decided to manually annotate the noisy answers in the

dataset. Annotating these noisy answers is particularly

challenging due to the complexity of task-oriented dia-

logue systems, where it is often difﬁcult to differentiate

between multiple types of noisy responses. Therefore,

the annotation process demands detailed deﬁnitions of

noise types and a high level of expertise.

In our annotation process, one annotator labeled

the entire dataset, comprising 1,076 noisy answers.

The discrepancies in noise classes between the annota-

tor’s labels and those generated by NG were identiﬁed

and corrected. To verify the consistency of the ﬁrst an-

notator’s annotations, two additional annotators each

labeled 142 randomly drawn noisy answers selected

from the 1,076 noisy answers with no overlaps, en-

suring balanced noise categories. The total sample

count of

n = 142 × 2 = 284

was reached based on

Equation 1 (Daniel, 1978), a statistical formula for

calculating the minimum number of essential samples

to ensure the correct assessment of the reliability of

annotation on a ﬁnite population, where the population

size

N = 1, 076

, conﬁdence level

Z = 1.96

, population

proportion p = 0.5, and margin of error E = 0.05.

n =

N × Z

× p × (1 − p)

((N − 1) ×E

) + (Z

× p × (1 − p))

(1)

Table 1 shows Cohen’s Kappa (Cohen, 1960) val-

ues when comparing the annotations of Annotator 1

with those of Annotator 2 and Annotator 3, sepa-

rately. The results show a high level of “Almost Per-

fect” (McHugh, 2012) agreement between the pairs of

annotators. Based on these strong agreement measures,

the labels provided by Annotator 1 (who corrected the

labels of the entire Gemini-generated conversation set)

were considered the ground truth for this study, ensur-

ing the reliability and consistency of the annotation

process.

Table 1: Cohen’s Kappa analysis of inter-annotator agree-

ment between pairs of annotators where two separate sets of

142 conversations were annotated by each annotator pair.

Comparison Cohen’s Kappa

Annotator 1 vs. Annotator 2 0.83

Annotator 1 vs. Annotator 3 0.90

Comparing the distribution of noisy answers anno-

tated by Annotator 1 and those generated by Gemini

revealed two key ﬁndings: i) some noisy answers from

Gemini, mostly in the confusion and inconsistent cate-

gories, were accurate and relabeled as correct, and ii)

some generated noisy answers were misclassiﬁed and

reassigned correct labels during annotation. Figure 3

demonstrates the discrepancy measures between these

classiﬁcations, where 62 noisy answers were labeled

as correct answers after the annotation process.

0 100 200

INCM

TMI

CNF

INCN

EXG

Correct

156

186

222

143

131

176

177

194

188

161

165

191

Count

Noise Type

Annotated Gemini-generated

Figure 3: Comparison of Gemini-generated and annotated

noise categories. The ”Correct” label was added after manual

annotation which resulted in some non-noisy answers.

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

661

3.3 QG LLM Awareness Analysis

In a natural conversation, the injection of noise forces

the questioner to either seek conﬁrmation or provide

a concise clariﬁcation regarding the question to make

it answerable. In this study, we ﬁrst focused on ana-

lyzing the ﬁrst question posed by the QG agent after

detecting a noisy answer to determine its immediate

action. However, as the analysis progressed, we broad-

ened our scope and examined all subsequent questions

following the injection of noise. We closely monitored

the behavior of the QG agent to assess its ability to

detect when a prior response contained a noisy an-

swer. Our primary objective was to determine whether

the QG agent chose to rephrase and repeat the origi-

nal question, seek additional clariﬁcation to address

any ambiguity or request conﬁrmation of the previ-

ously provided response. We manually examined and

counted every question generated after each instance

of noisy answer in the generated conversations to ﬁnd

the proportion of the times that the QG agent was able

to take correction steps.

3.4 Answer Validation

For each round of dialogue, the current question, cor-

responding answer, and the entire conversation history

were considered for answer validation. By incorporat-

ing the conversation history, which consists of all prior

responses, the validity and accuracy of the current an-

swers are assessed against the context provided by the

previous responses.

We aim to ensure that the answers are not only

relevant to the questions posed but also free from am-

biguities or inconsistencies and align with the overall

coherence of the dialogue. Figure 4 shows the over-

all functioning of the answer validator, focusing on

the detection and classiﬁcation of noisy answers and

assessing the quality of the conversation.

-Non-Sequitur

-Confusion

-Exaggerated

-Incomplete

-Too Much

Information

-Inconsistent

Correctness

Checker

Noise

Classifier

Conversation

History

Current

Question

Current Answer

incorrect

Figure 4: The conversation history, current answer, and

current question are fed into a correctness checker to assess

accuracy. If incorrect, a noise classiﬁer identiﬁes the type of

error.

3.4.1 Correctness Checker

This stage functions as a binary classiﬁer, assessing

answer accuracy by determining if they address the

questions and align with the conversation history. In

LLM-based models, the agent relies on predeﬁned

instructions and rules, established through prompt en-

gineering with few-shot learning, to label answers as

correct or incorrect. To understand the reasoning be-

hind each label, we reﬁne prompts so the agent ex-

plains its decisions. These explanations are crucial for

modifying prompts and ensuring alignment with our

strategy. In the case of non-LLM-based models BERT,

RoBERTa, and LSTM, correctness checking involves

data preprocessing, including stopword removal and

lemmatization. After preprocessing, binary classiﬁca-

tion determines whether an answer is correct or noisy

based on conversation history.

3.4.2 Noise Classiﬁer

If the correctness checker agent determines an answer

is incorrect in LLM-based models, the noise classiﬁer

agent performs multi-way classiﬁcation to categorize

the noise into one of the most probable predeﬁned cat-

egories of noise, incomplete, too much information,

non-sequitur, confusion, inconsistent, and exaggerated,

using few-shot learning. This process involves identi-

fying patterns and rules through prompt engineering.

Additionally, the agent provides detailed explanations

for each noisy answer identiﬁed, improving clarity and

reasoning in the decision-making process.

In non-LLM-based models, once an answer is clas-

siﬁed as incorrect, ﬁne-tuned models predict the noisy

answer category based on learned labels.

Below is an example of answer validation and clas-

siﬁcation for inconsistent noisy answers:

Agent: Now, are there any limits or restrictions on

how many of each dog you can train?

Client: Yes, I can only train 1.5 dogs at a time.

Correctness Checker: Incorrect

Reason: The answer includes a mathematical issue.

Classiﬁer: Inconsistent

Reason: The answer contains a mathematical error.

The statement “I can only train 1.5 dogs at a time”

is inconsistent as it is not possible to train a fraction

of a dog.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

662

4 EVALUATION

4.1 Conversation Characteristics

Table 2 presents the statistics regarding the generated

dataset consisting of a total of 1,101 dialogues. The

number of rounds initiated by the agent exceeds those

by the client because the QG initiates two additional

rounds in each dialogue: one to request waiting during

the summarization process and another to deliver a

farewell message. It can be seen that approximately

10% of the answers have been labeled as noisy and a

small proportion of conversations, which is about 8%,

were identiﬁed as non-noisy conversations.

Table 2: The statistics of the Gemini-generated and annotated

dialogue sets. Note: * indicates gold standard after manual

annotation.

Answers without noise 9252, 9314*

Number of conversations with noise 1076, 1014*

Number of conversations without noise 25, 87*

Total conversations 1101

Total QG agent rounds 12530

Total AG/NG agent rounds 10328

We analyzed the impact of noisy answers on the

dialogues from different angles, and the key ﬁndings

are outlined as follows.

The large number of conversations with noisy

answers — 1,014 out of a total of 1,101 dialogues

— compared to 87 conversations without noisy an-

swers, indicates a signiﬁcant prevalence of noise in

the dataset. From the results in Table 3 regarding the

lengths of conversations, it is evident that conversa-

tions without noisy answers are more straightforward.

As the answers in these conversations are accurate,

there is no need for additional clariﬁcation or follow-

ups, making the conversations shorter and more direct.

Table 3: Average number of dialogue rounds within the

Gemini-generated conversations.

Conversation type Average round

No noise included 16.1

Noisy answers included 21.4

Through the process of comparing the original

noise labels with the annotated labels (Figure 3), we

found that Gemini encounters difﬁculties in accurately

generating noisy answers in certain categories. The in-

consistent category, which is expected to include math-

ematical inconsistencies, and the confusion category,

which involves answers that contradict the current re-

sponse or prior statements, have been misclassiﬁed.

During the annotation process, we observed that some

of the answers generated in these two categories were

indeed correct and contained no noise. Additionally,

the annotation process revealed that some answers ini-

tially classiﬁed as too much information, exaggerated,

and incomplete should instead be categorized as non-

sequitur.

4.2 LLM Awareness of Noise

To evaluate the QG agent’s ability to detect noisy an-

swers, we considered that while the agent was not

explicitly aware of the types of noisy answers, it was

expected to detect whether the answer failed to ade-

quately fulﬁll the requirements related to the compo-

nents of the OP in dialogue. As shown in Table 4, the

QG agent successfully detected noise and requested

clariﬁcation or conﬁrmation in only 40.8% of the cases,

and struggled in 59.1% of the cases to detect noise and

be responsive. Particularly with noise types like incon-

sistent information (119 cases), the QG agent mostly

failed to detect mathematical errors.

4.3 Answer Validation

Our evaluation of noisy answer identiﬁcation is con-

ducted in two phases: the detection of noisy answers

and the classiﬁcation of their speciﬁc types. As de-

tailed in Table 5, the initial phase, acting as a bi-

nary classiﬁcation method, reveals that predictions

for non-noisy instances achieved higher performance

compared to noisy cases, highlighting the complexity

of detecting noise. Pretrained models like RoBERTa

(roberta-base) and BERT (bert-base-uncased) outper-

form LSTM, Mixtral8x7B, Llama-2-7B, and Gemini-

1.5-ﬂash.

The NL4Opt dataset has been pre-split into the

disjoint sets of training (713 instances), development

(99 instances), and test set (289 instances). While the

conversations generated on the basis of the test set

were used for testing all of the models, the conversa-

tions based on the training set were utilized for training

non-LLM models, LSTM, BERT, and RoBERTa. The

models trained with the training conversations labeled

the test data based on the patterns they learned during

the training/ﬁne-tuning phase, which resulted in their

performances being expectedly superior by approxi-

mately 8%.

Regarding noise classiﬁcation, Table 6 provides a

detailed comparison of the performance of six differ-

ent models including Gemini-1.5-ﬂash, Mixtral8x7B,

Llama-2-7B, BERT (bert-base-uncased), RoBERTa

(roberta-base), and LSTM across the previously-

mentioned six noise categories. RoBERTa demon-

strates high precision across multiple noise categories,

particularly excelling in non-sequitur and confusion.

Its recall values indicate detecting noisy instances, es-

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

663

Table 4: Overview of the noise detection status within the Gemini-generated noisy dataset based on the behavior of the QG

agent, broken down by noise category.

Detection Status INCM TMI NS CNF INCN EXG Total

Detected by QG 100 63 136 49 12 54 414 (40.8%)

Not Detected by QG 56 123 86 94 119 122 600 (59.1%)

Table 5: Performance metrics of the different models for the

accuracy classiﬁcation of answers into noisy and non-noisy

classes.

Metric Model Noisy

Non-

Noisy

Macro

avg.

Precision

Gemini1.5

0.745 0.963 0.854

Mixtral 0.607 0.956 0.782

Llama2 0.608 0.980 0.794

BERT 0.871 0.983 0.927

RoBERTa

0.945 0.984 0.964

LSTM 0.814 0.968 0.891

Recall

Gemini1.5

0.639 0.977 0.808

Mixtral 0.582 0.960 0.771

Llama2 0.823 0.944 0.884

BERT 0.842 0.987 0.915

RoBERTa

0.850 0.994 0.922

LSTM 0.691 0.983 0.837

F1-Score

Gemini1.5

0.688 0.970 0.830

Mixtral 0.594 0.958 0.776

Llama2 0.699 0.962 0.831

BERT 0.856 0.985 0.921

RoBERTa

0.895 0.989 0.942

LSTM 0.747 0.975 0.861

pecially within the exaggerated and inconsistent cat-

egories. We observed RoBERTa’s low performance

in Recall for detecting confusing categories. While

it accurately classiﬁed responses when the confusion

was expressed negatively in a message, it struggled

to identify cases where the prior response was denied

in negative language, indicating that its classiﬁcation

strongly depends on the impact of negative expres-

sions.

BERT achieves the highest precision in the incom-

plete and inconsistent categories. It also shows a high

recall value in the incomplete and confusion categories.

LSTM struggles with detecting confusion and inconsis-

tent noise, with 0 measures in both categories. Gemini

tends to have the highest performance in precision for

detecting too much information and exaggerated cat-

egories. While Gemini excels in speciﬁc categories,

Mixtral and Llama show unsatisfactory results, mak-

ing them less reliable. The LLMs have the lowest

performances when it comes to detecting incomplete

answers.

We analyzed the reasoning and explanations pro-

vided by the LLMs in identifying each type of noisy

answer and found that they frequently fail to deliver

accurate justiﬁcations for speciﬁc noise types. Clas-

sifying a single noise category among multiple noise

types introduces complexity. The overlapping char-

acteristics of different noise categories, with subtle

distinctions between them, make it difﬁcult for these

models to distinguish between noise types. As a result,

the models often misclassify noisy answers, highlight-

ing a limitation in their current ability to handle noise

classiﬁcation using their acquired world knowledge

and in the absence of ﬁne-tuning for LLMs.

A total of 224 responses with correct ground truth

labels, misclassiﬁed as noisy by the answer validation

system, were reviewed. Of these, 22 were correctly

validated as noisy and the ground truth labels were

incorrect. Most correctly classiﬁed samples involved

questions about quantities, where responses were irrel-

evant and addressed cost or proﬁt instead.

Figure 5 summarizes the classiﬁcation confusion

matrices of the six models for six different types of

noise in the test dataset, demonstrating the deﬁciencies

of LLMs and non-LLM-based models in distinguish-

ing overlapping noisy answers. Table 7 presents the

number of actual noise instances in the test dataset as

a reference for the values in the confusion matrices.

For LLMs, Gemini frequently misclassiﬁed the

confusion noise category as incomplete. Mixtral mis-

classiﬁed the confusion and exaggerated noise cat-

egories as inconsistent. Llama misclassiﬁed non-

sequiturs as inconsistent and the confusion noise cate-

gory as non-sequitur and incomplete while showing a

tendency to misclassify exaggerated, incomplete, and

inconsistent noisy answers as non-sequiturs too. From

the cumulative results for LLMs, it is evident that

the majority of misclassiﬁcation dispersion occurred

among the incomplete, non-sequitur, and inconsistent

categories.

For non-LLM-based models, BERT misclassiﬁed

actual non-sequitur noise as incomplete. A similar

misclassiﬁcation pattern was observed in RoBERTa,

where non-sequitur noise was predicted as incomplete.

LSTM exhibited signiﬁcant challenges in detecting

non-sequitur noise, misclassifying it across various

categories, such as incomplete, too much information,

and exaggerated. The cumulative results of non-LLM-

based models show that misclassiﬁcation occurred on

incomplete, non-sequitur, and exaggerated answers.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

664

Table 6: Precision, recall, f1-score, and macro averages across noise categories for different models.

Metric Model INCM TMI NS CNF INCN EXG Macro avg.

Precision Gemini1.5 0.239 0.944 0.753 0.307 0.642 0.971 0.643

Mixtral 0.225 0.750 0.745 0.434 0.208 0.652 0.502

Llama2 0.254 0.783 0.227 0.619 0.223 0.829 0.489

BERT 0.767 0.724 0.777 0.600 0.840 0.936 0.774

RoBERTa 0.756 0.891 0.894 0.867 0.735 0.923 0.844

LSTM 0.267 0.787 0.426 0.000 0.000 0.542 0.337

Recall Gemini1.5 0.306 0.302 0.860 0.353 0.500 0.680 0.500

Mixtral 0.250 0.396 0.667 0.294 0.389 0.600 0.433

Llama2 0.444 0.340 0.509 0.382 0.583 0.580 0.473

BERT 0.917 0.792 0.737 0.529 0.583 0.880 0.740

RoBERTa 0.889 0.774 0.737 0.417 0.694 0.960 0.745

LSTM 0.417 0.906 0.456 0.000 0.000 0.640 0.403

F1-Score Gemini1.5 0.268 0.457 0.803 0.328 0.562 0.800 0.536

Mixtral 0.237 0.519 0.704 0.351 0.271 0.625 0.451

Llama2 0.323 0.474 0.314 0.473 0.323 0.682 0.431

BERT 0.835 0.757 0.756 0.563 0.689 0.907 0.751

RoBERTa 0.817 0.828 0.808 0.563 0.714 0.941 0.779

LSTM 0.325 0.842 0.441 0.000 0.000 0.587 0.366

Table 7: Ground-truth distribution of noisy answers per category in the test conversation set.

Noise type INCM TMI NS CNF INCN EXG Total

Number of instances 36 53 57 34 36 50 266

5 CONCLUDING REMARKS

This work focused on assessing LLMs’ abilities in

detecting imposed stress in the form of erroneous re-

sponses to questions and maintaining a correct path

in goal-oriented dialogues. For this, we developed a

framework for generating synthetic dialogues includ-

ing incorrect answers in six categories: incomplete,

too much information, non-sequitur, confusion, incon-

sistent, and exaggerated by the Gemini large language

model, which simulates natural error-prone conversa-

tional interactions. Our initial ﬁndings revealed that

the LLM-based questioning agent struggled to detect

these noisy responses in a large portion of the gener-

ated noisy conversations. This limitation highlighted

the necessity for an improved dialogue system, leading

us to the investigation and development of an answer

validation strategy as a separate agent/model. This

validator agent is designed to detect noisy responses

and further identify speciﬁc types of noise based on

our predeﬁned set of rules. While the validator demon-

strated acceptable performance in distinguishing noise

and non-noise, even when using some of the state-of-

the-art LLMs and deep learning models its effective-

ness in classifying the speciﬁc noise types remains

limited. Further improvement of the answer validator

thus represents a crucial step towards enhancing the

effectiveness of dialogue systems and the robustness

of the goal-oriented dialogue agent to imposed stress.

6 FUTURE WORK

We plan to further this work in several directions. First,

in this work, the answer validator operates ofﬂine,

functioning independently rather than as an interme-

diary agent between the question and answer agents.

While this ofﬂine approach allows for comprehensive

analysis and reﬁnement, an online answer validator

could offer dynamic, real-time validation in the dia-

logue process.

Second, we introduced six noise categories in this

paper. Our other future work will focus on reﬁning

and expanding these categories to better align with

the speciﬁc challenges associated with optimization

problems. Additionally, we will examine whether and

how expanding these categories will impact the noise

detection process.

Third, improvements are necessary, especially in

the context of 6-way noise classiﬁcation. Applying

methods like Chain-of-Thought reasoning (Wei et al.,

2022), which breaks down complex problems into

further manageable steps may lead to more accurate

classiﬁcation. Additionally, ﬁne-tuning LLMs on the

speciﬁc noise types may improve the model’s ability

to distinguish between several noise categories.

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

665

Figure 5: Confusion matrices summarizing the actual noise categories compared against the predicted categories. The

SUM LLM and SUM NonLLM matrices represent the sum matrices of the relevant groups of LLM-based and non-LLM-based

models, respectively.

REFERENCES

Abdullin, Y., Molla, D., Ofoghi, B., Yearwood, J., and Li,

Q. (2023). Synthetic dialogue dataset generation using

LLM agents. In Gehrmann, S., Wang, A., Sedoc, J.,

Clark, E., Dhole, K., Chandu, K. R., Santus, E., and

Sedghamiz, H., editors, Proceedings of the Third Work-

shop on Natural Language Generation, Evaluation, and

Metrics (GEM), pages 181–191, Singapore. Association

for Computational Linguistics.

Chen, G. H., Chen, S., Liu, Z., Jiang, F., and Wang, B. (2024).

Humans or LLMs as the judge? a study on judgement bias.

In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors,

Proceedings of the 2024 Conference on Empirical Meth-

ods in Natural Language Processing, pages 8301–8327,

Miami, Florida, USA. Association for Computational Lin-

guistics.

Cohen, J. (1960). A coefﬁcient of agreement for nomi-

nal scales. Educational and psychological measurement,

20(1):37–46.

Daniel, W. W. (1978). Biostatistics: a foundation for analy-

sis in the health sciences, volume 129. Wiley.

Durmus, E., He, H., and Diab, M. (2020). FEQA: A question

answering evaluation framework for faithfulness assess-

ment in abstractive summarization. In Jurafsky, D., Chai,

J., Schluter, N., and Tetreault, J., editors, Proceedings of

the 58th Annual Meeting of the Association for Computa-

tional Linguistics, pages 5055–5070, Online. Association

for Computational Linguistics.

Dziri, N., Madotto, A., Za

ıane, O., and Bose, A. J. (2021).

Neural path hunter: Reducing hallucination in dialogue

systems via path grounding. In Moens, M.-F., Huang,

X., Specia, L., and Yih, S. W.-t., editors, Proceedings

of the 2021 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2197–2214, Online and

Punta Cana, Dominican Republic. Association for Com-

putational Linguistics.

Harabagiu, S. and Hickl, A. (2006). Methods for using

textual entailment in open-domain question answering.

In Proceedings of the 21st International Conference on

Computational Linguistics and 44th Annual Meeting of

the Association for Computational Linguistics, pages 905–

912.

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,

B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna,

E. B., Bressand, F., et al. (2024a). Mixtral of experts.

arXiv preprint arXiv:2401.04088.

Jiang, Z., Peng, H., Feng, S., Li, F., and Li, D. (2024b). Llms

can ﬁnd mathematical reasoning mistakes by pedagogical

chain-of-thought. In Larson, K., editor, Proceedings of the

Thirty-Third International Joint Conference on Artiﬁcial

Intelligence, IJCAI-24, pages 3439–3447. International

Joint Conferences on Artiﬁcial Intelligence Organization.

Main Track.

Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). BERT:

Pre-training of deep bidirectional transformers for lan-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

666

guage understanding. In Proceedings of naacL-HLT, vol-

ume 1, page 2.

Konigari, R., Ramola, S., Alluri, V. V., and Shrivastava, M.

(2021). Topic shift detection for mixed initiative response.

In Proceedings of the 22nd Annual Meeting of the Special

Interest Group on Discourse and Dialogue, pages 161–

166.

Liu, Y. (2019). Roberta: A robustly optimized bert pretrain-

ing approach. arXiv preprint arXiv:1907.11692.

McHugh, M. (2012). Interrater reliability: The kappa statis-

tic. Biochemia medica :

casopis Hrvatskoga dru

stva

medicinskih biokemi

cara / HDMB, 22:276–82.

Ofoghi, B., Yearwood, J., and Ma, L. (2009). The impact of

frame semantic annotation levels, frame-alignment tech-

niques, and fusion methods on factoid answer processing.

J. Am. Soc. Inf. Sci. Technol., 60(2):247–263.

Pakray, P., Bhaskar, P., Banerjee, S., Pal, B. C., Bandyopad-

hyay, S., and Gelbukh, A. F. (2011). A hybrid question an-

swering system based on information retrieval and answer

validation. In CLEF (Notebook Papers/Labs/Workshop),

volume 96.

Pan, L., Chen, W., Kan, M.-Y., and Wang, W. Y. (2021). Con-

traqa: Question answering under contradicting contexts.

ArXiv.

nas, A., Rodrigo, A., Sama, V., and Verdejo, M. (2006).

Overview of the answer validation exercise 2006. In

Evaluation of Multilingual and Multi-modal Information

Retrieval, volume 1172, pages 257–264.

Ramamonjison, R., Yu, T., Li, R., Li, H., Carenini, G., Ghad-

dar, B., He, S., Mostajabdaveh, M., Banitalebi-Dehkordi,

A., Zhou, Z., et al. (2023). Nl4Opt competition: Formulat-

ing optimization problems based on their natural language

descriptions. In NeurIPS 2022 Competition Track, pages

189–203. PMLR.

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lilli-

crap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat,

O., Schrittwieser, J., et al. (2024). Gemini 1.5: Unlocking

multimodal understanding across millions of tokens of

context. arXiv preprint arXiv:2403.05530.

Savic, D. (2002). Single-objective vs. multiobjective optimi-

sation for integrated decision support. Proceedings of the

First Biennial Meeting of the International Environmental

Modelling and Software Society.

Staudemeyer, R. C. and Morris, E. R. (2019). Understanding

lstm–a tutorial into long short-term memory recurrent

neural networks. arXiv preprint arXiv:1909.09586.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,

A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,

Bhosale, S., et al. (2023). Llama 2: Open foundation and

ﬁne-tuned chat models. arXiv preprint arXiv:2307.09288.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,

Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-

thought prompting elicits reasoning in large language

models. Advances in neural information processing sys-

tems, 35:24824–24837.

Yu, D. and Sagae, K. (2021). Automatically exposing prob-

lems with neural dialog models. In Moens, M.-F., Huang,

X., Specia, L., and Yih, S. W.-t., editors, Proceedings of

the 2021 Conference on Empirical Methods in Natural

Language Processing, pages 456–470, Online and Punta

Cana, Dominican Republic. Association for Computa-

tional Linguistics.

Zhang, Y. and Zhang, D. (2003). Enabling answer validation

by logic form reasoning in chinese question answering.

In International Conference on Natural Language Pro-

cessing and Knowledge Engineering, 2003. Proceedings.

2003, pages 275–280. IEEE.

APPENDIX A

This section provides prompts for generating questions,

correct answers, and noisy answers.

Question Generator Prompt

You are a chatbot called OptiGem, designed to help

users elicit information and formulate a complete opti-

mization problem statement.

The client is not a math expert and has no experience

with optimization problems.

Your goal is to gather the necessary details and map

them to a linear programming model.

Engage users by asking clear, concise, and sequential

questions to obtain the components of the problem.

The components are:1- Objective function 2- Decision

variables 3- Limitations and constraints 4- Additional

information.

Be creative in formulating your questions. Only one

component is allowed to be discussed per message.

Strictly avoid summarizing the gathered information

at any point during the conversation.

Think carefully to ensure, you gather all the necessary

details for the complete problem.

Pose a question based on the previous information that

will lead to identify a new constraint or a new key

parameter for the model.

Start the conversation with a friendly greeting, intro-

duce yourself, and ask about the user’s business.

If the user indicates that they have no additional in-

formation and all components are covered, end the

conversation with a polite farewell, such as: “It was

great working with you! Let me know if you have any

other optimization questions in the future.”

Answer Generator Prompt

You are an agent impersonating the business owner

described in the problem statement.

Act as if the details in the problem statement are your

personal knowledge.

Be polite and ensure that all information you provide is

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

667

accurate, concise, and derived solely from the problem

statement.

You are not a math expert and have no experience with

optimization problems.

If the problem statement lacks the requested informa-

tion, simply state that you do not know.

The ideal initial response should be as follows:

“Hello! I have a [business name], and I want to opti-

mize [objective function brieﬂy].”

Noisy Answer Generator Prompt

Incomplete: You are a noisy agent whose purpose is

to generate noisy or distracting answers.

Incomplete: Ensure each response ends as an unﬁn-

ished thought or statement, cutting off abruptly to in-

dicate interruption or incompletion.

Example of incomplete answer (1):

Agent: What is your favorite color?

Client: My favorite color is...

Example of incomplete answer (2):

Agent: What type of music do you like?

Client: I really enjoy listening to...

Non-sequitur: You are a noisy agent with the pur-

pose of generating noisy or distracting answers.

Non-sequitur: Provide answers that have absolutely

no connection to the question. The more complex, the

better!

Non-sequitur only: Focus solely on generating an-

swers that have absolutely no connection to the topic

or question at hand.

Humor and absurdity: Make the responses as funny

and nonsensical as possible.

Complexity is key: Add irrelevant details.

Example of non-sequitur answer (1):

Agent: How do I solve this math equation?

Client: Speaking of equations, I hear the ﬁsh soup is

to die for.

Example of non-sequitur answer (2):

Agent: What are the steps to improve efﬁciency in my

work process?

Client: Did you know penguins propose with pebbles?

Romantic, isn’t it?

Confusion: You are a noisy agent with the

purpose of generating noisy or distracting answers.

Generate intentionally confusing, contradictory

answers based on the conversation history or the

current answer, creating a sense of uncertainty.

Example of confusion answer (1):

Agent: Do you want to improve efﬁciency in your

operations?

Client: Yes, I want to make my operations as

inefﬁcient as possible.

Example of confusion answer (2):

Agent: Are there any constraints on your resources?

Client: I want to minimize resource usage while using

the maximum possible resources.

Exaggerated: You are a noisy agent with the pur-

pose of generating noisy or distracting answers. Pro-

vide exaggerated, outrageous, and unreasonable nu-

merical values.

The exaggerated values can be positive or negative.

Example of exaggerated answer (1):

Agent: How many cups of coffee do you drink in a

day?

Client: Well, on a typical day, I consume approxi-

mately enough coffee to ﬁll a swimming pool, then

add 1 billion gallons for good measure!

Example of exaggerated answer (2):

Agent: How many miles do you walk in a week?

Client: I walk about the distance from here to the moon

and back.

Too much information: You are a noisy agent

with the purpose of generating noisy or distracting an-

swers.

Provide detailed, excessive, and unnecessary informa-

tion in the answer.

Focus on giving an abundance of background knowl-

edge.

Example of too much information answer (1):

Agent: What is the maximum number of workers in

the company?

Client: Workers have been the backbone of compa-

nies since the industrial revolution, where they toiled

in factories under harsh conditions. Over time, labor

movements fought for better rights, leading to the es-

tablishment of labor unions and regulations. The 20th

century saw the rise of the modern workforce, with em-

ployees gaining rights such as minimum wage, safety

standards, and the 40-hour workweek. In recent years,

there’s been a shift towards remote work and the gig

economy, reshaping traditional employment structures.

Despite these changes, workers remain essential for

driving economic growth and innovation, shaping the

evolving landscape of employment.

Example of too much information answer (2):

Agent: How much does the company produce daily?

Client: The company produces approximately 10,000

units daily. Production levels depend on various fac-

tors such as demand, machinery efﬁciency, and work-

force capacity. Our advanced manufacturing process,

combined with strict quality controls, ensures consis-

tent output. This volume helps us meet market needs

while maintaining operational efﬁciency.

Inconsistent: You are a noisy agent with the pur-

pose of generating noisy or distracting answers.

Provide answers with incorrect basic mathematical

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

668

calculations, such as illogical proportion calculations,

misinterpretation of units, or applying fractional units

to physical objects.

Misinterpretation of units: Incorrect conversion of

measurement units, such as stating weight in liters or

hours in kilometers.

Illogical proportion calculations: Use proportions that

don’t add up logically, such as percentages exceeding

100%.

Applying fractional units to physical objects: Use frac-

tional or decimal units with objects that should be

whole, like “3.7 cars” or “250.5 employees.”

Examples:

Illogical proportion calculation example (1):

Agent: Could you tell me which industries you’re cur-

rently investing in?

Client: I am investing in 35% airline industry and 85%

railway industry.

Illogical proportion calculation example (2):

Agent: How many workers are employed in the hospi-

tal?

Client: There are 210 nurses and 12 doctors, so the

total number of workers is 130.

Fractional units to physical objects example (1):

Agent: How many people attended the event?

Client: The event had 250.4 attendees.

Fractional units to physical objects example (2):

Agent: How many cows are on your farm?

Client: There are 35.6 cows grazing in the ﬁeld.

Misinterpretation of units example (1):

Agent: How many books did your company publish

each day?

Client: My company published 30.5 square light-years

of books.

Misinterpretation of units example (2):

Agent: How many chairs did your company produce

this month?

Client: My company produced 42 megabytes of chairs.

APPENDIX B

This section includes a sample of dialogue gener-

ated using Gemini, including inconsistent noise cat-

egory and the problem statements originated from

NL4Opt (Ramamonjison et al., 2023).

Problem Statement: A dog school trains

labradors and golden retrievers to deliver newspaper.

A labrador can deliver 7 newspapers at a time and

requires 5 small bone treats for service. A golden re-

triever can deliver 10 newspapers at a time and requires

6 small bone treats per service. The school only has

1500 small bone treats available. In addition, at least

50 golden retrievers must be used and at most 60%

of the dogs can be labradors. How many of each dog

should be used to maximize the number of newspapers

that can be delivered?

Agent: Hello! I’m Optigem, your friendly opti-

mization assistant. What kind of business are you in?

Client: Hello! I have a dog school and I want to

optimize the number of newspapers delivered.

Agent: That’s great! So, you want to maximize the

number of newspapers delivered. To help me under-

stand, what are the things you can control to reach that

goal?

Client: Yes, I use labradors and golden retrievers.

Agent: Great! So, you use labradors and golden

retrievers. We’re trying to maximize the number of

newspapers delivered. To make this a math problem,

we need to ﬁgure out what decisions you make that

affect the number of newspapers delivered. What are

the things you can change or decide on that relate to

the dogs?

Client: I can decide how many labradors and how

many golden retrievers I use.

Agent: Now, are there any limits or restrictions on

how many of each dog you can train?

Client: Yes, I can only train 1.5 dogs at a time.

(Inconsistent)

Agent: That’s interesting! Let’s try to rephrase that.

Perhaps there are limits on the number of dogs you can

use, even if you could train more. What kind of limits

or restrictions do you have on the number of dogs you

can use?

Client: I only have 1500 small bone treats available.

Also, at least 50 golden retrievers must be used. And

at most 60% of the dogs can be labradors.

Agent: That’s great information! How much does

it cost to feed each type of dog for a single newspaper

delivery run?

Client: A labrador requires 5 small bone treats for

service, and a golden retriever requires 6.

Agent: That makes sense! Now, let’s talk about the

actual delivery process. How many newspapers can

each dog deliver per run?

Client: A labrador can deliver 7 newspapers at a

time and a golden retriever can deliver 10.

Agent: Okay, so we know how many newspapers

each dog can deliver. Is there any other information

about the dogs or the delivery process that is important

to know?

Client: I don’t know.

Agent: It was great working with you! I will come

back to you shortly.

Missing Information: No missing information was

found.

Agent: Thanks for your time, I’ll go ahead and

start working on your linear programming problems.

Investigating Answer Validation Using Noise Identiﬁcation and Classiﬁcation in Goal-Oriented Dialogues

669