Intrinsic Evaluation of RAG Systems for Deep-Logic Questions*

Junyi (Edward) Hu

, You Zhou

and Jie Wang

Miner School of Computer & Information Sciences, University of Massachusetts, Lowell, MA, U.S.A.

{junyi hu, you zhou1}@student.uml.edu, jie wang@uml.edu

Keywords:

Retrieval Augmented Generation, Logical-Relation Correctness Ratio, Overall Performance Index.

Abstract:

We introduce the Overall Performance Index (OPI), an intrinsic metric to evaluate retrieval-augmented genera-

tion (RAG) mechanisms for applications involving deep-logic queries. OPI is computed as the harmonic mean

of two key metrics: the Logical-Relation Correctness Ratio and the average of BERT embedding similarity

scores between ground-truth and generated answers. We apply OPI to assess the performance of LangChain,

a popular RAG tool, using a logical relations classiﬁer ﬁne-tuned from GPT-4o on the RAG-Dataset-12000

from Hugging Face. Our ﬁndings show a strong correlation between BERT embedding similarity scores

and extrinsic evaluation scores. Among the commonly used retrievers, the cosine similarity retriever using

BERT-based embeddings outperforms others, while the Euclidean distance-based retriever exhibits the weak-

est performance. Furthermore, we demonstrate that combining multiple retrievers, either algorithmically or

by merging retrieved sentences, yields superior performance compared to using any single retriever alone.

1 INTRODUCTION

A RAG system typically consists of two major com-

ponents: Indexing and Retrieval. The former is re-

sponsible for indexing a reference text document be-

fore any queries are made to it. The latter is responsi-

ble for retrieving relevant data from the indexed docu-

ment in response to a query and passing that informa-

tion, along with the query, to a large language model

(LLM) to generate an answer. The Retrieval compo-

nent is typically a framework that supports a variety

of retrieval methods, each referred to as a retriever.

To assess the effectiveness of a retriever in un-

covering the logical relationship for an answer to a

query with respect to the reference document, we in-

troduce the Overall Performance Index (OPI). This

metric measures both the correctness of the answers

generated by an LLM and the accuracy of the logi-

cal relations produced by a classiﬁer. The OPI is cal-

culated as the harmonic mean of the BERT embed-

ding similarity between ground-truth and generated

answers, and the logical-relation correctness ratio.

To demonstrate the effectiveness of the OPI met-

ric, we use the RAG-Dataset-12000 provided by Hug-

https://orcid.org/0000-0001-8524-0123

https://orcid.org/0009-0005-0919-5793

https://orcid.org/0000-0003-1483-2783

∗

This work was supported in part by Librum Tech-

nologies, Inc.

ging Face (D.H., 2024) as the training and testing

dataset. We ﬁne-tune GPT-4o to construct a classi-

ﬁer to generate logical relations between a query and

an answer, with respect to the reference document.

We then evaluate LangChain (LangChain, 2024), a

popular RAG tool, with seven common retrievers, ex-

tracting relevant sentences from the reference docu-

ment for each query. Using GPT-4o as the underlying

LLM, we generate an answer to the query and use the

ﬁne-tuned GPT-4o classiﬁer to generate a logical re-

lation.

To rank retrievers, we calculate the average OPI

score across all 13 logical relations provided in RAG-

Dataset-12000. We then use OPI to analyze the

strengths and weaknesses of individual retrievers.

Moreover, we demonstrate that several variations of

combining multiple retrievers, either algorithmically

or by merging retrieved sentences, outperform a sin-

gle retriever alone.

2 PRELIMINARIES

The technique of RAG was introduced by Lewis et

al. (2020) (Lewis et al., 2020) a few years before the

widespread adoption of LLMs. The performance of

a RAG system relies on the quality of the underlying

retriever and the ability of the underlying LLM.

Hu, J., Zhou, Y. and Wang, J.

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions.

DOI: 10.5220/0013070300003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 489-496

ISBN: 978-989-758-716-0; ISSN: 2184-3228

489

LangChain is a popular RAG tool, which divides

a reference document into overlapping text chunks of

equal size. The sufﬁx of each chunk overlaps with the

preﬁx of the next.

To the best of our knowledge, no previous re-

search has comprehensively evaluated the perfor-

mance of RAG systems in the context of deep-logic

question answering.

Given below are seven common sentence retriev-

ers supported by LangChain:

DPS (dot-product similarity) converts a query and

a text chunk as BERT-based (Devlin et al., 2018) em-

bedding vectors and compute their dot product as a

similarity score. It returns k chunks with the highest

scores to the query. (DPS in LangChain is referred to

as Cosine Similarity.)

kNN (k-Nearest Neighbors) in LangChain is the

normalized dot-product similarity by the L2-norm,

which is widely referred to as the cosine similarity.

It returns k chunks with the highest cosine similarity

scores to the query.

BM25 (Robertson and Zaragoza, 2009) is a prob-

abilistic information retrieval model that ranks docu-

ments based on the term frequency in a chunk and the

inverse chunk frequency in the reference document.

Let q be a query, T a chunk of text, f (t

,T ) the fre-

quency of term t

in T , |T | the size of T , avgTL the

average chunk length, N the total number of chunks,

and n(t

) the number of chunks that contain t

. Then

BM25(q,T ) is deﬁned by

BM25(q,T ) =

∑

i=1



N − n(t

) + 0.5

n(t

) + 0.5

+ 1



( f (t

,T ) · (κ + 1))

f (t

,T ) + κ · (1 − b + b ·

|T |

avgTL

)

where κ and b are parameters. Return k chunks of text

with the highest BM25 scores to the query.

SVM (Support Vector Machine) (Cortes and Vap-

nik, 1995) is a supervised learning model that ﬁnds

the hyperplane that best separates data points in a

dataset. To use SVM as a retriever, ﬁrst represent

each chunk of text as a feature vector. This can be

done using word embeddings, TF-IDF, or any other

vectorization method. Then use the labeled dataset

to train an SVM model. Convert the query into the

same feature vector space as the chunks. Apply the

SVM model to the query vector to produce a score

that indicates how similar the query is to each chunk.

Extract k chunks with the highest scores.

TF-IDF (Sammut and Webb, 2011) measures the

importance of a word in a chunk of text relative to the

set of chunks in the reference document, combining

term frequency and inverse chunk frequency. In par-

ticular,

TF-IDF(t,T ) = TF(t,T ) × IDF(t),

where t is a term, T is a chunk, and IDF(t) is the

inverse chunk frequency of t. Given a query q, select

k chunks with the highest TF-IDF(q, T ) values.

MMR (Carbonell and Goldstein, 1998) is a re-

trieval algorithm that balances relevance and diver-

sity in the selection of k chunks. It iteratively selects

chunks that are both relevant to the query and min-

imally redundant with respect to the chunks already

selected.

EDI (Euclidean Distance) (Bishop, 2006) mea-

sures the straight-line distance between a query and

a chunk, represented in bag-of-words vectors. Return

k chunks with the shortest distance to the query.

A data point in RAG-Database-12000 contains the

following attributes: ‘context’, ‘question’, ‘answer’,

‘retrieved sentences’, ‘logical relation’, where ‘con-

text’ is the reference document. There are thirteen

categories of logical reasoning in the dataset. Their

names, abbreviations, descriptions, and the distribu-

tion of counts are presented in Table 1. All but the

last category involve deep logical reasoning, mean-

ing that arriving at the correct answer requires com-

plex, multi-step processes involving multiple con-

cepts, facts, or events extracted from the content. The

table includes eleven speciﬁc types of deep reasoning,

with an additional category for general deep reason-

ing, referred to as multi-hop reasoning.

3 OVERALL PERFORMANCE

INDEX

Let A and LR denote, respectively, the ground-truth

answer and logical relation to the question with re-

spect to the question Q, the context C, and the re-

trieved sentences S. Let A

′

and LR

′

denote, respec-

tively, the answer and the logical relation generated

by a RAG system with an LLM. We represent A and

′

using BERT embeddings and compute the cosine

similarity of the embeddings.

For a given dataset D with respect to a particu-

lar logical relation LR, let BERTSim

denote the av-

erage BERT similarity scores of all (A, A

′

) pairs and

LRCR

(logical-relation correctness ratio) denote the

proportion of data points where the predicted logical

relation matches LR. Namely,

LRCR

|{d ∈ D | LR = LR

′

|D|

(1)

The OPI for dataset D is deﬁned by the follow-

ing parameterized harmonic mean of BERTSim

and

LRCR

, similar to deﬁning the F-measure (Lewis and

Gale, 1994).

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

490

Table 1: Information of logical relations.

OPI(β)

(1 + β

) · BERTSim

· LRCR

(β

· BERTSim

) + LRCR

. (2)

OPI(1)

weighs answer accuracy and logical re-

lation accuracy equally. OPI(β)

weighs answer ac-

curacy more heavily when β > 1 (e.g., β = 2), and

weighs logical relation accuracy more heavily when

β < 1 (e.g., β = 0.5).

When there is no confusion in the context, the sub-

script D is omitted. Denote OPI(1) as OPI-1, OPI(2)

as OPI-2, and OPI(0.5) as OPI-0.5.

In addition to BERTSim, other metrics may be

used to measure the similarity between the generated

answer and the ground-truth answer, such as Hugging

Face’s MoverScore, as applied in the study of con-

tent signiﬁcance distributions of text blocks in a doc-

ument (Zhou and Wang, 2023). We choose BERT-

Sim because MoverScore uses IDF to compute word

weights, which is better suited for extractive answers

but less appropriate for generative answers produced

by LLMs.

Experimental results show that the BERTSim met-

ric aligns well with the outcomes of extrinsic compar-

isons of the ground-truth answers with the generated

answers (see Section 4.2 for details).

In what follows, we will use OPI-1 as the default

intrinsic measure to study the performance of RAG

systems for answering deep-logic questions.

4 EVALUATION

As seen in Table 1, the data points in RAG-Dataset-

12000 are unevenly distributed across the 13 logical

relations, with signiﬁcant disparities, such as only 106

data points in Fuzzy Reasoning compared to 6,920

data points in Direct Matching. To ﬁne-tune GPT-4o

and construct a classiﬁer for identifying logical rela-

tions, a balanced dataset is preferred. To achieve this,

we randomly select 100 data points from each logical

relation category, forming a new dataset called RAG-

QA-1300 that consists of 1,300 data points. This

dataset is then split with an 80-20 ratio to create a

training set and a test set.

Fine-tuning was performed by combining the con-

text, question, and answer from each data point into

a cohesive input text, labeled with its corresponding

logical relation. The process involved approximately

800 training steps, resulting in a validation loss of

−4

. This speciﬁc checkpoint was selected for its

optimal performance.

The ﬁne-tuned GPT-4o classiﬁer for logical rela-

tions signiﬁcantly improves the accuracy to 75.77%

on the test set, compared to 49.23% when using GPT-

4o out-of-the-box without ﬁne-tuning.

We used LangChain with the seven common re-

trievers mentioned in Section 2. We used GPT-4o to

generate answers and the ﬁne-tuned GPT-4o classiﬁer

to generate logical relations. LangChain supports a

wide range of retrievers and allows for the seamless

integration of pre-trained LLMs.

4.1 Intrinsic Evaluation

We set the chunk size to 100 (words) with a chunk

overlap of 20 % in the setting of LangChain, where

paragraph breaks, line breaks, periods, question

marks, and exclamation marks are set to be the sep-

arators. These settings were fed into the LangChain

function RecursiveCharacterTextSplitter to

split a reference document into chunks, where each

chunk contains up to 100 words, ending at a speciﬁed

separator to break naturally such that the chunk is as

large as possible, and adjacent chunks have a 20%

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions

491

Table 2: Intrinsic comparisons across all logical relations, where “Retr” is an abbreviation of Retriever, “B” stands for

BERTSim, “L” for LRCR, and “O-1” for OPI-1.

overlap.

We used the default settings for each retriever

to return four chunks in the context with the best

scores—highest for similarity and ranking measures,

smallest for distance measures—from the underlying

retriever as the most relevant to the query. We then

converted the four chunks extracted by the retriever

back into complete sentences as they appeared in the

original article. These sentences and the query were

then fed to GPT-4o to generate an answer. Moreover,

we instructed GPT-4o to determine the logical rela-

tionship for the answer with respect to the input text.

We consider the accuracy of the generated an-

swers and logical relations to be equally important.

Table 2 presents the evaluation results of OPI-1 on

the test data of RAG-QA-1300. The OPI-1 score with

respect to each retriever is calculated for each set of

data points of the same logical relation. The average

OPI-1 score for each retriever across all logical rela-

tions is calculated by

OPI-1 =

2/|L| ·

∑

ℓ∈L

BERTSim

ℓ

∑

ℓ∈L

LRCR

ℓ

∑

ℓ∈L

BERTSim

ℓ

∑

ℓ∈L

LRCR

ℓ

, (3)

where L is the set of the 13 logical relations, and

BERTSim

ℓ

and LRCR

ℓ

denote, respectively, the cor-

responding BERTSim score and LRCR value for the

logical relation ℓ.

An alternative is to calculate the average OPI-1

score across all logical relations. While this differs

slightly from Formula (3), the difference is minimal.

We prefer Formula (3) for practical efﬁciency, as it

bypasses the need to compute individual OPI-1 scores

for each logical relation when these scores are not

needed in applications, streamlining the process and

reducing unnecessary computations.

4.2 Extrinsic Evaluation

The extrinsic evaluation uses a 0-3-7, 3-point scoring

system to score A

′

for each pair (A,A

′

), where A is the

ground-truth answer and A

′

is the answer generated

directly by GPT-4o based on the question Q and the

extracted sentences in the corresponding data point.

The scoring system works as follows:

• A

′

receives 7 points if it exactly matches the mean-

ing of A.

• A

′

receives 3 points if it partially matches the

meaning of A, with or without extra information

not contained in A.

• A

′

receives 0 points if there is no meaningful

match.

This scoring system was designed to simplify extrin-

sic evaluation while widening the gaps between per-

fect, partial, and no matches, providing a more nu-

anced assessment compared to the traditional 1-2-3,

3-point scoring system. The intention is to create

a ranking with greater discrepancy between scores,

enabling clearer differentiation of performance levels

and allowing for more meaningful comparisons.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

492

Given below are examples of each scenario.

• 7 points: A

′

receives 7 points if A and A

′

have

exactly the same meaning, with neither additional

nor missing information. For example, the follow-

ing A

′

using DPS is a 7-point answer.

– Q: What is the debate about what Desmond

said in the context?

– A: The debate is about what Desmond said to

Locke, speciﬁcally whether he said “snowman”

or “silhouette”.

– A

′

: The debate is about whether Desmond said

“snowman” or “silhouette”.

• 3 points. A

′

receives 3 points in either of the fol-

lowing two cases:

Case 1: Extra information. A

′

matches the mean-

ing of A, but includes additional information not

present in A. For example, the following A

′

using

MMR receives 3 points for the reason that A

′

con-

tains extra information that the Ford GT40 was

produced in limited runs from 1964 to 1969.

– Q: What was the reason behind the creation of

the Ford GT40?

– A: The Ford GT40 was created as a result of

a grudge between Henry Ford II and Enzo Fer-

rari. After a failed attempt to acquire Ferrari,

Henry Ford II decided to compete against Fer-

rari in the prestigious 24 Hours of Le Mans

race. To break Ferrari’s dominance in the race,

Ford’s chief engineer, Roy Lunn, came up with

the concept of the GT40.

– A

′

: The Ford GT40 was born out of a rivalry be-

tween Henry Ford II and Enzo Ferrari. After a

failed attempt to acquire Ferrari, Henry Ford II

resolved to challenge Ferrari in the prestigious

24 Hours of Le Mans race. To disrupt Ferrari’s

dominance, Ford’s chief engineer, Roy Lunn,

conceived the idea for the GT40. It was pro-

duced in limited runs from 1964 to 1969.

Case 2: Partial match. The meaning of A

′

partially

overlaps with the meaning of A, but not fully. For

example, the following A

′

generated using BM25

receives 3 points. Reason: A

′

clearly leaves out

information that the Ford GT40 was conceived by

Ford’s chief engineer, Roy Lunn.

– Q: What was the reason behind the creation of

the Ford GT40?

– A: See Case 1 above.

– A

′

: The reason behind the creation of the Ford

GT40 was to compete against Ferrari in rac-

ing events, as evidenced by Ford’s continued

efforts to improve the GT40 and best the Ital-

ians.

• 0 points: A

′

receive 0 points if A and A

′

are distinct

from each other with no overlap in meaning. For

example, the following A

′

generated through EDI

receives 0 points.

– Q: What was the reason behind the creation of

the Ford GT40?

– A: See Case 1 above.

– A

′

: The Ford GT40 was created to take full ad-

vantage of the beneﬁts associated with a mid-

engine design, including a slinky aerodynamic

shape and benign handling characteristics.”

Table 3 shows the average scores of comparing

answers by freelance annotators as well as the cor-

responding BERTSim scores. The integers in the row

below the row of evaluation scores represent the re-

spective rankings.

Table 3: Evaluation scores by extrinsic evaluation and in-

trinsic BERTSim metric with rankings, where “Extr” stands

for “extrinsic evaluation” and “Intr” for “intrisic evalua-

tion”.

It is evident that the extrinsic evaluation scores

align well with the BERTSim scores, demonstrating

consistency in ranking. In particular, both evaluations

are in complete agreement for the 2nd, 5th, 6th, and

7th places, with only minor variations in the other

rankings. For instance, MMR is ranked 1st by extrin-

sic evaluation and 3rd by BERTSim, which is quite

close. Similarly, TF-IDF is ranked 2nd by extrinsic

evaluation and 4th by BERTSim. Notably, DPS, kNN,

and TF-IDF all share the 2nd rank in extrinsic evalu-

ation, likely due to the coarseness of human annota-

tion. Since DPS and kNN are essentially the same

measures, they should logically be ranked closer to

each other than to TF-IDF. Therefore, the extrinsic

rank of TF-IDF, while differing slightly from BERT-

Sim, can still be considered reasonably aligned. Over-

all, this suggests a strong correlation between the two

evaluation methods.

4.3 Combining Multiple Retrievers

LangChain supports combining multiple retrievers

into a new retriever. We use the default setting to

return four chunks for each combination. This ap-

proach diversiﬁes the retrieved content from the ref-

erence document, potentially improving overall per-

formance.

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions

493

Table 4: Evaluation results of various combinations of retrievers and sentences.

As examples, we combine all seven retrievers, de-

noted as A-Seven; three retrievers with the highest

OPI-1 scores: kNN, DPS, and TF-IDF, plus MMR for

its strength in balancing relevance and diversity, de-

noted as A-Four; and two retrievers with the highest

OPI-1 scores: kNN and DPS, denoted as A-Two.

We may also combine the sentences retrieved by

individual retrievers, removing any duplicates, and

use the remaining set of sentences with the corre-

sponding questions to generate answers and logical

relations. Let S-Seven, S-Four, and S-Two denote the

sets of sentences obtained this way by the correspond-

ing retrievers as in A-Seven, A-Four, and A-Two.

The experimental results of both types of combi-

nations are shown in Table 4.

5 ANALYSIS

We ﬁrst analyze the performance of individual retriev-

ers, followed by examining the combinations of re-

trievers and the sentences retrieved by multiple re-

trievers.

5.1 Individual Retrievers

For each retriever, we ﬁrst analyze the performance

for each logical relation individually and then assess

the overall performance across all logical relations.

5.1.1 Individual Logical Relation

We use the OPI-1 scores to help identify the strengths

and weaknesses of individual retrievers across the 13

logical relations. For example, as seen in Table 2,

almost all retrievers tend to perform the worst on ad-

versarial reasoning, followed by fuzzy reasoning. For

other logical relations, the performance of retrievers

varies, indicating that certain retrievers may be more

suited to speciﬁc types of reasoning tasks while strug-

gling with others. For example, even for the worst-

performing retriever, EDI, which consistently ranks

the lowest in both extrinsic and intrinsic evaluations

of answer accuracy as seen in Table 3, it still per-

forms best on deductive reasoning. This suggests that

while EDI may generally be less effective across var-

ious logical relations, it has a particular strength in

handling tasks that involve deductive reasoning. This

example highlights the nuanced performance of re-

trievers, where even a generally weaker retriever can

excel in speciﬁc logical tasks. This variability in per-

formance highlights the importance of selecting the

appropriate retriever.

5.1.2 Across all Logical Relations

The average OPI-1 scores provide a means to iden-

tify, across all 13 logical relations, which retrievers

are more suitable for speciﬁc tasks and which retriev-

ers should be avoided. For example, as shown in Ta-

ble 2, EDI has the lowest and SVM the second-lowest

average OPI-1 scores, indicating they should gener-

ally be avoided. This is likely due to the limitations of

the underlying features used to compute SVM scores

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

494

and the coarseness of L2-norms when representing

text chunks as bag-of-word vectors, which may fail

to capture the nuanced relationships required for deep

logical reasoning tasks.

On the other hand, kNN has the highest and DPS

the second-highest average OPI-1 scores, indicating

that these retrievers would be the best choices for an-

swering deep-logic questions. kNN (cosine similar-

ity) and DPS are similar measures, with kNN being a

normalized version of DPS, which explains their com-

parable performance. However, kNN takes slightly

more time to compute than DPS, as DPS is the fastest

among all seven retrievers—dot products are the sim-

plest and quickest to compute compared to the opera-

tions used by other retrievers.

The MMR retriever allows GPT-4o to generate

better answers across all logical relations, as shown

in Table 3 . However, it does not perform as well in

producing the correct logical relations. This discrep-

ancy may be attributed to MMR’s focus on balanc-

ing relevance and diversity in retrieved content, which

improves answer quality but doesn’t necessarily align

with capturing accurate logical relations.

BM25 is in general more effective for retrieving

longer documents in a document corpus with the de-

fault parameter values for κ and b. However, to re-

trieve sentences from an article, it was shown that

BM25 would should use different parameter values

(Zhang et al., 2021). This explains why BM25 is the

second worse for generating answers as shown in Ta-

ble 3 by both extrinsic and intrinsic evaluations. It is

not clear, however, why it produces a relatively higher

LRCR value.

TF-IDF’s performance falls in the middle range,

which is expected. As a frequency-based approach, it

may struggle to capture deeper semantic information,

but it remains relatively effective because it retains

lexical information, ensuring that important terms are

still emphasized in the retrieval process.

5.2 Performance of Various

Combinations

We ﬁrst analyze the performance of combinations of

retrievers versus individual retrievers, followed by an

analysis of combining retrievers algorithmically ver-

sus combining sentences retrieved by individual re-

trievers within the combination.

5.2.1 Combinations vs. Individuals

It can be seen from Table 4 that A-Seven outperforms

A-Four, which in turn outperforms A-Two. A similar

ranking is observed with S-Seven, S-Four, and S-Two.

Moreover, both A-Seven and A-Four are substantially

better than the top performer, kNN, when only a sin-

gle retriever is used (see both Tables 2 and 4). A simi-

lar result is observed with S-Seven and S-Four, where

combining more retrieved sentences from different re-

trievers also enhances performance, reinforcing the

beneﬁts of increased diversity in the retrieval process.

These results all conﬁrm the early suggestion that

combining more retrievers generally enhances perfor-

mance in both algorithmic and sentence-based com-

binations, supporting the idea that diverse retrieval

methods contribute positively to the overall effective-

ness of the RAG system.

However, we also observe that some combina-

tions of retrievers may actually lead to poorer per-

formance compared to using the individual retrievers

alone. This is evident in the case of A-Two and S-

Two, the algorithmic and sentence combinations of

kNN and DPS, both result in slightly lower average

OPI-1 scores than kNN alone. This is probably due

to the fact that kNN and DPS are very similar mea-

sures, and combining them doesn’t signiﬁcantly in-

crease diversity. Worse, the extra information pro-

vided through their combination seems to have led to

diminishing returns, negating the potential beneﬁts of

combining retrievers to improve performance. This

phenomenon warrants further investigation.

Nevertheless, combining retrievers based on dif-

ferent retrieval methodologies could help increase di-

versity and, consequently, improve overall perfor-

mance. This is evident in the case of A-Seven and

S-Seven, which combine retrievers utilizing diverse

retrieval methods, as well as in A-Four and S-Four,

where MMR—a retrieval method that balances rel-

evance and diversity—complements kNN. By lever-

aging varied retrieval techniques, we can ensure that

a broader range of relevant content is retrieved, po-

tentially leading to greater accuracy and more robust

logical reasoning in the generated answers.

5.2.2 Combining Algorithms vs. Combining

Sentences

We compare the outcomes of combining retrievers at

the algorithm level versus the sentence level. Com-

bining retrievers at the algorithm level is a feature

supported by LangChain, which returns the same

default number of chunks before sentences are ex-

tracted. In contrast, combining retrievers at the

sentence level involves merging sentences retrieved

by individual retrievers, which may include more

sentences than the algorithmic combination, and so

should lead to a slightly better performance. This is

evident when comparing A-Four with S-Four and A-

Two with S-Two (see Table 4).

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions

495

However, having more sentences may not always

lead to improvement, as it can introduce conﬂict-

ing information. This is evident when comparing A-

Seven with S-Seven, where S-Seven has a lower av-

erage OPI-1 score than A-Seven. This is likely be-

cause A-Seven has already saturated the useful sen-

tences, while S-Seven introduces additional sentences

that negatively impact the average OPI-1 score.

In summary, these analyses suggest that, when

combining appropriate retrievers, both algorithmic

and sentence-level approaches offer performance im-

provements, with each method providing distinct ad-

vantages in terms of retrieval diversity and the quality

of generated answers. Selecting appropriate retrievers

requires a deeper understanding of the underlying re-

trieval mechanisms, making this an interesting topic

for further investigation.

6 FINAL REMARKS

This paper presents an effective intrinsic evaluate

method for the performance of RAG systems in con-

nection to question-answering involving deep logical

reasoning.

LangChain supports a wide range of retrievers

and allows users to integrate custom retrievers. Ad-

ditionally, there are numerous large language mod-

els (LLMs) such as the Gemini series (Google,

2024), LlaMA series (Meta, 2024), and Claude se-

ries (Claude AI, 2024), among others, as well as

various retrieval-augmented generation (RAG) tools

like LLAMAINDEX (LlamaIndex, 2024), HayStack

(Deepset, 2024), EmbedChain (EmbedChain, 2024),

and RAGatouille (AnswerDotAI, 2024). Evaluating

the performance of these models and tools, particu-

larly for answering deep-logic questions where iden-

tifying logical relations is essential, represents an in-

triguing direction for future research.

Regularly reporting the ﬁndings of such investi-

gations would signiﬁcantly contribute to the advance-

ment of RAG technologies. Furthermore, we aim to

develop a tool that quantitatively assesses the depth of

logical relations in question-answering systems rela-

tive to the underlying context. This effort would ne-

cessitate the creation of a new dataset that annotates

the depth of each logical relation for every triple con-

sisting of a question, an answer, and a set of reference

sentences.

REFERENCES

AnswerDotAI (2024). Ragatouille. Accessed: 2024-09-06.

Bishop, C. M. (2006). Pattern Recognition and Machine

Learning. Springer.

Carbonell, J. and Goldstein, J. (1998). The use of

mmr, diversity-based reranking for reordering docu-

ments and producing summaries. Proceedings of the

21st annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 335–336.

Claude AI (2024). Claude. Accessed: 2024-09-06.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine Learning, 20(3):273–297.

Deepset (2024). Haystack - deepset. Accessed: 2024-09-

06.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

D.H., C. (2024). Rag dataset 12000.

EmbedChain (2024). Embedchain. Accessed: 2024-09-06.

Google (2024). Gemini - google. Accessed: 2024-09-06.

LangChain (2024). Langchain ofﬁcial website.

Lewis, D. D. and Gale, W. A. (1994). A study of f-measure

in information retrieval. In Proceedings of the Third

Annual Symposium on Document Analysis and Infor-

mation Retrieval (SDAIR’94), pages 187–199. Cite-

seer.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., K

uttler, H., Lewis, M., tau Yih, W.,

Rockt

aschel, T., et al. (2020). Retrieval-augmented

generation for knowledge-intensive nlp tasks. arXiv

preprint arXiv:2005.11401.

LlamaIndex (2024). Retrieval-augmented generation (rag)

- llamaindex. Accessed: 2024-09-06.

Meta (2024). Llama - meta. Accessed: 2024-09-06.

Robertson, S. and Zaragoza, H. (2009). The probabilistic

relevance framework: Bm25 and beyond. Founda-

tions and Trends in Information Retrieval, 3(4):333–

389.

Sammut, C. and Webb, G. I. (2011). TF–IDF. Springer.

Zhang, H., Zhou, Y., and Wang, J. (2021). Contextual net-

works and unsupervised ranking of sentences. In Pro-

ceedings of the 33rd IEEE International Conference

on Tools with Artiﬁcial Intelligence (ICTAI 2021).

Zhou, Y. and Wang, J. (2023). Content signiﬁcance distribu-

tions of sub-text blocks in articles and its application

to article-organization assessment. In Proceedings of

the 15th Knowledge Discovery and Information Re-

trieval (KDIR 2023).

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

496