and the coarseness of L2-norms when representing
text chunks as bag-of-word vectors, which may fail
to capture the nuanced relationships required for deep
logical reasoning tasks.
On the other hand, kNN has the highest and DPS
the second-highest average OPI-1 scores, indicating
that these retrievers would be the best choices for an-
swering deep-logic questions. kNN (cosine similar-
ity) and DPS are similar measures, with kNN being a
normalized version of DPS, which explains their com-
parable performance. However, kNN takes slightly
more time to compute than DPS, as DPS is the fastest
among all seven retrievers—dot products are the sim-
plest and quickest to compute compared to the opera-
tions used by other retrievers.
The MMR retriever allows GPT-4o to generate
better answers across all logical relations, as shown
in Table 3 . However, it does not perform as well in
producing the correct logical relations. This discrep-
ancy may be attributed to MMR’s focus on balanc-
ing relevance and diversity in retrieved content, which
improves answer quality but doesn’t necessarily align
with capturing accurate logical relations.
BM25 is in general more effective for retrieving
longer documents in a document corpus with the de-
fault parameter values for κ and b. However, to re-
trieve sentences from an article, it was shown that
BM25 would should use different parameter values
(Zhang et al., 2021). This explains why BM25 is the
second worse for generating answers as shown in Ta-
ble 3 by both extrinsic and intrinsic evaluations. It is
not clear, however, why it produces a relatively higher
LRCR value.
TF-IDF’s performance falls in the middle range,
which is expected. As a frequency-based approach, it
may struggle to capture deeper semantic information,
but it remains relatively effective because it retains
lexical information, ensuring that important terms are
still emphasized in the retrieval process.
5.2 Performance of Various
Combinations
We first analyze the performance of combinations of
retrievers versus individual retrievers, followed by an
analysis of combining retrievers algorithmically ver-
sus combining sentences retrieved by individual re-
trievers within the combination.
5.2.1 Combinations vs. Individuals
It can be seen from Table 4 that A-Seven outperforms
A-Four, which in turn outperforms A-Two. A similar
ranking is observed with S-Seven, S-Four, and S-Two.
Moreover, both A-Seven and A-Four are substantially
better than the top performer, kNN, when only a sin-
gle retriever is used (see both Tables 2 and 4). A simi-
lar result is observed with S-Seven and S-Four, where
combining more retrieved sentences from different re-
trievers also enhances performance, reinforcing the
benefits of increased diversity in the retrieval process.
These results all confirm the early suggestion that
combining more retrievers generally enhances perfor-
mance in both algorithmic and sentence-based com-
binations, supporting the idea that diverse retrieval
methods contribute positively to the overall effective-
ness of the RAG system.
However, we also observe that some combina-
tions of retrievers may actually lead to poorer per-
formance compared to using the individual retrievers
alone. This is evident in the case of A-Two and S-
Two, the algorithmic and sentence combinations of
kNN and DPS, both result in slightly lower average
OPI-1 scores than kNN alone. This is probably due
to the fact that kNN and DPS are very similar mea-
sures, and combining them doesn’t significantly in-
crease diversity. Worse, the extra information pro-
vided through their combination seems to have led to
diminishing returns, negating the potential benefits of
combining retrievers to improve performance. This
phenomenon warrants further investigation.
Nevertheless, combining retrievers based on dif-
ferent retrieval methodologies could help increase di-
versity and, consequently, improve overall perfor-
mance. This is evident in the case of A-Seven and
S-Seven, which combine retrievers utilizing diverse
retrieval methods, as well as in A-Four and S-Four,
where MMR—a retrieval method that balances rel-
evance and diversity—complements kNN. By lever-
aging varied retrieval techniques, we can ensure that
a broader range of relevant content is retrieved, po-
tentially leading to greater accuracy and more robust
logical reasoning in the generated answers.
5.2.2 Combining Algorithms vs. Combining
Sentences
We compare the outcomes of combining retrievers at
the algorithm level versus the sentence level. Com-
bining retrievers at the algorithm level is a feature
supported by LangChain, which returns the same
default number of chunks before sentences are ex-
tracted. In contrast, combining retrievers at the
sentence level involves merging sentences retrieved
by individual retrievers, which may include more
sentences than the algorithmic combination, and so
should lead to a slightly better performance. This is
evident when comparing A-Four with S-Four and A-
Two with S-Two (see Table 4).
Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
495