Evaluation of Information Retrieval Models and Query Performance

Predictors for Amharic Adhoc Task

Tilahun Yeshambel

, Josiane Mothe

and Yaregal Assabie

IT Doctorial Program, Addis Ababa University, Addis Ababa, Ethiopia

INSPE, Univ. de Toulouse, IRIT, UMR5505 CNRS, Toulouse, France

Department of Computer Science, Addis Ababa University, Addis Ababa, Ethiopia

Keywords: Query Performance Prediction, Feature Analysis, Amharic, Adhoc Search, IR Models, Correlation.

Abstract: Query performance prediction (QPP) is the task of evaluating the quality of retrieval results of a query

within the context of a retrieval model. Although several research activities have been carried out on QPP

for many languages, query performance predictors are not studied yet for Amharic adhoc information

retrieval (IR) task. In this paper, we present the effect of various IR models on Amharic queries, and make

some analysis on the computed features for QPP methods from both Indri and Terrier indexes based on the

Amharic Adhoc Information Retrieval Test Collection (2AIRTC). We conducted various experiments to

attest the quality of Amharic queries and performance of IR models on 2AIRTC which is TREC-like test

collection. The correlation degree between predictors is used to measure the dependence between various

query performance predictors, or between a predictor and a retrieval score. Our finding shows that Jelinek-

Mercer model outperformed the BM25 and Dirichlet models. The finding also indicates the correlation

matrices between the query-IDF predictors and the evaluation measures show very low Pearson correlation

coefficient values.

1 INTRODUCTION

QPP is an important task in IR and is the concern of

many researchers in the IR community. The retrieval

performances vary widely across diﬀerent retrieval

systems for a single query. The retrieval

performance of an IR system is even inconsistent

over different IR test collections. For example,

adhoc retrieval eﬀectiveness can vary across queries

for different retrieval methods. Adhoc retrieval is the

standard information retrieval task that aims to

provide relevant documents for a given query within

a given document corpus in ranked order based on

relevance to a query (Teufel, 2007). As a result of

ambiguities in user query or errors in a query, the

performances of many IR systems vary while

responding to different users’ queries (Shtok et al.,

2012). A user query performance prediction deals

with the retrieval performance of an IR system for a

given query. The well-known QPP approaches are

pre-retrieval and post-retrieval strategies (Carmel

and Yom-Tov, 2010). The post-retrieval method

evaluates the difficulty of a query by analyzing the

most ranked retrieval results in a response to the

query whereas the pre-retrieval approach analyzes a

query and corpus prior to retrieval time and is based

on linguistic and statistical features of a query and

documents (Shtok et al., 2012). Linguistic

information (Mothe and Tanguy, 2005) and

statistical features such as term frequency-inverse

document frequency (TF-IDF), IDF and standard

deviation values (Cronen-Townsend et al., 2002) are

examples of pre-retrieval prediction methods

whereas analysis clarity (Cronen-Townsend et al.,

2002; Datta et al., 2022), robustness (Zhou and

Croft, 2006), normalized query commitment and

weighted information gain (Datta et al., 2022), and

retrieval scores (Tao and Wu, 2014) are well-known

post-retrieval prediction techniques. Unlike pre-

retrieval, post-retrieval predictors are

computationally expensive but outperform pre-

retrieval predictors. Given a query and an IR system,

a QPP method computes a real-valued score that

indicates the effectiveness of a system for a given

query. The quality of a QPP method is usually

determined by evaluating the correlation between its

predicted effectiveness scores and the values of

some standard evaluation metric for a set of queries

Yeshambel, T., Mothe, J. and Assabie, Y.

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task.

DOI: 10.5220/0012224000003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 98-108

ISBN: 978-989-758-671-2; ISSN: 2184-3228

(Ganguly et al., 2022). The performance of a

predictor is usually measured by computing the

correlation coefficient between the output of a

predictor and the actual performance of queries on a

retrieval system. Pearson and Spearman correlation

coefficients are the two widely used statistical

measures.

Eﬀective QPP enables a retrieval system to

decide an appropriate action to be taken at the next

turn. Estimating the performance of queries

accurately is useful for various applications such as

for managing resources, optimizing query, managing

user experience, detecting queries with no relevant

content and automatic spell correction, performing

selective query expansion, selecting eﬀective

ranking algorithm for a query, estimating the

retrieval quality for the top-ranked results and

merging results in a distributed IR system (Akdere et

al., 2012). For example, if the retrieval effectiveness

of a query is good, then the query’s performance can

be further improved by an affirmative action such as

automatic query expansion; otherwise, the system

may ask the user for a refinement of the query.

QPP techniques have been explored and

analyzed on standard test collections for many

information retrieval systems (Pérez-Iglesias and

Araujo, 2010; Zendel et al., 2023). However, in spite

of recent progresses made on the development of

Amharic IR (Yeshambel et al., 2022; Yeshambel et

al., 2023), there are no studies carried out to indicate

the retrieval effectiveness of various IR techniques

for the language. Therefore, the aim of this study is

to evaluate:

(i) the quality of some retrieval models on

Amharic topic set;

(ii) the correlation between various QPP features;

and

(iii) the performance of some QPPs for estimating

the quality of retrieval results.

We take the first initiation to examine the impact

of some QPP methods on the retrieval effectiveness

of Amharic adhoc retrieval setting. Some prediction

methods considered as state-of-the-art have been

implemented in this study. We explore the

generalizability of the results from QPP methods for

Amharic adhoc search for estimating the retrieval

quality of an IR model based on top-rank lists. We

also present analysis about the computed features

and performance measures on the 2AIRTC Amharic

collection created by Yeshambel et al. (2020c). The

correlation between various query performance

predictors and each predictor with the retrieval score

is also investigated using mean average precision

(MAP) and normalized discounted cumulative gain

(NDCG) metrics. We performed QPP by considering

query and document features. We made various

experiments with two types of features: pre-retrieval

query performance features, and Letor features.

The rest of this paper is organized as follows.

Section 2 introduces the morphological features of

Amharic and the progresses made on the

construction of resources for Amharic IR. Section 3

discusses related works on QPP methods. In Section

4, we present the experimental setup. Experimental

results along with detailed analysis of the features

are discussed in Section 5. Finally, we make our

conclusion in Section 6.

2 AMHARIC LANGUAGE

Amharic is the official language of the government

of Ethiopia. It is used as a mother tongue by

substantial segment of the population of the country

and serves as the lingua franca of various

communities speaking other languages. It is the

second most spoken Semitic language in the world

after Arabic (Gambäck, 2012). The language uses its

own script for writing. The Amharic alphabet has 34

characters, each of which has 7 orders representing a

combination of a consonant and 7 vowels (Argaw and

Asker, 2006).

Owing to its Semitic characteristics, Amharic is

morphologically complex language. Word formation

involves affixation, reduplication, and Semitic stem

interdigitation (Yimam, 2001). Words can be formed

either from stems or roots by adding affixes on

stems or by modifying the root itself internally.

Stems take different types of prefixes and suffixes to

form inflectional and derivational words. One of the

unique features of Amharic morphology is root-

pattern morphological phenomena. For example,

Amharic verbs heavily rely on the arrangement of

consonants and vowels in order to code different

morph-syntactic properties. The root of an Amharic

verb is represented by consonants which carry the

semantic core of the verb. It is possible to create

many words by attaching different affixes to a stem.

For example, according to Abate and Assabie

(2014), it is possible to generate thousands of words

from a verbal root through a complex morphological

process. The possibility of having multiple

affixations on Amharic words may lead to

ambiguity. Thus, contextual analysis is important in

Amharic in order to understand the exact meaning of

some words. The morphological complexity of

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

Amharic along with its implication in Amharic IR is

exposed in detail by Yeshambel et al. (2022) and

Yeshambel et al. (2023).

Although lots of Amharic documents are

available in digital form, the language is still

regarded as under-resourced as only few are usable.

Among the Amharic resources worthy of mention

are NLP corpus (Woldeyohannis and Meshesha,

2020; Yeshambel et al., 2021), Amharic IR test

collection (Yeshambel et al., 2020c), pre-trained

language models (Eshetu et al., 2020; Muhie et al.,

2021; Yeshambel et al., 2023), stopwords

(Alemayehu and Willett, 2002; Yeshambel et al.,

2020b), and word analyzers (Alemayehu and

Willett, 2002; Gasser, 2011; Abate and Assabie,

2014). Evaluations made on the aforementioned

resources indicate that most of them are not in

usable form for Amharic IR task (Yeshambel et al.,

2020a). This shows that the status of research and

development on Amharic IR is far behind in

comparison to technologically advanced languages.

3 RELATED WORK

Zhao et al. (2008) proposed pre-retrieval predictors

based on two sources of information: the similarity

between a query and the underlying collection; and

the variability with which query terms occur in

documents. These predictors used collection and

term distribution statistics. The proposed predictors

were evaluated by investigating the correlation

between the predictors and the actual performance of

a retrieval system on each query. The finding

indicates that the proposed predictors provide more

consistent performance than existing pre-retrieval

methods across informational and navigational

search tasks on the Newswire and Web datasets.

Cronen-Townsend et al. (2002) developed QPP

method by calculating the relative entropy between a

query language model and the corresponding

collection language model. Various experiments

were conducted to evaluate the correlation between

clarity score and average precision score for various

TREC Adhoc Track test collections and the title

section of TREC topics using the language

modelling. The obtained clarity score was used to

measure the coherence of language usage in

documents and its models to generate a query. The

finding indicates that clarity score can measure the

ambiguity of a query regarding to a corpus and they

correlate positively with average precision in a

variety of TREC datasets.

He and Ounis (2006) explored the effect of query

length, IDF, query scope, query clarity, standard

deviation of the IDF of the terms in a query, average

inverse collection TF (AvICTF) and the ratio of

maximum IDF value (IDF_max) to minimum IDF

value (IDF_min) of a query. The correlations of the

proposed predictors with query performance are

evaluated on various TREC collections using title,

description and narrative fields of the topic. Among

the six proposed predictors, a simplified clarity score

(SCS) and the AvICTF have the strongest

correlation with average precision for short queries.

The standard deviation of IDF, SCS and AvICTF are

the most correlated predictors with average precision

for normal and long queries. On the contrary, the use

of two statistically diverse document weighting

models does not have an impact on the overall

effectiveness of the proposed predictors.

Zendel et al. (2023) proposed entropy to

estimate the retrieval eﬀectiveness of neural re-

ranking models. The idea is to measure the entropy

of the retrieval scores for a re-ranking model if there

is no training data or corpus related statistics. The

proposed QPP method was tested using Query

Likelihood and NeuralRanker on Robust04 adhoc

retrieval TREC collection. The NeuralRanker

slightly outperformed the query likelihood model

using the entropy predictor. For query likelihood, the

performance of entropy is similar to the standard

deviation. However, entropy outperformed in the

case of NeuralRanker retrieval approach.

Pérez-Iglesias and Araujo (2010) proposed QPP

evaluation framework to avoid some limitations of

correlation coefficients to evaluate QPP methods.

The two proposed evaluation frameworks are: (i) to

evaluate the performance of the method to detect

‘easy’ or ‘hard’ queries; and (ii) to make explicit the

accuracy of the method for different types of topics.

The proposed evaluation frameworks were tested

against set of different prediction methods on a

standard TREC collection using the query likelihood

language modelling with a Dirichlet prior smoothing

parameter. The proposed predictor is used to classify

queries as hard, average and easy. The most accurate

prediction methods in terms of grouping predictions

are SCS and ScoreDesv. The Pearson and Kendall

correlation coefficients suggested an equivalent

performance for SCS and AvICTF methods.

While we see that QPP has been a subject of

research and development for various IR systems, to

our best knowledge, there is no such attempt for

Amharic adhoc IR task. Thus, our study is aimed at

addressing this issue using the 2AIRTC test

collection.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

100

4 EXPERIMENTAL SETUP

In this study, we conducted various experiments to

investigate the effect of QPP on Amharic IR, the

performance of some widely used QPP methods and

the correlation between them. The details of the

experimental setup are presented as follows.

4.1 The Amharic IR System

We made our experiment based on the Amharic IR

system developed by Yeshambel et al. (2020d). In

the system, both documents and queries pass

through similar pre-processing, morphological

analysis and stopword removal tasks. The pre-

processing task involves tokenization, character

normalization and removal of punctuation marks.

Morphological analysis was used to extract stems

and roots from surface forms of words. Stopwords

were removed from queries and documents using the

morpheme-based stopword list constructed by

Yeshambel et al. (2020b). Document indexing was

performed based on morphologically analyzed

words. We created three indexes using the retrieval

models BM25, language modelling with Dirichlet

smoothing (LMDir) and language modelling with

Jelinek-Mercer smoothing (LMJM).

The system performed matching between index

terms and morphologically analyzed query terms.

Finally, ranking was made based on the results

obtained from the matching task. Figure 1 shows the

architecture of the Amharic IR system.

4.2 Test Collection

The 2AIRTC (Yeshambel et al., 2020c) is the only

publicly accessible and usable Amharic IR test

collection. We used 6,069 documents and 240

queries from this test collection. The relevance has

binary values with 1 for relevant and 0 for irrelevant.

This collection has been used in some Amharic IR

studies. We used the titles of the topic set.

4.3 Features and Predictors

Various query-document, query-IDF and document

features have been used in many IR studies for QPP.

In this study, we have tested some pre-retrieval and

post-retrieval predictors which have performed well

in past evaluations. We used IDF with TF since it

has been used to rank retrieved documents based on

relevance to a given query and enable the retrieval

system to select relevant documents to the query

based on TF-IDF score. Moreover, IDF statistics

have been used in many studies. Ranking is one of

the main tasks of adhoc retrieval and can be done in

various ways. In our case, we did it using Letor

features which are associated with query-document

pairs. The details of query-IDF, query-document

Letor features and their variants are presented in

Table 1.

Table 1: Indri query-IDF and Terrier query-document

Letor features.

Feature Variant

Query-IDF

IDF_Q1, IDF_Q3, IDF_mean,

IDF_median, IDF_std,

IDF_std2, IDF_sum, IDF_min

and IDF_max

Query-document

and document

Letor

IFB2, In_expB2, In_expC2,

InB2, InL2, Js_KLs,

LemurTF_IDF, LGD, MDL2,

ML2, PL2, TF and TF-IDF

4.4 Retrieval Models

The analysis and performance measures were

performed using both the Indri and Terrier retrieval

systems. Indri uses BM25 and language model

methods. Terrier is an efficient, flexible, extensible,

effective and interactive IR platform. We used

Terrier platform to compute Letor features.

IR models are one of the main components of

QPP. We employed three retrieval models: Okapi

BM25, language modelling with Dirichlet

smoothing and language modelling with Jelinek-

Mercer smoothing. The values of k1 and b in BM25

were set to 0.7 and 0.3, respectively. We used λ=0.6

for LMJM, and the value of the smoothing

parameter µ for LMDir was set to 1000.

4.5 Evaluation Metrics

We measured the ground truth effectiveness of the

system by the trec-eval toolkit to compute precision,

MAP, recall, NDCG, relative precision, set-based

measures, success and number-based metrics. The

prediction quality of QPP features was measured by

Pearson correlation between the ground truth

effectiveness values and the prediction values for the

queries. Pearson measures the strength of the linear

relationship between two variables. The number of

top-retrieved documents in all the pre-retrieval

predictors was selected from the top {5, 10, 15, 20,

30, 100, 200, 500, 1000} retrieval results.

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

101

Figure 1: Architecture of Amharic IR system (Yeshambel et al., 2020d).

Table 2: Retrieval effectiveness of BM25, Dirichlet and Jelinek-Mercer models.

Model

Retrieval Effectiveness

P@5 P@10 P@15 MAP Rprec bpref recip_rank NDCG RNDCG 11pt_avg

BM25 0.60 0.54 0.49 0.48 0.50 0.48 0.70 0.64 0.59 0.50

Dirichlet 0.65 0.59 0.53 0.56 0.54 0.54 0.78 0.72 0.65 0.58

Jelinek-Mercer 0.70 0.61 0.55 0.60 0.58 0.58 0.83 0.75 0.69 0.61

5 RESULT AND DISCUSSION

In this section, the retrieval performance of various

IR models, analysis on the correlation between

different query performance predictors and

comparisons between some pre-retrieval predictors

on Amharic IR test collections are presented.

5.1 Effectiveness of IR Models

The retrieval performances of BM25, Dirichlet and

Jelinek-Mercer retrieval models are measured by

issuing queries against the 2AIRTC. The overall

retrieval performances of the models are presented

in Table 2 where the best performances are depicted

in bold font. As shown in Table 2, the three retrieval

models are generally far to each other in terms of

retrieval results. It can be seen that Jelinek-Mercer

model outperformed the other two models. The

correlation between the three retrieval models is

computed and presented in Figure 2. The minimum

and maximum extreme features of the three models

are presented in Figure 3. It can be seen from the

figures that the Jelinek-Mercer and Dirichlet features

have similar data distribution, with a negative skew

and few outliers having small values. They are also

strongly correlated with Pearson correlation

coefficient (PCC) of 0.95. The least correlation is

obtained between BM25 and Dirichlet models.

Concerning retrieval effectiveness, all the three

models have positive correlations on the Amharic IR

test collection.

5.2 Correlation Between Indri

Query-IDF Features

To examine the relation among IDF features, the

correlation between some query-IDF features was

computed from the Indri index. Figure 4 shows PCC

between IDF feature variants. Since the query-IDF

features were computed on queries, the data count of

query-IDF features is smaller than the query-

document features that are presented above.

Figure 2: Correlation of the three retrieval models.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

102

Figure 3: Boxplots of min and max extremes for the three retrieval models.

Figure 4: Correlation between query-IDF QPP predictors.

The data counts of query-document and query-IDF

features are 2,334 and 240, respectively. The

IDF_Q1 has negative correlations with IDF_std2,

IDF_std and IDF_median predictors. Similarly the

predictor IDF_min has a negative skew with

IDF_median, IDF_std and IDF_std2. However, the

correlation between any of the predictors is

positive. The IDF_sum has the largest data

distribution and the next most distributed Indri

query-IDF feature variant is variance (IDF-std2).

Most of the Indri query-IDF features have outliers

having large values. There are also some weak

inverse correlations between IDF_Q1 and

IDF_median, IDF_std and IDF_std2, IDF_min and

IDF_median, IDF_std and IDF_std2. Some of the

strongly correlated features are:

• IDF_mean and IDF_Q3 with PCC of 0.82.

• IDF_std and IDF_std2 with PCC of 0.95.

• IDF_min and IDF_Q1 with PCC of 0.98.

• IDF_max and IDF_Q3 with PCC of 0.87.

For better understanding of the features, we

plotted score extremes of Indri query-IDF features.

We observed that IDF-median, IDF-std and IDF-

std2 have positive values.

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

103

5.3 Correlation Between Letor

Features

For Letor features, a subset of documents was

selected since all the documents were not needed

to create the training set. Then, feature extraction

was performed to represent each query-document

pair with the features. To investigate the

correlation between Letor predictors, we computed

query-document and document Letor features from

the Terrier index. In this case, some features are

poorly correlated while the others are strongly

correlated. The correlations between weak features

are shown in Figure 5 where PCCs range from

0.63 to 0.87. The Letor predictors Ml2, Mdl2,

Inb2, parameter-free hypergeometric model with

Popper’s normalization (Dph) and divergence from

independence model based on standardization

(Dfiz) have relatively strong correlation with

similar data distribution as shown in Figure 6.

They are strongly correlated with PCCs ranging

from 0.90 to 1.00, and we see that the data

distributions are nearly the same for each feature

which are positively skewed. As shown in Figures

6 and 7, the data distributions are nearly the same

for each feature that is positively skewed, and they

have more of large outlier values (with some small

value outliers for the Hiemstra feature). They are

strongly correlated with PCCs ranging from 0.96

to 1.00.

Figure 5: Letor correlated features with different data distribution.

Figure 6: Strongly correlated Letor features with similar data distribution.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

104

Figure 7: Correlation between Ml2, Mdl2, Inb2, Dph and Dfiz features.

Figure 8: Correlation of TF, inb2, dhl3, dl and dfic features.

The correlations between TF, inb2, dhl3, dl and

divergence from independence model based on Chi-

square statistics (dfic) features are also computed

and shown in Figure 8 where the data distributions

are nearly the same for each feature that is positively

skewed, and they have more of large outlier values

with a few outliers having small values for the

document length (DL), parameter-free weighting

model (Dlh) and Dlh13 features. They are strongly

correlated with PCCs ranging from 0.88 to 0.99.

Moreover, we examined the correlations between

TF, Inb2, Dlh13, Dlh, Dl, Dfic, LemurTF-IDF and

BM25 divergence from randomness (Dfr) features.

Experimental results show that data distributions are

nearly the same for each feature. TF, Inb2 and Dfic

have more of large outlier values whereas the Dl,

Dlh and Dlh13 features have few outliers having

small values. All of these features are strongly

correlated with PCCs ranging from 0.88 to 0.99.

Using box plots of Terrier query-document and

document features, we observed that the query-

document and document features have the same

trend when considering their different variants.

5.4 Performance of Selected Predictors

The correlations between some selected QPP

methods and the retrieval scores were computed and

analysed. We compared the prediction quality of the

four pre-predictors, namely query length (QL), TF,

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

105

Table 3: Correlation of query performance predictors and

retrieval scores.

Predictor MAP NDCG

IDFstd 0.039501 0.03931

maxIDF 0.030262 0.028146

MeanIDF 0.250945 0.249829

TF 0.130696 0.164911

TF-IDF 0.339494 0.370119

QL -0.15229 -0.1304

IDF and TF-IDF. The PCC of each query prediction

feature score against MAP and NDCG for BM25

retrieval model are presented in Table 3. The table

indicates the correlation of IDF of query terms and

some other query performance prediction features

with average precision and NDCG in 2AIRTC. A

correlation score of 1 indicates a perfect agreement

between a predictor score and the actual retrieval

score in the rankings whereas -1 indicates opposite

agreement between a predicator and a retrieval score

(i.e. MAP and NDCG). All predictors except QL

have weak positive correlation with MAP and

NDCG of the queries. The correlation between QL

and the retrieval scores (i.e. MAP and NDCG) has

opposite association. The average query length is 4.

The top three well performing predictors are TF-

IDF, TF and meanIDF.

5.5 Analyzing Retrieval Scores

The performance measures of various retrieval

scoring metrics were computed and analyzed as

shown in Figures 9 and 10. In Figure 9, we can see

strong correlations between Set_F, Set_P and

set_MAP measures with PCCs ranging from 0.95 to

0.97 for Terrier, and from 0.91 to 0.97 for both Indri

Dirichlet and Indri Jelinek-Mercer. There is also

correlation between the Set_relative_P and

Set_recall measures with PCC of 0.99 for Terrier

and 0.97 for both Indri Dirichlet and Jelinek-Mercer.

In Figure 10, we see that the success at 5 and 10

recommendations are correlated with PCC of 0.85

for both Terrier and Indri Jelinek-Mercer, and 0.76

for Indri Dirichlet.

Figure 9: Set-based measures.

Figure 10: Success measures.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

106

From the experiments carried out to evaluate

various IR models, we see that Jelinek-Mercer

model outperformed the other models on the

2AIRTC corpus. Regarding the extracted features

from the Indri index, there is a high correlation

between Jelinek-Mercer and Dirichlet query-

document features, but generally low correlations

between the different query-IDF features, with few

of them strongly pairwise-correlated. Based on our

results, it is clear that some queries perform better

than others. Most of the query terms are more of

specific terms and are able to discriminate retrieval

results. Regarding the extracted features from the

Terrier index, most of them have high correlations

between the different features (except for three of

them). An interesting point of discussion is that both

query-document and document Letor features have

the same correlation matrix as well as the same data

distribution for each feature. This is because we

nearly obtained the same results for both query-

document and document Letor features. Regarding

the computed evaluation metrics from both Terrier

and Indri (for both Dirichlet and Jelinek-Mercer

smoothing) indexes, we can see very similar data

distributions and correlations for the precision,

NDCG, MAP, recall, relative precision, set-based,

success and number-based metrics. The correlation

matrices between the query-IDF predictors and the

evaluation measures show very low PCC values.

6 CONCLUSION

QPP is the task of estimating the retrieval quality of

an IR system for a given query. In this study, we

investigated the retrieval eﬀectiveness of various IR

models, examined the quality of some query

performance predictors in IR task, and explored the

correlations between some predictors. We carried

out a range of experiments to analyze the effect of

QPP methods on 2AIRT based on different IR

models and IR metrics. The performances of IR

models for queries were tested on 2AIRTC test

collection. Our findings reveal Jelinek-Mercer

model outperformed over the other models and more

correlated to Dirichlet than BM25. Strong

correlation is observed between TF, inb2, dhl3, dl

and dfic features. Future work needs to focus on

developing large test collections for generalizing the

quality of a retrieval model on a given query and for

investigating the consistence of each query

performance predictors across test collections and

query types. Further investigation can be made on

the quality of more post-retrieval strategies and

automatic query expansion on all queries and

selective queries using QPP methods.

REFERENCES

Abate, M. and Assabie, Y. (2014). Development of

Amharic morphological analyzer using memory based

learning. In Proc. of the 9th Int. Conf. on Natural

Language Processing, Warsaw, pp. 1-13.

Akdere, M., Çetintemel, U., Riondato, M., Upfal, E.,

Zdonik, S. (2012). Learning-based query performance

modelling and prediction. In Proceedings of 2012

IEEE 28th International Conference on Data

Engineering Arlington, VA, USA, pp. 390-401.

Alemayehu, N. and Willett, P. (2002). Stemming of

Amharic words for information retrieval. Literary and

Linguistic Computing, 17(1), pp.1–17.

Argaw, A.A and Asker, L. (2006). Amharic-English

information retrieval. In: Peters, C., et al. Evaluation

of Multilingual and Multi-modal Information

Retrieval. CLEF 2006. Lecture Notes in Computer

Science, vol 4730. Springer, Berlin, Heidelberg.

Carmel, D. and Yom-Tov, E. (2010). Estimating the query

difficulty for information retrieval. Synthesis Lectures

Inf. Concepts Retrieval Serv, 2(1).

Cronen-Townsend, S., Zhou, Y. and Croft, W. B. (2002).

Predicting query performance. In: J¨arvelin, K.,

Beaulieu, M., Baeza-Yates, R.A., Myaeng, S. (eds.)

SIGIR 2002: In Proceedings of the 25th Annual

International ACM SIGIR Conference on Research

and Development in Information Retrieval, Tampere,

Finland, pp. 299-306.

Datta,S., Ganguly,D., Mitra,M. and Greene, D. (2022). A

relative information gain-based query performance

prediction framework with generated query variants.

ACM Transactions on Information Systems, 1(1).

Eshetu, A., Teshome, G. and Abebe, T. (2020). Learning

word and sub-word vectors for Amharic (Less

Resourced Language). Int. J. Adv. Eng. Res. Sci.

(IJAERS), 7(8), pp. 358-366.

Gambäck, B. (2012). Tagging and verifying an Amharic

news corpus. Workshop on Language Technology for

Normalisation of Less-Resourced Languages

(SALTMIL8/AfLaT2012), Istanbul, Turkey,pp. 79-84.

Ganguly, D., Datta, S., Mitra, M. and Greene, D. (2022).

An analysis of variations in the effectiveness of query

performance prediction. In 44th European Conference

on Information Retrieval (ECIR 2022), Stavanger,

Norway, pp. 215-229.

Gasser, M., (2011). Hornmorpho: A system for

morphological processing of Amharic, Afaan Oromo,

and Tigrinya. In: conference on Human language

technology for development, Alexandria, Egypt, pp.

94-99.

He, B. and Ounis, I. (2006). Query performance

prediction. Information Systems, 31(7), pp. 585–594.

Mothe, J., Tanguy, L. (2005). Linguistic features to predict

query difficulty. In ACM Conference on research and

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

107

Development in Information Retrieval, SIGIR,

Predicting query difficulty-methods and applications

workshop, Salvador de Bahia, Brazil, pp.7-10.

Muhie, S., Ayele, A., Venkatesh, G., Gashaw, I. and

Biemann, C. (2021). Introducing various semantic

models for Amharic: Experimentation and evaluation

with multiple tasks and datasets. Future Internet,

13(11).

Shtok, A., Kurland, O., Carmel, D., Raiber, F., and

Markovits, G. (2012). Predicting query performance

by query-drift estimation. ACM Trans. Inf. Syst. 30(2).

Pérez-Iglesias, J. and Araujo, L. (2010). Evaluation of

query performance prediction methods by range. E.

Chavez and S. Lonardi (Eds.): SPIRE 2010, LNCS

6393, pp. 225–236.

Tao, Y. and Wu, S. (2014). Query performance prediction

by considering score magnitude and variance together.

In: Li, J., Wang, X.S., Garofalakis, M.N., Soboroff, I.,

Suel, T., Wang, M. (eds.). In Proceedings of the 23rd

ACM International Conference on Conference on

Information and Knowledge Management, CIKM

2014, Shanghai, China, pp. 1891-1894.

Teufel, S. (2007). An overview of evaluation methods in

TREC adhoc information retrieval and TREC question

answering. In title of Evaluation of Text and System

speech, Dybkjær et al. (eds.),163-186, University of

Cambridge, United Kingdom, Springer.

Woldeyohannis, M. and Meshesha, M. (2020). Usable

Amharic text corpus for natural language processing

applications. Applied Corpus Linguistics, 2(3).

Yeshambel, T., Mothe, J. and Assabie, Y. (2020a).

Evaluation of corpora, resources and tools for Amharic

information retrieval. In Proceeding of ICAST2020,

springer, Bahir Dar, Ethiopia,pp.470-483.

Yeshambel, T., Mothe, J. and Assabie, Y. (2020b).

Construction of morpheme-based Amharic stopword

list for information retrieval system. In Proceeding of

ICAST2020, Springer, Bahir Dar, Ethiopia, pp. 484-

498.

Yeshambel, T., Mothe, J. and Assabie, Y. (2020c).

2AIRTC: The Amharic Adhoc information retrieval

test collection. In proceeding of CLEF 2020, LNCS

12260, Thessaloniki, Greece, pp.55-66.

Yeshambel, T., Mothe, J. and Assabie, Y. (2020d).

Amharic document representation for adhoc retrieval.

In procedding of the 12th internatinal conference on

knowledge discovery, knowledgey enginerring and

management, online, pp. 123-134.

Yeshambel, T., Mothe, J. and Assabie, Y. (2021).

Morphologically annotated Amharic text corpora. In

Proc. of 44th ACM SIGIR Conf. on research and

development in information retrieval, online

conference, Canada, pp. 2349–2355.

Yeshambel, T., Mothe, J., Assabie, Y. (2022)

Amharic Adhoc information retrieval system based on

morphological features. Applied Sciences; 12(3).

Yeshambel, T., Mothe, J., Assabie, Y. (2023). Learned

text representation for Amharic information retrieval

and natural language processing. Information ,14(3).

Zendel, O., Liu, B., Culpepper, J. and Scholer, F. (2023).

Entropy-based query performance prediction for

neural information retrieval systems. Query

Performance Prediction and Its Evaluation in New

Tasks, co-located with the 45th European Conference

on Information Retrieval (ECIR), Dublin, Ireland.

Zhao, Y., Scholer, F. and Tsegay, Y. (2008). Effective

pre-retrieval query performance prediction using

similarity and variability evidence. C. Macdonald et al.

(Eds.): ECIR 2008, LNCS 4956, pp. 52–64,

Zhou, Y. and Croft, W.B. (2006). Ranking robustness: a

novel framework to predict query performance. In:

Yu, P.S., Tsotras, V.J., Fox, E.A., Liu, B. (eds.) In

Proceedings of the 2006 ACM CIKM International

Conference on Information and Knowledge

Management, Arlington, Virginia, USA, 2006, pp.

567-574.

Yimam, B., (2001). Yamarigna sewasiw (Amharic

grammar). CASE. Addis Ababa, 2

edition.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

108