Personalization of Dataset Retrieval Results Using a Data Valuation
Malick Ebiele
1 a
Malika Bendechache
2 b
Eamonn Clinton
and Rob Brennan
1 c
ADAPT, School of Computer Science, University College Dublin, Belfield, Dublin, Ireland
School of Computer Science, University of Galway, Galway, Ireland
Eireann, Phoenix Park, Dublin, Ireland
Data Valuation, Data Value, Personalized Data Value, Dataset Retrieval, Information Retrieval, Quantitative
Data Valuation.
In this paper, we propose a data valuation method that is used for Dataset Retrieval (DR) results re-ranking.
Dataset retrieval is a specialization of Information Retrieval (IR) where instead of retrieving relevant doc-
uments, the information retrieval system returns a list of relevant datasets. To the best of our knowledge,
data valuation has not yet been applied to dataset retrieval. By leveraging metadata and users’ preferences,
we estimate the personal value of each dataset to facilitate dataset ranking and filtering. With two real users
(stakeholders) and four simulated users (users’ preferences generated using a uniform weight distribution), we
studied the user satisfaction rate. We define users’ satisfaction rate as the probability that users find the datasets
they seek in the top k = {5, 10} of the retrieval results. Previous studies of fairness in rankings (position bias)
have shown that the probability or the exposure rate of a document drops exponentially from the top 1 to the
top 10, from 100% to about 20%. Therefore, we calculated the Jaccard score@5 and Jaccard score@10 be-
tween our approach and other re-ranking options. It was found that there is a 42.24% and a 56.52% chance on
average that users will find the dataset they are seeking in the top 5 and top 10, respectively. The lowest chance
is 0% for the top 5 and 33.33% for the top 10; while the highest chance is 100% in both cases. The dataset
used in our experiments is a real-world dataset and the result of a query sent to a National mapping agency data
catalog. In the future, we are planning to extend the experiments performed in this paper to publicly available
data catalogs.
Given rapidly rising data volumes, knowing which
data to keep and which to discard has become an
essential task. Data valuation has emerged as a
promising approach to tackle this problem (Even and
Shankaranarayanan (2005)). The primary focus of
data valuation research is the development of method-
ologies for determining the value of data (Khokhlov
and Reznik; Laney; Qiu et al.; Turczyk et al.; Wang
et al.; Wang et al. (2020; 2017; 2017; 2007; 2021;
Data valuation methods have been applied to data
management, machine learning, system security, and
energy (Khokhlov and Reznik; Turczyk et al.; Wang
et al.; Wang et al. (2020; 2007; 2021; 2020)). There
have been no previous attempts to apply data valua-
tion to dataset retrieval. Dataset retrieval is a special-
ization of information retrieval where instead of re-
trieving relevant documents the Information Retrieval
system returns a list of relevant datasets (Kunze and
Auer (2013)). Dataset retrieval systems will return
relevant datasets according to a given query. The re-
trieved datasets are sorted alphabetically by name or
using another metadata like creation date or a ranking
algorithm incorporated in the dataset retrieval tech-
nique. However, they do not consider the user’s pref-
erences in terms of metadata. Some dataset retrieval
software allows users to sort the results by each meta-
data separately like creation date, usage, and last up-
date or filtering the results using boolean operations.
However, none of them allow users to sort the results
by a combination of those metadata (see Equation 5
below). In this paper, we propose a metadata-based
data valuation method that will allow users to sort
dataset retrieval results using a combination of meta-
Position bias is the study of the relationship be-
tween the ranking or position of a retrieved document
and the exposure it receives (Agarwal et al.; Craswell
et al.; Jaenich et al.; Wang et al. (2019; 2008; 2024;
2018)). In other words, position bias is the study of
the probability of a document being consulted by a
user according to its position among the retrieved doc-
uments. Previous studies have shown that the proba-
bility or the exposure rate of a document drops ex-
ponentially from the top 1 to the top 10 and then
more logarithmically from the top 11 to the top 100
(Jaenich et al. (2024)). Jaenich et al. (2024) also
showed that the number of possible orderings of doc-
uments for rankings of size k = 1 ...100 grows expo-
nentially. Using a group of 6 job seekers as an ex-
ample, Singh and Joachims (2018) illustrated how a
small difference in relevance (used to order retrieved
documents or items) can lead to a large difference in
exposure (an opportunity) for the group of females.
They showed that a 0.03 difference in average rele-
vance (between the top 3 who are all male and the
bottom 3 who are all female) can result in a 0.32 dif-
ference in average exposure. The difference in aver-
age exposure (between the top 3 and the bottom 3) is
10 times the difference in average relevance.
The above studies show that putting the most rel-
evant information on top or providing a fair ranking
is crucial. Many fair ranking techniques have been
designed to attempt to solve the fairness problem in
rankings (Singh and Joachims; Zehlike et al.; Zehlike
et al. (2018; 2022a; 2022b)). To the best of our knowl-
edge, none of the existing ranking techniques inte-
grate the user’s preferences in the ranking algorithm
or use them as a post-retrieval step to re-rank the re-
trieved information. Here, we present a metadata-
based data valuation technique that takes in the re-
trieved datasets’ metadata and the user’s preferences
and outputs a re-ranking of the retrieved datasets. It
is worth noting that because data value is a relative
measure if a dataset d
is more valuable than d
the whole set of datasets D, then d
will always be
ranked higher than d
considering D or any subset of
D containing both datasets.
Many of the existing data valuation approaches
are subjective. This is due to the subject-dependent
nature of some dimensions (e.g. Utility dimen-
sion) that characterise data value (Attard and Brennan
(2018)) or the subject-dependent weighting tech-
niques (in the case of weighted averaging or sum-
ming) (Deng et al.; Odu (2023; 2019)). Subjective
metrics of data value dimensions (metadata are proxy
for data value dimensions, therefore usage metadata
and usage dimension mean the same thing) or weight-
ing techniques can only be defined by individual users
or experts based on their personal views, experiences,
and backgrounds. These are opposed to objective
metrics that can be determined precisely based on a
detailed analysis of the data or extracted from the data
infrastructure (Bodendorf et al. (2022)). This makes
it challenging to develop a fully objective data val-
uation model because of the difficulty to objectively
measure some dimensions and also experts can be ex-
pensive. We believe that instead of generalizing sub-
jective metrics and weighting techniques, it would be
better to attempt to develop personalized data val-
uation models. The difference between subjective
data value and personalized data value is that the for-
mer assumes that subjective metrics and weights can
be applied to every user. Meanwhile, personalized
data value will request the subjective metrics and the
weights from each user representing their preferences
to calculate a personal data value.
Choosing a suitable weighting technique is an ad-
ditional challenge for weighted approaches to data
valuation. For instance, usage-over-time is one of the
first data valuation methods and developed a weight-
ing technique based on recency (Chen (2005)). The
recency-based weighting technique is objective. The
only subjective decision is the choice of assigning
higher or lower weights to the more recent Usage
metadata. Chen (2005) assigned higher weights to
the more recency Usage metadata; which is logical
for their use case. In our case, the desired weighting
technique should be subjective, performant (have low
complexity for calculation), and straightforward for
the users to interact with.
The research question is: To what extent can
metadata-based data valuation methods improve the
results of dataset retrieval systems in terms of users’
To answer this research question, we designed and
implemented a metadata-based data valuation method
and applied it to a dataset retrieval use case for a Na-
tional Mapping Agency. The goal is to improve the
users’ satisfaction by putting on top the datasets they
consider more valuable. This is done by taking into
account the customers’ dataset preferences to re-rank
the retrieved datasets.
The contributions of this paper are as follows:
The first application of a metadata-based data val-
uation method to dataset retrieval.
Proposed a personalized and interactive data valu-
ation method. Extant methods are mainly subjec-
tive approaches.
Personalization of Dataset Retrieval Results Using a Data Valuation Method
The remainder of this paper is structured as fol-
lows. Section 2 gives a description of the use case.
Section 3 describes the related work. Our proposed
metadata-based data valuation method is explained in
Section 4. Section 5 explains our experimental de-
sign. In Section 6, the experimental results are shown
and discussed. Finally, the conclusion and future
work are presented in Section 7.
2.1 Project Description
This data valuation project is part of an ongoing col-
laboration between researchers from University Col-
lege Dublin (UCD) and Tailte
Eireann (TE). Tailte
Eireann (TE) is Ireland’s state agency for property
registrations, property valuation and national map-
ping services. It was established on 1 March 2023
from a merger of the Property Registration Author-
ity (PRA), the Valuation Office (VO) and Ordnance
Survey Ireland (OSI). The end goal of this collab-
oration is to design and implement a data valuation
method for TE’s datasets from the customer’s per-
spective. They would like to apply a metadata-based
data valuation to re-rank the results of a query sent
to their dataset retrieval platform. The data valuation
method should take into account the customers’ pref-
erences in terms of metadata. At this stage, the goal
is to design and implement a proof of concept.
2.2 Current Dataset Retrieval Process
Figure 1 below displays the current dataset retrieval
process (in Blue, some examples here
) and our
proposed personalized dataset retrieval process (in
Green). In the current process, the user sends a query
to the data catalog. The query is then processed and
used to extract the relevant datasets from the data cat-
alog. The retrieved datasets are finally formatted in
a user-friendly way and sent to the user. In our pro-
posed approach, simultaneously or after the query is
sent, the user can specify their preferences in terms
of the retrieved datasets’ metadata. The user pref-
erences go through a validity test (to test if all of
the weights provided are not zeros). The retrieved
datasets’ metadata and the user preferences are then
used to compute the value of each retrieved dataset.
The calculated data value is finally used to re-rank
the retrieval datasets before formatting them in a user-
friendly way and sending them to the user. If no pref-
erences are provided or if they are invalid, then the
retrieved datasets are presented alphabetically.
Figure 1: Personalized datasets retrieval using a metadata-
based data valuation. In Blue is the current dataset retrieval
process. In Green are the additional steps we proposed to
personalize dataset retrieval results.
2.3 Information and Dataset Retrieval
Performance Metrics
The Jaccard index also known as the Jaccard score has
been chosen to evaluate the users’ satisfaction. The
Jaccard score measures the similarity between at least
two finite sets and is defined as the size their inter-
section divided by the size of their union (see Equa-
tion 1 below). The truncated Jaccard score at k (Jac-
card score@k), which only focuses on the top k el-
ements, is preferred for our use case. As shown in
Section 1, only the top k (with k 10) are most likely
to be consulted. Therefore, focusing mainly on the
top k elements makes sense.
However, Jaccard score does not take into account
the positions. So, the Normalized Discounted Cumu-
lative Gain (NDCG) has also been calculated. NDCG
is widely used and involves a discount function over
the rank while many other measures uniformly weight
all positions (see Equation 2 below). It measures the
matching degree between our ranking and other rank-
NDCG and Jaccard score@k range between 0 and
1, with 1 being the optimal performance. We used
the scikit-learn implementation of NDCG with the de-
fault parameters
and a self implementation of Jac-
card score@k in Python.
J(A, B) =
|A B|
|A B|
|A B|
|A| + |B| |A B|
( f ,S
) =
( f ,S
( f ,S
) =
) = max
with D(r) =
(inverse logarithm decay with
base b) the discount function, S
is a dataset, f is a
ranking function, f
is the best ranking function on
, and G is the Gain. DCG
( f ,S
) is the Discounted
Cumulative Gain (DCG) of f on S
with discount D
and IDCG
) is the Ideal DCG.
This section describes the current state-of-the-art in-
formation and dataset retrieval approaches and their
limitations. Then, it highlights the challenges related
to weighted average approaches because the approach
proposed in this paper falls into that category.
3.1 Information and Dataset Retrieval
Tamine and Goeuriot (2021) define Information re-
trieval (IR) as a system that deals with the repre-
sentation, storage, organization and access to infor-
mation items. It has two main processes: Indexing
(which consists of building computable representa-
tions of content items using metadata) and Retrieval
(which consists of optimally matching queries to rel-
evant documents) (Tamine and Goeuriot (2021)). IR
models have evolved since the 1960s from Boolean
to Neural Networks (Lavrenko and Croft; Liu; Maron
and Kuhns; Miutra and Craswell; Robertson et al.;
Salton et al.; Salton and McGill; Tamine and Goeuriot
(2001; 2009; 1960; 2018; 1980; 1983; 1986; 2021)).
Hambarde and Proenc¸a (2023) argue that IR sys-
tems have two stages: retrieval and ranking. The re-
trieval stage consists of four main techniques: Con-
ventional IR, Sparse IR, Dense IR, and Hybrid IR
techniques. The latter is any combination of the for-
mer three. The ranking stage consists of two main
approaches: Learning To Rank and Deep Learning
Based Ranking approaches. For more details on this
categorization of IR techniques, please refer to Ham-
barde and Proenc¸a (2023).
Liu et al. (2020) argue that the IR research com-
munity has long agreed that major improvement of
search performance can only be achieved by taking
account of the users and their contexts, rather than
through developing new retrieval algorithms that have
reached a plateau. Three main approaches have been
employed to personalize IR results: Query expansion,
Result re-ranking, and Hybrid personalization tech-
niques (Liu et al. (2020)). Query expansion collects
additional information about user interest from het-
erogeneous sources, represents them by some terms,
and automatically adds these terms to the initial query
for a refined search (Bai et al.; Belkin et al.; Bian-
calana et al.; Bilenko et al.; Bouadjenek et al.; Budzik
and Hammond; Buscher et al.; Cai and de Rijke; Chen
and Ford; Chirita et al.; Jayarathna et al.; Kelly
et al.; Kraft et al. (2007; 2005; 2008; 2008; 2013;
1999; 2009; 2016; 1998; 2007; 2013; 2005; 2005)).
Result re-ranking techniques reorder search results
for users according to document relevance (Gauch
et al.; Liu et al.; Liu and Hoeber; Tanudjaja and
Mui; Wang et al. (2003; 2002; 2011; 2002; 2013)).
Hybrid techniques combine query expansion and re-
sult re-ranking; they outperform either one individu-
ally but are under-explored (Ferragina and Gulli; Lv
et al.; Pitkow et al.; Pretschner and Gauch; Shen et al.
(2005; 2006; 2002; 1999; 2005)).
Most re-ranking systems are not interactive. They
have some sort of pre-settled weighting criteria for
re-ranking, giving heavier weight to those documents
that match user interests and push them to top ranks
(Liu et al.; Tanudjaja and Mui (2002; 2002)). The
ones that are interactive present the top k documents
to the users for feedback and then refine ranking based
on the feedback (Gauch et al.; Liu and Hoeber; Wang
et al. (2003; 2011; 2013)).
Thus it can be seen that interactive IR result
re-ranking based on users’ preferences is under-
explored. The approach proposed in this paper is an
interactive dataset retrieval technique based on users’
preferences in terms of the retrieved datasets’ meta-
3.2 Weighted Average Data Valuation
There were also previous attempts to calculate the
data value using weighted averaging of metadata de-
scribing data value dimensions (Chen; Ma and Zhang;
Qiu et al. (2005; 2019; 2017)). For instance, measur-
Personalization of Dataset Retrieval Results Using a Data Valuation Method
ing usage-over-time is one of the first data valuation
methods and it estimates data value with the weighted
averaging approach of Chen (Chen (2005)). It con-
sists of splitting the usage data into a series of time
slots, assigning a weight to each time slot, and then
computing the data value using the weighted aver-
age. The weights are the normalized recency weights.
The more recent time slots are assigned the higher
weights (Chen (2005)). Ma and Zhang (2019) ex-
tended the usage-over-time model by adding the age
and size dimensions. Their Multi-Factors Data Valu-
ation Method (MDV ) is a trade-off between dynamic
and static data value. The dynamic data value is the
usage-over-time model of Chen. The static data value
is the weighted average of the normalized age and
size. The weights of the age and size dimensions are
assigned subjectively by experts.
Qiu et al. (2017) used the Analytic Hierarchy Pro-
cess (AHP) which is a different weighting approach.
AHP requires a subjective rating of the input dimen-
sions in pairs. These pairwise comparisons are then
arranged in a matrix (the Judgement matrix, see Ap-
pendix 7), from which a final weighting of the dimen-
sions will be calculated. AHP is technically straight-
forward to implement and more importantly allows
to assess the transitivity consistency of the pairwise
comparisons matrix by assigning a consistency score
to it. However, experts are still needed for the pair-
wise rating of the input dimensions. Qiu et al. (2017)
use the measure of 6 dimensions in their model.
Those dimensions are: the size of the data (S), the
access interval (T ), the data read and write frequency
(F), the number of visits (C), the contents of the file
(D), and the potential value of the data (V ). For more
details on the dimensions used, please refer to Qiu
et al. (2017).
The challenge of applying weighted approaches
is the weighting technique. In our case, the desired
weighting technique must be straightforward for the
users to interact with and fast to compute as it is sup-
posed to be integrated into a live system for interac-
tive IR re-ranking. The weighting approach used in
this paper is detailed in Section 4.1.
To the best of our knowledge, the application of
a metadata-based data valuation approach to dataset
retrieval proposed in this study is unique. Also, none
of the studies described above validated the outputs of
their data valuation approaches. Our approach is val-
idated using preferences from two stakeholders and
four simulated users.
Our method has two main steps: first dimension meta-
data weight determination and then data value calcu-
lation. These are described below.
4.1 Weight Determination
Analytic Hierarchy Process (AHP) was our first
choice because of its sound mathematical basis (Saaty
(1987)). However, it was challenging to apply, as in-
stead of assigning a weight to each metadata or di-
mension, a pairwise comparison of the dimensions is
needed (Saaty (1987)). E.g. usage is twice as im-
portant as creation date, usage is 5 times more im-
portant than the number of spatial objects, or us-
age is twice less important than currency. This ex-
ercise was difficult for the stakeholders who partic-
ipated in the experiments. They confessed being
more comfortable with a rating-like weighting ap-
proach e.g. 1 to 5 websites or products rating mech-
anism. Also, AHP assumes that preferences are tran-
sitive and has a transitivity consistency test. Saaty
(1987) advise to discard the current weights deduced
from the pairwise comparisons if the consistency ra-
tio is greater than 0.1. Previous studies showed
that preferences are not always transitive (Al
et al.; Al
os-Ferrer and Garagnani; Fishburn; Gendin
(2023; 2021; 1991; 1996)). Al
os-Ferrer et al. (2023)
shows using two preference datasets that no matter
the initial assumptions, even when the preferences are
supposed to be transitive, a maximum of 27.45% of
individual preferences are non-transitive. We believe
that assuming that all preferences are transitive im-
plies ignoring some individual preferences. There-
fore, we used a slider from 0 to 10 (with a step of 1)
as the weights determination technique; the presence
of a zero rating allows the individual to discard a par-
ticular metadata as not relevant to the use case or at
that time. This approach is straightforward and inclu-
sive because it was tested during the interviews with
the stakeholders. The only constraint in our weighting
approach is that at least one of the provided weights
should be non-zero.
4.2 Data Value Calculation
This is split into the following steps: Data preprocess-
ing and Data value calculation.
4.2.1 Data Preprocessing
As the collected metadata values have different scales,
they must be normalized. The weights also must be
normalized. For the Number of spatial objects meta-
data (see Table 1 below for the description), the val-
ues are divided by the maximum value. For the Us-
age, because it is a time series data (collected monthly
from January 2017 to January 2023). It is normal-
ized by dividing each value by the maximum value
of each month. Then the current Usage value is the
6-month Exponential Moving Average (EMA). EMA
is widely used in finance to capture stock and bond
price trends while reducing noises like sudden sharp
moves. It was first introduced by Roberts (1959) (see
Equation 3). The 6-month Exponential Moving Av-
erage was calculated using the Pandas implementa-
tion with default parameters
. As to the creation date,
we applied the probabilistic approach of calculating
data currency with a decline rate of 20%. This ap-
proach was proposed by Heinrich and Klier (2011)
and the data currency Q
(ω,A) formula is shown
in the Equation 4 below. ω is a value in the Attribute
A. The motivation is that the currency of information
does not solely depend on its age but also on whether
the information is likely to change over time or not.
For instance, a satellite image of a mountain range
might still be relevant even if the image is 30 years
old. On the other hand, a 10-year-old satellite image
of road networks might be outdated.
(U, n) =
(1 α)
(1 α)
t is the current time, n the number of past periods, U
the time-series of usage metadata, U
the usage meta-
data at time t, α (0 < α 1) is the smoothing factor,
and EMA
(U, n) the EMA of usage metadata at time
t considering n previous periods.
(ω,A) := exp(decline(A) ·age(ω,A)) (4)
For the weights, the weight of each metadata has
been divided by the sum of the weights of all three
metadata per stakeholder.
4.2.2 Actual Data Value Calculation
The data value is then the weighted average of the
metadata values using the Equation 5 below.
V (d
) = w
× U
+ w
× Q
+ w
× O
, (5)
where w
{U, Q, O}
in [0,1] are the weights and V(d
) in
[0,1] the data value. U, Q, and O stand for Usage,
Currency (derived from the Creation date; see Equa-
tion 4), and Number of Spatial Objects, respectively.
Metadata /
Data Value
Access counts. It measures how many times a given
dataset has been accessed.
Creation date
Date the first version has been made available for
the users or the last date it has been updated.
Number of
spatial objects
The number of geometric data (e.g. points, lines,
polygons, paths) in the dataset. It is a domain-relevant
measure of data volume and information content.
Figure 2 below shows the flowchart of our experi-
mental design. The experiments consist of re-ranking
dataset retrieval results using a metadata-based data
valuation technique. It has four main steps: Metadata
extraction, User preferences request, Data value cal-
culation, and Re-ranking of the retrieved datasets.
5.1 Metadata Extraction
This consists of extracting metadata from the data cat-
alog system. For this use case, only three metadata
types have been extracted from 15 datasets: creation
date, number of spatial objects, and usage. The 15
selected datasets are the results of a query sent to the
data catalog system; they are ordered alphabetically
by default.
5.2 User Preferences Request
For this use case, the user preferences have been
requested during interviews with three stakeholders.
The stakeholders included in this study are managers
within the mapping agency with data management re-
sponsibilities for at least 3 years each.
The main goal of each interview (15-20 minutes)
was to get the stakeholders to assign weights to each
metadata field. A slider from 0 to 10 (with a step of
1) is used to assign the weight to each metadata.
Table 3 shows the weights provided by each stake-
holder. Stakeholder 2 (SH2) provided an invalid set of
weights (all of the weights are zero) because all of the
metadata selected for this case study was irrelevant to
them. Therefore, the retrieved datasets will be alpha-
betically presented to Stakeholder 2.
5.3 Personal Data Value Calculation
The personal data value is calculated for each dataset
using the valid weights provided by stakeholders SH1
Personalization of Dataset Retrieval Results Using a Data Valuation Method
and SH3 and four randomly generated users’ prefer-
ences (using a uniform weight distribution) and Equa-
tion 5. The datasets are then ranked by data value.
The resulting personalized rankings are then com-
pared to the default alphabetic order, MDV and AHP-
based re-rankings. They were also compared to the
univariate rankings based on each metadata indepen-
dently (Usage, Number of Spatial Objects, and Cur-
rency; the current IR/DR data catalog re-ranking op-
6.1 Comparison with Other Data
Valuation Approaches
In this Section, we compare our approach with other
data valuation approaches: Chen (2005)’s usage-over-
time, Ma and Zhang (2019)’s MDV, and Qiu et al.
(2017)’s AHP-based data valuation techniques.
6.1.1 Our Approach vs Usage-over-Time Model
To computer the usage-over-time data value (see
Equation 6), we used a valuation period (vp) of 6
months, a lifestage length s of 1 month (usually in
terms of usage metadata granularity, here on monthly
basis), N
= 6 (N
is the number of lifestages per valu-
ation period), and x = 2 (x is a regularizer of the slope
of the weight distribution together with N
). Chen
(2005) suggest that significantly flat (too large x or N
or steep (too small x or N
) weight distributions should
be avoided. Chen (2005) also advised that a valid val-
uation period for long-lived information should be at
least a few months on a quarterly or semi-annual ba-
We chose x = 2 because, for the examples shown
by Chen (2005) with N
= 5, the weight distribution is
too flat for x = 1.2, too steep for x = 3, and in between
for x = 2.
We have 13 valuation periods with a length of 6
months for the first 12 periods and 1 month for the
last period. Therefore, for the last valuation period,
UT is equal to the collected usage data.
(d) =
(w(i) × f (U
(d))), 0 f (U
(d)) 1,
w(i) =
w(i) = 1, x 1,
vp = [t (N
× s),t],N
Figure 3 below shows the Usage metadata trends
of the retrieved datasets (Figure 3a), the usage-over-
time (Figure 3b), and 6-month Exponential Moving
Average (EMA-6, Figure 3c). We can see that both
usage-over-time (as per Chen) and our proposed 6-
month EMA capture the main usage trends with re-
duced noise (steep highs and lows). The main differ-
ence is that the 6-month EMA reduces the effects of
the noise on the present values while the usage-over-
time removes them completely. EMA is preferred be-
cause it captures every movement while usage-over-
time fails for the same valuation period.
To make the graphs below and in the remainder of
this paper easy to read, Table 2 has been generated. It
maps each dataset to a unique ID. The dataset names
have been sorted alphabetically and an ID starting
from 1 has been assigned to them.
Table 2: Dataset IDs and Names Mapping.
IDs Datasets
1 ig/basemap premium
2 itm/6inch cassini
3 itm/basemap premium
4 itm/basemap public
5 itm/digitalglobe
6 itm/historic 25inch
7 itm/historic 6inch cl
8 itm/national high resolution imagery
9 itm/ortho
10 itm/ortho 2005
11 wm/basemap eire
12 wm/basemap ms public
13 wm/basemap premium
14 wm/basemap public
15 wm/digitalglobe
6.1.2 Our Approach vs MDV and AHP Data
Valuation Approaches
MDV (Ma and Zhang (2019)) is a natural extension of
the usage-over-time model by adding the Age (Valu-
ation Date minus Creation Date) and the Size meta-
data to the Usage metadata. MDV is calculated using
the Equation 7 below. The weights of the age (W
and the size (W
), and the trade-off coefficient k
are set to W
= W
= 0.5 and k = 0.2; the same
values as the example presented by (Ma and Zhang
(2019)). The Age and Size metadata are normalized
using the MinMax scaler (Scikit-learn implementa-
Figure 2: Experimental design for personalized metadata-based data valuation.
(a) Usage metadata
(b) Usage-over-time
(c) 6-month Exponential Moving Average (EMA)
Figure 3: Comparison of usage-over-time with a 6-month
EMA at capturing the usage metadata trends.
considered more valuable.
V = kV
+ (1 k)V
= w
× f (S(d)) + w
× f (A(d)),
0 f (S(d)) 1,0 f (A(d)) 1,
= V
(d) (see Equation 6).
As we couldn’t collect pairwise comparisons of
the metadata from the stakeholders (see Section 4.1),
we will use the weights they provided (see Table 3)
to produce proxy pairwise comparisons. The pro-
vided weights are summed per metadata type and then
the inverse of the sum per metadata is multiplied by
the maximum of the sum (see Table 4 and Equation
8). The obtained pairwise comparison vector is used
to fill out the AHP Judgement matrix using its reci-
procity and transitivity properties (see Appendix 7).
= [
] = [1,
Because w
= Max(w
With w
̸= 0,w
̸= 0,
̸= 0,
is the first row vector of the judgement ma-
trix P because the diagonal elements of P are equal to
1. From V
, we can deduce the first column vec-
tor of P using its reciprocity property. Then, fill out
the rest of the matrix P using its transitivity property
For more details see Appendix 7.
Figure 4 below shows the order in which the re-
trieved datasets are presented to the users based on
MDV, AHP, and ours (ties are broken using alpha-
betic order). Figures 4a and 4b display the order in
which the retrieved datasets are shown to all the users.
Figures 4c-4h show the order in which the results are
presented to each user according to their preferences.
One can see that the order is different from one user to
another and from each user to MDV and AHP-based
It can also be seen that the data value varies ac-
cording to the weights assigned to each metadata.
Therefore, we are going to measure the users’ satis-
faction rate in Section 6.2 below.
6.2 Users’ Satisfaction Evaluation
We define a user satisfaction rate as the probabil-
ity that users find the datasets they seek in the top
k = {5,10} of the retrieval results. Therefore, we cal-
culated the Jaccard score@5 and Jaccard score@10
between our approach and other re-ranking options.
We also computed NDCG which measures the degree
It works fine considering V
as the first column vec-
tor of P instead of its first row vector. One just needs to
apply the reciprocity property of P then its transitivity prop-
Personalization of Dataset Retrieval Results Using a Data Valuation Method
(a) Re-ranking Based on MDV (b) Re-ranking Based on AHP
(c) Re-ranking Based on SH1 Preferences (d) Re-ranking Based on SH3 Preferences
(e) Re-ranking Based on User1 Preferences (f) Re-ranking Based on User2 Preferences
(g) Re-ranking Based on User3 Preferences (h) Re-ranking Based on User4 Preferences
Figure 4: Retrieved Datasets’ Re-ranking Based on MDV, AHP, and Ours.
to which the results re-ranking using users’ prefer-
ences match the other re-rankings.
Table 5 presents the evaluation results regarding
NDCG, Jaccard score@5, and Jaccard score@10 per
user. The highest and the lowest scores per user and
metric are highlighted in bold and red. There is a
42.24% and a 56.52% chance on average that users
will find the dataset they are seeking in the top 5 and
top 10, respectively. The lowest chance is 0% for the
top 5 and 33.33% for the top 10; while the highest
chance is 100% in both cases. On average, the dif-
ferent re-rankings match the users’ preferred ordering
81.81% of the time.
It can also be seen in Table 5 that the degree to
KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval
Table 3: Dataset value dimension (metadata field) weights
provided by stakeholders. SH2 provided an invalid set of
(SH) / Users
#Spatial Objects
SH1 10 8 5
SH3 9 9 4
User1 9 0 1
User2 7 1 7
User3 2 8 0
User4 0 4 2
Table 4: From stakeholders’ provided weights to AHP
/ Age
#Spatial Objects
(Proxy for Size)
SH1 10 8 5
SH3 9 9 4
The Sum of
the provided
19 17 9
A pairwise
1 19/17 19/9
AHP weights
0.4222 0.3778 0.2
which a given re-ranking technique matches a user’s
preferred ordering does not predict the probability
of the user finding what they are seeking. For in-
stance, for SH1, 6month EMA got the highest NDCG
score. However, 6month EMA got the same Jac-
card score@5 as #Objects, MDV, and UT and a lower
Jaccard score@10 than #Objects.
This paper introduces a data valuation method that
can be used to re-rank dataset retrieval results. It
showed, using 12 datasets (the result of a query sent to
a data catalog) and 6 users (including two stakehold-
ers and 4 randomly generated using the uniform dis-
tribution of the weights), that there is only a 42.24%
and a 56.52% chance on average that users will find
the dataset they are seeking in the top 5 and top 10, re-
spectively. Users should find the information they are
seeking in the top 10 because, as shown by Jaenich
et al. (2024), the probability of a document being con-
sulted drops exponentially from the top 1 (100%) to
the top 10 (about 20%). In other words, if a document
is not in the top 10, its chances of being consulted are
less than 20%. It is important to re-rank retrieval re-
sults according to users’ interests because, in addition
to the query sent to a data catalog, users also have
Table 5: Evaluation Results.
Data Value
#Objects 0.8035 0.6667 1.0000
6month EMA 0.8958 0.6667 0.6667
AHP (Qiu et al.) 0.7487 0.0000 0.3333
Alphabetic order 0.7506 0.4286 0.3333
Currency 0.8482 0.0000 0.3333
MDV (Ma and Zhang) 0.8445 0.6667 0.5385
UT (Chen) 0.8384 0.6667 0.6667
#Objects 0.8035 0.6667 1.0000
6month EMA 0.8958 0.6667 0.6667
AHP (Qiu et al.) 0.7487 0.0000 0.3333
Alphabetic order 0.7506 0.4286 0.3333
Currency 0.8482 0.0000 0.3333
MDV (Ma and Zhang) 0.8445 0.6667 0.5385
UT (Chen) 0.8384 0.6667 0.6667
#Objects 0.7669 0.2500 0.5385
6month EMA 0.8418 0.4286 0.5385
AHP (Qiu et al.) 0.8170 0.4286 0.6667
Alphabetic order 0.8199 0.2500 0.5385
Currency 0.7846 0.4286 0.5385
MDV (Ma and Zhang) 0.8540 0.4286 0.8182
UT (Chen) 0.8320 0.4286 0.5385
#Objects 0.8660 0.6667 0.8182
6month EMA 0.8051 0.6667 0.6667
AHP (Qiu et al.) 0.7857 0.0000 0.4286
Alphabetic order 0.7692 0.4286 0.4286
Currency 0.7215 0.0000 0.4286
MDV (Ma and Zhang) 0.7524 0.6667 0.6667
UT (Chen) 0.7493 0.6667 0.6667
#Objects 0.9977 1.0000 1.0000
6month EMA 0.8225 0.4286 0.6667
AHP (Qiu et al.) 0.8040 0.0000 0.3333
Alphabetic order 0.7571 0.4286 0.3333
Currency 0.8128 0.0000 0.3333
MDV (Ma and Zhang) 0.8174 0.4286 0.5385
UT (Chen) 0.8239 0.4286 0.6667
#Objects 0.8245 0.6667 0.6667
6month EMA 0.9062 0.6667 0.8182
AHP (Qiu et al.) 0.7434 0.0000 0.3333
Alphabetic order 0.8174 0.4286 0.3333
Currency 0.8891 0.0000 0.3333
MDV (Ma and Zhang) 0.8623 0.6667 0.5385
UT (Chen) 0.8556 0.6667 0.8182
preferences regarding the retrieved datasets’ proper-
ties or metadata. In fact, Liu et al. (2020) argue that
the IR scholars have agreed that major improvement
in search performance can only be achieved by con-
sidering the users and their contexts; thus their pref-
erences. This paper is a step in that direction by using
the users’ preferences to re-rank IR results.
In the future, we are planning to run a set
of queries on public data catalogs (e.g. Kaggle
Personalization of Dataset Retrieval Results Using a Data Valuation Method
) and collect the top k (k100) results sorted
by relevance and study the distribution of users’ sat-
isfaction through simulation.
This research has received funding from the ADAPT
Centre for Digital Content Technology, funded un-
der the SFI Research Centres Programme (Grant
13/RC/2106 P2), co-funded by the European Re-
gional Development Fund. For the purpose of Open
Access, the author has applied a CC BY public copy-
right licence to any Author Accepted Manuscript ver-
sion arising from this submission.
AHP Explained
AHP stands for Analytic Hierarchy Process and was
first introduced by Saaty (1987). It is used to cal-
culate the relative weights of the criteria in a multi-
criteria decision setting. For instance, a multi-criteria
decision consists of choosing the best dataset among
multiple datasets considering their currency, size, and
usage frequency, simultaneously.
AHP has 5 main components:
1. Criteria. Selection of the criteria to be considered
in the decision making.
2. Pairwise Comparisons of the Criteria. This
consists of comparing each criterion to all the
other criteria. There are
needed for n criteria.
3. Judgement Matrix P
P is reciprocal: P(i, j) = 1/P( j,i)
The diagonal elements of P are equal to 1
Each element of P is a strictly positive real
P(i, j) = 1 means criteria i and j are equiva-
P(i, j) < 1 means criterion i is less important
than criterion j
P(i, j) > 1 means criterion i is more important
than criterion j
4. Criteria Weights. The weights are calculated us-
ing the judgement matrix P. The details of the
calculation steps can be found in (Qiu et al.; Saaty
(2017; 1987)).
5. Consistency Ratio (CR). CR should be less than
or equal to 0.1 or 10%. It measures the transitive
Transitivity: if a = 2b and b = 3c, then a = 6c
CR = 0 iff P is transitively consistent. Then
P(i, j) = P(i,k) ×P(k, j), for all i, j, and k.
With one row or column vector from the judge-
ment matrix P (a vector of n elements with at least
one element equal to 1), one can fill out the rest of
the judgement matrix P using its reciprocity and tran-
sitivity properties. This is how we derived the AHP
weights shown in Table 4.
KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval