Hyperparameter Optimization for Search Relevance in E-Commerce

Manuel Dalcastagn

e and Giuseppe Di Fabbrizio

VUI, Inc., Boston, U.S.A.

Keywords:

Hyperparameter Optimization, Differential Evolution, e-Commerce Search Relevance Optimization.

Abstract:

The tuning of retrieval and ranking strategies in search engines is traditionally done manually by search experts

in a time-consuming and often irreproducible process. A typical use case is ﬁeld boosting in keyword-based

search, where the ranking weights of different document ﬁelds are changed in a trial-and-error process to

obtain what seems to be the best possible results on a set of manually picked user queries. Hyperparameter

optimization (HPO) can automatically tune search engines’ hyperparameters like ﬁeld boosts and solve these

problems. To the best of our knowledge, there has been little work in the research community regarding the

application of HPO to search relevance in e-commerce. This work demonstrates the effectiveness of HPO

techniques for optimizing the relevance of e-commerce search engines using a real-world dataset and evalu-

ation setup, providing guidelines on key aspects to consider for the application of HPO to search relevance.

Differential evolution (DE) optimization achieves up to 13% improvement in terms of NDCG@10 over base-

line search conﬁgurations on a publicly available dataset.

1 INTRODUCTION

Modern e-commerce platforms rely on search engines

to help customers ﬁnd relevant products from cata-

logs containing millions of items. Conﬁguring these

platforms is challenging and requires carefully mod-

eling the query intent, product attributes, customer be-

havior, and other factors inﬂuencing relevance. Most

search engines have numerous hyperparameters that

can signiﬁcantly impact both retrieval and ranking of

results. Traditionally, these options are tuned man-

ually in a time-consuming and often irreproducible

process as the queries, products, and customer pref-

erences evolve continuously over time.

In recent years, hyperparameter optimization

(HPO) techniques have been successfully used to con-

ﬁgure automatically many types of algorithms as well

as complex machine learning models (Feurer and

Hutter, 2019; Eggensperger et al., 2019). HPO em-

ploys a class of models usually called black-box or

derivative-free, as no mathematical closed-form for-

mulation of an objective function is necessary and the

only requirement is a metric for numerical estimation.

These techniques search through a multi-dimensional

space of possible hyperparameter conﬁgurations to

ﬁnd the settings that optimize a performance metric

such as NDCG (Wang et al., 2013).

To the best of our knowledge, there has been little

work in the research community on e-commerce ap-

plications of HPO for search relevance. One notable

exception is the work by Cavalcante et al. (Cavalcante

et al., 2020), who used Bayesian Optimization to tune

the ranking function of a customer support search ap-

plication on a private dataset. However, their work

did not explore different query structures, ﬁeld boost-

ing, query intent or query classiﬁcation (Di Fabbrizio

et al., 2024). Our work is one of the ﬁrst to systemat-

ically apply HPO techniques to optimize relevance in

e-commerce and to provide guidelines regarding the

application of HPO to this context.

The main contributions of this work are: 1) ap-

plication of differential evolution (DE) on a publicly

available e-commerce dataset for search relevance op-

timization; 2) analysis of the dataset’s label distribu-

tion impact on search relevance; 3) tuning of precision

and recall-oriented Elasticsearch queries, and variants

thereof, observing improvements up to 13% in terms

of NDCG@10; 4) insights into the impact of ﬁeld

boosting, query structure, and query understanding on

relevance; 5) guidelines on key aspects to consider

when applying HPO to search relevance, such as the

characteristics of the search space, multiﬁdelity, or the

use of multiple metrics for multi-objective optimiza-

tion.

The remainder of this paper is structured as fol-

lows. Section 2 provides the problem deﬁnition. Sec-

tion 3 introduces HPO and DE. Section 4 describes

the WANDS evaluation dataset. Section 5 presents

Dalcastagné, M. and Di Fabbrizio, G.

Hyperparameter Optimization for Search Relevance in E-Commerce.

DOI: 10.5220/0013010500003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 399-407

ISBN: 978-989-758-716-0; ISSN: 2184-3228

399

setup, results, and analysis of the experiments. Fi-

nally, Section 6 concludes the paper and outlines po-

tential future research.

2 PROBLEM DEFINITION

Users traditionally search by typing natural language

queries that deﬁne what they are looking for (user’s

intent). As a response, a search engine retrieves and

ranks a set of relevant documents from a corpus of

possibly multiple document types, whose speciﬁcs

are determined in dedicated document schemas. A

document type is represented as a collection of named

ﬁelds, also known as attributes or features, that are

employed to build ranking signals quantifying the rel-

evance of each ﬁeld with respect to search queries.

2.1 Index Time and Query Time

Modern search engines index and query documents

at separate times, but decisions taken at index time

might impact on both the performance and quality

of results retrieved at query time. At index time,

the ﬁelds of each document are analyzed and in-

dexed: each feature is divided into tokens, mapped

to a type (e.g., string, numeric, date), processed and

transformed in one or more ﬁelds that are indexed

(i.e., according to the signals to be modeled). Also,

details about the keyword and vector algorithms to be

used for ranking are usually deﬁned at this point. For

example, if using BM25 (Robertson and Zaragoza,

2009) as a ranking algorithm, its b and k

hyperpa-

rameters could be optimized during this phase.

Although the application of HPO is possible at

both stages, doing so at index time is signiﬁcantly

more expensive from a computational perspective -

changes to the index usually require the reindexing of

the whole corpus. This work focuses only on query-

time applications of HPO, but the same techniques

can be applied to optimize hyperparameters with im-

pact at index time.

2.2 HPO for Search Relevance

The application of HPO involves two steps. First, de-

ﬁne the hyperparameters to tune (i.e., type, range, re-

lationships with other hyperparameters) and a budget

to spend for the optimization process (e.g., number

of function evaluations). Second, run an optimization

loop where a search algorithm iteratively explores the

space deﬁned previously to ﬁnd the best possible con-

ﬁguration of the hyperparameters by using some user-

deﬁned metric to evaluate each conﬁguration.

In search relevance optimization (SRO), hyperpa-

rameters correspond to properties of the search engine

query (e.g., values of ﬁeld boosts, type of logical op-

erators), while user-deﬁned metrics are information

retrieval (IR) metrics like precision, recall, or NDCG.

Therefore, in order to evaluate a retrieval and ranking

strategy over a corpus of documents, a dataset should

contain a representative set of search queries and a

collection of sets of relevance labels, deﬁning the rel-

evance of each document that could appear in the top

results of each user query.

More precisely, let D be a dataset of triplets

(q, d, y) where q is a search query, d is a document

and y is a relevance label that deﬁnes the relevance of

d for q, and let S be a search engine with a given

index structure, whose output depends on a vector

of hyperparameters θ ∈ Θ

that deﬁne an optimiza-

tion search space of dimension t. The optimization

goal is to heuristically ﬁnd the best possible conﬁg-

uration θ

∗

by using a training dataset D

train

to esti-

mate the performance of S during the optimization

and a validation dataset D

val

to prevent overﬁtting, so

that θ

∗

generalizes to a test dataset D

test

that was not

employed during the optimization. Ideally, all these

datasets should be large enough to ensure statistically

sound decisions. If D is not large enough, meth-

ods like k-fold-cross-validation can be used to split

available data in folds to be combined as k training

and test sets. Therefore, the performance of any θ

is estimated as p

train,θ

= S (θ, D

train

) and, at the end

of the optimization, the quality of θ

∗

is estimated as

test,θ

∗

= S(θ

∗

, D

test

). Finally, to evaluate the contri-

bution of the optimization, the whole process is re-

peated k times and the optimized performance of S is

estimated as

test

∑

i=1

S (θ

∗

, D

test,i

) (1)

where θ

∗

is the best conﬁguration found at op-

timization i by using D

train,i

as training dataset and

test,i

as test dataset from the i-th split. It is impor-

tant to highlight that all splits are based on folds com-

ing from the same initial randomized sampling pro-

cess. As a result, the repeated estimation and averag-

ing over multiple splits results in an estimate of gen-

eralization error with lower variance (Kohavi, 1995).

3 OPTIMIZATION

HPO algorithms are usually classiﬁed as model-free

(e.g., variants of stochastic search like differential

evolution) or model-based (e.g., Bayesian optimiza-

tion), where a model is used to estimate the re-

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

400

sponse of the objective function to be optimized. Both

approaches have advantages and disadvantages, and

picking the right algorithm for the problem at hand

depends on multiple factors that include search space

characteristics (i.e., size, type of hyperparameters) or

latency requirements (Feurer and Hutter, 2019; Bischl

et al., 2023).

3.1 HPO Search Space and Latency

Due to the curse of dimensionality (Bellman, 1966),

the size of the search space has a large inﬂuence on

the optimization. The larger the size, the harder it is

for the algorithm to ﬁnd well-performing conﬁgura-

tions of the hyperparameters. Furthermore, not all al-

gorithms are able to scale with the number of dimen-

sions. For example, standard Bayesian optimization

(BO) based on Gaussian processes is not usually efﬁ-

cient on problems with more than 20 dimensions, but

it excels in continuous spaces (Eggensperger et al.,

2013; Frazier, 2018). In contrast, BO based on ran-

dom forests and evolutionary algorithms like DE are

not as efﬁcient. Still, they are able to handle larger

search spaces based on mixed hyperparameters as

well.

When performance evaluations are computation-

ally expensive, which can happen when the objective

function requires training on large datasets, it might

be helpful to consider multi-ﬁdelity algorithms like

Successive Halving (Jamieson and Talwalkar, 2016)

or Hyperband (Li et al., 2017) to schedule monotoni-

cally the use of low-ﬁdelity (less expensive) and high-

ﬁdelity (more expensive) evaluations during the opti-

mization, to spend the budget efﬁciently. For exam-

ple, DEHB (Awad et al., 2021) uses Differential evo-

lution (DE) as an optimization algorithm to search θ

in combination with a variant of hyperband, perform-

ing better than the more famous BOHB (Falkner et al.,

2018) on a wide range of problems, including the tun-

ing of deep learning networks.

Optimization algorithms differ as well in their par-

allelizability capabilities. In fact, model-free algo-

rithms are usually more scalable since model-based

methods are less parallelizable due to the presence of

a common model that must be iteratively updated. For

more details, refer to (Feurer and Hutter, 2019; Bischl

et al., 2023).

3.2 Differential Evolution

The optimization algorithm used in the experiments is

Differential evolution (Storn and Price, 1997), which

is an evolutionary algorithm inspired by the concepts

of biological evolution and natural selection, speciﬁ-

cally by how the offspring inheriting the best traits of

a population evolve over generations.

At the beginning of the process, a population p

(θ

, . . . , θ

) is randomly sampled from Θ

. Until some

user-deﬁned optimization budget b is consumed, DE

works iteratively in three steps: mutation, crossover,

and selection. During the mutation phase, each mem-

ber θ of the population p

at the current iteration i is

evaluated by computing S (θ, D

train

). Then, a new set

of n offsprings is generated by applying a scaled per-

turbation to each dimension of a new offspring θ

new

resulting from the combination of randomly picked

parents from p

. A crossover operator combines each

member of p

with one of the new offsprings θ

new

by picking for each dimension with some probability

which value from the two vectors should be used for

the mutant conﬁguration θ

mutant

. Finally, θ

mutant

compared with θ, and θ

mutant

possibly takes place of

θ if its quality is better.

4 EVALUATION DATASET

The Wayfair Annotation DataSet (WANDS) is an

open-source e-commerce product dataset designed to

evaluate the relevancy of e-commerce product search

engines (Chen et al., 2022). As described in Table 1,

the WANDS dataset contains:

• 480 search queries sampled from real search logs

of Wayfair, a major e-commerce retailer, with two

features: query text and class. For example, a

query like smart coffee table belongs to the Coffee

& Cocktail Tables class. The queries were strati-

ﬁed sampled to cover various dimensions such as

popularity, seasonality, and whether they led to

customer purchases. This ensures the query set

is representative of real customer search behavior.

• 42,994 products sampled from Wayfair’s catalog,

with nine features of which only the following

ﬁve textual features were used for ﬁeld boosting

in the experiments: product name, class, descrip-

tion, category hierarchy and list of features. For

example, a product named solid wood platform

bed belongs to the Bed class within a category hi-

erarchy like Furniture / Bedroom Furniture / Beds

and has a list of features that contains information

like color, material, size or weight.

For each query, Wayfair selected a set of poten-

tially relevant products using a combination of

customer click logs, lexical search systems, and

neural retrieval models. Speciﬁcally, the dataset

authors employed two strategies to construct the

product pool:

Hyperparameter Optimization for Search Relevance in E-Commerce

401

1. They leveraged user engagement data (clicks

and add-to-cart events), hypothesizing that

products users clicked on are a good approxi-

mation of potentially relevant products, while

products users clicked on but didn’t add to the

cart could be hard negatives or almost-relevant

products.

2. They further mined the product catalog using

an open-source lexical search engine (Apache

Solr) and a neural product retrieval system

inspired by (Nigam et al., 2019). The two

systems provide complementary ways to re-

trieve relevant products, removing the bias re-

lated to the use of a single lexical retrieval

source. Moreover, this hybrid approach ensures

the product set contains both obviously rele-

vant products as well as more challenging cases

that can help discriminate between different re-

trieval systems.

• 233,448 (query, product) pairs assigning one out

of three relevance labels to the match of query and

product: exact (1.0) if the product is completely

relevant to the query, partial (0.5) if the product

matches some but not all aspects of the query, and

irrelevant (0.0) if the product is not relevant to the

query.

Note that the statistics are based on the most re-

cent version available on GitHub

which is slightly

different from the version in (Chen et al., 2022).

A group of trained human annotators provided the

labels following a rigorous set of annotation guide-

lines. Each (query, product) pair was judged by up

to 3 annotators, and the ratings were aggregated us-

ing a majority vote. The WANDS dataset was con-

structed through multiple rounds of annotation and

reﬁnement. The inter-annotator agreement, measured

by Cohen’s Kappa (Cohen, 1960), improved from a

moderate 0.467 in the initial months to a substantial

0.826 after a few iterations of guideline reﬁnement

and annotator training. This indicates the dataset la-

bels are of high quality and consistency.

A key feature of WANDS is that it aims for com-

pleteness - i.e., for a given query, the dataset tries

to include relevance labels for all the relevant prod-

ucts from the catalog subset, not just the top few re-

sults. This is achieved through an iterative “cross-

referencing” process during dataset construction that

identiﬁes potentially relevant products that were not

covered in the initial labeling. Completeness is im-

portant for unbiased ofﬂine evaluation as it avoids

missing relevant products that could unfairly penalize

certain retrieval systems. The complete, multi-graded

https://github.com/wayfair/WANDS

relevance labels allow for a robust evaluation of the

ranking quality of search engines using metrics like

NDCG.

To evaluate the difﬁculty of the search relevance

task in the WANDS dataset, we analyzed the distribu-

tion of relevance labels (exact match, partial match,

irrelevant) across the queries. The goal was to un-

derstand how many queries have products labeled as

only exact matches, only partial matches, only irrele-

vant, or a mixture of these labels. This analysis pro-

vides insights into the difﬁculty of ranking the search

results for each query.

Assuming that, on average, each query contains

the same proportion of exact, partial, and irrelevant la-

bels as the overall distribution in the dataset, we found

that:

• 0 queries have products with only the Exact label,

24 queries have products with only the Partial la-

bel, 1 query has products with only the Irrelevant

label

• 33 queries have products with only Exact and Par-

tial labels, 11 queries have products with only Ex-

act and Irrelevant labels, 76 queries have products

with only Irrelevant and Partial labels

This analysis reveals that 25 queries do not have

an impact on NDCG, and 11 queries should have re-

sults that are relatively easy to rank. Around 100

queries are of medium difﬁculty, while the rest are

more challenging. However, the distribution of la-

bels across queries is not balanced, which is an im-

portant consideration for learning to rank (LtR) mod-

els (Goswami et al., 2018). If the number of labels

per type is imbalanced, the model may be more prone

to overﬁtting. For example, queries with a highly

skewed distribution of exact and partial matches are

easier to achieve a good NDCG score compared to

queries which have a more balanced distribution of

exact and partial matches.

This analysis highlights the importance of consid-

ering the distribution of relevance labels when eval-

uating the difﬁculty of the search relevance task and

the potential impact on the performance of ranking

models. The WANDS dataset provides a diverse set

of queries with varying levels of difﬁculty, making it

a valuable resource for evaluating and comparing dif-

ferent search engines and ranking algorithms in the

e-commerce domain.

5 EXPERIMENTS AND RESULTS

We provide experimental results to demonstrate how

hyperparameter optimization can be leveraged to

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

402

Table 1: Summary of key data statistics about the WANDS dataset.

Feature Value

Number of queries 480

Number of products 42,994

Number of (query, product) relevance labels 233,448

Relevance label scale 0-2

Relevance label distribution Exact: 25,614 Partial: 146,633 Irrelevant: 61,201

Search queries from Real search logs of Wayfair

Products sampled from Wayfair’s catalog

Annotators per (query, product) pair Up to 3

Inter-annotator agreement (Cohen’s Kappa) 0.826

automate solutions to many information retrieval

and search problems commonly encountered in e-

commerce. The focus is on optimizing hyperparam-

eters of search queries and ranking signals used by

search engines in keyword search. Elasticsearch is

used as an experimental framework, but the tech-

niques mentioned in this section are applicable to any

other engine that supports the manual tuning of its

components.

Experiments start from the consideration that both

TF-IDF and BM25 have some ranking strategy limits,

which can be partially addressed through the use of

optimization for ﬁeld boosting. It is worth mentioning

that well-tuned boosts are critical not only to rank the

expected importance of different signals but also to

balance the range of the respective BM25 scores.

5.1 BM25 Limits and Field Boosting

Scores based on TF-IDF have some shortcomings,

which are partially solved by the BM25 formulation.

TF-IDF’s score for a term in a corpus is computed

as the product of term frequency and inverse docu-

ment frequency. A problem comes from the uncon-

strained impact of term frequency on the score (i.e., a

term that appears n times in a document implies that a

document is n times more relevant than another doc-

ument without any occurrence). Also, the length of a

document does not weight the relevance of its terms

(e.g., if a term appears once in a document contain-

ing 10 words, it is considered to be as relevant as if

the term appears once in a document containing 1000

words). BM25’s b parameter restrains the degree to

which term frequency can impact the score, determin-

ing a penalty for documents longer than the average,

and the inﬂuence of common terms on the score is

saturated by BM25’s k

parameter.

Nonetheless, the scores of ﬁelds can be on differ-

ent scales due to distribution differences of frequen-

cies and document lengths and are, therefore, not di-

rectly comparable. Also, by deﬁnition, these scores

are biased towards information, usually against users’

needs (i.e., rare matches within a document score

higher, while users usually look for popular items).

Field boosting helps counterbalance the aforemen-

tioned problems by prioritizing and balancing signals

from different ﬁelds. In fact, a search query usually

contains more than one string and possibly multiple

concepts. It does not come as a surprise that the in-

formation required to return relevant results is often

stored in multiple ﬁelds.

Elasticsearch tries to solve some of TF-IDF’s

problems by changing how token frequencies are

combined to compute scores during a multi-ﬁeld

search by considering the frequencies coming from

multiple ﬁelds at the same time. In particular, ﬁeld-

centric search (e.g., multi match best fields and

most fields) focuses towards precision by promot-

ing results which satisfy criteria based on the sig-

nals which are expected to match the user’s search,

while term-centric search (e.g., cross fields,

combined fields) focuses towards recall, by select-

ing all possibly relevant search results (Turnbull and

Berryman, 2016). The use of either conjunctive

(AND) or disjunctive operator (OR) further pushes

these queries towards precision or recall, respectively.

The combination of recall-oriented and precision-

oriented clauses in a stratiﬁed query improves the

ranking of the results returned to the user (Turn-

bull and Berryman, 2016). In Elasticsearch, this can

be achieved using a boolean query, which matches

documents satisfying boolean combinations of other

queries (e.g., multi match queries), where some

clauses provide a recall-oriented base score that is im-

proved by other precision-oriented clauses. For exam-

ple, the base score may come from a multi match

cross fields query searching in all text ﬁelds,

while other scores may come from multi match

best fields or most fields queries based on high-

quality signals.

Hyperparameter Optimization for Search Relevance in E-Commerce

403

5.2 User’s Intent

Understanding the user’s intent is another critical sig-

nal that signiﬁcantly improves search relevance in e-

commerce. This involves classifying whether the user

is searching for a speciﬁc product category or asking

a broader, more informational question. By using a

machine learning model to predict the user’s intent

based on query structure, search patterns, and histori-

cal behavior, the system can adjust its ranking strategy

to deliver more relevant results. For example, when

seeking a speciﬁc product, search results can priori-

tize relevant items from the desired category. Con-

versely, if the user’s query is informational, the sys-

tem can prioritize results such as FAQs, reviews, or

other informative content. Incorporating intent pre-

diction into the search optimization process allows for

more accurate recommendations and a highly person-

alized shopping experience.

5.3 Multi-Objective Optimization

In the experiments, we use NDCG@10 over the la-

beled dataset to evaluate the relevance performance

of a given search engine conﬁguration θ. The nor-

malized discounted cumulative gain (NDCG) (Wang

et al., 2013) measures the relevance of the top-ranked

results, putting more emphasis on the relevance of re-

sults at higher ranks (J

arvelin and Kek

ainen, 2000).

This aligns well with users’ behavior and preferences

on e-commerce search result pages, who tend to focus

mainly on the ﬁrst page of results. Still, while a single

metric is a good starting point for assessing the qual-

ity of search relevance performance, it might only tell

part of the story.

NDCG assumes that labeled documents are uni-

formly distributed in the ranked list, which is usually

untrue. In Section 4, we showed that even a well-built

dataset like WANDS falls in more extreme situations

where not all relevance labels are found for more than

100 use queries, and in some cases, only one class

of relevance labels might be retrieved. A metric like

NDCG cannot detect such scenarios and would return

a perfect value even if some queries were evaluated,

for example, only on irrelevant documents. To ob-

tain robust evaluations, one should combine at least

an order-aware metric like NDCG or Mean Recipro-

cal Rank with an order-unaware metric like Precision

or Recall. For further details about these indicators or

variants thereof, please refer to (Valcarce et al., 2018).

The optimization of multiple equally important

but conﬂicting objectives is named multi-objective

optimization, where solutions that optimize all ob-

jectives simultaneously usually do not exist (Helfrich

et al., 2023). In this scenario, heuristic algorithms

try to ﬁnd efﬁcient, non-dominated solutions concern-

ing the deﬁned objectives. An alternative solution is

to employ scalarization techniques to systematically

approximate a multi-objective optimization problem

into a regular single-objective optimization problem

with the help of additional parameters such as weights

and use regular optimization problems to solve the re-

sulting scalarization. For further details about multi-

objective HPO algorithms, please refer to (Feurer and

Hutter, 2019; Bischl et al., 2023).

5.4 Experimental Setup

Text ﬁelds from WANDS were indexed using Elastic-

search’s English analyzer, without any additional pre-

processing steps. In particular, all experiments were

run on Elasticsearch 8.8.2 and Python 3.10. To ensure

replicability and improved comparison of results, all

splits and optimization runs were carried out multi-

ple times with a common set of random seeds. This

ensured that the evaluations utilized to build estima-

tors were paired. In addition, we computed random

ranking values based on 5 repetitions, similar to how

k-fold cross-validation was employed with k = 5. Ac-

cording to the experiment, the search space size varies

from 8 to 27 dimensions, and each optimization run is

executed up to a budget b of 400 function evaluations.

Results on the test set are considered only for evalu-

ation purposes at b = 50, 100, 200, and 400. Unless

explicitly deﬁned, the experiments’ optimized hyper-

parameters were deﬁned as in Table 2. For further de-

tails about the role of these hyperparameters in multi-

ﬁeld queries, please refer to Elasticsearch documen-

tation. Finally, the DE implementation used in the ex-

periments is the default version available on GitHub

from the Python package created by the authors of

DEHB.

5.5 Random and Standard Baselines

In this work, we consider both random and standard

ranking as baselines against which to evaluate the

contribution of HPO. The random ranking provides a

ranking baseline for the problem, by assigning to each

document from the set of results of a retrieval strategy

a pseudorandom number in the range [0,1]. As a re-

sult, it is possible to compute any performance met-

ric on the resulting ranked list. For example, if using

NDCG, higher values imply easier ranking problems.

Similar considerations can be achieved analyzing the

distribution of relevance labels across the dataset.

https://github.com/automl/DEHB

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

404

Table 2: Hyperparameter used in the experiments.

Name Type Range Default value

operator categorical {and, or} or

type categorical {best ﬁelds, most ﬁelds, cross ﬁelds} none

minimum should match ordinal {0%, 20%, 40%, 60%, 80%, 100%} none

tie breaker ﬂoat [0, 1] 0

boost ﬂoat [0, 100] 1

Standard ranking quantiﬁes how the standard conﬁg-

uration of a search engine’s ranking strategy performs

with respect to a completely random ranking strategy.

Unlikely the general HPO scenario, where good ini-

tial conﬁgurations of hyperparameters are usually un-

known, search engines come with default values that

work well on average. As a consequence, the corre-

sponding ranking performance should be considered

as well as a baseline.

5.6 Optimization Improvements

The contribution of the optimization to ranking strate-

gies is empirically estimated by showing the improve-

ment that DE is able to achieve with respect to the

standard ranking of multiple retrieval queries with

a ﬁxed structure and increasing difﬁculty. Results

show that the optimization contributed, on average,

to an improvement of approximately 0.05 in terms of

NDCG@10 on 12 cases. Our optimization strategy

was not only employed to fully optimize both retrieval

and ranking parts of each type of Elasticsearch query

used in the experiments, but it also proved its effec-

tiveness. It was able to reach comparable results with

respect to its optimized counterparts with a ﬁxed re-

trieval structure, providing reassurance about its suc-

cess.

Results were built on three main types of Elastic-

search queries that were increasingly difﬁcult. In the

ﬁrst set of experiments (Table 3, top), basic types of

multi-ﬁeld query are used distinctively in combina-

tion with both conjunctive and disjunctive operators.

On average, the optimization achieves an improve-

ment of 0.07. Once optimized, precision-oriented

queries achieve the same results, and therefore, only

one of the two is going to be considered in the fol-

lowing experiments. A Boolean query is employed

to build a stratiﬁed query in the second set of exper-

iments (Table 3, middle). On average, the optimiza-

tion achieves an improvement of 0.05. The best re-

sults are interestingly achieved by combining a recall-

oriented query based on the conjunctive operator and

a precision-oriented query based on the disjunctive

operator. In the third set of experiments (Table 3, bot-

tom), the best stratiﬁed query from the previous ex-

periments is extended with an additional multi-ﬁeld

query that considers user intent. On average, the op-

timization achieves an improvement of 0.04, and the

introduction of user intent contributes approximately

0.03 - 0.04 with respect to the best results from the

previous sets of experiments.

5.7 Retrieval Relaxation Improvements

All results show that, on average, queries using

the conjunctive operator perform worse than queries

adopting the disjunctive operator. In particular, ran-

dom ranking results allow us to infer that performance

values can be improved by relaxing the matching re-

quirements and retrieving more potentially relevant

documents that could be otherwise excluded from fur-

ther ranking reﬁnement. This behavior aligns with

modern multi-stage IR systems that rely on multiple

ranking phases, where the ﬁrst phase focuses on recall

and successive steps towards precision (Dang et al.,

2013; Zhou and Devlin, 2021).

6 CONCLUSIONS

This work demonstrates the potential for HPO tech-

niques to substantially improve the search relevance

of e-commerce engines with minimal human effort in

a reproducible and automatic process, providing in-

sights into the impact of ﬁeld boosting, retrieval query

structure, and query understanding on relevance, as

well as guidelines on the application of HPO to search

relevance in e-commerce.

By leveraging the WANDS evaluation dataset and

DE as HPO algorithm, we automatically optimized

both retrieval and ranking strategies of Elasticsearch

queries, improving NDCG@10 up to 13% with re-

spect to baseline conﬁgurations. The introduction of

the user’s intent in the search strategy, deﬁned as cor-

respondence between the category of user query and

document, brought an improvement of up to 4 %.

Finally, results showed that the relaxation of the re-

trieval strategy led to signiﬁcantly better results. De-

fault search engine conﬁgurations leave signiﬁcant

room for relevance improvements that can be un-

Hyperparameter Optimization for Search Relevance in E-Commerce

405

Table 3: Best results from the ﬁrst set (top), the second set (middle), and the third set of experiments (bottom). All performance

metrics are expressed as averaged NDCG@10 with standard deviation, and results with highest average are in bold for each

column.

Query type Operator Space Size Random Standard Optimized

cross ﬁelds OR 8 0.53 ± 0.01 0.60 ± 0.00 0.73 ± 0.02

cross ﬁelds AND 8 0.49 ± 0.00 0.52 ± 0.01 0.59 ± 0.02

best ﬁelds OR 8 0.54 ± 0.01 0.60 ± 0.01 0.73 ± 0.01

best ﬁelds AND 8 0.46 ± 0.00 0.48 ± 0.01 0.52 ± 0.04

most ﬁelds OR 8 0.60 ± 0.00 0.69 ± 0.00 0.73 ± 0.01

most ﬁelds AND 8 0.49 ± 0.00 0.51 ± 0.01 0.52 ± 0.04

optimized optimized 10 / / 0.75 ± 0.02

stratiﬁed OR, OR 17 0.62 ± 0.00 0.71 ± 0.00 0.74 ± 0.02

stratiﬁed OR, AND 17 0.59 ± 0.00 0.64 ± 0.00 0.74 ± 0.03

stratiﬁed AND, OR 17 0.64 ± 0.00 0.72 ± 0.00 0.75 ± 0.02

stratiﬁed AND, AND 17 0.52 ± 0.00 0.55 ± 0.01 0.58 ± 0.02

optimized optimized 21 / / 0.74 ± 0.02

stratiﬁed, most ﬁelds AND, OR, AND 21 0.65 ± 0.00 0.74 ± 0.00 0.77 ± 0.02

stratiﬁed, cross ﬁelds AND, OR, OR 21 0.65 ± 0.00 0.74 ± 0.00 0.78 ± 0.03

optimized optimized 27 / / 0.78 ± 0.02

locked with HPO, through a reproducible process that

does not keep humans in the never-ending loop of

manual search relevance optimization.

Picking the best algorithm for search relevance

optimization depends on various factors including the

size and type of hyperparameters, as well as multi-

ﬁdelity and multi-objective requirements. Evolution-

ary algorithms like DE are capable of handling large

mixed search spaces, but unless the size of the search

space goes beyond hundreds of dimensions, random-

forests-based BO is another possible option. Further-

more, when performance evaluations are expensive

due to the need for large datasets, options such as

multi-ﬁdelity HPO algorithms should be considered.

Finally, to obtain robust conﬁgurations, one should

consider multi-objective HPO algorithms to optimize

for both order-aware and order-unaware metrics, or to

create a scalarization of such metrics to apply regular

HPO algorithms like DE.

While the optimal conﬁguration will vary for each

search application, this work establishes a general

framework, methodology, and best practices for ap-

plying HPO to improve search relevance. With the in-

creasing availability of easy-to-use HPO libraries and

their integration with popular search engines, we be-

lieve this is a highly promising direction to improve

the search experience for e-commerce customers with

less manual effort and greater reproducibility.

This work focuses on optimizing keyword-based

search, but it is worth noting the complementary role

of dense vector search using learned semantic rep-

resentations (Mitra and Craswell, 2018). In many

search use cases where user queries primarily con-

sist of named entities like product names or brands,

exact keyword matching remains critical and even

preferable. However, modern search engines offer

hybrid search capabilities that combine the strenghts

of sparse keyword-based retrieval with dense vector

search. This hybrid approach is commonly used in

retrieval augmented generation (RAG) architectures,

as purely semantic search can miss obvious keyword

matches needed for accurate product retrieval in e-

commerce and for more strongly grounded factual

knowledge retrieval (Lewis et al., 2020).

Finally, other several exciting avenues for future

work in this area include:

• Exploration of the beneﬁts that HPO can bring

to hybrid search, such as improvements to the

ﬁne-tuning process of embedding models used in

dense vector search or the conﬁguration of other

hyperparameters used in multi-stage IR systems;

• Application of multi-objective optimization to

jointly optimize multiple metrics that measure dif-

ferent aspects of the results;

• Investigation of possible interactions as well as

differences between HPO and LtR techniques for

search relevance.

REFERENCES

Awad, N. H., Mallik, N., and Hutter, F. (2021). DEHB: Evo-

lutionary hyperband for scalable, robust and efﬁcient

hyperparameter optimization. Proceedings of IJCAI.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

406

Bellman, R. (1966). Dynamic programming. Science,

153(3731):34–37.

Bischl, B., Binder, M., Lang, M., Pielok, T., Richter,

J., Coors, S., Thomas, J., Ullmann, T., Becker, M.,

Boulesteix, A.-L., et al. (2023). Hyperparameter op-

timization: Foundations, algorithms, best practices,

and open challenges. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery.

Cavalcante, L., Lima, U., Barbosa, L., Gomes, A. L.,

Eden

Santana, and Martins, T. (2020). Improving Search

Quality with Automatic Ranking Evaluation and Tun-

ing. In Anais do XXXV Simp

osio Brasileiro de Bancos

de Dados, Brasil.

Chen, Y., Khrennikov, D., Ferrer, I., and Verberne, S.

(2022). WANDS: A Dataset for Web-based Product

Search. In European Conference on Information Re-

trieval, pages 61–75. Springer.

Cohen, J. (1960). A Coefﬁcient of Agreement for Nominal

Scales. Educational and Psychological Measurement.

Dang, V., Bendersky, M., and Croft, W. B. (2013). Two-

stage learning to rank for information retrieval. In Ad-

vances in Information Retrieval. Springer.

Di Fabbrizio, G., Stepanov, E., and Tessaro, F. (2024).

Extreme Multi-label Query Classiﬁcation for E-

commerce. In The SIGIR 2024 Workshop on eCom-

merce, Washington, D.C., USA.

Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., et al.

(2013). Towards an empirical foundation for assessing

bayesian optimization of hyperparameters. In NIPS

workshop on Bayesian Optimization in Theory and

Practice, Nevada. Curran Associates, Inc.

Eggensperger, K., Lindauer, M., and Hutter, F. (2019). Pit-

falls and best practices in algorithm conﬁguration.

Journal of Artiﬁcial Intelligence Research, 64:861–

893.

Falkner, S., Klein, A., and Hutter, F. (2018). Bohb: Robust

and efﬁcient hyperparameter optimization at scale. In

International conference on machine learning.

Feurer, M. and Hutter, F. (2019). Hyperparameter Opti-

mization, chapter 1, pages 3–38. Springer, Cham.

Frazier, P. I. (2018). Bayesian optimization. In Recent ad-

vances in optimization and modeling of contemporary

problems, pages 255–278. Informs.

Goswami, A., Zhai, C., and Mohapatra, P. (2018). Learning

to rank and discover for e-commerce search. In 14th

International Conference on Machine Learning and

Data Mining in Pattern Recognition (MLDM 2018),

pages 331–346, Germany. Springer.

Helfrich, S., Herzel, A., Ruzika, S., and Thielen, C. (2023).

Using scalarizations for the approximation of multiob-

jective optimization problems: towards a general the-

ory. Mathematical Methods of Operations Research,

pages 1–37.

Jamieson, K. and Talwalkar, A. (2016). Non-stochastic best

arm identiﬁcation and hyperparameter optimization.

In Artiﬁcial intelligence and statistics.

arvelin, K. and Kek

ainen, J. (2000). IR evaluation meth-

ods for retrieving highly relevant documents. In Pro-

ceedings of the 23rd Annual International Confer-

ence on Research and Development in Information

Retrieval, New York, NY, USA. Association for Com-

puting Machinery.

Kohavi, R. (1995). A study of cross-validation and boot-

strap for accuracy estimation and model selection. In

Proceedings of the 14th International Joint Confer-

ence on Artiﬁcial Intelligence - Volume 2, IJCAI’95,

page 1137–1143, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., et al. (2020). Retrieval-Augmented

Generation for Knowledge-Intensive NLP Tasks. In

Advances in Neural Information Processing Systems,

volume 33, pages 9459–9474. Curran Associates, Inc.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and

Talwalkar, A. (2017). Hyperband: a novel bandit-

based approach to hyperparameter optimization. J.

Mach. Learn. Res., 18(1):6765–6816.

Mitra, B. and Craswell, N. (2018). An introduction to neu-

ral information retrieval. Foundations and Trends in

Information Retrieval.

Nigam, P., Song, Y., Mohan, V., Lakshman, V., Ding, W. A.,

Shingavi, A., Teo, C. H., Gu, H., and Yin, B. (2019).

Semantic product search. In Proceedings of KDD,

New York, NY, USA. Association for Computing Ma-

chinery.

Robertson, S. and Zaragoza, H. (2009). The probabilis-

tic relevance framework: Bm25 and beyond. Found.

Trends Inf. Retr., 3(4):333–389.

Storn, R. and Price, K. (1997). Differential evolution – a

simple and efﬁcient heuristic for global optimization

over continuous spaces. J. of Global Optimization,

11(4):341–359.

Turnbull, D. and Berryman, J. (2016). Relevant Search:

With applications for Solr and Elasticsearch. Man-

ning Publications Co., USA.

Valcarce, D., Bellog

ın, A., Parapar, J., and Castells, P.

(2018). On the robustness and discriminative power

of information retrieval metrics for top-n recommen-

dation. In Proceedings of the 12th ACM conference

on recommender systems, pages 260–268.

Wang, Y., Wang, L., Li, Y., He, D., and Liu, T. (2013). A

Theoretical Analysis of NDCG Type Ranking Mea-

sures. In COLT 2013 - The 26th Annual Conference

on Learning Theory.

Zhou, G. and Devlin, J. (2021). Multi-vector attention mod-

els for deep re-ranking. In Proceedings of the 2021

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 5452–5456.

Hyperparameter Optimization for Search Relevance in E-Commerce

407