all the arms in the Pareto front. Thus, the exploitation
now means the fair usage of Pareto optimal arms.
This difference in exploitation vs exploration
trade-off reflects on all aspects of Pareto MAB algo-
rithmic design. There are two regret metrics for the
MOMAB algorithms Drugan and Nowe (2013). One
performance metric, i.e. the Pareto projection regret
metric, measures the amount of times any Pareto opti-
mal arm is used. Another performance metric, i.e. the
Pareto variance regret metric, measures the variance
in using all Pareto optimal arms. Background infor-
mation on MOMABs, in general, and Pareto MABs,
in particular, are given in Section 2.
We propose several Pareto MAB algorithms that
are an extension of the classical single objectiveMAB
algorithms, i.e. UCB1 and UCB2 Auer et al. (2002),
to reward vectors. The proposed algorithms focus
on either the exploitation or the exploration mecha-
nisms. We consider the Pareto UCB1 Drugan and
Nowe (2013) to be an exploratory variant of this al-
gorithm because each round only one Pareto optimal
arm is pulled. In Section 3, we propose an exploita-
tive variant of the Pareto UCB1 algorithm where, each
round, all the Pareto optimal arms are pulled. We
show that the analytical properties, i.e. upper confi-
dence bound of the Pareto projection regret, for the
exploitative Pareto UCB1 are improved when com-
pared with the exploratory variant of the same algo-
rithm because this bound is independent of the cardi-
nality of the Pareto front.
Section 4 proposes two multi-objective variants of
UCB2 correspondingto the two exploitation vs explo-
ration mechanisms described before. The exploita-
tive Pareto UCB2 is an extension of UCB2 where,
each epoch, all the Pareto optimal arms are pulled
equally often. This algorithm is introduced in Sec-
tion 4.1. The exploratory Pareto UCB2 algorithm, see
Section 4.2, pulls each epoch a single Pareto optimal
arm. We compute the upper bound of the Pareto pro-
jection regret for the exploitative Pareto UCB2 algo-
rithm.
Our motivating example is a bi-objective wet
clutch Vaerenbergh et al. (2012) that is a system with
one input characterised by a hard non-linearity when
the piston of the clutch gets in contact with the fric-
tion plates. These clutches are typically used in power
transmissions of off-road vehicles, which operate un-
der strongly varying environmental conditions. The
validation experiments are carried out on a dedicated
test bench, where an electro-motor drives a flywheel
via a torque converser and two mechanical transmis-
sions. The goal is to learn by minimising simulta-
neously: i) the optimal current profile to the electro-
hydraulic valve, which controls the pressure of the
oil to the clutch, and ii) the engagement time. The
output data is stochastic because the behavior of the
machine varies with the surrounding temperature that
cannot be exactly controlled. Section 5 experimen-
tally compares the proposed MOMAB algorithms on
a bi-objective Bernoulli reward distribution generated
on the output solutions of the wet clutch.
Section 6 concludes the paper.
2 THE MULTI-OBJECTIVE
MULTI-ARMED BANDITS
PROBLEM
We consider the general case where a reward vector
can be better than another reward vector in one objec-
tive, and worse in another objective. Expected reward
vectors are compared according to the Pareto domi-
nance relation Zitzler et al. (2003).
The following dominance relations between two
vectors µ and ν are used. A vector µ is dominating,
another vector ν, ν ≺ µ, if and only if there exists at
least one objective o for which ν
o
< µ
o
and for all
other objectives j, j 6= i, we have ν
j
≤ µ
j
. A reward
vector µ is incomparable with another vector ν, νkµ,
if and only if there exists at least one objective o for
which ν
o
< µ
o
, and there exists another objective j,
j 6= i, for which ν
j
> µ
j
. Finally, the vector µ is non-
dominated by ν, ν 6≻ µ, if and only if there exists at
least one objective o for which ν
o
< µ
o
. Let A
∗
be the
Pareto front, i.e. non-dominated by any arm in A.
2.1 The Exploration vs Exploitation
Trade-off in Pareto MABs
A Pareto MAB-algorithm selects an arm to play based
on the previous plays and the obtained reward vec-
tors and it tries to maximize the total expected reward
vectors. The goal of a MOMAB algorithm is to si-
multaneously minimise the regret of not selecting the
Pareto optimal arms by fairly playing all the arms in
the Pareto front.
In order to measure the performance of these al-
gorithms, we define two Pareto regret metrics. The
first regret metric measures the loss in pulling arms
that are not Pareto optimal and is called the Pareto
projection regret. The second metric, the Pareto vari-
ance regret, measures the variance
2
in pulling each
arm from the Pareto front A
∗
.
2
Not to be confused with the variance of random vari-
ables.
ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms
67