Exploration Versus Exploitation Trade-off in Inﬁnite Horizon Pareto

Multi-armed Bandits Algorithms

Madalina Drugan and Bernard Manderick

Artiﬁcial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium

Keywords:

Multi-armed Bandits, Multi-objective Optimisation, Pareto Dominance Relation, Inﬁnite Horizon Policies.

Abstract:

Multi-objective multi-armed bandits (MOMAB) are multi-armed bandits (MAB) extended to reward vectors.

We use the Pareto dominance relation to assess the quality of reward vectors, as opposite to scalarization

functions. In this paper, we study the exploration vs exploitation trade-off in inﬁnite horizon MOMABs

algorithms. Single objective MABs explore the suboptimal arms and exploit a single optimal arm. MOMABs

explore the suboptimal arms, but they also need to exploit fairly all optimal arms. We study the exploration vs

exploitation trade-off of the Pareto UCB1 algorithm. We extend UCB2 that is another popular inﬁnite horizon

MAB algorithm to rewards vectors using the Pareto dominance relation. We analyse the properties of the

proposed MOMAB algorithms in terms of upper regret bounds. We experimentally compare the exploration

vs exploitation trade-off of the proposed MOMAB algorithms on a bi-objective Bernoulli environment coming

from control theory.

1 INTRODUCTION

Multi-armed bandits (MAB) is a machine learning

paradigm used to study and analyse resource alloca-

tion in stochastic and noisy environments. The multi-

armed bandit problem considers multi-objective re-

wards and imports techniques from multi-objective

optimisation into the multi-armed bandits algorithms.

We call this the multi-objective multi-armed bandits

(MOMAB) problem and it is an extension of the stan-

dard MAB-problem to reward vectors.

MOMAB

also has K arms, K ≥ 2, and let I the set of these K

arms. But since we have multiple objectives, a ran-

dom vector of rewards is received, one component per

objective, whenone of the arms is pulled. The random

vectors have a stationary distribution with support in

the D-dimensional hypercube [0,1]

but the vector of

true expected rewards µ

= (µ

,...,µ

), where D is

the number of objectives, is unknown. All rewards X

obtained from any arm i are independently and identi-

cally distributed according to an an unknownlaw with

unknown expectation vector µ

= (µ

,...,µ

). Re-

ward values obtained from different arms are also as-

Some of these techniques were also imported in other

related learning paradigms: multi-objective Markov Deci-

sion Processes Lizotte et al. (2010); Wiering and de Jong

(2007), and multi-objective reinforcement learning van

Moffaert et al. (2013); Wang and Sebag (2012).

sumed to be independent. A MAB algorithm chooses

the next machine to play based on the sequence of

past plays and obtained reward values.

MOMAB leads to important differences com-

pared to the standard MAB. Pareto dominance Zitzler

et al. (2003) allows to maximize the reward vectors

directly in the vector reward space. A reward vector

can optimize one objective and be sub-optimal in the

other objectives, leading to many vector rewards of

the same quality. Thus, there could be several arms

considered to be the best according to their reward

vectors. We call the set of optimal arms of the same

quality the Pareto front. An adequate regret deﬁnition

for the Pareto MAB algorithm measures the distance

between a suboptimal reward vector and the Pareto

front. We call this class of algorithms the Pareto MAB

problem.

The main goal of this paper is to study the explo-

ration vs exploitation trade-off in several Pareto MAB

algorithms. Exploration means pulling the subopti-

mal arms that might have been unlucky, whereas ex-

ploitation means pulling as much as possible the op-

timal arms. The exploration vs exploitation trade-

off is different for single objective MABs and for

MOMABs. For single objective MABs, we are con-

cerned with the exploration of the suboptimal arms

and the exploitation of a single optimal arm. In

MOMABs, by design, we should pull equally often

Drugan M. and Manderick B..

Exploration Versus Exploitation Trade-off in Inﬁnite Horizon Pareto Multi-armed Bandits Algorithms.

DOI: 10.5220/0005195500660077

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (ICAART-2015), pages 66-77

ISBN: 978-989-758-074-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

all the arms in the Pareto front. Thus, the exploitation

now means the fair usage of Pareto optimal arms.

This difference in exploitation vs exploration

trade-off reﬂects on all aspects of Pareto MAB algo-

rithmic design. There are two regret metrics for the

MOMAB algorithms Drugan and Nowe (2013). One

performance metric, i.e. the Pareto projection regret

metric, measures the amount of times any Pareto opti-

mal arm is used. Another performance metric, i.e. the

Pareto variance regret metric, measures the variance

in using all Pareto optimal arms. Background infor-

mation on MOMABs, in general, and Pareto MABs,

in particular, are given in Section 2.

We propose several Pareto MAB algorithms that

are an extension of the classical single objectiveMAB

algorithms, i.e. UCB1 and UCB2 Auer et al. (2002),

to reward vectors. The proposed algorithms focus

on either the exploitation or the exploration mecha-

nisms. We consider the Pareto UCB1 Drugan and

Nowe (2013) to be an exploratory variant of this al-

gorithm because each round only one Pareto optimal

arm is pulled. In Section 3, we propose an exploita-

tive variant of the Pareto UCB1 algorithm where, each

round, all the Pareto optimal arms are pulled. We

show that the analytical properties, i.e. upper conﬁ-

dence bound of the Pareto projection regret, for the

exploitative Pareto UCB1 are improved when com-

pared with the exploratory variant of the same algo-

rithm because this bound is independent of the cardi-

nality of the Pareto front.

Section 4 proposes two multi-objective variants of

UCB2 correspondingto the two exploitation vs explo-

ration mechanisms described before. The exploita-

tive Pareto UCB2 is an extension of UCB2 where,

each epoch, all the Pareto optimal arms are pulled

equally often. This algorithm is introduced in Sec-

tion 4.1. The exploratory Pareto UCB2 algorithm, see

Section 4.2, pulls each epoch a single Pareto optimal

arm. We compute the upper bound of the Pareto pro-

jection regret for the exploitative Pareto UCB2 algo-

rithm.

Our motivating example is a bi-objective wet

clutch Vaerenbergh et al. (2012) that is a system with

one input characterised by a hard non-linearity when

the piston of the clutch gets in contact with the fric-

tion plates. These clutches are typically used in power

transmissions of off-road vehicles, which operate un-

der strongly varying environmental conditions. The

validation experiments are carried out on a dedicated

test bench, where an electro-motor drives a ﬂywheel

via a torque converser and two mechanical transmis-

sions. The goal is to learn by minimising simulta-

neously: i) the optimal current proﬁle to the electro-

hydraulic valve, which controls the pressure of the

oil to the clutch, and ii) the engagement time. The

output data is stochastic because the behavior of the

machine varies with the surrounding temperature that

cannot be exactly controlled. Section 5 experimen-

tally compares the proposed MOMAB algorithms on

a bi-objective Bernoulli reward distribution generated

on the output solutions of the wet clutch.

Section 6 concludes the paper.

2 THE MULTI-OBJECTIVE

MULTI-ARMED BANDITS

PROBLEM

We consider the general case where a reward vector

can be better than another reward vector in one objec-

tive, and worse in another objective. Expected reward

vectors are compared according to the Pareto domi-

nance relation Zitzler et al. (2003).

The following dominance relations between two

vectors µ and ν are used. A vector µ is dominating,

another vector ν, ν ≺ µ, if and only if there exists at

least one objective o for which ν

< µ

and for all

other objectives j, j 6= i, we have ν

≤ µ

. A reward

vector µ is incomparable with another vector ν, νkµ,

if and only if there exists at least one objective o for

which ν

< µ

, and there exists another objective j,

j 6= i, for which ν

> µ

. Finally, the vector µ is non-

dominated by ν, ν 6≻ µ, if and only if there exists at

least one objective o for which ν

< µ

. Let A

∗

be the

Pareto front, i.e. non-dominated by any arm in A.

2.1 The Exploration vs Exploitation

Trade-off in Pareto MABs

A Pareto MAB-algorithm selects an arm to play based

on the previous plays and the obtained reward vec-

tors and it tries to maximize the total expected reward

vectors. The goal of a MOMAB algorithm is to si-

multaneously minimise the regret of not selecting the

Pareto optimal arms by fairly playing all the arms in

the Pareto front.

In order to measure the performance of these al-

gorithms, we deﬁne two Pareto regret metrics. The

ﬁrst regret metric measures the loss in pulling arms

that are not Pareto optimal and is called the Pareto

projection regret. The second metric, the Pareto vari-

ance regret, measures the variance

in pulling each

arm from the Pareto front A

∗

Not to be confused with the variance of random vari-

ables.

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms

The Pareto projection regret expresses the ex-

pected loss due to the play of suboptimal arms. For

this purpose, it uses the Euclidean distance between

the mean reward vector µ

of an arm i and its projec-

tion ν

into the Pareto front. This projection is ob-

tained as follows: A vector ε

with equal components

, i.e. ε

= (ε

,ε

,··· ,ε

), is added to µ

such that ε

is the smallest value for which ν

= µ

+ ε

becomes

Pareto optimal. The Euclidean distance ∆

between µ

and its projection ν

into the Pareto front equals:

∆

= kν

−µ

= kε

√

Dε

(1)

where the last equality holds because we have D ob-

jectives and all components of ε

are the same.

Since by deﬁnition ∆

is always non-negative, the

resulting regret is also non-negative. Note that the for

a Pareto optimal arm ν

= µ

and ∆

= 0.

Let T

(n) be the number of times that arm i has

been played after n plays in total. Then the Pareto

projection regret R

(n) after n plays is deﬁned as:

(n) =

∑

i6∈A

∗

∆

E[T

(n)] (2)

where ∆

is deﬁned in Equation 1 and where E is the

expectation operator. A similar regret metric was in-

troduced in Drugan and Nowe (2013).

The Pareto variance regret metric measures the

variance of a Pareto-MAB algorithm in pulling all op-

timal arms. Let T

∗

(n) be the number of times an op-

timal arm i is pulled during n total arm pulls. Let

E[T

∗

(n)] the expected number of times the Pareto op-

timal arm i is pulled. The Pareto variance regret is

deﬁned as

(n) =

∗

∑

i∈A

∗

(E[T

∗

(n)] −E[T

∗

(n)]/|A

∗

(3)

where E[T

∗

(n)] is the expected number of times that

any Pareto optimal arm is selected, and |A

∗

| is the

cardinality of the Pareto front A.

If all Pareto optimal arms are played in a fair way,

i.e. an equal number of times, then R

(n) is mini-

mized. For a perfect fair, or equal, usage of the Pareto

optimal arms, we have R

(n) ← 0. If a Pareto MAB-

algorithm identiﬁes only a subset of A

∗

, then R

(n) is

large. A similar measure, called unfairness, was pro-

posed in Drugan and Nowe (2013) to measure vari-

ance of a Pareto-MAB algorithm in pulling all Pareto

optimal arms.

Algorithm 1: Exploitative Pareto UCB1.

1: Play each arm i once

2: t ← 0; n ← K; n

← 1, ∀i

3: while the stopping criteria is NOT met do

4: t ←t + 1

5: Select the Pareto front at the round t, A

∗(t)

such that ∀i ∈ A

∗(t)

the index ˆµ

2ln(n

√

is non-dominated

6: Pull each arm i once, where i ∈A

∗(t)

7: ∀i ∈ A

∗(t)

, update ˆµ

, and n

← n

+ 1

8: n ← n+ |A

∗(t)

9: end while

3 EXPLORATION VS

EXPLOITATION TRADE-OFF

IN PARETO UCB1

The Pareto UCB1 algorithm Drugan and Nowe(2013)

is an UCB1 algorithm using the Pareto dominance

relation to partially order the reward vectors. Like

for the classical single-objective UCB1 Auer et al.

(2002), the index for a Pareto UCB1 algorithm has

two terms: the mean reward vector, and the sec-

ond term related to the size of a one-sided conﬁ-

dence interval of the average reward according to the

Chernoff-Hoeffding bounds.

In this section, we propose a Pareto UCB1 algo-

rithm with an improved exploration vs exploitation

trade-off because its performance does not depend on

the size of Pareto front. In each round, all the Pareto

optimal arms are pulled once instead of pulling only

one arm. This means that the proposed Pareto UCB1

algorithm has an aggressive exploitation mechanism

of Pareto optimal arms that improves it upper regret

bound. We denote this algorithm with the exploitative

Pareto UCB1 algorithm as opposite with the Pareto

UCB1 algorithm from Drugan and Nowe (2013), de-

noted as exploratory Pareto UCB1 algorithm.

3.1 Exploitative Pareto UCB1

The pseudo-code for the exploitative Pareto UCB1 is

given in Algorithm 1. To initialise the algorithm, each

arm is played once. Let ˆµ

be the estimation of the true

but unknown expected reward vector µ

of an arm i.

In each iteration, we compute for each arm i its index,

i.e. the sum of the estimated reward vector ˆµ

and the

associated conﬁdence value of arm i

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

ˆµ

2ln(n

√





ˆµ

2ln(n

√

,... , ˆµ

2ln(n

√





At each time step t, the Pareto front A

∗(t)

is deter-

mined using the indexes ˆµ

2ln(n

√

. Thus, for

all arms not in the Pareto front i 6∈ A

∗(t)

, there exists a

Pareto optimal arm h ∈ A

∗(t)

that dominates arm i:

ˆµ

2ln(n

√

≻ ˆµ

2ln(n

√

Each iteration, the exploitative Pareto UCB1 al-

gorithm selects all Pareto optimal arm from A

∗(t)

and

pull them. Thus, by design, this algorithm is fair in

selecting Pareto optimal arms. Next, the estimated

vector of the selected arm ˆµ

and the corresponding

counters are updated. A possible stopping criteria is a

given ﬁx number of iterations.

The following theorem provides an upper bound

for the Pareto regret of the efﬁcient Pareto UCB1

strategy. The only difference is that a suboptimal arm

is pulled |A

∗

| times less often than in the exploratory

Pareto UCB1 algorithm. This fact is reﬂected by the

multiplicative constant,

√

D, in the index of the algo-

rithm.

Theorem 1. Let exploitative Pareto UCB1 from Al-

gorithm 1 be run on a K-armed D-objective bandit

problem, K > 1, having arbitrary reward distributions

,... P

with support in [0,1]

. Consider the Pareto

regret deﬁned in Equation 1. The expected Pareto pro-

jection regret of after any number of n plays is at most

∑

i6∈A

∗

8·ln(n

√

∆

+ (1+

) ·

∑

i6∈A

∗

∆

Proof. The prove follows closely the prove from Dru-

gan and Nowe (2013). Let X

i,1

,...,X

i,n

be random D-

dimensional variables generated for arm i with com-

mon range [0,1]

. The expected reward vector for the

arm i after n pulls is

i,n

= 1/n ·

∑

t=1

i,t

⇒ ∀j,

i,n

= 1/n ·

∑

t=1

i,t

Chernoff-Hoeffding bound. We use a straight-

forward generalization of the standard Chernoff-

Hoeffding bound for D dimensional spaces. Con-

sider that ∀j, 1 ≤ j ≤D, IE[X

i,t

i,1

,... ,X

i,t−1

] = µ

There,

i,n

6≺µ

+a if there exists at least a dimension

j for which

i,n

> µ

+ a. Translated in Chernoff-

Hoeffding bound, using union bound, for all a ≥ 0,

we have

IP{



i,n

6≺ µ

+ a



} = (4)

IP{



i,n

> µ

+ a



∨... ∨



i,n

> µ

+ a



}≤ De

−2na

Following the same line of reasoning

IP{



i,n

< µ

−a



∨... ∨



i,n

< µ

−a



}≤ De

−2na

(5)

Let ℓ > 0 an arbitrary number. We take c

t,s

2·ln(t

√

D)/s, and we upper bound T

(n) on any

sequence of plays by bounding for each t ≥ 1 the in-

dicator (I

= i). We have (I

= i) = 1 if arm i is played

at time t and (I

= i) = 0 otherwise. We use the super-

script ∗ when we mean a Pareto optimal arm. Thus,

∗

(n) means that the arm h is Pareto optimal, h ∈A

∗

Then,

(n) = 1+

∑

t=K+1

= i} ≤

ℓ +

∑

t=K+1

= i,T

(t −1) ≥ ℓ}≤ ℓ +

∑

t=K+1

∗

∑

h=1

{

∗

h,T

∗

(t−1)

t−1,T

∗

(t−1)

6≻

i,T

(t−1)

t−1,T

(t−1)

}

≤

∗

← T

∗

(t −1)

← T

(t −1)

ℓ+

∞

∑

t=1

t−1

∑

s=1

t−1

∑

=ℓ

∗

∑

h=1

{

∗

h,s

∗

+ c

t−1,s

∗

6≻

i,s

+ c

t−1,s

}

(6)

From the straightforward generalization of

Chernoff-Hoeffding bound to D objectives, we have

that

IP{

(t)

6≺ µ

+ c

(t)

} ≤

·t

−4

= t

−4

and

IP{

∗(t)

6≻ µ

∗

−c

(t)

∗

} ≤t

−4

For s

≥

8·ln(n

√

∆

, we have that

∗

−µ

−2·c

t,s

= ν

∗

−µ

−2·

2·ln(n

√

≥ ν

∗

−µ

−∆

Thus, we take ℓ = ⌈

8·ln(n

√

∆

⌉, and we have

IE[T

(n)] ≤ ⌈

8·ln(n

√

∆

⌉+

∞

∑

t=1

t−1

∑

s=1

∑

=⌈

8·ln(n

√

∆

⌉

∗

∑

h=1

(IP{

∗(t)

6≻ µ

∗

−c

(t)

∗

}+ IP{

(t)

6≺ µ

+ c

(t)

})

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms

≤

8·ln(n·

√

∆

+ 1+

∞

∑

t=1

∑

s=1

∑

∗

∑

h=1

−4

∗

≤

8·ln(n

√

∆

+ 1+ 2 ·

∞

∑

t=1

·|A

∗

−4

∗

8·ln(n

√

∆

+ 1+ 2 ·

∞

∑

t=1

−2

Approximating the last term with the Riemann

zeta function ζ(2) =

∑

∞

t=1

−2

≈

we obtain the

bound from the theorem.

For a suboptimal arm i, we have IE[T

(n)] ≤

∆

ln(n

√

D) plus a small constant. Like for the stan-

dard UCB1, the leading constant is 8/∆

and the ex-

pected upper bound of the Pareto regret for the ex-

ploitative Pareto UCB1 is logarithmic in the number

of plays n. Unlike exploratory Pareto UCB1 Drugan

and Nowe (2013), this expected bound does not de-

pend on the cardinality of the Pareto front A

∗

. This is

an important improvement for the exploratory Pareto

UCB1 since the size of the Pareto optimal arms is: i)

usually not known beforehand, and ii) increases with

the number of objectives.

Note that the algorithm reduces to the standard

UCB1 for D = 1. Thus, exploitative Pareto UCB1

performs similarly with the standard UCB1 for small

number of objectives. Consider that almost all the

arms K are Pareto optimal arms, |A |

∗

≈ K. Then,

each iteration, the exploitative Pareto UCB1 algo-

rithm pulls once (almost) all arms.

3.2 Exploratory Pareto UCB1

The exploratory version of Pareto UCB1 algorithm

was introduced in Drugan and Nowe (2013) and it

is a straightforward extension of the UCB1 algorithm

to reward vectors. The main difference between the

exploratory Pareto UCB1 and the exploitative Pareto

UCB1, cf Algorithm 1, is in lines 6 −8 of the algo-

rithm. For the exploratory Pareto UCB1 algorithm,

each iteration, a single Pareto optimal arm is selected

uniformly at random and pulled. The counters are up-

dated accordingly, meaning that n ←n + 1.

Another difference is the index associated to the

mean vector that is larger than for the exploitative

Pareto UCB1. Thus, the Pareto set is now the non-

dominated vectors ˆµ

2ln(n

√

D|A

∗

The regret bound for the exploratoryPareto UCB1

algorithm using Pareto regrets is logarithmic in the

number of plays for a suboptimal arm and in the size

of the reward vectors, D. In addition, this conﬁdence

bound is also logarithmic in the cardinality of Pareto

Algorithm 2: Exploitative Pareto UCB2.

Require: 0 < α < 1; the length of a epoch r is an

exponential function τ(r) = ⌈(1+ α)

⌉

1: Play each arm once

2: n ←K; r

← 1, ∀i

3: while the stopping condition is NOT met do

4: Select the Pareto front at the epoch r, A

∗(r)

such that ∀i ∈A

∗(r)

, the index ˆµ

τ(r

)

is non-

dominated

5: for all i ∈A

∗(t)

6: Pull the arm i exactly τ(r

+ 1) −τ(r

)

7: Update ˆµ

, and r

← r

+ 1

8: r ← r+ 1 and n ← n+ τ(r+ 1) −τ(r)

9: end for

10: end while

front, |A

∗

|. This indicates a poor behavior of the ex-

ploratory Pareto UCB1 for a large Pareto front ap-

proaching the number of total arms, which is usually

the case for large number of objectives.

4 THE EXPLORATION VS

EXPLORATION TRADE-OFF IN

PARETO UCB2

In this section, we propose Pareto MAB algorithms

that extend of the standard UCB2 algorithm to reward

vectors. Like for the standard UCB2, these Pareto

UCB2 algorithms play the optimal arms in epochs.

These epochs are exponential with the number of

plays in order to allow the gradual selection of good

arms to be played longer each epoch. In single objec-

tive MABs, the UCB2 algorithm is acknowledged to

have a better upper regret bound than the UCB1 algo-

rithm Auer et al. (2002). We show that Pareto UCB2

algorithms have a better upper Pareto projection re-

gret bound than the Pareto UCB1 algorithms, consid-

ering the same exploitation vs exploration trade-off.

The ﬁrst proposed Pareto UCB2 algorithm, see

Section 4.1, plays in an epoch all Pareto optimal arms

equally often. We call this algorithm an exploitative

Pareto UCB2 algorithm. The second Pareto UCB2

algorithm introduced in Section 4.2 plays only one

Pareto optimal arm per epoch. We call this algorithm

an exploratory Pareto UCB2 algorithm.

4.1 Exploitative Pareto UCB2

In this section, we present the exploitative Pareto

UCB2 algorithm and we analyze its upper conﬁdence

bound. The pseudo-code for this algorithm is given in

Algorithm 2.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

As an initial step, we play each arm once. The

plays are divided in epochs, r, of exponential length

until a stopping criteria is met a ﬁx number of arm’

pulls. The length of an epoch is an exponential func-

tion τ(r) = ⌈(1+α)

⌉. In each epoch, we compute for

each arm i an index given by with the sum of expected

rewards plus a second term for the conﬁdence value

ˆµ

+ a

τ(r

)

←



ˆµ

+ a

τ(r

)

,... , ˆµ

+ a

τ(r

)



where a

τ(r

)

(1+α)·ln(e·n/(D·τ(r

)))

2·τ(r

)

, and r

is the

number of epochs played by the arm i. A Pareto front

∗(r)

is selected from all vectors ˆµ

+ a

τ(r

)

. Thus,

∀i ∈A , exists h ∈A

∗(t)

, such that we have

ˆµ

+ a

τ(r

)

≻ ˆµ

+ a

τ(r

)

Each arm i ∈ A

∗(t)

is selected and played τ(r

+ 1) −

τ(r

) consecutive times. The mean value and the

epoch counter for all Pareto optimal arms are updated

accordingly, meaning that r

←r

+1. The total epoch

counter, r, and the total number of arms’ pulls n are

also updated.

The following theorem bounds the expected regret

for the Pareto UCB2 strategy from Algorithm 2.

Theorem 2. Let exploitative Pareto UCB2 from Al-

gorithm 2 be run on K-armed bandit, K > 1, having

arbitrary reward distributions P

,... P

with support

in [0,1]

. Consider the regret deﬁned in Equation 1.

The expected regret of a strategy π after any num-

ber of n ≥max

ˆµ

/∈A

∗

2·∆

plays is at most

∑

i:ˆµ

/∈A

∗



D·

(1+ α) ·(1+ 4·α) ·ln(2·e·∆

·n/D)

2·∆

∆



where

= 1+

·(1+ α) ·e

α+2



α+ 1



(1+α)



11·D·(1+ α)

5·α

·ln(1+ α)



Proof. This prove is based on the homologue prove

of Auer et al. (2002). We consider n ≥

2·∆

, for all i.

From the deﬁnition of τ(r) we can deduce that τ(r) ≤

τ(r−1) ·(1−α) + 1.

Let τ(˜r

) be the largest integer such that

τ(˜r

−1) ≤

D·(1+ 4·α) ·ln(2·e·n·∆

/D)

2·∆

We have that for an suboptimal arm i

(n) ≤ 1+

∗

∑

r≥1

(τ(r) −τ(r−1))·{arm i ﬁnished its r-th epoch }

≤ τ(˜r

) +

∗

∑

r>˜r

(τ(r) −τ(r−1))·{arm i ﬁnished its r-th epoch }

The assumption n ≤ D/(2 ·∆

) implies ln(2e ·

n∆

/D) ≥ 1. Therefore, for r > ˜r

, we have

τ(r−1) >

D·(1+ 4α) ·ln(2e·n∆

/D)

2·∆

(7)

and

τ(r−1)

(1+ α)ln(e·n/(D·τ(r−1)))

2τ(r−1)

≤

Eq 7

∆

√

(1+ α)ln(e·n/(D·τ(r−1)))

(1+ 4α)ln(2e·n∆

/D)

≤

∆

√

(1+ α)ln(2e·n∆

/D))

(1+ 4α)ln(2e·n∆

/D)

≤

∆

√

1+ α

1+ 4α

Because a

τ(r)

is increasing in t, by deﬁnition, if the

suboptimal arm j ﬁnishes to play the r-th epoch then

∀h, 1 ≤ h ≤ |A

∗(r)

|, ∃s

≥ 0, ∃t ≥ τ(r −1) + τ(s

)

such that arm i is non-dominated by any of the Pareto

optimal arms in |A

∗(r)

|. This means that

∗τ(s

)

+ a

6≻

τ(r−1)

+ a

τ(r−1)

implies that one of the following conditions holds

τ(r−1)

+ a

τ(r−1)

6≺ν

∗

−

α·∆

√

D·2

∗τ(s

)

+ a

τ(s

)

τ(r−1)+τ(s

)

6≻ µ

∗

−

α·∆

√

D·2

Then,

IE[T

(n)] ≤ τ(˜r

) +

∑

r≥˜r

τ(r) −τ(r−1)

∗

· (8)

∗

∑

h=1

IP{

τ(r−1)

+ a

τ(r−1)

6≺ν

∗

−

α·∆

√

D·2

∑

i≥0

∑

r≥1

τ(r) −τ(r−1)

∗

∑

h=1

IP{

τ(r−1)

+ a

τ(s

)

τ(r−1)−τ(s

)

6≻ µ

∗

−

α·∆

√

D·2

}

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms

Let’s expand Inequation 8 using Chernoff and

union bound. For the ﬁrst term between the paren-

thesis, we have that

IP{

τ(r−1)

+ a

τ(r−1)

6≺ ν

∗

−

α·∆

√

D·2

} =

∑

j=1

IP{

jτ(r−1)

+ a

τ(r−1)

> µ

+ ∆

−

α·∆

√

D·2

} ≤

D·e

−2·τ(r−1)·∆

·(1−

2·

√

−

√

1+α

1+4·α

)

≤

α<1/10

D·e

−

τ(r−1)·∆

·α

2·D

If g(x) =

x−1

1+α

and c =

∆

·α

, and g(x) ≤ τ(r−1) then

∑

r≥1

τ(r) −τ(r−1)

∗

∑

h=1

IP{

τ(r−1)

+ a

τ(r−1)

6≺ν

∗

−

α·∆

√

D·2

} ≤

∑

r≥1

∑

i≥0

τ(r) −τ(r−1)

∗

∑

h=1

D·e

−τ(r−1)·∆

·α

D·|A

∗

|·

∑

r≥1

∑

i≥0

τ(r) −τ(r−1)

∗

·e

−τ(r−1)·∆

·α

≤

∗

·|A

∗

|·

∞

−c·g(x)

dx ≤

·(1+ α) ·e

∆

·α

Let’s now expand the second term of the parenthe-

sis in Inequation 8

IP{

τ(r−1)

+ a

τ(s)

τ(r−1)−τ(s)

6≻ µ

∗

−

α·∆

√

D·2

} =

∑

j=1

IP{

jτ(r−1)

+ a

τ(s)

τ(r−1)−τ(s)

< µ

j∗

−

α·∆

√

D·2

} ≤

D·e

−τ(i)·

·∆

D·2

·e

−(1+α)·ln

e·(τ(r−1)+τ(i))

D·τ(i)

≤

α+2

·e

−τ(i)·

·∆

D·2



τ(r−1) + τ(i)

τ(i)



−(1+α)

Thus,

∑

i≥0

∑

r≥1

τ(r) −τ(r −1)

∗

∑

h=1

IP{

τ(r−1)

+ a

τ(s)

τ(r−1)−τ(s)

6≻ µ

∗

−

α·∆

√

D·2

} ≤

α+2

∑

i≥0

−τ(i)·

·∆

D·2

∞



x−1

(1+ α) ·τ(i)



−(1+α)

dx ≤

α+2

(1+ α) −1



α+ 1



(1+α)

∑

i≥0

τ(i) ·e

−τ(i)·

·∆

D·2

Following the rationale from the prove of Theo-

rem 2 from Auer et al. (2002), we can bound further

the ﬁrst term of Inequation 8 to

∑

i≥0

τ(i) ·e

−τ(i)·

·∆

D·2

≤ 1+

11·D·(1+ α)

5α

·∆

·ln(1+ α)

Using the bounds above, we now bound the ex-

pected regret for an arm i in Algorithm 2

IE[T

(n)] ≤ τ(˜r

) −1 +

∆

where

= 1+

·(1+ α) ·e

α+2



α+ 1



(1+α)



11·D·(1+ α)

5·α

·ln(1+ α)



and the upper bound on τ( ˜r

)

τ( ˜r

) ≤ τ(˜r

−1)(1+ α) + 1≤

D·(1+ α) ·(1+ 4α) ·ln(2en∆

/D)

2·∆

+ 1

This concludes our prove.

The bound of the expected regret for Pareto UCB2

is the similar with the bound for the standard UCB2

within a constant given by the number of objectives

D. The intuition is that now the algorithm has to run D

times longer to achieve a similar regret bound for the

Pareto UCB2. For α small, the Pareto projection re-

gret of this Pareto algorithm is bounded by

2·∆

. This

is a better bound than for the Pareto UCB1 algorithm,

∆

The difference between single objective and

Pareto UCB2 is in the constant c

which is smaller

than the same constant for the standard UCB2 for

α > 0. This means that the constant c

converges

faster to inﬁnity when α → 0.

4.2 Exploratory Pareto UCB2

In this section, we introduce the exploratory Pareto

UCB2 algorithm. In fact, the only difference between

the exploratory and exploitative variants of Pareto

UCB2 is in lines 5 from Algorithm 2. Now a single

arm from the Pareto front at epoch r, A

∗(r)

, is selected

and played the entire epoch, i.e. for τ(r

+ 1) −τ(r

)

consecutive times.

Since the length of the epochs is exponential, a

single Pareto optimal arm is played longer and longer.

Thus, the exploitation mechanism of Pareto optimal

arms of the exploratory Pareto UCB2 algorithm is

poor, and the upper Pareto projection regret depends

on the cardinality of the Pareto front.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

Figure 1: All the points generated by the bi-objective wet-

clutch application.

5 NUMERICAL SIMULATIONS

In this section, we compare the performance of ﬁve

Pareto MAB algorithms: 1) a baseline algorithm, 2)

two Pareto UCB1 algorithms and 3) two Pareto UCB2

algorithms. As announced in the introduction, the test

problem is a bi-objective stochastic environment gen-

erated by a real world control application.

The algorithms. The ﬁve Pareto MAB algorithms

compared are

tPUCB1. The exploitative Pareto UCB1 algorithm

introduced in Section 3.1;

rPUCB1. The exploratory Pareto UCB1 algorithm

summarised in Section 3.2;

tPUCB2. The exploitative Pareto UCB2 algorithm

summarised in Section 4.1;

rPUCB2. The exploratory Pareto UCB2 algorithm

summarised in Section 4.2;

hoef. A baseline algorithm for multi-armed ban-

dits in general is the Hoeffding race algo-

rithm Maron and Moore (1994) where all the arms

are pulled equally often and the arms with the

non-dominated empirical mean reward vectors are

chosen.

Each algorithm is run 100 times with a ﬁxed

budged, or arm’ pulls, of N = 10

. By default, we set

the α parameter for the two Pareto UCB2 algorithms

to 1.

The Wet Clutch Application. In order to optimise

the functioning of the wet clutch Vaerenbergh et al.

(2012) it is necessary to simultaneously minimise 1)

the optimal current proﬁle of the electro-hydraulic

valve that controls the pressure of the oil in the clutch,

and 2) the engagement time. The piston of the clutch

gets in contact with the friction plates to change the

proﬁle of the valve. Such a system is characterised

by a hard non-linearity. Additionally, external fac-

tors that cannot be controlled exactly, e.g. the sur-

rounding temperature, make this a stochastic control

application. Such clutches are typically used in the

power transmission of off-road vehicles that has to

operate under strongly varying environmental condi-

tions. And the goal in this control problem is to min-

imise both the clutch’s proﬁle and the engagement

time under varying environmental conditions.

In Figure 1, we give 54 points generated with the

wet clutch application, each point representing a trial

of the machine and the jerk time obtained in the given

time. The problem was a minimisation problem that

we have transformed into a maximisation problem, by

ﬁrst normalising each objective with values between

0 and 1, and then transforming it into a maximisation

problem. The best set of incomparable reward vectors

is called the Pareto optimal reward set, i.e. there are

16 such reward vectors. In our example, |A

∗

| is about

one-third from the total number of arms, i.e. 16/54,

and is a mixture of convex and non-convex regions.

In Table 1, we show the mean values of the 54 reward

vectors.

The Performance of the Algorithms. We use four

metrics to measure the performance of the ﬁve tested

Pareto MAB algorithms. Two of these metrics are the

Pareto projection regret, cf Equation 2, and the Pareto

variance regret, cf Equation 3, presented in Section 2.

We also use two additional metrics two explain the

dynamics of the Pareto MAB algorithms.

The third metric measures the percentage of times

each Pareto optimal arm is pulled. Thus, for all Pareto

optimal arms, i ∈ A

∗

, we measure E[T

∗

(n)] the ex-

pected number of times the arm i is pulled during n

total arm pulls. Note that E[T

∗

(n)] is a part of Equa-

tion 3 and it gives a detailed understanding of the

Pareto variance regret.

The last metric is a measure of the running time

of each algorithm, and it is given by the number of

times each arm in A was compared against the other

arms in A in order to compute the Pareto front. Note

that for the exploratory algorithms, i.e. rPUCB1 and

rPUCB2, each arm pull corresponds to one estimation

of the Pareto front, whereas, for the exploitative algo-

rithms, i.e. tPUCB1 and tPUCB2, one estimation of

∗

corresponds to the arms’ pulls of the entire set.

5.1 Comparing MOMAB Algorithms

In Figure 2, we compare the performance of the ﬁve

MOMAB algorithms. According to the Pareto projec-

tion regret, cf. Figure 2 a), the best performing algo-

rithm is the exploitative Pareto UCB2, cf. tPUCB2,

the second best algorithm is the exploratory Pareto

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms

Table 1: Fifty-four bi-dimensional reward vectors labelled from 1 to 54 for the wet clutch application. The ﬁrst sixteen reward

vectors are labeled from µ

∗

till µ

∗

and are Pareto optimal, while the last thirty-four reward vectors are labelled from µ

till

and they are suboptimal.

∗

= (0.116,0.917) µ

∗

= (0.218,0.876) µ

∗

= (0.322,0.834) µ

∗

= (0.336,0.788)

∗

= (0.379,0.783) µ

∗

= (0.383,0.753) µ

∗

= (0.509,0.742) µ

∗

= (0.512,0.737)

∗

= (0.514,0.711) µ

∗

= (0.540,0.710) µ

∗

= (0.597,0.647) µ

∗

= (0.698,0.540)

∗

= (0.753,0.374) µ

∗

= (0.800,0.332) µ

∗

= (0.869,0.321) µ

∗

= (0.916,0.083)

= (0.249,0.826) µ

= (0.102,0.892) µ

= (0.497,0.722) µ

= (0.251,0.824)

= (0.249,0.826) µ

= (0.102,0.892) µ

= (0.497,0.722) µ

= (0.251,0.824)

= (0.575,0.596) µ

= (0.651,0.448) µ

= (0.571,0.607) µ

= (0.083,0.903)

= (0.696,0.350) µ

= (0.272,0.784) µ

= (0.601,0.521) µ

= (0.341,0.753)

= (0.507,0.685) µ

= (0.526,0.611) µ

= (0.189,0.857) µ

= (0.620,0.454)

= (0.859,0.314) µ

= (0.668,0.388) µ

= (0.334,0.782) µ

= (0.864,0.290)

= (0.473,0.722) µ

= (0.822,0.316) µ

= (0.092,0.863) µ

= (0.234,0.796)

= (0.476,0.709) µ

= (0.566,0.596) µ

= (0.166,0.825) µ

= (0.646,0.349)

= (0.137,0.829) µ

= (0.511,0.611) µ

= (0.637,0.410) µ

= (0.329,0.778)

= (0.649,0.347) µ

= (0.857,0.088)

UCB2, cf rPUCB2, and the worst algorithm is the

Hoeffding race algorithm, cf. hoef. Note that the

Pareto UCB1 family of algorithms has a (almost) lin-

ear regret whereas Pareto UCB2 algorithms have a

logarithmic regret, like the single objective UCB2 al-

gorithm. The worst performance of the exploitative

Pareto UCB1 algorithm can be explained by the poor

explorative behaviour of the algorithm. The perfor-

mance of the explorative Pareto UCB1 is in-between

linear and logarithmic and can be explained by the im-

proved exploratory technique of pulling all the Pareto

optimal arms each round. Both Pareto UCB2 algo-

rithms perform better than Pareto UCB1 algorithms

because the Pareto optimal arms are explored longer

each round.

In opposition, according to the Pareto variance re-

gret, cf. Figure 2 b), the worst performing algorithms

are the exploitative and exploratory Pareto UCB2 al-

gorithms and the best algorithms are the exploratory

and exploitative Pareto UCB1 algorithms but also the

Hoeffding race algorithm. It is interesting to note that

the difference in Pareto variance and projection re-

gret between the exploratory and exploitative variance

of the same algorithms is small. In general, Pareto

UCB1 algorithms have a larger Pareto projection re-

gret then the Pareto UCB2 algorithms, but a smaller

Pareto variance regret.

Figure 2 c) explains these contradictory results

with the percentage of times each of the Pareto op-

timal arms are pulled. As noticed in Section 4.2, the

exploratory Pareto UCB2, cf rPUCB2, pulls the same

Pareto optimal arm each epoch longer and longer,

generating the peak in the ﬁgure on one random sin-

gle Pareto optimal arm. In contrast, the exploitative

Pareto UCB2, cf. tPUCB2, is fair in exploiting the

entire Pareto front. In the sequel, the exploratory

Pareto UCB1 algorithm, cf rPUCB1, has more vari-

ance in pulling Pareto optimal arms than the exploita-

tive Pareto UCB2 algorithm, cf tPUCB1, and this fact

is reﬂected also in the Pareto variance regret measures

from Figure 2 b).

The percentage of time of the any Pareto optimal

arms is pulled: 1) for the exploitative Pareto UCB2

is 83% ±8.5, 2) for the explorative Pareto UCB2 is

77% ±10.9, 3) for the exploitative Pareto UCB1 is

49%±4.9, and 4) for the explorative UCB1 is 49%±

4.9. Note the large difference between the efﬁciency

of Pareto UCB2 and Pareto UCB1 algorithms.

In Figure 2 d), we show that the running time,

i.e. number of comparisons between arms, for ex-

ploratory MOMABs, i.e. the exploratory Pareto

UCB1 and the exploratory Pareto UCB2, are order of

magnitude larger than the exploitative MOMAB algo-

rithms, i.e. the exploitative Pareto UCB1 and the ex-

ploitative Pareto UCB2. The running time for Pareto

UCB1 algorithms which compute the Pareto front of-

ten is larger than the running time for Pareto UCB2

algorithms that compute the Pareto front once in the

beginning of an epoch. The most computational ef-

ﬁcient is the exploitative Pareto UCB2 and the worst

algorithm is the exploratory Pareto UCB1.

5.2 Exploration vs Exploitation

Mechanism in Pareto UCB2

Algorithms

In our second experiment, we measure the inﬂuence

of the parameter α on the performance of Pareto

UCB2 algorithms. Figure 3 considers ﬁve values for

this parameter α = {0.1, 0.5,1.0,2.0,4.0} that indi-

cates the length of an epoch. The largest variance

in performance we have for the exploratory Pareto

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

(a) (b)

Figure 2: The performance of the ﬁve MOMAB algorithms on the wet clutch problem: a) the Pareto projection regret, b)

the Pareto variance regret, c) the percentage of times each Pareto optimal arm is pulled, and d) the running time in terms of

comparisons between arms and Pareto front for each MOMAB algorithm. The ﬁve algorithms are: 1) tPUCB1 the exploitative

Pareto UCB1, 2) rPUCB1 the exploratory Pareto UCB1, 3) tPUCB2 the exploitative Pareto UCB2, 4) rPUCB2 the exploratory

Pareto UCB2, and 5) hoef the Hoeffding race algorithm.

(a) (b)

Figure 3: The performance of the two version of Pareto UCB2 algorithms, i.e. exploratory and exploitative Pareto UCB2,

given for the ﬁve values of the α = {0.1,0.5, 1.0,2.0, 4.0} parameter.

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms

(a) (b)

Figure 4: The performance of the exploitative Pareto UCB1 algorithm given ﬁve values of the C = {0.1,0.5,1.0, 2.0,4.0}

parameter multiplying the index value.

UCB2. The smaller is the size of an epoch, the bet-

ter the performance of the exploratory Pareto UCB2

algorithm is in terms of Pareto projection regret and

Pareto variance regret. Note that for epochs’ length

of 1, the Pareto UCB2 algorithms resemble the Pareto

UCB1 algorithms, meaning that an arm or a set of

arms are pulled each epoch. Of course, the two algo-

rithms have a different exploration index. The same

parameter α has little inﬂuence on the performance

of exploitative Pareto UCB2 algorithm where all the

Pareto arms are pulled each epoch.

5.3 Exploration vs Exploitation

Trade-off in Pareto UCB1

Algorithms

To study the inﬂuence of the exploration index for the

Pareto UCB1 algorithm, we multiply the index of ex-

ploitative Pareto UCB1 with a constant C that takes

ﬁve values C = {0.1,0.5, 1.0,2.0,4.0}. Unlike for

the exploitative Pareto UCB2 algorithm, the constant

C has a big inﬂuence on the performance of Pareto

UCB1 algorithms. The smaller is the multiplication

constant, the better is the performance of the exploita-

tive Pareto UCB1 algorithm. This means that an ex-

ploitative Pareto UCB1 algorithm performs the best

with a small exploration index.

6 CONCLUSION

In this paper, we investigate the exploration vs ex-

ploitation trade-off in two of the inﬁnite horizon

MABs. The classical UCB1 and UCB2 algorithms

are extended to reward vectors those quality is classi-

ﬁed with Pareto dominance relation. We analytically

and experimentally study the regret, i.e. the Pareto

projection regret and the Pareto variance regret, of the

proposed MOMAB algorithms.

We propose the exploitative Pareto UCB1 algo-

rithm that each round pulls all the Pareto optimal

arms. The exploratory version of the same algorithm

uniform at random selects each round only one arm

Pareto optimal arm. We show that this difference has

an important impact on the upper Pareto projection

regret bound of the exploitative Pareto UCB1 algo-

rithm. Now, the upper regret bound is independent of

the cardinality of the Pareto front, which is large for

many objective environments, and, furthermore, un-

known beforehand.

Based on the same principle, we propose the ex-

ploratory and exploitative Pareto UCB2 algorithms.

The exploratory Pareto UCB2 algorithm pulls each

epoch a single Pareto optimal arm selected at ran-

dom. The exploitative Pareto UCB2 pulls, each

epoch, equally often all the Pareto optimal arms. We

upper bound the Pareto projection regret of the ex-

ploitative Pareto UCB2 algorithm.

We compare these algorithms also experimentally

on a bi-objective problem coming from control the-

ory. Our conclusion is that the exploration vs ex-

ploitation trade-off is better in the exploitative Pareto

algorithms where all the Pareto optimal arms are

pulled often. In opposition, the exploratory Pareto

UCB2 algorithm has a small Pareto projective vari-

ance regret but a large Pareto variance regret since the

algorithm pulls a single Pareto optimal arm during ex-

ponentially large epochs.

REFERENCES

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite

time analysis of the multiarmed bandit problem. Ma-

chine Learning, 47(2/3):235–256.

Drugan, M. and Nowe, A. (2013). Designing multi-

objective multi-armed bandits: a study. In Proc of

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

International Joint Conference of Neural Networks

(IJCNN).

Lizotte, D., Bowling, M., and Murphy, S. (2010). Efﬁcient

reinforcement learning with multiple reward functions

for randomized clinical trial analysis. In Proceed-

ings of the Twenty-Seventh International Conference

on Machine Learning (ICML).

Maron, O. and Moore, A. (1994). Hoeffding races: Accel-

erating model selection search for classiﬁcation and

function approximation. In Advances in Neural Infor-

mation Processing Systems, volume 6, pages 59–66.

Morgan Kaufmann.

Vaerenbergh, K. V., Rodriguez, A., Gagliolo, M., Vrancx,

P., Nowe, A., Stoev, J., Goossens, S., Pinte, G., and

Symens, W. (2012). Improving wet clutch engage-

ment with reinforcement learning. In International

Joint Conference on Neural Networks (IJCNN). IEEE.

van Moffaert, K., Drugan, M., and Nowe, A. (2013).

Hypervolume-based multi-objective reinforcement

learning. In Proc of Evolutionary Multi-objective Op-

timization (EMO). Springer.

Wang, W. and Sebag, M. (2012). Multi-objective Monte

Carlo tree search. In Asian conference on Machine

Learning, pages 1–16.

Wiering, M. and de Jong, E. (2007). Computing optimal sta-

tionary policies for multi-objective markov decision

processes. In Proc of Approximate Dynamic Program-

ming and Reinforcement Learning (ADPRL), pages

158–165. IEEE.

Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C. M., and

da Fonseca, V. (2003). Performance assessment of

multiobjective optimizers: An analysis and review.

IEEE T. on Evol. Comput., 7:117–132.

ExplorationVersusExploitationTrade-offinInfiniteHorizonParetoMulti-armedBanditsAlgorithms