EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT

LEARNING

Similarity Models for Multi-Armed Bandits and MDPs

Ronald Ortner

Lehrstuhl f¨ur Informationstechnologie, Montanuniversit¨at Leoben, Austria

Keywords:

Reinforcement learning, Markov decision process, Multi-armed bandit, Similarity, Regret.

Abstract:

This paper considers reinforcement learning problems with additional similarity information. We start with

the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is

assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that

this color information can be used to improve the dependency of online regret bounds on the number of

arms. Further, we discuss to what extent this approach can be extended to the more general case of Markov

decision processes. For the simplest case where the same color for actions means similar rewards and identical

transition probabilities, an algorithm and a corresponding online regret bound are given. For the general case

where transition probabilities of same-colored actions imply only close but not necessarily identical transition

probabilities we give upper and lower bounds on the error by action aggregation with respect to the color

information. These bounds also imply that the general case is far more difﬁcult to handle.

1 INTRODUCTION

Algorithms for reinforcement learning problems suf-

fer from the curse of dimensionality when either the

action space or the state space are large. Unlike that,

in many of these problems humans have no difﬁcul-

ties in learning, as they are able to structure the state

space and the action space in a favorable way. In

many cases, this structure information regards simi-

larity of states and actions.

Here we investigate to what extent similarity in-

formation can be exploited to improveoverthe perfor-

mance in case no such information is given. Although

our main interest lies in Markov decision processes

(MDPs), we start with a multi-armed bandit problem

with a simple similarity model: For each arm there is

an additional color information available, where arms

of the same color are assumed to have close mean re-

wards, that is, these deviate by at most θ, a parameter

known to the learner. Indeed, a similar model has al-

ready been considered by (Pandey et al., 2007), who

also give a typical application to an ad-selection prob-

lem on webpages, where ads with similar content are

similarly attractive to the user and get comparable re-

ward (i.e., user clicks). Also in other of the numerous

applications of multi-armed bandits such as routing,

wireless networks, design of experiments, or pricing

(for references see e.g. (Kleinberg, 2005)), similarity

information of the given kind seems to be natural.

In Section 2 below we present an algorithm that

is able to exploit color information, as the derived

bounds on the regret with respect to the best arm

show: While online regret bounds for ordinary bandit

problems (which usually are logarithmic in the num-

ber of steps taken) grow linearly with the number of

actions, with color information the total number of ac-

tions can be replaced with the number of colors plus

the number of arms with promising color.

In the subsequent Section 3 we consider the more

general setting of Markov decision processes where

color information for the actions is available. We

start examining the simplest case where actions of the

same color have similar rewards (again measured by

a parameter θ) and identical transition probabilities.

For this setting we give an adaptation of the UCRL2

algorithm of (Auer et al., 2009) for which we show

regret bounds that demonstrate similarly to the bandit

setting that the color information can be exploited to

get improved bounds.

When this setting is generalized so that actions

with the same color have not identical but only sim-

ilar transition probabilities, things get more compli-

cated. In Section 3.2, we investigate action aggrega-

tion with respect to such colorings. We derive bounds

203

Ortner R. (2010).

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs.

In Proceedings of the 2nd International Conference on Agents and Artiﬁcial Intelligence - Artiﬁcial Intelligence, pages 203-210

DOI: 10.5220/0002703002030210

 SciTePress

on the error caused by working on the aggregated in-

stead of the original MDP. Unlike in the simpler set-

tings where this error is trivially bounded by the pa-

rameter θ, the error can be arbitrarily large, depend-

ing on the (aggregated) MDP. This indicates that sim-

ilarity information regarding the transition probabili-

ties cannot be as well exploited as for rewards, which

is conﬁrmed by an example at the end of Section 3

that shows that straightforward adaptations of the al-

gorithm for the simpler setting fail.

2 COLORED BANDITS

In a multi-armed bandit problem the learner has a ﬁ-

nite set of arms A at his disposal. Choosing an arm a

from A gives a random reward bounded in the unit in-

terval [0,1] with mean r(a). As performance measure

for a learning algorithm one usually considers its re-

gret with respect to choosing the optimal arm at each

step. That is, setting τ(a) to be the number of steps

where arm a has been chosen (up to some ﬁnite hori-

zon T) the regret is deﬁned as

∑

a∈A

τ(a)



∗

−r(a)



where r

∗

= max

r(a) is the optimal mean reward.

The regret of established bandit algorithms such as

UCB1 (Auer et al., 2002) is logarithmic in the num-

ber of steps, but grows linearly with the number of

arms. This is also best possible (Mannor and Tsitsik-

lis, 2004).

Unlike in the general case, where the learner has

no information apart from A, here we are interested in

the question how given similarity information about

different arms can be exploited to improve regret

bounds with respect to the dependency on the number

of arms. That is, the learner additionally knows the

color of an arm, that is, a coloring function c : A →C

that assigns each arm in A a color from a given set of

colors C. (We assume that the function c is surjective,

i.e., each color in C is assigned to an arm in A.) The

color gives some similarity information about the re-

wards of arms according to the following assumption.

Assumption 1. There is a θ > 0 such that for each

two arms a,a

′

∈ A: If c(a) = c(a

′

) then |r(a) −

r(a

′

)| < θ.

We assume that the learner knows the parame-

ter θ. This setting is similar to the one considered

in (Pandey et al., 2007). However, there it is as-

sumed that choosing an arm is a Bernoulli trial that

gives reward 1 with some success probability p and

reward 0 otherwise. Further, our Assumption 1 is re-

placed with the supposition that the success probabil-

ities p of arms of the same color are distributed ac-

cording to a common probability distribution.

2.1 Algorithm

An obvious idea is to adapt a standard bandit algo-

rithm to ﬁrst choose a color c and then in a second step

to choose an arm with color c. This idea also underlies

the TLP algorithm of (Pandey et al., 2007). However

there is a problem with that direct approach when two

colors are very close, as it takes Ω(

) steps to dis-

tinguish a distance of ε between two arms/colors (cf.

the analysis of (Pandey et al., 2007), which does not

derive regret bounds, but only considers the conver-

gence behavior of the TLP algorithm). Our algorithm

(shown in Figure 1) does not try to identify the best

color c

∗

but instead forms a set of good colors C

. A

distance parameter β determines what the distance be-

tween the best color c

∗

and another color c should be

in order to consider c to be suboptimal and exclude it

fromC

. Unlike (Pandey et al., 2007) we do not main-

tain a single estimate value for each color c, but cal-

culate a conﬁdence interval for each color that w.h.p.

contains the mean reward of the best arm of color c.

2.2 Analysis

In order to derive an upper bound on the regret of

the algorithm, we will consider (i) for how many

steps suboptimal colors are included in C

, and (ii)

when C

contains only close to optimal colors, how

often will a suboptimal arm be chosen? Ques-

tion (ii) is answered by original UCB1 analysis taken

from (Auer et al., 2002). For question (i) this has to be

adapted. Let r

a:c(a)=c

r(a) and r

−

min

a:c(a)=c

r(a). We assume that at each step t of the

algorithm

(c), and (1)

−

for each color c, and that

ˆr

) ≤ r(a

) +

log(t

/δ)

)

, and (3)

ˆr

∗

) ≥ r(a

∗

) −

log(t

/δ)

∗

)

, (4)

where a

∗

= argmax

a∈A

r(a) is an arm with maximal

mean reward r

∗

. Application of Hoeffding’s inequal-

ity shows that (1) as well as (2) holds with probability

at least 1 −

|C|t

for a ﬁxed time step t, a ﬁxed color

c and a ﬁxed value of n

(c). Thus, a union bound

over all colors, all possible values of n

shows that (1) and (2) hold with probability at least

1−2

∑

> 1 −

δ for all t. Similarly, (3) and (4)

hold with probability at least 1−

δ for all t.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

204

Input: A conﬁdence parameter δ ∈(0,1), and a dis-

tance parameter β ∈(0,1).

Initialization

For each color c ∈C sample an action a ∈ A with

c(a) = c.

For time steps t = 1,2,... do

⊲ Calculate conﬁdence intervals I

color:

For each color c in C calculate a conﬁdence in-

terval for max

a:c(a)=c

r(a):



ˆr

(c), ˆr

(c)



where ˆr

(c)

∑

τ<t:c(a

)=c

with r

being

the random reward obtained at step τ for choos-

ing arm a

, n

(a) being the number of times ac-

tion a was chosen, and n

∑

a:c(a)=c

(a)

being the number of times an action with color c

has been chosen. Further,

conf

log(|C|t

/δ)

(c)

⊲ Determine relevant colors C

Let c

:= argmax

c∈C



ˆr

(c)



the color with maximal upper conﬁdence bound

value, and set C



c ∈C|

) 6= ∅



If conf

) ≥ β/4 and conf

c ∈C

, reset C

:= {c

⊲ Arm selection:

Use UCB1 to choose an arm from A

:= {a ∈

A|c(a) ∈C

}, i.e., if there is an unsampled arm a

in A

choose a, otherwise choose

:= argmax

a∈A



ˆr

(a) +

log(t

/δ)

(a)



where ˆr

(a) =

(a)

∑

τ<t:a

Figure 1: The colored bandits algorithm.

Note that under assumptions (1) and (2) an opti-

mal color c

∗

(i.e., the color of an optimal arm a

∗

) is

always in C

, since

ˆr

∗

) + conf

∗

) + θ ≥ r

−

∗

) + θ ≥

∗

) ≥ r

) ≥ ˆr

) −conf

Now, we establish sample complexity bounds both on

(i) the number of times an arm of a color that is β-

far from the optimal color c

∗

is chosen, and (ii) the

number of times a suboptimal arm is chosen from C

(assuming that C

contains only colors β-close to the

optimal color). For the bound on (ii), we may directly

refer to (Auer et al., 2002), where it is shown that

any suboptimal arm a is chosen at most 1+

8logT

∗

−r(a))

times (w.h.p). As playing such an arm gives regret

∗

−r(a), this yields a bound of

∑

c∈C

∑

a:c(a)=c

r(a)<r

∗



8logT

∗

−r(a)



where C

:= {c ∈C|r

∗

−r

colors that are β-close to the optimal reward r

∗

For a bound on (i) we may easily adapt the men-

tioned proof as follows. Consider a β-bad color c /∈

. Then r

∗

. According to the al-

gorithm c ∈C

only when

ˆr

) −conf

). (5)

Further, if conf

only in case

conf

) < β/4, too. But then we have from (1), (5),

and the fact that r

∗

≤ ˆr

) + conf

) + θ

≥ ˆr

) −2conf

) + β + θ

≥ r

∗

−2conf

) −2conf

∗

contradicting our assumption that c is a β-bad color.

Hence, whenever conf

, so

that c(a

) = c only at ≤

8log(|C|T

/δ)

time steps.

Further we have to consider the case when setting

:= {c

}, which may be a suboptimal choice as

well. However, this happens only when conf

) ≥

β/4, that is, not more often than

8log(|C|T

/δ)

times.

Summarizing (and also taking into account the regret

of the initialization), we get the following result.

Theorem 2. The regret of the colored bandits algo-

rithm after T steps with probability at least 1−

δ is

at most

|C|+ 2|C|

8log(|C|T

/δ)

∑

a:c(a)∈C

r(a)<r

∗



8logT

∗

−r(a)



As the regret at each step is at most 1, we may

simply sum up the error probabilities for failing con-

ﬁdence intervals givenin (1) and (3) to obtain a bound

on the expected regret as well.

These bounds show that it is possible for the

learner to exploit color information in order to elimi-

nate the dependency on the total number of actions in

the respective regret bounds.

3 COLORED ACTION MDPS

We continue dealing with the natural generalization of

the problem to Markov decision processes. A Markov

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed

Bandits and MDPs

205

decision process (MDP) is a tuple M = hS,A, p,ri,

where S is a ﬁnite set of states and A is a ﬁnite set

of actions. Unlike in the usual setting where in each

state from S each action from A is available, we con-

sider that for each state s there is a nonempty sub-

set A(s) ⊆ A of actions available in s. Further, we

assume that the sets A(s) are a partition of A, i.e.,

A(s) ∩A(s

′

) = ∅ for s 6= s

′

, and

s∈S

A(s) = A. The

transition probabilities p(s

′

|s,a) give the probability

of reaching state s

′

when choosing action a in state s,

and the payoff distributions with mean r(s,a) and sup-

port in [0, 1] specify the random reward obtained for

choosing action a in state s.

We are interested in the undiscounted, average re-

ward lim

T→∞

∑

t=1

, where r

is the random reward

obtained at step t. As possible strategies we consider

(stationary) policies π : S → A, where π(s) ∈ A(s).

This is justiﬁed by the fact that there is always such

a policy π

∗

which gives optimal reward (Puterman,

1994). Let ρ(π) denote the expected average re-

ward of policy π. Then π

∗

is an optimal policy if

ρ(π) ≤ ρ(π

∗

) =: ρ

∗

for all policies π.

In the analysis we will also need some tran-

sition parameters of the MDP at hand. Thus let

T(s

′

|M ,π,s) be the ﬁrst (random) time step in which

state s

′

is reached when policy π is executed on

MDP M with initial state s. Then we deﬁne the di-

ameter of the MDP to be the average time it takes to

move from any state s to any other state s

′

, using an

appropriate policy, i.e.

D(M ) := max

s6=s

′

∈S

min

π:S→A



T(s

′

|M ,π,s)



We will consider only MDPs with ﬁnite diameter,

which guarantees that there is always an optimal pol-

icy that achieves optimal average reward ρ

∗

indepen-

dent of the initial state. Note that each policy π in-

duces a Markov chain M

on M . If the Markov chain

is ergodic (i.e., each state is reachable from each other

state after a ﬁnite number of steps) this Markov chain

has a state-independent stationary distribution. The

mixing time of a policy π on an MDP M with induced

stationary distribution µ

is given by

(M ) :=

∑

′

∈S



T(s

′

|M ,π,s)



′

)

for an arbitrary s ∈ S. The deﬁnition is independent

of the choice of s as shown in (Hunter, 2006). Finally,

we remark that in case a policy π induces an ergodic

Markov chain on M with stationary distribution µ

the average reward of π can be written as

ρ(π) :=

∑

s∈S

(s)r(s,π(s)). (6)

3.1 When the Color Determines the

Transition Probabilities

The simplest case in the MDP scenario is when two

actions of the same color have identical transition

probabilitiesand close rewards, given a coloring func-

tion c : A →C on the action space.

Assumption 3. There is a θ > 0 such that for each

two actions a,a

′

∈A(s) with s ∈S: If c(a) = c(a

′

) then

(i) p(·|s,a) = p(·|s,a

′

), and (ii) |r(s,a) −r(s,a

′

)|< θ.

We will try to exploit the color information only

for same-colored actions in the same state, so that

we may assume without loss of generality that actions

available in distinct states have distinct colors. Thus

the set of colors C is partitioned by the sets C(s) of

colors of actions that are available in state s. Fur-

ther, we have to distinguish between colors having

distinct transition probability distributions. Thus, we

write C(s, p) for the set of colors of actions available

in state s and having transition probabilities p(·).

3.1.1 Algorithm

The algorithm we propose (shown in Figure 2) is

a straightforward adaptation of the UCRL2 algo-

rithm (Auer et al., 2009). For the sake of simplicity,

we only consider the case where the transition struc-

ture of the MDP is known. The general case can be

handled analogously to (Auer et al., 2009). We use

the color information just as in the bandit case. That

is, in each state a set of promising colors is deﬁned.

Then an optimal policy is calculated where the action

set is restricted to actions with promising colors, and

the actions’ rewards are set to their upper conﬁdence

values. The algorithm proceeds in episodes i, and the

chosen policy π

is executed until a state s is arrived

where the action π

(s) has been played in the episode

as often as before the episode.

3.1.2 Analysis

Again we are interested in the algorithm’s regret af-

ter T steps, deﬁned as

Tρ

∗

−

∑

. Furthermore,

we also consider the regret with respect to an ε-

optimal policy, i.e., with respect to ρ

∗

−ε instead

of ρ

∗

. The analysis of the algorithm’s regret is a

combination of the respective proofs of Theorem 2

and of the logarithmic regret bounds in (Auer et al.,

2009). That is, we ﬁrst establish a sample complex-

ity bound on the number of steps in episodes where

Unlike in the bandit case, this regret deﬁnition also con-

siders the deviations of the achieved rewards from the mean

rewards. Actually, the regret bounds for the bandit case can

be adapted to this alternative regret deﬁnition.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

206

Input: A conﬁdence parameter δ ∈(0,1), and a dis-

tance parameter β ∈(0,1).

Notation: Let t denote the current time step.

For episodes i = 1,2,... do

⊲ Initialize episode i:

Set t

:= t. Calculate estimates ˆr

(s,a) for the

mean reward r(s,a) for state-action pairs (s,a)

with a ∈ A(s), and determine conﬁdence in-

tervals I

each state s, each transition probability distribu-

tion p(·) and each color c in C(s, p) calculate a

conﬁdence interval for max

a:c(a)=c

r(s,a):

(s,c) :=



ˆr

(s,c) −conf

(s,c), ˆr

(s,c)



where ˆr

(s,c) =

(s,c)

∑

τ:c(a

)=c,s

with r

being the random reward obtained at step τ for

choosing action a

in state s, n

(s,a) being the

number of times action a was chosen in state s,

and n

(s,c) being the number of times an action

with color c has been chosen in s.

Further,

conf

(s,c) :=

7log(2|C|t/δ)

(s,c)

⊲ Determine relevant colors C

For each state s and each p(·) let c

(s, p) be

the color with maximal upper conﬁdence bound

value, i.e.,

(s, p) := argmax

c∈C(s,p)



ˆr

(s,c) + θ+ conf

(s,c)



Set C

(s, p) :=



c ∈ C(s, p)|

I(s,c

(s, p)) ∩

I(s,c) 6= ∅



. If conf

(s,c

(s, p)) ≥ β/4 and

conf

(s,c) < β/4 for some c ∈ C

(s, p), reset

(s, p) := { c

(s, p)}.

⊲ Policy selection and execution:

Choose an optimal policy

in the MDP with

transition structure as given and action sets

(s) := {a ∈ A(s)|c(a) ∈

(s, p)} with re-

wards

(s,a) := ˆr

(s,a) +

7log(2|A|t/δ)

(s,a)

Play π

as long as n

(s,π

)) < 2n

(s,π

)) in

the current state s

Figure 2: The colored MDP algorithm for MDPs with

known transition structure and colored action set.

If the color or action count is 0, reset it to 1.

Such an optimal policy can be calculated using ordi-

nary value iteration (Puterman, 1994).

ε-suboptimal reward is received. Thus, let T

be the

number of steps in episodes where the average per-

step reward is less than ρ

∗

−ε, and let M

be the re-

spective indices of these episodes. Note that setting

∆

∑

i+1

−1

t=t



∗

−r



to be the regret in episode i we

have that

∆

∑

i∈M

∆

≥ εT

. (7)

Having this lower bound on ∆

, we now aim at achiev-

ing also an upper bound on ∆

in terms of T

. These

two bounds then will give us the desired regret bound.

The main part of the derivation of this upper bound

is mainly the same as given in the extended version

of (Auer et al., 2009), so we will not repeat it here

and only state that it can be shown that

∆

≤ 1+

log

+ 2D

log

+D·#episodes+

log

2|A|T

∑

s,a

(s,a). (8)

with probability 1 −3δ, where n

(s,a) is the total

number of times action a is chosen in s in episodes

in M

. Now we split

∑

s,a

(s,a) into one sum han-

dling the actions of β-bad color /∈C

and another sum

for all other actions, where

s,p



c ∈C(s, p)|r

∗

(s, p) −r



with r

∗

(s, p) := max

a:c(a)∈C(s,p)

r(a) and r

max

a:c(a)=c

r(a). Then similarly to the bandit case,

whenever the conﬁdence interval conf

(s,c) of such

a β-bad color c is smaller than β/4 at the beginning

of an episode, the respective color will not be part of

(s, p). Consequently, by deﬁnition of conf

(s,c) the

number of times an action with that color is chosen

in state s is upper bounded by

112log(2|C|T)

, the addi-

tional factor 2 stemming from the fact that the conﬁ-

dence intervals are only updated at the beginning of

an episode (in which the number of times the respec-

tive action is chosen may be doubled). Consequently,

for any β-bad color c /∈C

∑

s,a:c(a)=c

(s,a) ≤

112log(2|C|T)

, (9)

whence one obtains by Jensen’s inequality that

∑

s,a

(s,a) ≤

|A||C|log(2|C|T),

where A

:= {a ∈A|c(a) ∈C

}. This yields from (8)

that

∆

≤

log

+ 2D

log

+ D·#episodes

log

2|A|T

|A||C|log

2|A|T

log(2|C|T) + 1, (10)

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed

Bandits and MDPs

207

so that it remains to upper bound the number of

episodes. By the doubling criterion for episode ter-

mination it is easy to see that generally there are not

more than |A|log

|A|

episodes (cf. Appendix A.2 of

the extended version of (Auer et al., 2009) for de-

tails). However, again considering actions of β-good

color ∈ C

and others separately, according to (9)

this can be improved to a bound of |A

|log

|A|log

896log(2|C|T)

|A|β

. Putting this into (10) we get in

combination with (7)

≤ c

+|A

|)log(T/δ)

+ c

√

|A||C|loglog(T/δ)

εβ

+ c

D|A

|logT+D|A|loglogT

. (11)

As ∆

is an upper bound on the regret with respect

to an ε-optimal policy, we may plug (11) into (10) to

obtain after some calculations the following result.

Theorem 4. The regret of the colored MDP algorithm

with respect to an ε-optimal policy after T steps is

with probability at least 1−3δ upper bounded by (ig-

noring terms sublogarithmic in T)

′



+ |A



log

+ c

′

|A||C|log

+ c

′

(D+

|A||C|log

βε

+ D|A

|logT.

For sufﬁciently small ε, an ε-optimal policy is also

optimal, which yields a corresponding bound with ε

replaced by the difference between the optimal and

the highest suboptimal average reward, i.e., g := ρ

∗

−

max

π:ρ(π)<ρ

∗ ρ(π).

Thus, as in the bandit case the learner can beneﬁt

from the color information (as can be seen when com-

paring the bounds to the case without color informa-

tion, i.e. |C|= 1 and A

= A). The reason why there is

still some dependency on the total number of actions

is that the doubling criterion for episode termination

concerns the actions and not their colors. However,

adapting the episode termination criterion to apply to

colors instead of actions, some other parts of the proof

do not work anymore.

3.2 Action Aggregation

Now let us consider the case where actions of the

same color have not identical but only similar tran-

sition probabilities. As before we are interested only

in similar actions that are available in the same state.

Thus we again assume for the sake of simplicity that

only actions contained in the same set A(s) have the

same color.

Assumption 5. There are θ

,θ

> 0 such that

for each two actions a,a

′

∈ A(s) with s ∈ S: If

c(a) = c(a

′

) then (i)



r(s,a) − r(s,a

′

)



< θ

, and

(ii)

∑

′

∈S



p(s

′

|s,a) − p(s

′

|s,a

′

)



< θ

Unlike in the settings considered so far, it is by

no means clear what happens if one simply chooses

a representative of each color and works on the ag-

gregated MDP. In this section we derive error bounds

that answer this question.

Deﬁnition 6. Given an MDP M = hS,A, p,ri and a

coloring function c : A →C for the actions, an MDP

M = hS,C, bp,bri is called an aggregation of M with

respect to c if for a ∈ A(s) with c(a) = c:



br(s,c) −

r(s,a)



< θ

, and

∑

′

∈S



bp(s

′

|s,c) − p(s

′

|s,a)



< θ

Thus beside picking an arbitrary reference action

a for each color c one may also set e.g. br(s,c) :=

∑

a∈A

r(s,a) and bp(s

′

|s,c) :=

∑

a∈A

p(s

′

|s,a),

where A

:= {a ∈A|c(a) = c}.

A policy

π on

M is the aggregation of a policy π

on M with respect to a coloring function c : A → S,

if c(π(s)) =

π(s) for all states s. In the following we

consider only ergodic MDPs, where all policies in-

duce ergodic Markov chains. However, if an aggrega-

tion conserves the ergodicity structure of the MDP the

following results can be adapted to the general case.

Theorem 7. Let M = hS,A, p, ri be an MDP and

M = hS,C, bp,bri an aggregation of M with respect

to a coloring function c : A →C. Then for each pol-

icy π on M and the aggregation

π of π on

M with

respect to c, we have for the difference of the average

reward ρ(π) of π in M and the average reward

ρ(

π)

π in



ρ(π) −

ρ(

π)



< θ

+ (κ

−1)θ

where κ

is the mixing time of the Markov chain in-

duced by π on M .

As for the bounds on the error of state aggrega-

tion of (Ortner, 2007), in the proof of Theorem 7 we

use the following result of (Hunter, 2006) on pertur-

bations of Markov chains.

Theorem 8. (Hunter, 2006) Let C ,

C be two ergodic

Markov chains on the same state space S with transi-

tion probabilities p(·,·), ep(·,·) and stationary distri-

butions µ, eµ, and let κ

be the mixing time of C . Then



µ−eµ



≤ (κ

−1)max

s∈S

∑

′

∈S



p(s,s

′

) − ep(s,s

′

)



Proof of Theorem 7: Writing µ

and µ

for the sta-

tionary distributions of π on M and

π on

M , respec-

tively, and abbreviating r(s,π(s)) with r

(s), we have

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

208

by (6)



ρ(π) −

ρ(

π)



∑

(s)r

(s) −

∑

(s)r

(s)



≤

∑



(s) −µ

(s)



(s) +

∑

(s)



(s) −r

(s)



As r

(s) ≤ 1, using Theorem 8 together with our as-

sumptions on aggregation gives



ρ(π) −

ρ(

π)



∑

s∈S



(s) −µ

(s)



∑

s∈S

(s)θ

< (κ

−1)θ

+ θ

Corollary 9. Let π

∗

be an optimal policy on an

MDP M with optimal average reward ρ

∗

, and let

∗

be an optimal policy with optimal average reward

∗

on an aggregation

M of M with respect to some col-

oring function c : A →C. Then

(i) |ρ

∗

−

∗

| < θ

+ (κ

−1)θ

(ii) ρ

∗

< ρ(

∗e

) + 2θ

+ 2(κ

−1)θ

where κ

:= max

, and

∗e

is any policy such that

∗

is the aggregation of

∗e

with respect to c.

Remark: As the role of C and

C in Theorem 8

is symmetric, Theorem 7 and Corollary 9 hold also

when the mixing time of M is replaced with the mix-

ing time of the aggregated MDP. Hence, the results

also hold for the minimum of the two mixing times.

The following theorem shows that the error in av-

erage reward indeed becomes arbitrarily large when

the mixing time approaches inﬁnity.

Theorem 10. For each θ

> 0 and each sufﬁciently

small η > 0 there is an MDP M = hS,A, p, ri and

a coloring c : A → C of the action space such that

in each aggregation

M of M with respect to c there

is some policy π on M such that for the respective

aggregated policy

π on

M ,

|ρ(π) −

ρ(

π)| ≥ 1 −η.

Proof. Fix some θ

> 0 and consider for δ ∈ (0,

)

an MDP with S = {s

}, A(s

) = {a

}, and

A(s

) = {a

}. Deﬁne the transition probabilities (cf.

Figure 3) in s

as p(s

) = 1 −δ, p(s

) =

δ, p(s

) = 1 −

, and p(s

) =

, and

those in s

as p(s

) =

and p(s

) =

1 −

, where n ∈ N. Then the stationary distribu-

tion of the policy π with π(s

) = a

and π(s

) = a

is µ



n+1



, which for n → ∞ converges to

(1,0). On the other hand, the policy π

′

with π

′

) =

and π

′

) = a

has stationary distribution µ

′



n+1



, which for n → ∞ converges to (0,1).

Now, as δ <

, a coloring function c may as-

sign (choosing an arbitrary θ

> 0, cf. the choice of

n1−δ/

δ/

1−δ/

1−δ

δ/

Figure 3: The MDP in the proof of Theorem 10. Solid ar-

rows correspond to action a

, dashed ones to a

, and dotted

ones to a

the rewards below) the same color c

to actions a

and a

in s

. We consider transition probabilities

bp(·|s

) which are a convex combination of the re-

spective probabilities p(·|s

) and p(·|s

), i.e.

bp(s

) = λδ+ (1−λ)

and

bp(s

) = 1−λδ −(1−λ)

for λ ∈ [0,1]. (For transition probabilities outside

this convex set the error will clearly be larger.)

Then the stationary distribution of the aggregated

policy

π with

π(s

) = c

and

π(s

) = a

is µ



λ+n−λ+1

λ−λ+1

λ+n−λ+1



. Thus, if λ > 0 then µ

converges to (0,1) for n → ∞. Otherwise for λ = 0

we have µ



n+1



, which for n → ∞ con-

verges to (1, 0). Now setting the rewards r(s

) :=

r(s

) := 1 and r(s

) := 0 (and choosing appro-

priate rewards br) we obtain an error arbitrarily close

to 1 either with respect to π or to π

′

, which proves the

theorem.

The lower bound of Theorem 10 indicates that ex-

ploiting similarity of transition probabilities is harder

than for rewards. Here we conﬁrm this by showing

that an optimistic algorithm in the style of UCRL2

fails already in very simple situations.

Example: Consider a two state MDP with S =

}, A(s

) = {a

}, and A(s

) = {s

} as shown

in Figure 4. The transition probabilities in s

are set

to p(s

) =

, p(s

) =

, p(s

) =

and p(s

) =

. In state s

the transition proba-

bilities are p(s

) = p(s

) =

. The mean

rewards are given by r(s

) =

, r(s

) =

and r(s

) =

. Thus, setting θ

> 1 and θ

we may color actions a

and a

with the same

color c

. The intervals of possible values of ac-

tions with color c

are then ep(s

) ∈





ep(s

) ∈





, and er(s

) ∈





An optimistic algorithm that handles this infor-

mation as in a bounded parameter MDP (Tewari

and Bartlett, 2007) now would assume values

ep(s

) = ep(s

) =

and er(s

) =

in or-

der to maximize the average reward of a policy play-

ing an action with color c

in state s

, which then is

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed

Bandits and MDPs

209

(1/4,1/2)

(1/2,3/4)

(1/2,5/12)

(3/4,1/2)

(1/2,5/12)

Figure 4: The MDP from the example. Solid arrows corre-

spond to action a

, dashed ones to a

, and dotted ones to a

For each arrow the respective transition probability and the

reward are displayed.

ρ =

, since the stationary distribu-

tion in this case is obviously (

). However, the real

achievable average reward for actions a

(that gives

stationary distribution (

)) and a

(with stationary

distribution (

)) of color c

is ρ(a

) =

and ρ(a

) =

Hence, if we add another action a

to state s

with

p(s

) = 1 and r(s

) ∈





, the optimal

policy would choose a

in s

. However, as the algo-

rithm expects the larger average reward of

ρ, action a

would not be chosen.

4 DISCUSSION AND PROBLEMS

Our aim was to investigate a simplest possible simi-

larity model for discrete bandits / action spaces. We

think that the shift of analysis from the bandit to the

MDP case may serve as a blueprint for much more

general settings where the action space of an MDP

is e.g. a metric space with Lipschitz condition, when

an appropriate bandit algorithm like the zooming al-

gorithm (Kleinberg et al., 2008) may be similarly

adapted. Indeed, this particular setting would be a

proper generalization of the scenario considered in

this paper, as the zooming algorithm can handle the

colored bandits case by deﬁning a special metric d,

where d(a, a

′

) := θ if c(a) = c(a

′

) and d(a,a

′

) := 1

otherwise (although it is not quite clear whether the

same regret bounds are achievable). However, these

considerations need further investigation.

We have tried to exploit similarity information

only with respect to the actions. As many real-world

problems (also) have large state spaces, it is a natu-

ral question whether a similar approach would work

for coloring the state space of an MDP. However, it is

not quite clear how similarity for states could be used

in principle. The most natural thing to do would be to

choose in a state s (of color c) that has not been visited

before an action that proved to be successful in other

states of the same color. This would obviously lead to

some sort of state aggregation similarly to the action

aggregation concept considered in Section 3.2. How-

ever, in this setting lower bounds that resemble that of

Theorem 10 have already been shown that state aggre-

gation may cause arbitrarily large error as well (Ort-

ner, 2007). Still, regret bounds that consider mixing

time parameters of the MDP may be possible.

ACKNOWLEDGEMENTS

The author is grateful for the input of the anonymous

reviewers of an earlier version of (parts of) this pa-

per. In particular, the remark before Theorem 8 and

the ideas about the zooming algorithm are due to two

reviewers. This work was supported in part by the

Austrian Science Fund FWF (S9104-N13 SP4). The

research leading to these results has received funding

from the European Community’s Seventh Framework

Programme (FP7/2007-2013) under grant agreements

◦

216886 (PASCAL2), and n

◦

216529 (PinView).

This publication only reﬂects the authors’ views.

REFERENCES

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-

time analysis of the multi-armed bandit problem.

Mach. Learn., 47:235–256.

Auer, P., Jaksch, T., and Ortner, R. (2009). Near-optimal

regret bounds for reinforcement learning. In Adv. Neu-

ral Inf. Process. Syst. 21, pages 89–96. (full version

http://www.unileoben.ac.at/∼infotech/publications/

TR/CIT-2009-01.pdf).

Hunter, J. J. (2006). Mixing times with applications to

perturbed Markov chains. Linear Algebra Appl.,

417:108–123.

Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-

armed bandits in metric spaces. In Proceedings STOC

2008, pages 681–690.

Kleinberg, R. D. (2005). Nearly tight bounds for the

continuum-armed bandit problem. In Adv. Neural Inf.

Process. Syst. 17, pages 697–704.

Mannor, S. and Tsitsiklis, J. N. (2004). The sample com-

plexity of exploration in the multi-armed bandit prob-

lem. J. Mach. Learn. Res., 5:623–648.

Ortner, R. (2007). Pseudometrics for state aggregation in

average reward Markov decision processes. In Pro-

ceedings of ALT 2007, pages 373–387.

Pandey, S., Chakrabarti, D., and Agarwal, D. (2007). Multi-

armed bandit problems with dependent arms. In Pro-

ceedings of ICML 2007, pages 721–728.

Puterman, M. L. (1994). Markov Decision Processes: Dis-

crete Stochastic Dynamic Programming. John Wiley

& Sons, Inc., New York, NY, USA.

Tewari, A. and Bartlett, P. L. (2007). Bounded parameter

Markov decision processes with average reward crite-

rion. In Proceedings of COLT 2007, pages 263–277.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

210