EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT
LEARNING
Similarity Models for Multi-Armed Bandits and MDPs
Ronald Ortner
Lehrstuhl f¨ur Informationstechnologie, Montanuniversit¨at Leoben, Austria
Keywords:
Reinforcement learning, Markov decision process, Multi-armed bandit, Similarity, Regret.
Abstract:
This paper considers reinforcement learning problems with additional similarity information. We start with
the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is
assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that
this color information can be used to improve the dependency of online regret bounds on the number of
arms. Further, we discuss to what extent this approach can be extended to the more general case of Markov
decision processes. For the simplest case where the same color for actions means similar rewards and identical
transition probabilities, an algorithm and a corresponding online regret bound are given. For the general case
where transition probabilities of same-colored actions imply only close but not necessarily identical transition
probabilities we give upper and lower bounds on the error by action aggregation with respect to the color
information. These bounds also imply that the general case is far more difficult to handle.
1 INTRODUCTION
Algorithms for reinforcement learning problems suf-
fer from the curse of dimensionality when either the
action space or the state space are large. Unlike that,
in many of these problems humans have no difficul-
ties in learning, as they are able to structure the state
space and the action space in a favorable way. In
many cases, this structure information regards simi-
larity of states and actions.
Here we investigate to what extent similarity in-
formation can be exploited to improveoverthe perfor-
mance in case no such information is given. Although
our main interest lies in Markov decision processes
(MDPs), we start with a multi-armed bandit problem
with a simple similarity model: For each arm there is
an additional color information available, where arms
of the same color are assumed to have close mean re-
wards, that is, these deviate by at most θ, a parameter
known to the learner. Indeed, a similar model has al-
ready been considered by (Pandey et al., 2007), who
also give a typical application to an ad-selection prob-
lem on webpages, where ads with similar content are
similarly attractive to the user and get comparable re-
ward (i.e., user clicks). Also in other of the numerous
applications of multi-armed bandits such as routing,
wireless networks, design of experiments, or pricing
(for references see e.g. (Kleinberg, 2005)), similarity
information of the given kind seems to be natural.
In Section 2 below we present an algorithm that
is able to exploit color information, as the derived
bounds on the regret with respect to the best arm
show: While online regret bounds for ordinary bandit
problems (which usually are logarithmic in the num-
ber of steps taken) grow linearly with the number of
actions, with color information the total number of ac-
tions can be replaced with the number of colors plus
the number of arms with promising color.
In the subsequent Section 3 we consider the more
general setting of Markov decision processes where
color information for the actions is available. We
start examining the simplest case where actions of the
same color have similar rewards (again measured by
a parameter θ) and identical transition probabilities.
For this setting we give an adaptation of the UCRL2
algorithm of (Auer et al., 2009) for which we show
regret bounds that demonstrate similarly to the bandit
setting that the color information can be exploited to
get improved bounds.
When this setting is generalized so that actions
with the same color have not identical but only sim-
ilar transition probabilities, things get more compli-
cated. In Section 3.2, we investigate action aggrega-
tion with respect to such colorings. We derive bounds
203
Ortner R. (2010).
EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs.
In Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Artificial Intelligence, pages 203-210
DOI: 10.5220/0002703002030210
Copyright
c
SciTePress
on the error caused by working on the aggregated in-
stead of the original MDP. Unlike in the simpler set-
tings where this error is trivially bounded by the pa-
rameter θ, the error can be arbitrarily large, depend-
ing on the (aggregated) MDP. This indicates that sim-
ilarity information regarding the transition probabili-
ties cannot be as well exploited as for rewards, which
is confirmed by an example at the end of Section 3
that shows that straightforward adaptations of the al-
gorithm for the simpler setting fail.
2 COLORED BANDITS
In a multi-armed bandit problem the learner has a fi-
nite set of arms A at his disposal. Choosing an arm a
from A gives a random reward bounded in the unit in-
terval [0,1] with mean r(a). As performance measure
for a learning algorithm one usually considers its re-
gret with respect to choosing the optimal arm at each
step. That is, setting τ(a) to be the number of steps
where arm a has been chosen (up to some finite hori-
zon T) the regret is defined as
aA
τ(a)
r
r(a)
,
where r
= max
a
r(a) is the optimal mean reward.
The regret of established bandit algorithms such as
UCB1 (Auer et al., 2002) is logarithmic in the num-
ber of steps, but grows linearly with the number of
arms. This is also best possible (Mannor and Tsitsik-
lis, 2004).
Unlike in the general case, where the learner has
no information apart from A, here we are interested in
the question how given similarity information about
different arms can be exploited to improve regret
bounds with respect to the dependency on the number
of arms. That is, the learner additionally knows the
color of an arm, that is, a coloring function c : A C
that assigns each arm in A a color from a given set of
colors C. (We assume that the function c is surjective,
i.e., each color in C is assigned to an arm in A.) The
color gives some similarity information about the re-
wards of arms according to the following assumption.
Assumption 1. There is a θ > 0 such that for each
two arms a,a
A: If c(a) = c(a
) then |r(a)
r(a
)| < θ.
We assume that the learner knows the parame-
ter θ. This setting is similar to the one considered
in (Pandey et al., 2007). However, there it is as-
sumed that choosing an arm is a Bernoulli trial that
gives reward 1 with some success probability p and
reward 0 otherwise. Further, our Assumption 1 is re-
placed with the supposition that the success probabil-
ities p of arms of the same color are distributed ac-
cording to a common probability distribution.
2.1 Algorithm
An obvious idea is to adapt a standard bandit algo-
rithm to first choose a color c and then in a second step
to choose an arm with color c. This idea also underlies
the TLP algorithm of (Pandey et al., 2007). However
there is a problem with that direct approach when two
colors are very close, as it takes (
1
ε
2
) steps to dis-
tinguish a distance of ε between two arms/colors (cf.
the analysis of (Pandey et al., 2007), which does not
derive regret bounds, but only considers the conver-
gence behavior of the TLP algorithm). Our algorithm
(shown in Figure 1) does not try to identify the best
color c
but instead forms a set of good colors C
t
. A
distance parameter β determines what the distance be-
tween the best color c
and another color c should be
in order to consider c to be suboptimal and exclude it
fromC
t
. Unlike (Pandey et al., 2007) we do not main-
tain a single estimate value for each color c, but cal-
culate a confidence interval for each color that w.h.p.
contains the mean reward of the best arm of color c.
2.2 Analysis
In order to derive an upper bound on the regret of
the algorithm, we will consider (i) for how many
steps suboptimal colors are included in C
t
, and (ii)
when C
t
contains only close to optimal colors, how
often will a suboptimal arm be chosen? Ques-
tion (ii) is answered by original UCB1 analysis taken
from (Auer et al., 2002). For question (i) this has to be
adapted. Let r
+
(c) := max
a:c(a)=c
r(a) and r
(c) :=
min
a:c(a)=c
r(a). We assume that at each step t of the
algorithm
r
+
(c) ˆr
t
(c) conf
t
(c), and (1)
r
(c) ˆr
t
(c) + conf
t
(c) (2)
for each color c, and that
ˆr
t
(a
t
) r(a
t
) +
r
log(t
3
/δ)
2n
t
(a
t
)
, and (3)
ˆr
t
(a
) r(a
)
r
log(t
3
/δ)
2n
t
(a
)
, (4)
where a
= argmax
aA
r(a) is an arm with maximal
mean reward r
. Application of Hoeffding’s inequal-
ity shows that (1) as well as (2) holds with probability
at least 1
δ
|C|t
3
for a fixed time step t, a fixed color
c and a fixed value of n
t
(c). Thus, a union bound
over all colors, all possible values of n
t
(c) and all t
shows that (1) and (2) hold with probability at least
12
t
δ
t
2
> 1
10
3
δ for all t. Similarly, (3) and (4)
hold with probability at least 1
10
3
δ for all t.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
204
Input: A confidence parameter δ (0,1), and a dis-
tance parameter β (0,1).
Initialization
For each color c C sample an action a A with
c(a) = c.
For time steps t = 1,2,... do
Calculate confidence intervals I
t
(c) for each
color:
For each color c in C calculate a confidence in-
terval for max
a:c(a)=c
r(a):
ˆ
I
t
(c) :=
ˆr
t
(c) conf
t
(c), ˆr
t
(c) + θ+ conf
t
(c)
,
where ˆr
t
(c) =
1
n
t
(c)
τ<t:c(a
τ
)=c
r
τ
with r
τ
being
the random reward obtained at step τ for choos-
ing arm a
τ
, n
t
(a) being the number of times ac-
tion a was chosen, and n
t
(c) :=
a:c(a)=c
n
t
(a)
being the number of times an action with color c
has been chosen. Further,
conf
t
(c) :=
q
log(|C|t
3
/δ)
2n
t
(c)
.
Determine relevant colors C
t
:
Let c
t
:= argmax
cC
ˆr
t
(c) + θ + conf
t
(c)
be
the color with maximal upper confidence bound
value, and set C
t
:=
c C|
ˆ
I
t
(c)
ˆ
I
t
(c
t
) 6=
.
If conf
t
(c
t
) β/4 and conf
t
(c) < β/4 for some
c C
t
, reset C
t
:= {c
t
}.
Arm selection:
Use UCB1 to choose an arm from A
t
:= {a
A|c(a) C
t
}, i.e., if there is an unsampled arm a
in A
t
choose a, otherwise choose
a
t
:= argmax
aA
t
ˆr
t
(a) +
q
log(t
3
/δ)
2n
t
(a)
,
where ˆr
t
(a) =
1
n
t
(a)
τ<t:a
τ
=a
r
τ
.
Figure 1: The colored bandits algorithm.
Note that under assumptions (1) and (2) an opti-
mal color c
(i.e., the color of an optimal arm a
) is
always in C
t
, since
ˆr
t
(c
) + conf
t
(c
) + θ r
(c
) + θ
r
+
(c
) r
+
(c
t
) ˆr
t
(c
t
) conf
t
(c
t
).
Now, we establish sample complexity bounds both on
(i) the number of times an arm of a color that is β-
far from the optimal color c
is chosen, and (ii) the
number of times a suboptimal arm is chosen from C
t
(assuming that C
t
contains only colors β-close to the
optimal color). For the bound on (ii), we may directly
refer to (Auer et al., 2002), where it is shown that
any suboptimal arm a is chosen at most 1+
8logT
(r
r(a))
2
times (w.h.p). As playing such an arm gives regret
r
r(a), this yields a bound of
cC
β
a:c(a)=c
r(a)<r
1+
8logT
r
r(a)
,
where C
β
:= {c C|r
r
+
(c) β+2θ} is the set of
colors that are β-close to the optimal reward r
.
For a bound on (i) we may easily adapt the men-
tioned proof as follows. Consider a β-bad color c /
C
β
. Then r
+
(c) + β + 2θ < r
. According to the al-
gorithm c C
t
only when
ˆr
t
(c) + conf
t
(c) + θ ˆr
t
(c
t
) conf
t
(c
t
). (5)
Further, if conf
t
(c) < β/4 then c C
t
only in case
conf
t
(c
t
) < β/4, too. But then we have from (1), (5),
and the fact that r
ˆr
t
(c
t
) + conf
t
(c
t
) + θ
r
+
(c) +β + 2θ ˆr
t
(c) conf
t
(c) + β + 2θ
ˆr
t
(c
t
) 2conf
t
(c) conf
t
(c
t
) + β + θ
r
2conf
t
(c
t
) 2conf
t
(c) + β > r
,
contradicting our assumption that c is a β-bad color.
Hence, whenever conf
t
(c) < β/4 we have c / C
t
, so
that c(a
t
) = c only at
l
8log(|C|T
3
/δ)
β
2
m
time steps.
Further we have to consider the case when setting
C
t
:= {c
t
}, which may be a suboptimal choice as
well. However, this happens only when conf
t
(c
t
)
β/4, that is, not more often than
l
8log(|C|T
3
/δ)
β
2
m
times.
Summarizing (and also taking into account the regret
of the initialization), we get the following result.
Theorem 2. The regret of the colored bandits algo-
rithm after T steps with probability at least 1
20
3
δ is
at most
|C|+ 2|C|
l
8log(|C|T
3
/δ)
β
2
m
+
a:c(a)C
β
r(a)<r
1+
8logT
r
r(a)
.
As the regret at each step is at most 1, we may
simply sum up the error probabilities for failing con-
fidence intervals givenin (1) and (3) to obtain a bound
on the expected regret as well.
These bounds show that it is possible for the
learner to exploit color information in order to elimi-
nate the dependency on the total number of actions in
the respective regret bounds.
3 COLORED ACTION MDPS
We continue dealing with the natural generalization of
the problem to Markov decision processes. A Markov
EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed
Bandits and MDPs
205
decision process (MDP) is a tuple M = hS,A, p,ri,
where S is a finite set of states and A is a finite set
of actions. Unlike in the usual setting where in each
state from S each action from A is available, we con-
sider that for each state s there is a nonempty sub-
set A(s) A of actions available in s. Further, we
assume that the sets A(s) are a partition of A, i.e.,
A(s) A(s
) = for s 6= s
, and
S
sS
A(s) = A. The
transition probabilities p(s
|s,a) give the probability
of reaching state s
when choosing action a in state s,
and the payoff distributions with mean r(s,a) and sup-
port in [0, 1] specify the random reward obtained for
choosing action a in state s.
We are interested in the undiscounted, average re-
ward lim
T
1
T
T
t=1
r
t
, where r
t
is the random reward
obtained at step t. As possible strategies we consider
(stationary) policies π : S A, where π(s) A(s).
This is justified by the fact that there is always such
a policy π
which gives optimal reward (Puterman,
1994). Let ρ(π) denote the expected average re-
ward of policy π. Then π
is an optimal policy if
ρ(π) ρ(π
) =: ρ
for all policies π.
In the analysis we will also need some tran-
sition parameters of the MDP at hand. Thus let
T(s
|M ,π,s) be the first (random) time step in which
state s
is reached when policy π is executed on
MDP M with initial state s. Then we define the di-
ameter of the MDP to be the average time it takes to
move from any state s to any other state s
, using an
appropriate policy, i.e.
D(M ) := max
s6=s
S
min
π:SA
E
T(s
|M ,π,s)
.
We will consider only MDPs with finite diameter,
which guarantees that there is always an optimal pol-
icy that achieves optimal average reward ρ
indepen-
dent of the initial state. Note that each policy π in-
duces a Markov chain M
π
on M . If the Markov chain
is ergodic (i.e., each state is reachable from each other
state after a finite number of steps) this Markov chain
has a state-independent stationary distribution. The
mixing time of a policy π on an MDP M with induced
stationary distribution µ
π
is given by
κ
π
(M ) :=
s
S
E
T(s
|M ,π,s)
µ
π
(s
)
for an arbitrary s S. The definition is independent
of the choice of s as shown in (Hunter, 2006). Finally,
we remark that in case a policy π induces an ergodic
Markov chain on M with stationary distribution µ
π
,
the average reward of π can be written as
ρ(π) :=
sS
µ
π
(s)r(s,π(s)). (6)
3.1 When the Color Determines the
Transition Probabilities
The simplest case in the MDP scenario is when two
actions of the same color have identical transition
probabilitiesand close rewards, given a coloring func-
tion c : A C on the action space.
Assumption 3. There is a θ > 0 such that for each
two actions a,a
A(s) with s S: If c(a) = c(a
) then
(i) p(·|s,a) = p(·|s,a
), and (ii) |r(s,a) r(s,a
)|< θ.
We will try to exploit the color information only
for same-colored actions in the same state, so that
we may assume without loss of generality that actions
available in distinct states have distinct colors. Thus
the set of colors C is partitioned by the sets C(s) of
colors of actions that are available in state s. Fur-
ther, we have to distinguish between colors having
distinct transition probability distributions. Thus, we
write C(s, p) for the set of colors of actions available
in state s and having transition probabilities p(·).
3.1.1 Algorithm
The algorithm we propose (shown in Figure 2) is
a straightforward adaptation of the UCRL2 algo-
rithm (Auer et al., 2009). For the sake of simplicity,
we only consider the case where the transition struc-
ture of the MDP is known. The general case can be
handled analogously to (Auer et al., 2009). We use
the color information just as in the bandit case. That
is, in each state a set of promising colors is defined.
Then an optimal policy is calculated where the action
set is restricted to actions with promising colors, and
the actions’ rewards are set to their upper confidence
values. The algorithm proceeds in episodes i, and the
chosen policy π
i
is executed until a state s is arrived
where the action π
i
(s) has been played in the episode
as often as before the episode.
3.1.2 Analysis
Again we are interested in the algorithm’s regret af-
ter T steps, defined as
1
Tρ
t
r
t
. Furthermore,
we also consider the regret with respect to an ε-
optimal policy, i.e., with respect to ρ
ε instead
of ρ
. The analysis of the algorithm’s regret is a
combination of the respective proofs of Theorem 2
and of the logarithmic regret bounds in (Auer et al.,
2009). That is, we first establish a sample complex-
ity bound on the number of steps in episodes where
1
Unlike in the bandit case, this regret definition also con-
siders the deviations of the achieved rewards from the mean
rewards. Actually, the regret bounds for the bandit case can
be adapted to this alternative regret definition.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
206
Input: A confidence parameter δ (0,1), and a dis-
tance parameter β (0,1).
Notation: Let t denote the current time step.
For episodes i = 1,2,... do
Initialize episode i:
Set t
i
:= t. Calculate estimates ˆr
t
(s,a) for the
mean reward r(s,a) for state-action pairs (s,a)
with a A(s), and determine confidence in-
tervals I
t
(c) for each color as follows. For
each state s, each transition probability distribu-
tion p(·) and each color c in C(s, p) calculate a
confidence interval for max
a:c(a)=c
r(s,a):
ˆ
I
t
(s,c) :=
ˆr
t
(s,c) conf
t
(s,c), ˆr
t
(c) + θ+ conf
t
(s,c)
,
where ˆr
t
(s,c) =
1
n
t
(s,c)
τ:c(a
τ
)=c,s
τ
=s
r
τ
with r
τ
being the random reward obtained at step τ for
choosing action a
τ
in state s, n
t
(s,a) being the
number of times action a was chosen in state s,
and n
t
(s,c) being the number of times an action
with color c has been chosen in s.
2
Further,
conf
t
(s,c) :=
q
7log(2|C|t/δ)
2n
t
(s,c)
.
Determine relevant colors C
t
:
For each state s and each p(·) let c
t
(s, p) be
the color with maximal upper confidence bound
value, i.e.,
c
t
(s, p) := argmax
cC(s,p)
ˆr
t
(s,c) + θ+ conf
t
(s,c)
.
Set C
t
(s, p) :=
c C(s, p)|
ˆ
I(s,c
t
(s, p))
ˆ
I(s,c) 6=
. If conf
t
(s,c
t
(s, p)) β/4 and
conf
t
(s,c) < β/4 for some c C
t
(s, p), reset
C
t
(s, p) := { c
t
(s, p)}.
Policy selection and execution:
Choose an optimal policy
3
π
i
in the MDP with
transition structure as given and action sets
A
t
(s) := {a A(s)|c(a)
S
p
C
t
(s, p)} with re-
wards
er
t
(s,a) := ˆr
t
(s,a) +
q
7log(2|A|t/δ)
2n
t
(s,a)
.
Play π
i
as long as n
t
(s,π
i
(s
t
)) < 2n
t
i
(s,π
i
(s
t
)) in
the current state s
t
.
Figure 2: The colored MDP algorithm for MDPs with
known transition structure and colored action set.
2
If the color or action count is 0, reset it to 1.
3
Such an optimal policy can be calculated using ordi-
nary value iteration (Puterman, 1994).
ε-suboptimal reward is received. Thus, let T
ε
be the
number of steps in episodes where the average per-
step reward is less than ρ
ε, and let M
ε
be the re-
spective indices of these episodes. Note that setting
i
:=
t
i+1
1
t=t
i
ρ
r
t
to be the regret in episode i we
have that
ε
:=
iM
ε
i
εT
ε
. (7)
Having this lower bound on
ε
, we now aim at achiev-
ing also an upper bound on
ε
in terms of T
ε
. These
two bounds then will give us the desired regret bound.
The main part of the derivation of this upper bound
is mainly the same as given in the extended version
of (Auer et al., 2009), so we will not repeat it here
and only state that it can be shown that
ε
1+
q
T
ε
2
log
T
δ
+ 2D
q
T
ε
log
T
δ
+D·#episodes+
q
log
2|A|T
δ
s,a
p
n
ε
(s,a). (8)
with probability 1 3δ, where n
ε
(s,a) is the total
number of times action a is chosen in s in episodes
in M
ε
. Now we split
s,a
p
n
ε
(s,a) into one sum han-
dling the actions of β-bad color /C
β
and another sum
for all other actions, where
C
β
:=
S
s,p
c C(s, p)|r
(s, p) r
+
(c) β + 2θ
with r
(s, p) := max
a:c(a)C(s,p)
r(a) and r
+
(c) :=
max
a:c(a)=c
r(a). Then similarly to the bandit case,
whenever the confidence interval conf
t
(s,c) of such
a β-bad color c is smaller than β/4 at the beginning
of an episode, the respective color will not be part of
C
t
(s, p). Consequently, by definition of conf
t
(s,c) the
number of times an action with that color is chosen
in state s is upper bounded by
112log(2|C|T)
β
2
, the addi-
tional factor 2 stemming from the fact that the confi-
dence intervals are only updated at the beginning of
an episode (in which the number of times the respec-
tive action is chosen may be doubled). Consequently,
for any β-bad color c /C
β
s,a:c(a)=c
n
ε
(s,a)
112log(2|C|T)
β
2
, (9)
whence one obtains by Jensen’s inequality that
s,a
p
n
ε
(s,a)
q
|A
β
|T
ε
+
11
β
p
|A||C|log(2|C|T),
where A
β
:= {a A|c(a) C
β
}. This yields from (8)
that
ε
q
T
ε
2
log
T
δ
+ 2D
q
T
ε
log
T
δ
+ D·#episodes
+
q
|A
β
|T
ε
log
2|A|T
δ
+
11
β
q
|A||C|log
2|A|T
δ
log(2|C|T) + 1, (10)
EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed
Bandits and MDPs
207
so that it remains to upper bound the number of
episodes. By the doubling criterion for episode ter-
mination it is easy to see that generally there are not
more than |A|log
2
8T
|A|
episodes (cf. Appendix A.2 of
the extended version of (Auer et al., 2009) for de-
tails). However, again considering actions of β-good
color C
β
and others separately, according to (9)
this can be improved to a bound of |A
β
|log
8T
|A
β
|
+
|A|log
896log(2|C|T)
|A|β
2
. Putting this into (10) we get in
combination with (7)
T
ε
c
1
·
(D
2
+|A
β
|)log(T/δ)
ε
2
+ c
2
·
|A||C|loglog(T/δ)
εβ
+ c
3
·
D|A
β
|logT+D|A|loglogT
ε
. (11)
As
ε
is an upper bound on the regret with respect
to an ε-optimal policy, we may plug (11) into (10) to
obtain after some calculations the following result.
Theorem 4. The regret of the colored MDP algorithm
with respect to an ε-optimal policy after T steps is
with probability at least 13δ upper bounded by (ig-
noring terms sublogarithmic in T)
c
1
·
D
2
+ |A
β
|
log
T
δ
ε
+ c
2
·
p
|A||C|log
T
δ
β
+ c
3
·
(D+
q
|A
β
|)
4
p
|A||C|log
T
δ
p
βε
+ D|A
β
|logT.
For sufficiently small ε, an ε-optimal policy is also
optimal, which yields a corresponding bound with ε
replaced by the difference between the optimal and
the highest suboptimal average reward, i.e., g := ρ
max
π:ρ(π)<ρ
ρ(π).
Thus, as in the bandit case the learner can benefit
from the color information (as can be seen when com-
paring the bounds to the case without color informa-
tion, i.e. |C|= 1 and A
β
= A). The reason why there is
still some dependency on the total number of actions
is that the doubling criterion for episode termination
concerns the actions and not their colors. However,
adapting the episode termination criterion to apply to
colors instead of actions, some other parts of the proof
do not work anymore.
3.2 Action Aggregation
Now let us consider the case where actions of the
same color have not identical but only similar tran-
sition probabilities. As before we are interested only
in similar actions that are available in the same state.
Thus we again assume for the sake of simplicity that
only actions contained in the same set A(s) have the
same color.
Assumption 5. There are θ
r
,θ
p
> 0 such that
for each two actions a,a
A(s) with s S: If
c(a) = c(a
) then (i)
r(s,a) r(s,a
)
< θ
r
, and
(ii)
s
S
p(s
|s,a) p(s
|s,a
)
< θ
p
.
Unlike in the settings considered so far, it is by
no means clear what happens if one simply chooses
a representative of each color and works on the ag-
gregated MDP. In this section we derive error bounds
that answer this question.
Definition 6. Given an MDP M = hS,A, p,ri and a
coloring function c : A C for the actions, an MDP
c
M = hS,C, bp,bri is called an aggregation of M with
respect to c if for a A(s) with c(a) = c:
br(s,c)
r(s,a)
< θ
r
, and
s
S
bp(s
|s,c) p(s
|s,a)
< θ
p
.
Thus beside picking an arbitrary reference action
a for each color c one may also set e.g. br(s,c) :=
1
|A
c
|
aA
c
r(s,a) and bp(s
|s,c) :=
1
|A
c
|
aA
c
p(s
|s,a),
where A
c
:= {a A|c(a) = c}.
A policy
b
π on
c
M is the aggregation of a policy π
on M with respect to a coloring function c : A S,
if c(π(s)) =
b
π(s) for all states s. In the following we
consider only ergodic MDPs, where all policies in-
duce ergodic Markov chains. However, if an aggrega-
tion conserves the ergodicity structure of the MDP the
following results can be adapted to the general case.
Theorem 7. Let M = hS,A, p, ri be an MDP and
c
M = hS,C, bp,bri an aggregation of M with respect
to a coloring function c : A C. Then for each pol-
icy π on M and the aggregation
b
π of π on
c
M with
respect to c, we have for the difference of the average
reward ρ(π) of π in M and the average reward
b
ρ(
b
π)
of
b
π in
c
M
ρ(π)
b
ρ(
b
π)
< θ
r
+ (κ
π
1)θ
p
,
where κ
π
is the mixing time of the Markov chain in-
duced by π on M .
As for the bounds on the error of state aggrega-
tion of (Ortner, 2007), in the proof of Theorem 7 we
use the following result of (Hunter, 2006) on pertur-
bations of Markov chains.
Theorem 8. (Hunter, 2006) Let C ,
e
C be two ergodic
Markov chains on the same state space S with transi-
tion probabilities p(·,·), ep(·,·) and stationary distri-
butions µ, eµ, and let κ
C
be the mixing time of C . Then
µeµ
1
(κ
C
1)max
sS
s
S
p(s,s
) ep(s,s
)
.
Proof of Theorem 7: Writing µ
π
and µ
b
π
for the sta-
tionary distributions of π on M and
b
π on
c
M , respec-
tively, and abbreviating r(s,π(s)) with r
π
(s), we have
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
208
by (6)
ρ(π)
b
ρ(
b
π)
=
s
µ
π
(s)r
π
(s)
s
µ
b
π
(s)r
b
π
(s)
s
µ
π
(s) µ
b
π
(s)
r
π
(s) +
s
µ
b
π
(s)
r
b
π
(s) r
π
(s)
.
As r
π
(s) 1, using Theorem 8 together with our as-
sumptions on aggregation gives
ρ(π)
b
ρ(
b
π)
<
sS
µ
π
(s) µ
b
π
(s)
+
sS
µ
b
π
(s)θ
r
< (κ
π
1)θ
p
+ θ
r
.
Corollary 9. Let π
be an optimal policy on an
MDP M with optimal average reward ρ
, and let
b
π
be an optimal policy with optimal average reward
b
ρ
on an aggregation
c
M of M with respect to some col-
oring function c : A C. Then
(i) |ρ
b
ρ
| < θ
r
+ (κ
M
1)θ
p
,
(ii) ρ
< ρ(
b
π
e
) + 2θ
r
+ 2(κ
M
1)θ
p
,
where κ
M
:= max
π
κ
π
, and
b
π
e
is any policy such that
b
π
is the aggregation of
b
π
e
with respect to c.
Remark: As the role of C and
e
C in Theorem 8
is symmetric, Theorem 7 and Corollary 9 hold also
when the mixing time of M is replaced with the mix-
ing time of the aggregated MDP. Hence, the results
also hold for the minimum of the two mixing times.
The following theorem shows that the error in av-
erage reward indeed becomes arbitrarily large when
the mixing time approaches infinity.
Theorem 10. For each θ
p
> 0 and each sufficiently
small η > 0 there is an MDP M = hS,A, p, ri and
a coloring c : A C of the action space such that
in each aggregation
c
M of M with respect to c there
is some policy π on M such that for the respective
aggregated policy
b
π on
c
M ,
|ρ(π)
b
ρ(
b
π)| 1 η.
Proof. Fix some θ
p
> 0 and consider for δ (0,
θ
p
2
)
an MDP with S = {s
1
,s
2
}, A(s
1
) = {a
1
,a
2
}, and
A(s
2
) = {a
3
}. Define the transition probabilities (cf.
Figure 3) in s
1
as p(s
1
|s
1
,a
1
) = 1 δ, p(s
2
|s
1
,a
1
) =
δ, p(s
1
|s
1
,a
2
) = 1
δ
n
2
, and p(s
2
|s
1
,a
2
) =
δ
n
2
, and
those in s
2
as p(s
1
|s
2
,a
3
) =
δ
n
and p(s
2
|s
2
,a
3
) =
1
δ
n
, where n N. Then the stationary distribu-
tion of the policy π with π(s
1
) = a
2
and π(s
2
) = a
3
is µ
π
=
n
n+1
,
1
n+1
, which for n converges to
(1,0). On the other hand, the policy π
with π
(s
1
) =
a
1
and π
(s
2
) = a
3
has stationary distribution µ
π
=
1
n+1
,
n
n+1
, which for n converges to (0,1).
Now, as δ <
θ
p
2
, a coloring function c may as-
sign (choosing an arbitrary θ
r
> 0, cf. the choice of
n
2
n1−δ/
n
n
2
δ/
1−δ/
1−δ
δ/
δ
Figure 3: The MDP in the proof of Theorem 10. Solid ar-
rows correspond to action a
1
, dashed ones to a
2
, and dotted
ones to a
3
.
the rewards below) the same color c
1
to actions a
1
and a
2
in s
1
. We consider transition probabilities
bp(·|s
1
,c
1
) which are a convex combination of the re-
spective probabilities p(·|s
1
,a
1
) and p(·|s
1
,a
2
), i.e.
bp(s
2
|s
1
,c
1
) = λδ+ (1λ)
δ
n
2
and
bp(s
1
|s
1
,c
1
) = 1λδ (1λ)
δ
n
2
for λ [0,1]. (For transition probabilities outside
this convex set the error will clearly be larger.)
Then the stationary distribution of the aggregated
policy
b
π with
b
π(s
1
) = c
1
and
b
π(s
2
) = a
3
is µ
b
π
=
n
n
2
λ+nλ+1
,
n
2
λλ+1
n
2
λ+nλ+1
. Thus, if λ > 0 then µ
b
π
converges to (0,1) for n . Otherwise for λ = 0
we have µ
b
π
=
n
n+1
,
1
n+1
, which for n con-
verges to (1, 0). Now setting the rewards r(s
1
,a
1
) :=
r(s
1
,a
2
) := 1 and r(s
2
,a
3
) := 0 (and choosing appro-
priate rewards br) we obtain an error arbitrarily close
to 1 either with respect to π or to π
, which proves the
theorem.
The lower bound of Theorem 10 indicates that ex-
ploiting similarity of transition probabilities is harder
than for rewards. Here we confirm this by showing
that an optimistic algorithm in the style of UCRL2
fails already in very simple situations.
Example: Consider a two state MDP with S =
{s
1
,s
2
}, A(s
1
) = {a
1
,a
2
}, and A(s
2
) = {s
3
} as shown
in Figure 4. The transition probabilities in s
1
are set
to p(s
1
|s
1
,a
1
) =
3
4
, p(s
2
|s
1
,a
1
) =
1
4
, p(s
1
|s
1
,a
2
) =
1
2
,
and p(s
2
|s
1
,a
2
) =
1
2
. In state s
2
the transition proba-
bilities are p(s
1
|s
2
,a
3
) = p(s
2
|s
2
,a
3
) =
1
2
. The mean
rewards are given by r(s
1
,a
1
) =
1
2
, r(s
1
,a
2
) =
5
12
,
and r(s
2
,a
3
) =
3
4
. Thus, setting θ
p
> 1 and θ
r
>
1
12
we may color actions a
1
and a
2
with the same
color c
1
. The intervals of possible values of ac-
tions with color c
1
are then ep(s
1
|s
1
,c
1
)
1
2
,
3
4
,
ep(s
2
|s
1
,c
1
)
1
4
,
1
2
, and er(s
1
,c
1
)
5
12
,
1
2
.
An optimistic algorithm that handles this infor-
mation as in a bounded parameter MDP (Tewari
and Bartlett, 2007) now would assume values
ep(s
1
|s
1
,c
1
) = ep(s
2
|s
1
,c
1
) =
1
2
and er(s
1
,c
1
) =
1
2
in or-
der to maximize the average reward of a policy play-
ing an action with color c
1
in state s
1
, which then is
EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed
Bandits and MDPs
209
(1/4,1/2)
(1/2,3/4)
(1/2,3/4)
(1/2,5/12)
(3/4,1/2)
(1/2,5/12)
Figure 4: The MDP from the example. Solid arrows corre-
spond to action a
1
, dashed ones to a
2
, and dotted ones to a
3
.
For each arrow the respective transition probability and the
reward are displayed.
e
ρ =
1
2
·
1
2
+
1
2
·
3
4
=
5
8
, since the stationary distribu-
tion in this case is obviously (
1
2
,
1
2
). However, the real
achievable average reward for actions a
1
(that gives
stationary distribution (
2
3
,
1
3
)) and a
2
(with stationary
distribution (
1
2
,
1
2
)) of color c
1
is ρ(a
1
) =
2
3
·
1
2
+
1
3
·
3
4
=
7
12
and ρ(a
2
) =
1
2
·
5
12
+
1
2
·
3
4
=
7
12
.
Hence, if we add another action a
4
to state s
1
with
p(s
1
|s
1
,a
4
) = 1 and r(s
1
,a
4
)
7
12
,
5
8
, the optimal
policy would choose a
4
in s
1
. However, as the algo-
rithm expects the larger average reward of
e
ρ, action a
4
would not be chosen.
4 DISCUSSION AND PROBLEMS
Our aim was to investigate a simplest possible simi-
larity model for discrete bandits / action spaces. We
think that the shift of analysis from the bandit to the
MDP case may serve as a blueprint for much more
general settings where the action space of an MDP
is e.g. a metric space with Lipschitz condition, when
an appropriate bandit algorithm like the zooming al-
gorithm (Kleinberg et al., 2008) may be similarly
adapted. Indeed, this particular setting would be a
proper generalization of the scenario considered in
this paper, as the zooming algorithm can handle the
colored bandits case by defining a special metric d,
where d(a, a
) := θ if c(a) = c(a
) and d(a,a
) := 1
otherwise (although it is not quite clear whether the
same regret bounds are achievable). However, these
considerations need further investigation.
We have tried to exploit similarity information
only with respect to the actions. As many real-world
problems (also) have large state spaces, it is a natu-
ral question whether a similar approach would work
for coloring the state space of an MDP. However, it is
not quite clear how similarity for states could be used
in principle. The most natural thing to do would be to
choose in a state s (of color c) that has not been visited
before an action that proved to be successful in other
states of the same color. This would obviously lead to
some sort of state aggregation similarly to the action
aggregation concept considered in Section 3.2. How-
ever, in this setting lower bounds that resemble that of
Theorem 10 have already been shown that state aggre-
gation may cause arbitrarily large error as well (Ort-
ner, 2007). Still, regret bounds that consider mixing
time parameters of the MDP may be possible.
ACKNOWLEDGEMENTS
The author is grateful for the input of the anonymous
reviewers of an earlier version of (parts of) this pa-
per. In particular, the remark before Theorem 8 and
the ideas about the zooming algorithm are due to two
reviewers. This work was supported in part by the
Austrian Science Fund FWF (S9104-N13 SP4). The
research leading to these results has received funding
from the European Community’s Seventh Framework
Programme (FP7/2007-2013) under grant agreements
n
216886 (PASCAL2), and n
216529 (PinView).
This publication only reflects the authors’ views.
REFERENCES
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-
time analysis of the multi-armed bandit problem.
Mach. Learn., 47:235–256.
Auer, P., Jaksch, T., and Ortner, R. (2009). Near-optimal
regret bounds for reinforcement learning. In Adv. Neu-
ral Inf. Process. Syst. 21, pages 89–96. (full version
http://www.unileoben.ac.at/infotech/publications/
TR/CIT-2009-01.pdf).
Hunter, J. J. (2006). Mixing times with applications to
perturbed Markov chains. Linear Algebra Appl.,
417:108–123.
Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multi-
armed bandits in metric spaces. In Proceedings STOC
2008, pages 681–690.
Kleinberg, R. D. (2005). Nearly tight bounds for the
continuum-armed bandit problem. In Adv. Neural Inf.
Process. Syst. 17, pages 697–704.
Mannor, S. and Tsitsiklis, J. N. (2004). The sample com-
plexity of exploration in the multi-armed bandit prob-
lem. J. Mach. Learn. Res., 5:623–648.
Ortner, R. (2007). Pseudometrics for state aggregation in
average reward Markov decision processes. In Pro-
ceedings of ALT 2007, pages 373–387.
Pandey, S., Chakrabarti, D., and Agarwal, D. (2007). Multi-
armed bandit problems with dependent arms. In Pro-
ceedings of ICML 2007, pages 721–728.
Puterman, M. L. (1994). Markov Decision Processes: Dis-
crete Stochastic Dynamic Programming. John Wiley
& Sons, Inc., New York, NY, USA.
Tewari, A. and Bartlett, P. L. (2007). Bounded parameter
Markov decision processes with average reward crite-
rion. In Proceedings of COLT 2007, pages 263–277.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
210