decision process (MDP) is a tuple M = hS,A, p,ri,
where S is a finite set of states and A is a finite set
of actions. Unlike in the usual setting where in each
state from S each action from A is available, we con-
sider that for each state s there is a nonempty sub-
set A(s) ⊆ A of actions available in s. Further, we
assume that the sets A(s) are a partition of A, i.e.,
A(s) ∩A(s
′
) = ∅ for s 6= s
′
, and
S
s∈S
A(s) = A. The
transition probabilities p(s
′
|s,a) give the probability
of reaching state s
′
when choosing action a in state s,
and the payoff distributions with mean r(s,a) and sup-
port in [0, 1] specify the random reward obtained for
choosing action a in state s.
We are interested in the undiscounted, average re-
ward lim
T→∞
1
T
∑
T
t=1
r
t
, where r
t
is the random reward
obtained at step t. As possible strategies we consider
(stationary) policies π : S → A, where π(s) ∈ A(s).
This is justified by the fact that there is always such
a policy π
∗
which gives optimal reward (Puterman,
1994). Let ρ(π) denote the expected average re-
ward of policy π. Then π
∗
is an optimal policy if
ρ(π) ≤ ρ(π
∗
) =: ρ
∗
for all policies π.
In the analysis we will also need some tran-
sition parameters of the MDP at hand. Thus let
T(s
′
|M ,π,s) be the first (random) time step in which
state s
′
is reached when policy π is executed on
MDP M with initial state s. Then we define the di-
ameter of the MDP to be the average time it takes to
move from any state s to any other state s
′
, using an
appropriate policy, i.e.
D(M ) := max
s6=s
′
∈S
min
π:S→A
E
T(s
′
|M ,π,s)
.
We will consider only MDPs with finite diameter,
which guarantees that there is always an optimal pol-
icy that achieves optimal average reward ρ
∗
indepen-
dent of the initial state. Note that each policy π in-
duces a Markov chain M
π
on M . If the Markov chain
is ergodic (i.e., each state is reachable from each other
state after a finite number of steps) this Markov chain
has a state-independent stationary distribution. The
mixing time of a policy π on an MDP M with induced
stationary distribution µ
π
is given by
κ
π
(M ) :=
∑
s
′
∈S
E
T(s
′
|M ,π,s)
µ
π
(s
′
)
for an arbitrary s ∈ S. The definition is independent
of the choice of s as shown in (Hunter, 2006). Finally,
we remark that in case a policy π induces an ergodic
Markov chain on M with stationary distribution µ
π
,
the average reward of π can be written as
ρ(π) :=
∑
s∈S
µ
π
(s)r(s,π(s)). (6)
3.1 When the Color Determines the
Transition Probabilities
The simplest case in the MDP scenario is when two
actions of the same color have identical transition
probabilitiesand close rewards, given a coloring func-
tion c : A →C on the action space.
Assumption 3. There is a θ > 0 such that for each
two actions a,a
′
∈A(s) with s ∈S: If c(a) = c(a
′
) then
(i) p(·|s,a) = p(·|s,a
′
), and (ii) |r(s,a) −r(s,a
′
)|< θ.
We will try to exploit the color information only
for same-colored actions in the same state, so that
we may assume without loss of generality that actions
available in distinct states have distinct colors. Thus
the set of colors C is partitioned by the sets C(s) of
colors of actions that are available in state s. Fur-
ther, we have to distinguish between colors having
distinct transition probability distributions. Thus, we
write C(s, p) for the set of colors of actions available
in state s and having transition probabilities p(·).
3.1.1 Algorithm
The algorithm we propose (shown in Figure 2) is
a straightforward adaptation of the UCRL2 algo-
rithm (Auer et al., 2009). For the sake of simplicity,
we only consider the case where the transition struc-
ture of the MDP is known. The general case can be
handled analogously to (Auer et al., 2009). We use
the color information just as in the bandit case. That
is, in each state a set of promising colors is defined.
Then an optimal policy is calculated where the action
set is restricted to actions with promising colors, and
the actions’ rewards are set to their upper confidence
values. The algorithm proceeds in episodes i, and the
chosen policy π
i
is executed until a state s is arrived
where the action π
i
(s) has been played in the episode
as often as before the episode.
3.1.2 Analysis
Again we are interested in the algorithm’s regret af-
ter T steps, defined as
1
Tρ
∗
−
∑
t
r
t
. Furthermore,
we also consider the regret with respect to an ε-
optimal policy, i.e., with respect to ρ
∗
−ε instead
of ρ
∗
. The analysis of the algorithm’s regret is a
combination of the respective proofs of Theorem 2
and of the logarithmic regret bounds in (Auer et al.,
2009). That is, we first establish a sample complex-
ity bound on the number of steps in episodes where
1
Unlike in the bandit case, this regret definition also con-
siders the deviations of the achieved rewards from the mean
rewards. Actually, the regret bounds for the bandit case can
be adapted to this alternative regret definition.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
206