Recall that the quality of the primal ALP (eq. 4)
is very sensitive to the choice of the primal basis
H. Similarly, the quality of policies produced by the
composite ALPs ((eq. 6) and (eq. 7)) greatly depends
on the choice of both H and Q. However, as we em-
pirically show below, the approach lends itself to an
intuitive algorithm for constructing small and com-
pact basis sets H and Q that yield high quality solu-
tions for the collision-avoidance domain.
Finally, let us also note that, while feasibility of the
primal ALP (eq. 4) can be ensured by simply adding
a constant h
0
= 1 to the basis H (de Farias and Roy,
2003), it is slightly more difficult to ensure the fea-
sibility of the composite ALP (eq. 6) (or the bound-
edness of (eq. 7)). Let us note that in practice, for
any primal basis H, boundedness and feasibility of
the composite ALPs can be ensured by constructing a
sufficiently large dual basis Q.
3 COLLISION-AVOIDANCE MDP
MODEL
We conducted experiments on several two-
dimensional collision-avoidance scenarios, and
the high-level results were consistent across the
domains. To ground the discussion, we report our
findings for a simplified model of the task of driving
on a two-way street. We model the problem as a
discrete-state MDP, by using a grid-world representa-
tion for the road, with the x-y positions of all cars as
the state features (the flat state space is given by their
cross-product).
In this domain, we are controlling one of the cars,
and the goal is to find a policy that minimizes the ag-
gregate probability of collisions with other cars. Each
uncontrolled vehicle is modeled to strictly adhere to
the right-hand-side driving convention. Within these
bounds, the vehicles stochastically change lanes while
drifting with varying speed in the direction of traffic.
This model can be naturally represented as a fac-
tored MDP. Indeed, the reward function lends itself
to a factored representation, because we only penal-
ize collisions with other cars, so the total reward can
be represented as a sum of local reward functions,
each one a function of the relative positions of the
controlled car and one of the uncontrolled cars.
1
The
transition function of the MDP also factors well, be-
cause each car moves mostly independently, so the
factored transition function can be represented as a
Bayesian network with each node depending on a
small number of world features.
1
We also experimented with other more interesting do-
mains and reward functions (e.g., roads with shoulders
where moving on a shoulder gave a small penalty); the high-
level results were consistent across such modifications.
4 BASIS SELECTION AND
EVALUATION
As mentioned earlier, ALP is very sensitive to the
choice of basis functions H and Q. Therefore, our
main goal is to design procedures for constructing pri-
mal (H) and dual (Q) basis sets that are compact, but
at the same time yield high-quality control policies.
The basic domain-independent idea behind our al-
gorithm is to use solutions to smaller MDPs as ba-
sis functions for larger problems. For our collision-
avoidance domains, we implemented this idea as fol-
lows. For every pair of objects, we constructed an
MDP with the original topology but without any other
objects, and then used these optimal value functions
as the primal basis H and the optimal occupation
measures as the dual basis Q for the original MDP.
We empirically evaluated this method on the car
domain from Section 3.
2
In our experiments, we var-
ied the geometry of the grid and the number of cars,
and for each configuration, we solved the correspond-
ing factored MDP using the ALP method described
above, and evaluated the resulting policies using a
Monte Carlo simulation (an exact evaluation is infea-
sible, due to the curse of dimensionality).
Figure 1a shows the value of the approximate poli-
cies computed in this manner, as a function of how
highly constrained the problem is (the ratio of the grid
area to the number of cars), with the average values
of random policies shown for comparison. The im-
portant question is, of course, how close our solution
is to the optimum. Unfortunately, for all but the most
trivial domains, computing the optimal solution is in-
feasible, so we cannot directly answer that question.
However, for our collision-avoidance domains, where
only negative rewards are obtained in collision states,
we can upper-bound the value of any policy by zero.
Using this upper bound on the quality of the optimal
solution, we can compute a lower bound on the rela-
tive quality of our approximation, which is shown in
Figure 1b. Notice that, for highly constrained prob-
lems (where optimal solutions have large negative
values), this lower bound can greatly underestimate
the quality of our solution, which explains low num-
bers in the left part of the graph. However, even given
this pessimistic view, our ALP method produced poli-
cies that were, on the average, no worse than 92% of
the optimum (relative to the optimal-random gap).
We also evaluated our approximate solution by its
relative gain in efficiency. In our experiments, the
sizes of the primal and dual basis sets grow quadrat-
ically with the number of cars, while the size of the
exact LP (eq. 1) grows exponentially. Table 1 illus-
trates the complexity reduction achieved by using the
composite ALP approach. In fact, the difference in
2
Other collision-avoidance domains had similar results.