1997), which seeks to minimize the structural vari-
ance in a hypothesis. Positioning the splits as far as
possible from the dense regions of observations mini-
mizes the risk that in the next run of the problem new
observations belonging to one dense region will spill
into an adjacent state and mislead the learning there.
This process is linear in the number of bins used to
measure the frequency distribution, and in practice
had only a negligible impact on the running time of
each generation of the algorithm, and so is a feasible
abstraction algorithm in terms of time complexity.
MDS effectively overcomes the limitations of the
fixed tiling: abstract state partitions can be placed
anywhere in the space and actively work to fit the out-
puts of the neural networks. This allows the bound-
aries to dynamically fit the outputs instead of depend-
ing on expensive evolution to fit the neural networks
to the abstract states. The MDS method does intro-
duce some other limitations, however. It could be the
case that an area of dense observations is not really
homogeneous in terms of preferred action, but were
coincidentally grouped together by the ANN. In this
case, the abstract states might still become successful
if the ANN adapts and separates these states into two
different clusters in a later evolutionary stage. An-
other concern is the additional computational burden
if a more complicated clustering/density estimation
method was used, however our results show that the
simple frequency based approach taken here can be
effective.
4 EXPERIMENTAL SETUP
Here we see what benefit MDS gives the RL-SANE
algorithm in terms of convergence speed and number
of abstract states in the solution. In addition to com-
paring the automatic MDS method against the fixed
tiling of RL-SANE, we include another algorithm in
the study, a mutation method which allows RL-SANE
to mutate the number of abstract states during the evo-
lution of the network. The experiments are carried out
on two benchmark RL problems, mountain car and
double pole balance.
The mutation method simply introduces another
mutation operator into the neural network’s evolution-
ary process which changes the number of tiles used in
the abstraction in the next generation. The number
of tiles can either increase or decrease up to five val-
ues per mutation so that each subsequent abstraction
is somewhat similar to the preceding one. The muta-
tion method is not completely automatic; the number
of tiles to start with will have an impact on how the
algorithm performs. We also present a brief study of
this phenomenon in the results.
The mountain car problem (Boyan and Moore,
1995) consists of a car trying to escape a valley as
illustrated in Figure 2a. The car starts at a random po-
sition in the valley and its goal is to drive over the hill
to the right of the starting position. Unfortunately, the
car’s engine is too weak to drive up the hill directly.
Instead, the driver must build momentum by driving
forward and backward in order to escape the valley.
Only two perceptions are used to define this prob-
lem, the position of the car within the valley X, and
the velocity of the car V. Time is discretized into
small intervals and the learner can choose one of two
actions in each time step: drive forward or backward.
The only reward that is assigned is -1 for each each
action that is taken before the car reaches the goal
of escaping the valley. Since RL algorithms seek to
maximize the reward, the optimal policy is the one
that enables the car to escape the valley as quickly as
possible.
The double inverted pole balancing prob-
lem (Gomez and Miikkulainen, 1999) depicted in
Figure 2b is a very difficult RL benchmark problem.
In this problem, the learner must balance two poles
of different length and mass which are attached to a
moving cart. The problem is further complicated by
maintaining that the cart must stay within a certain
small stretch of track. If the learner is able to prevent
the poles from falling over after a specified amount
of time then the problem is considered solved.
This is a higher dimensional problem than the
mountain car problem, with six perceptions being
given to the learner: the position of the cart X, the
velocity of the cart X
′
, the angle each beam makes
with the cart, θ
1
and θ
2
, and the angular velocities of
the beams, θ
′
1
and θ
′
2
. Once again, time is discretized
into small intervals, and during any such interval the
learner can choose to push the cart to the left or right
or to leave it alone. In our experiment, the learner
only receives a -1 reward for dropping either beam or
exceeding the bounds of the track. If the learner is
able to balance two poles and not exceed the bounds
of the track for 10
6
time steps the problem is taken to
be solved.
On each of the problems the three methods were
evaluated over 25 runs using different random seeds
(the same seed values were used for all three meth-
ods). For each run, both the mountain car and double
pole balance environments used a problem set size of
100 random initial start states. We report the aver-
age values across the 25 runs in our results. It should
be noted that the mutation and the fixed tiling ap-
proaches have a significant dependency on the initial
number of abstract states, while the MDS does not.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
252