Wide and Deep Reinforcement Learning for Grid-based Action Games

Juan M. Montoya and Christian Borgelt

Chair for Bioinformatics and Information Mining, University of Konstanz, Germany

Keywords:

Wide and Deep Reinforcement Learning, Wide Deep Q-Networks, Value Function Approximation, Rein-

forcement Learning Agents.

Abstract:

For the last decade Deep Reinforcement Learning has undergone exponential development; however, less has

been done to integrate linear methods into it. Our Wide and Deep Reinforcement Learning framework provides

a tool that combines linear and non-linear methods into one. For practical implementations, our framework can

help integrate expert knowledge while improving the performance of existing Deep Reinforcement Learning

algorithms. Our research aims to generate a simple practical framework to extend such algorithms. To test

this framework we develop an extension of the popular Deep Q-Networks algorithm, which we name Wide

Deep Q-Networks. We analyze its performance compared to Deep Q-Networks and Linear Agents, as well as

human players. We apply our new algorithm to Berkley’s Pac-Man environment. Our algorithm considerably

outperforms Deep Q-Networks’ both in terms of learning speed and ultimate performance showing its potential

for boosting existing algorithms.

1 INTRODUCTION

In Artiﬁcial Intelligence there is an interest in cre-

ating rational agents which “act so as to achieve the

best outcome or, when there is uncertainty, the best-

expected outcome”(Russell and Norvig, 2003, p. 6).

The reinforcement learning (RL) problem seeks to

develop rational agents that learn from their environ-

ment by searching to maximize their outcomes using

a rewards system. These RL agents can accomplish

different kinds of tasks such as autonomous driving

(Kim et al., 2004), playing games (Mnih et al., 2015)

and directing robots (Kalashnikov et al., 2018). Since

the last decade, RL has been developing exponen-

tially, especially in the area of Deep Reinforcement

Learning (DRL) (Henderson et al., 2018).

Some examples of RL agents worthy of mention

that have used linear and non-linear functions to im-

prove and extend the RL framework. The autonomous

helicopter (Kim et al., 2004) from Standford Univer-

sity is an early work, where the agent learns to hover

in place and to ﬂy a number of maneuvers by ap-

plying RL via Linear Function Approximation. This

implementation is efﬁcient in training, as well as re-

solves and generalizes the problem of ﬂying and hov-

ering. Nevertheless, it also assumes implicitly that

the problem is linearly solvable and so has limited

use in many (non-linear) real-world problems. In

2015, Deep Mind’s algorithm enabled RL agents to

successfully play 49 Atari games using a single al-

gorithm, ﬁxed hyperparameters, and deep learning

(Mnih et al., 2015). Most recently, RL agents that

control robotic arms learn by applying similar princi-

ples how to generalize from their grasping strategies

so as to respond dynamically to disturbances and per-

turbations (Kalashnikov et al., 2018). Those network

architectures are robust and able to adapt to many

real-world problems; nevertheless, they inherit the al-

ready well-known difﬁculties of choosing and train-

ing neural networks and also require a lot of compu-

tation power (Goodfellow et al., 2016).

Researchers have been developing and treating

Linear Function Approximation and deep learning

separately to the best of our knowledge. Why not

combine wide learning (e.g. Linear Function Ap-

proximation) and deep learning to improve the per-

formance of RL algorithms? Fortunately, a wide and

deep machine learning framework has already been

developed in the ﬁeld of recommendation systems

(Cheng et al., 2016). Our research aims to develop

a framework to transfer this approach to RL, making

it easy for researchers to extend already existing DRL

algorithms. To test our framework we developed an

extension of the popular Deep Q-Networks (DQN)

algorithm, which we name Wide Deep Q-Networks

(WDQN). We evaluated WDQN using a grid-based

Montoya, J. and Borgelt, C.

Wide and Deep Reinforcement Learning for Grid-based Action Games.

DOI: 10.5220/0007313200500059

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 50-59

ISBN: 978-989-758-350-6

action game: Berkley’s Pac-Man environment. We

used Berkley’s Pac-Man environment because it is

highly scalable and computationally efﬁcient. Fur-

thermore, solving the problem of playing Pac-Man is

not trivial. The DQN’s results on this game are some

of the worst among the 49 ATARI games, underper-

forming humans (Mnih et al., 2015).

Using the simple idea of combining both learning

approaches, we demonstrate that our WDQN trained

agent has a signiﬁcantly higher winning rate and pro-

duces much better results compared to solely linear or

non-linear agents, and has better learning speed com-

pared to DQN.

Our research is now divided into six sections. In

the “Background” section we do a review of Linear

Function Approximation and DQN and then present

our theoretical framework “Wide and Reinforcement

Learning”, which we used to develop the WDQN

algorithm. In the section entitled “Experiments”

we show how WDQN performs compared to DQN,

Linear Function Approximation, and humans. For

this, we expose how Berkley’s Pac-Man environment

works, the experimental set-up for WDQN, and the

results. Next, we discuss the results and then present

our conclusions.

2 BACKGROUND

2.1 Linear Function Approximation

RL agents receive feedback from their actions in the

form of rewards from interacting with the environ-

ment. The agents aim to resolve a sequential decision

problem by optimizing the cumulative future rewards

(Sutton and Barto, 2018). One of the most popular

methods to resolve this is Q-learning (Watkins, 1989;

Henderson et al., 2018). Nevertheless, Q-Learning

alone cannot compute all value functions when con-

fronted with a large state space, which is the case

for most real-world problems (Russell and Norvig,

2003).

One way of tackling the large state space problem

is to use a function

Q to approximate the true q-value

Q. Differentiable methods, such as linear combina-

tion of features and neural networks, offer us the pos-

sibility of using stochastic gradient descent (SGD) as

an intuitive form to optimize the action value func-

tion.

The equation using a Q-learning update after tak-

ing action A

in state S

observing the immediate re-

wards R

t+1

and continuing state S

t+1

is then

t+1

= θ

+ α(y

ˆq

−

Q(S

, A

;θ

) 5

Q(S

, A

;θ

) (1)

where α is a scalar step size, θ

the parameters of

the function

Q, and the target function y

ˆq

deﬁned as

t+1

+ γ max

Q(S

t+1

, a;θ

). Gradient descent is ap-

plied by optimizing a loss function from the differ-

ence of y

ˆq

and

Q(S

, A

;θ

The value function can be approached us-

ing a linear combination of features f (S

) =

[ f

), ..., f

)]

. Each f

) represents a feature

mapped at state S

with a particular function f

. The

q-value function is constructed as Q

, A

;θ

) =

f (S

)

, where θ

are the weights of the linear

function. The target is then deﬁned as

:= R

t+1

+ γ max

t+1

, a;

) (2)

The differential of (1) applied to f (S

)

then θ

t+1

= θ

+α(y

−Q

, A

;θ

) f (S

). It is

therefore a simple matter to compute the update rule.

In practice, linear methods can be very efﬁcient

in terms of both data and computation. Nevertheless,

prior domain knowledge is usually needed to create

useful features, representing interactions between fea-

tures can be difﬁcult, and convergence guarantees are

limited to linear problems (Sutton and Barto, 2018).

2.2 Deep Q-Networks

The intuitive action to resolve non-linear cases is

to substitute the approximation function

Q with a

non-linear function using neural networks. However,

this ﬁrst “naive” approach underperformed because of

problems with non-stationary, non-independent, and

non-identically distributed data (Mnih et al., 2015).

The DQN tackle such problems by using an ex-

perience replay memory and target networks. DQN

use a convolutional neural network (convNets) archi-

tecture to compute the state S

to a vector of action

values. This q-value function is Q

DQN

, A

;θ

DQN

where θ

DQN

are the parameters of the convNets. The

experience replay (Lin, 1992) saves observed transi-

tions for some time in a dequeue. These transitions

are later uniformly sampled and used to update θ

DQN

The parameters

DQN

of the target network

DQN

are

copied from the online network every τ steps, so that

DQN

= θ

DQN

, ﬁxing

DQN

on all other steps. The

target used by DQN is then

DQN

:= R

t+1

+ γ max

DQN

t+1

, a;

DQN

) (3)

Both components dramatically improve the perfor-

mance of the algorithm (Mnih et al., 2015) and have

been successfully extended since their creation (Hen-

derson et al., 2018). However, DQN and variants in-

herit all the problems related to neural networks such

as the difﬁculty of interpreting the decision making

Wide and Deep Reinforcement Learning for Grid-based Action Games

Features

DeepComponent

....

Dense

Layer

Convolutional

Layers

WideComponent

Action

Values

preprocessing

State

....

conv

k1

....

neuron

Figure 1: Wide and Deep Reinforcement Framework showing the connections from the weights of the deep and wide compo-

nent to action value a

of the networks, the tuning of multiple hyperparame-

ters (Goodfellow et al., 2016), and the complexity in

computation with no guarantees of convergence (Sut-

ton and Barto, 2018).

Our approach proposes to combine both linear and

non-linear methods in order to obtain better, faster and

more comprehensive results using DQN Algorithms.

Cheng et al. (2016) already showed that such a wide

deep learning model is viable and signiﬁcantly im-

proves the results for recommendation systems.

3 WIDE AND DEEP

REINFORCEMENT LEARNING

Figure 1 shows the general structure of the Wide and

Deep Reinforcement Learning (WDRL) Framework,

which can be used for already existing DRL algo-

rithms. This framework consists of the linear combi-

nation of features (left side) and convNets (right side),

which are respectively called the wide and deep com-

ponent. Both components compute the action values

A = [a

, ..., a

Initially, the states are preprocessed separately for

the wide and deep component. For this step, the pre-

processing function φ needs to be able to process the

state S

separately for the wide and the deep com-

ponent. The state S

needs to be processed indepen-

dently for each component because the inputs are dif-

ferent for each component. The wide component re-

quires information provided by the features, while the

deep component takes an image representation to be

processed by the convNets in our case.

3.1 The Wide and Deep Component

In Figure 1, the wide component uses a linear combi-

nation of features f

, ..., f

, which are connected with

their respectively weights θ

= [w

, ..., w

]

to each

value action a

of the approximation functions. In

this illustration, we represent the weights as shared

weights (e.g. as a vector) because they correspond to

the implementation used in our experiments. The fea-

tures can be represented by a feature matrix F ∈ R

m,n

where m is the total number of outputs and n the num-

ber of features. Figure 1 shows [F

1,1

, ..., F

1,n

] = F

multiplied by θ

for a

. The optimization step can

be easily inferred from (1).

In our illustration, the deep component uses

the convNets [conv

, conv

, ..., conv

k−1

] for almost

all layers. Each hidden layer except for the last

one performs the following computation a

(l+1)

f (

(l)

+ b

(l)

), where l is the layer number and

f is the activation function, a

(l)

, b

(l)

, and

(l)

are

the activations, bias, and function weights at the l·-

·th layer. The last layer k implements a dense layer,

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

but it could also be extended to have more fully con-

nected layers. In addition, this ﬁnal layer does not use

an activation function so that a

(k)

W a

(k−1)

where

(k−1)

W ∈ R

. Figure 1 shows also that

(k−1)

= [neuron

, ..., neuron

]

and the weights of the

last layer for a

are represented by [ ˆw

, ..., ˆw

]

The wide and deep components are combined to

generate the output of the Wide and Deep Reinforce-

ment function. Figure 1 shows that a

= F

1,1

· w

... + F

1,n

· w

+ neuron

1,1

+ neuron

1,1

+ ... +

neuron

p,1

, which is equal to a

= F

· θ

( f −1)

. This can be easy generalized to all action

values: A = F · θ

+ a

(k)

Now, in order to train the function, two ap-

proaches can be implemented. The ﬁrst approach

is called joint ensemble training and computes SGD

jointly for the linear and non-linear functions. We

call the second approach semi-ensemble training be-

cause, in contrast to ensemble training in supervised

machine learning, predictions of the linear and non-

linear functions do inﬂuence each other during train-

ing (Cheng et al., 2016). However, as in ensemble

training, this approach implements SGD separately

for both functions. Both approaches use the combined

prediction of the linear and non-linear function to act.

3.2 Wide Deep Q-Networks

The DQN Algorithms can be extended by integrat-

ing the linear function Q

with the non-linear func-

tion Q

DQN

creating the combined function Q

W D

; thus

called “Wide Deep Q-Networks”.

For our illustration using WDQN, the wide com-

ponent uses the target function for the linear combi-

nation of features shown in (2). Meanwhile, the deep

component uses the target function of (3). There-

fore, the combined function is Q

W D

, A

;θ

W D

) =

, A

;θ

) + Q

DQN

, A

;θ

DQN

), where θ

W D

includes the parameters of the wide and deep func-

tion.

For the joint training, the algorithm remains al-

most identical to the original DQN. The online Q

DQN

and target network

DQN

need only be replaced by

W D

and

W D

respectively. The target is then deﬁned

W D

:= R

t+1

+ γ max

W D

t+1

, a;

W D

) (4)

where

W D

are the target parameters of the combined

function. The SGD is estimated directly on the joint

function.

For the semi-ensemble training, the algorithm

needs to save the linear and non-linear function, as

Algorithm 1: Semi-Ensemble Training WDQN.

1: Initialize:replay memory D to size N;

2: Action-value functions Q

W D

, Q

DQN

with

respectively random weights θ

W D

, θ

DQN

;

3: Target action-value functions

W D

DQN

with weights

W D

= θ

W D

= θ

DQN

= θ

DQN

respectively.

4: for episode = 1, M do

5: Initialize sequence S

= [x

]

6: Preprocessed sequence φ

= φ(s

)

7: for t = 1,T do

8: With probability ε select a random action

∈ A

otherwise select:

= argmax

W D

(φ(s

), a; θ

W D

)

9: Execute action a

, observe

reward R

and image x

t+1

10: Set S

t+1

= S

, a

, x

t+1

and preprocess

t+1

= φ(S

t+1

)

11: Store transition (φ

, a

, R

, φ

t+1

) in D

12: Sample random minibatch of transitions

(φ

, a

, R

, φ

t+1

) from D

13: Set y

DQN

, y

= r

for terminal φ(S

j+1

)

and non terminal φ(S

j+1

DQN

= R

t+1

+ γmax

DQN

t+1

, a;

W D

)

= R

t+1

+ γmax

t+1

, a;

)

14: Perform gradient descent on

DQN

− Q(S

t+1

, a; θ

DQN

))

and

− Q(S

t+1

, a; θ

))

with respect to θ

and θ

DQN

15: Every C steps reset

= θ

W D

= θ

W D

and

DQN

= θ

DQN

16: end for

17: end for

well as the combined function (see Algorithm 1). Ba-

sically, the actions are being chosen by the combined

function Q

W D

, however SGDs are estimated sepa-

rately on Q

and Q

DQN

, implementing both targets

from (2) and (3).

4 EXPERIMENTS

4.1 The Pac-Man Environment

In order to compare the different algorithms with each

other, we used the Pac-Man open source environment

of UC Berkeley (DeNero and Klein, 2010). Our goal

was not to use a fully realistic simulator of Pac-Man

to achieve superhuman results, but rather to have a

scalable and computer efﬁcient environment to test

our Deep and Wide Reinforcement framework. The

Pac-Man environment of UC Berkeley is suitable for

this: the scalability is guaranteed by providing cus-

tomizable map sizes. Moreover, the preprocessing of

the game states is more efﬁcient than using raw pixels

Wide and Deep Reinforcement Learning for Grid-based Action Games

(more details coming below). We decided to test our

approach only on small and medium maps due to our

computational limitation of one 12 GB NVIDIA Ti-

tan GPU. We analyzed the performance of our agents

for each map. We have chosen ConvNets because

they share weights across maze positions creating in-

dependence between maze locations and accelerat-

ing training speed. This is an advantage compared

to fully connected nets with the same amount of lay-

ers (Goodfellow et al., 2016), reassuringly, the use of

ConvNets has been a standard tool for DRL research

(Mnih et al., 2015; van Hasselt et al., 2016; Hender-

son et al., 2018).

For a more complex and fully realistic Pac-Man

environment see Ms. Pacman used in (Mnih et al.,

2015; van Hasselt et al., 2016). In order to guaran-

tee transparency and reproducibility, our Python code

using Tensorﬂow is available at GitHub

. We follow

the recommendations of Henderson et al. (2018) to

include the used hyperparameters and random seeds.

(a)

(b)

(c)

Figure 2: The small (a) and medium (b) maps of Berkeley’s

Pac-Man environment at the agent’s starting point. The

medium map in (c) shows the last dot that has to be eaten in

order to ﬁnish the game.

Figure 2 shows our implemented maps in Berke-

ley’s Pac-Man environment. Pac-Man, the yellow

agent, can move vertically and horizontally, eating

dots on his way. The goal of Pac-Man is to score as

https://github.com/JuanMMontoya/WDRL

many points as possible by eating the small dots all

around the maze while avoiding crashing into any of

the ghosts that are chasing him. Additional points are

given if Pac-man eats the special edible ghosts. The

ghost becomes temporally edible when Pac-Man eats

one of the big dots on the board. Eaten ghosts reap-

pear back as not edible ghosts. An episode is ﬁnished

when Pac-Man either eats the last dot (see (c) in Fig-

ure 2) or gets killed by a non-edible ghost.

In both maps, Pac-Man starts in the top middle

part of the map which also contains two ghosts, many

small dots, and two big dots (see (a) and (b) in Fig-

ure 2). The room, where the ghosts start and reap-

pear if they get killed, is in the middle of the maze for

medium map and at the top of the maze for the small

map. For scoring, we used the original reward system

of UC Berkeley. The initial score is always zero and

restarts after each episode. Eating the small and big

dots scores 10 points. For each eaten edible ghosts,

the agent scores 50 points. In order to avoid stagna-

tion, the agent is deducted 1 point for each second

that is spent. At the end of an episode, Pac-Man ei-

ther wins, scoring 100 points or loses, deducting 500

points.

A proﬁtable approach to achieve higher scores is

to eat ghosts because of the high rewards. Neverthe-

less, the original DQN algorithm did not learn to eat

them (Mnih et al., 2015), explaining to some extent

the poor scores of DQN in this game. Their DQN

agent received a clipped reward of either 1 or -1 at

each state S

. For example, the agent gets 1 point for

eating either a ghost or a dot; a negative score, for

dying, is -1. The agent, therefore, did not learn the

signiﬁcance of eating ghosts because of the low re-

ward (van Hasselt et al., 2016). However, the agent

did learn to eat ghosts in two other approaches: Has-

selt et al. (2016) confronted this problem by adap-

tively normalizing the targets of the network, making

it possible to process all types of rewards. Meanwhile,

Van der Ouderaa (2016) used the incoming reward of

Berkeley’s environment at each state S

to train the

DQN algorithm.

For our approach, we preferred to use Van der

Ouderaa’s method, because it keeps our implementa-

tion as minimalistic as possible and was also tested

using the same reward system. In addition, in our

experiments, we decided to distinguish whether the

trained agents can or cannot eat ghosts. We consider

that an agent can only eat a ghost when it actively

hunts the edible ghost and not purely by chance. To

do this, we observed at least 10 games for each of the

selected agents.

In order to save computational power, raw pixels

are not used as the input for the deep component (also

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

for DQN). For agents that learn directly from pix-

els see (Mnih et al., 2015; van Hasselt et al., 2016).

The input we implemented, consists of an array with

six matrices containing the coordinates respectively

of each 1) ghost, 2) wall, 3) dot, 4) big dot, 5) Pac-

Man and 6) edible ghost (van der Ouderaa, 2016). In

each matrix, a 0 or 1 respectively expresses the ex-

istence or absence of the element at each coordinate

on its corresponding matrix. These matrices are eas-

ily retrievable for each state and they are a distinctive

quality of using Berkeley’s environment.

The matrix size is deﬁned by the width W and

height H of the game grid. The input, therefore, has

the following dimensions of W × H × 6. This per-

mits a fast identiﬁcation of the important game ele-

ments. In combination with the size favorable maps,

the inputs’ preprocessing and the back-propagation

are computed efﬁciently.

For the wide component, we used the Linear

Function Approximation contained in the Berkeley

environment. This consist of the three features struc-

tured in the following way:

1. #-of-ghosts-1-step-away: lets the agent know the

number of ghosts one step away and does not dif-

ferentiate between edible and non-edible ghosts.

2. eats-food: sets to one if there is a ghost one step

away and zero if there isn’t.

3. closest-food: gives the direction to the closest dot.

Notably, the linear agent cannot eat ghosts. We

found that the main reason for such a behavior is the

inability of #-of-ghosts-1-step-away to distinguish the

ghost type. This creates a dichotomy: either learning

to eat ghost or avoiding them. Since the reward incen-

tives are higher to survive he chooses to evade them.

4.2 Experimental Set-up

The algorithms analyzed are the Linear Function Ap-

proximation, DQN, and WDQN. To tune the hyper-

parameters we performed around 100 preliminary ex-

periments for the linear and DQN agent in differ-

ent maps. We adjusted mainly the size of the mem-

ory replay, the learning rates, the update rate of the

target function, the network structure, and the ex-

ploration value ε with its ﬁnal exploration frame.

We consistently maintained the ﬁnal hyperparameters

shared between the linear and DQN agent. Afterward,

we used these hyperparameters for the WDQN. All

agents also have also the same action values, i.e. four

possible directions: left, right, up, down.

In addition, we found that the learning curve stag-

nated around 10000 episodes. Therefore, we have

chosen this value as the training limit for the ﬁnal

experiments. In order to compare the agent’s per-

formance against each other, we decided to use the

averaged score and the win rate of each agent for

100 episodes. Finally, we repeated multiple random

seeded experiments with the same hyperparameters

for each selected agent to guarantee consistency.

Our DQN-agent do not apply the same hyperpa-

rameters but rather the same algorithm structure de-

scribed in (Mnih et al., 2015). Our convNet has two

convolutional layers and one fully connected layer

that maps into the four outputs. The ﬁrst layer applies

16 3× 3 ﬁlters with full padding and one stride, while

the second 32 3 × 3 ﬁlters with full padding and one

stride. The fully connected layer has 256 neurons.

The learning rate was set up to 0.001 using ADAM

optimization algorithm. This architecture permits us

to maintain the network small but with the capacity

of making complex decisions. By using height and

width 3x3 ﬁlters in two layers (resulting in a 5x5 ﬁeld

of view) it permits the agent to see at least 2 steps

away from him. This is important for avoiding been

eaten. The two layers with a depth of 16 and 32 di-

mensions respectively allow the agent to be able to

abstract from a combination of 32 different base maze

patterns. A similar architecture was implemented in

(van der Ouderaa, 2016) and was conﬁrmed during

our preliminary experiments.

Using the linear combination of features, the lin-

ear agent chooses the policy at state s

that could

give him the highest q-function at the next state s

n+1

The sequence of training for our linear approximator

follows the DQN’s algorithmic structure (i.e. target

function, memory replay, etc.). The learning rate was

set up to 0.1 using SGD. The exact description of the

applied hyperparameters for all agents can be found

in our GitHub’s repository.

The WDQN algorithm is trained using semi-

ensemble training because of the preliminary knowl-

edge of learning rates for the linear function and con-

vNets. The listed features of the last subsection are

implemented in the wide component using different

combinations of features. The WDQN algorithm is

tested separately using a wide component with three,

two, and one feature(s) respectively. We decided to

combine the features in the following way because:

1. Combining the three available features permits us

to mix DQN and the Linear Approximator com-

pletely.

2. By avoiding using the feature #-of-ghosts-1-step-

away, the two features closest-food and eats food

do not contain the strongest constraint to not learn

eating ghosts.

3. The feature #-of-ghosts-1-step-away should en-

able faster learning to win, because it gives the

Wide and Deep Reinforcement Learning for Grid-based Action Games

most important information on how to survive the

game.

Table 1: Score Average and Win rate in Small and Medium

Map, as well as whether the agents learned to eat ghosts.

The agents presented are the Linear Function Approxima-

tion, DQN and WDQN with 3, 2 and 1 features respectively,

as well as the Random agent. For each algorithm, the best

agent is chosen and evaluated for 100 episodes. The best

amateur human player also played 100 games.

Small Map Medium Map Eats

GhostsScore Win Rate Score Win Rate

Linear -108 18% 486 55% no

DQN 110 33% 622 47% yes

WDQN 3 feat. 296 60% 666 64% no

WDQN 2 feat. 353 61% 727 65% yes

WDQN 1 feat. 215 51% 614 61% no

Human -99 11% 125 12% yes

Random -463 0% -443 0% no

4.3 Results

Figure 3 shows the achieved scores for each agent in

the small (right side) and medium map (left side) dur-

ing training. A clear difference of training speeds is

observable between algorithms. The WDQN 3 and 2

features agents learn faster than the WDQN 1 feature

and the DQN agents. The linear agent learns faster

than all other agents but stabilizes once arriving at a

certain threshold, which is exceeded by all agents us-

ing neural networks at some point. The learning speed

also varies depending on the map.

In the medium map, we see that the training is

faster than in the small map. For instance, in the

medium map, the WDQN 3 and 2 features surpass the

score of the linear agent at 4000 episodes and DQN at

6700 episodes. This results in a difference of 2700

episodes between WDQN and DQN. Meanwhile, the

WDQN 3 and 2 features agents in the small map

reach the score of the linear agent at 3900 episodes,

while DQN at 5700 episodes. This is a difference of

1600 episodes. Thus, there is a substantial disparity

of training speed between WDQN and DQN agents,

which also varies depending on the map.

In addition, Figure 3 presents the different perfor-

mance of the algorithms during training. WDQN 3

and 2 features perform better than the WDQN 1 fea-

ture and DQN agents. The linear agent achieves the

worst results followed by WDQN 1 feature, while

DQN stays only behind the outcomes of WDQN 3

and 2 features. Yet again, there are notable differences

between maps. First, the WDQN 3 features agent be-

haves better in the small map than in the medium map.

Second, the performance gap between worst and bet-

ter agent is more unequivocal in the small than in the

medium map.

Table 1 shows the averaged score and winning rate

for WDQN, DQN, Random and Linear agents, and

human players in the small and medium map for 100

episodes. Table 1 illustrates that the WDQN 3 and

2 features have the best averaged score and win rate

among all agents, while the DQN agent is at 3

place depending on the map. The best agent is

WDQN 2 features. The agent wins 61% with 353

points and 65% with 727 points respectively for the

small and medium map and learns to eat ghosts. At

the same time, the WDQN 3 features agent has a

slightly smaller win rate than WDQN 2 features but

scores lower than such an agent and cannot eat ghosts.

The DQN agent underperforms those agents and

learns to eat ghosts as shown in Table 1. However, in

the medium map, the difference in score to WDQN

3 and 2 features is less than 120 points, although the

DQN agent loses more than half of the games. In con-

trast, the linear agent wins a little more than half of the

games but scores around 120 points less than DQN.

As already stated, the linear agent cannot eat ghosts

because of the features being used. The WDQN 1

feature agent wins less than the other WDQN agents

and cannot eat ghosts. In the case of the medium map,

it scores lower than the DQN agent.

Furthermore, the results for our amateur human

players are below the non-random agents with around

11% of games won. The only slight exception is

the score (-99) against the linear agent in the small

map (-108). For the human row, we selected the best

players of a round-robin tournament with 9 volunteers

(see our GitHub repository for the recorded human’s

games ). This shows us that the problem is not triv-

ial for humans. Random agents have the worst results

(-463 and -443); thus, moving by chance is not a prof-

itable option for this game.

The Figure 3 and Table 1 present almost equiva-

lent outcome. Yet, the most notable contrast is that

the best agent during training is WDQN 3 features in

Figure 3 (left side), but the best scores in Table 1 are

produced by WDQN 2 features.

The present ﬁndings conﬁrm that the WDQN

agent can perform better than DQN, linear and ran-

dom agent, as well as the amateur human players.

WDQN algorithm with 3 and 2 features have the best

score and win rates. In addition, they learn faster than

the DQN agent. Nevertheless, the WDQN 1 feature

agent has a lower score and learning speed than the

other WDQN’s. Our assumption that WDQN 1 fea-

ture could learn faster is proven to be wrong.

Lastly, we detect some tendencies in the results

that are worth examining more closely. There is a

learning speed difference between maps that could be

related to the performance of the linear agent. More-

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

Figure 3: Each agents score is evaluated during training for the small (right side) and medium map (left side). Five random

seeded training sessions are conducted for each agent with the same hyperparameters until 10200 episodes respectively in the

small (a) and medium map(b). The lines arise of averaging those training sessions according to the agent type.

over, there are algorithms that learn to eat ghosts,

while others do not, depending on the implemented

feature. Finally, making only random moves is a

high unproductive strategy both to win and score high.

This could be part of a pattern that explains the nega-

tive results of the DQN agent.

5 DISCUSSION

The results presented conﬁrm the better performance

of the combined WDQN agent compared to the solely

DQN and linear agent. Now, we concentrate on dis-

cussing the reasons for such improved performance

by analyzing the learning speed difference between

maps, why some agents eat ghosts or not, and the

logic behind the DQN’s underperforming.

For WDQN 2 and 3 features agents, we observe a

speedup of the learning during training results, which

depends on the map. The linear agent’s performance

can be the key to understand such a difference. In

the medium map, the linear agent works considerably

better than on the small map. The superior results of

the linear agent in the medium map inﬂuence those

WDQN agents to learn faster and more prolonged

than in the small map.

In addition, we observe that the poor results of the

linear agent are not transferred to the WDQN agents

in the small map. This could imply that our linear

function cannot abstract a proper solution to the game,

although the features itself provide valuable informa-

tion.

The scores and win rates of the trained agents

show that the WDQN agents with the #-of-ghosts-

1-step-away feature do not learn to eat ghosts. This

explain why the WDQN 2 features outperforms the

WDQN 3 features learns to eat ghosts scoring more.

Furthermore, by looking at WDQN 1 feature’s per-

formance, the #-of-ghosts-1-step-away feature helps

to develop the capacity to survive in short-term. Yet,

it restricts the capacity to achieve high scores in long

term.

We observed that the weights of the wide com-

ponent change faster and stronger than those of the

deep component because of its simple updates. In the

case of WDQN 3 feature, there is an information con-

ﬂict between the deep component and the wide com-

ponent. The features of the linear component treat

all ghosts identically, while the input of deep compo-

nent can see the difference between edible and not.

From this conﬂict the wide component is more domi-

nant because of its weights. If there is no compensa-

tion of this effect, for example, by normalizing the

weights, switching on and off the wide component

during training or only training more episodes, we

suspect that WDQN agents using this feature will not

learn to eat ghosts.

Repetitive observations of the DQN agent’s play

leads us to detect that the agent had difﬁculties reach-

ing the last dots in the map. This was especially true

if the dots were far away from the agent. Figure 2

considerably far for the last dot. We believe that:

1. Reaching the last dot is complicated because the

reward propagation usually happens in a different

place each game.

2. Making random moves does not permit Pac-Man

to explore the map because the ghosts can easily

kill him when maneuvering randomly. Rather, the

DQN agent seems to reach far away dots thanks

to the movements caused by avoiding ghosts.

In both cases, there are insufﬁcient examples to

learn how to detect the exact position of far away dots.

This could explain the success of the WDQN 3 and 2

features. The wide component adds the information

about where to ﬁnd such dots.

Wide and Deep Reinforcement Learning for Grid-based Action Games

However, we should not exclude the possibility

that the DQN problems could be related to the chosen

hyperparameters, especially of the convNets. Maybe

choosing larger ﬁlters could contract such problem.

Moreover, assuming that the DQN has an excellent

solution in itself in another different implementation,

maybe adding a wide component would improve nei-

ther the training speed nor the results.

Conclusively, we believe that integrating a good

wide component to the WDQN model can be the rea-

son for a substantial speedup of learning. Adding a

deep component to a linear agent could improve its

linear limitation considerably by converting it into

a non-linear model. Precautions are needed when

choosing which features to integrate into the com-

bined agent. Lastly, a favorable wide component can

compensate for the difﬁculties of the deep component

to learn from insufﬁcient examples.

6 CONCLUSION

Our research shows that the WDQN agents can out-

perform linear and DQN agents in score, winning rate

and learning speed. The chosen features also play a

role in achieving these results. However, there can

be learning limitations depending on the selected fea-

ture(s). The research demonstrates that combining a

neural network with a linear agent helps improve re-

sults by allowing the model to learn non-linear rela-

tionships while adding information about the interac-

tion between speciﬁc features, while also making the

agent adaptable to uncertainty. Furthermore, the wide

component can complement the weaknesses of a non-

linear agent by helping the agent learn faster and con-

centrate on ﬁnding less obvious important features.

Our method is straightforward and employable for

various deep reinforcement contexts. For real-world

implementations such as robotics, the combination of

linear and non-linear functions in our Wide and Deep

Reinforcement Learning provides an interesting tool

for integrating new devices like sensors in the form

of features into DRL agents, or for including better

expert knowledge with human chosen features. Fu-

ture work could look into extending WDQN to in-

clude newer DQN-related algorithms and developing

methods that make implementing WDQNs easier; for

example, automatically setting the learning rate of the

wide component from the deep component’s to re-

duce the number of hyperparameters. In addition, one

could research how to ensure that the inﬂuence of neg-

ative features of the wide component can be overrid-

den by the deep component.

ACKNOWLEDGEMENTS

We thank our colleagues from the Chair for Bioinfor-

matics and Information Mining of the University of

Konstanz, who provided insight and corrections that

greatly assisted the research. Especially, we are grate-

ful to Christoph Doell and Benjamin Koger for the

continuous assistance with the paper’s structure and

the engineering of the deep neural networks.

REFERENCES

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra,

T., Aradhye, H., Anderson, G., Corrado, G., Chai,

W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V.,

Liu, X., and Shah, H. (2016). Wide & Deep Learning

for Recommender Systems. In Proceedings of the 1st

Workshop on Deep Learning for Recommender Sys-

tems, DLRS 2016, pages 7–10, New York, NY, USA.

ACM.

DeNero, J. and Klein, D. (2010). Teaching Introductory

Artiﬁcial Intelligence with Pac-Man. Proceedings of

the Symposium on Educational Advances in Artiﬁcial

Intelligence, pages 1885–1889.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Pre-

cup, D., and Meger, D. (2018). Deep Reinforcement

Learning that Matters. In Proceedings of the Thirtieth-

Second AAAI Conference on Artiﬁcial Intelligence,

AAAI’18. AAAI Press.

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,

A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,

Vanhoucke, V., and Levine, S. (2018). QT-Opt: Scal-

able Deep Reinforcement Learning for Vision-Based

Robotic. CoRR, abs/1806.10293.

Kim, H. J., Jordan, M. I., Sastry, S., and Ng, A. Y.

(2004). Autonomous Helicopter Flight via Rein-

forcement Learning. In Thrun, S., Saul, L. K., and

Sch

olkopf, B., editors, Advances in Neural Infor-

mation Processing Systems 16, pages 799–806. MIT

Press.

Lin, L.-J. (1992). Self-Improving Reactive Agents Based

on Reinforcement Learning, Planning and Teaching.

Machine Learning, 8(3):293–321.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beat-

tie, C., Sadik, A., Antonoglou, I., King, H., Kumaran,

D., Wierstra, D., Legg, S., and Hassabis, D. (2015).

Human-Level Control through Deep Reinforcement

Learning. Nature, 518(7540):529–533.

Russell, S. J. and Norvig, P. (2003). Artiﬁcial Intelligence:

A Modern Approach. Pearson Education, 3 edition.

Sutton, R. S. and Barto, A. G. (2018). Introduction to Rein-

forcement Learning. Working Second Edition.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

van der Ouderaa, T. (2016). Deep Reinforcement Learning

in Pac-Man. Bachelor Thesis, University of Amster-

dam.

van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., and Sil-

ver, D. (2016). Learning values across many orders of

magnitude. In Advances in Neural Information Pro-

cessing Systems 29: Annual Conference on Neural In-

formation Processing Systems 2016, December 5-10,

2016, Barcelona, Spain, pages 4287–4295.

Watkins, C. J. C. H. (1989). Learning from Delayed Re-

wards. PhD thesis, King’s College, Cambridge, UK.

Wide and Deep Reinforcement Learning for Grid-based Action Games