Towards Adaptive Deep Reinforcement Game Balancing

Ashey Noblega, Aline Paes and Esteban Clua

Department of Computer Science, Institute of Computing, Niter

oi, RJ, Brazil

Keywords:

Game Balancing, Reinforcement Learning, Deep Learning, Dynamic Balancing, Playability.

Abstract:

The experience of a player regarding the difﬁculty of a video game is one of the main reasons for he/she decide

to keep playing the game or abandon it. Effectively, player retention is one of the primary concerns related to

the game development process. However, the experience of a player with a game is unique, making impractical

to anticipate how they will face the gameplay. This work leverages the recent advances in Reinforcement

Learning (RL) and Deep Learning (DL) to create intelligent agents that are able to adapt to the abilities of

distinct players. We focus on balancing the difﬁculty of the game based on the information that the agent

observes from the 3D environment as well as the current state of the game. In order to design an agent that

learns how to act while still maintaining the balancing, we propose a reward function based on a balancing

constant. We require that the agent remains inside a range around this constant during the training. Our

experimental results show that by using such a reward function and combining information from different

types of players it is possible to have adaptable agents that ﬁt the player.

1 INTRODUCTION

In video game, the gameplay portrays an important

role to the success of a game. Thus, arguably, even

if a game has the most realistic audiovisual features

or the most fanciful story, but it contains challenges

unsuitable to the player’s ability, this may affect his

experience, making him abandon the game.

Thus, issues related to the gameplay refer to how

the game interacts with the player, such as, the game’s

challenges, objectives, rules, and so on (Southey

et al., 2005). Many efforts have been made to yield

adjustable games such that the player does not get

frustrated by a very difﬁcult game, or bored, when the

game is too easy. Commonly, this problem is tackled

by creating a ﬁnite number of difﬁculty levels (Olesen

et al., 2008) that the players must select according to

their beliefs on which one better ﬁts. Nevertheless,

this is possibly an ineffective solution as the player

may have intermediate skills between these classes or

because he is constantly learning (or forgetting) skills,

or still because the player may have a wrong vision of

its own abilities (Missura and G

artner, 2009).

Thus, terms such as difﬁculty adjustment (Hu-

nicke, 2005; Missura and G

artner, 2009), adaptive

The authors would like to thank NVidia and the Brazi-

lian research agencies CAPES and CNPq for funding this

research.

game (Spronck et al., 2006) and dynamic game bal-

ancing (Andrade et al., 2006) are receiving a great at-

tention in the games’ industry. Particularly, the topic

of dynamic game balancing refers to the process of

automatically changing behaviors, parameters, and/or

elements of the scenarios in a video game in real-time,

to match the game to the ability of a speciﬁc player

(Andrade et al., 2005). For a long time, most of the

effort to yield balanced games were based on creating

speciﬁc heuristics and/or probabilistic methods that

would hardly adapt to different games or players with

diverse abilities (Silva et al., 2015; Hunicke, 2005).

Based on this motivation, we propose a strategy

where the non-player characters, which can certainly

inﬂuence the gameplay, have the ability to learn how

to act and leverage the game balancing. By going in

this direction, we expect that they can match the hu-

man player performance, whatever the game is, and

consequently make the player has more fun and stick

to the game.

To achieve similar game-based goals, previous

works have taken advantage of Machine Learning

methods, focusing on providing artiﬁcial agents with

the ability of learning from experience without be-

ing explicitly programmed (Mitchell, 1997; Michal-

ski et al., 2013). Other approaches follow a semi-

automatic analysis of the playability of a game

(Southey et al., 2005), recognizing the level of knowl-

edge the player has to conﬁgure opponents (Bakkes

Noblega, A., Paes, A. and Clua, E.

Towards Adaptive Deep Reinforcement Game Balancing.

DOI: 10.5220/0007395406930700

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 693-700

ISBN: 978-989-758-350-6

693

et al., 2009) and new environments according to the

player’s progress(Lopes et al., 2018).

Other researchers have attacked the same problem

by recognizing the type of player, followed by a com-

parison to groups of similar players and, ﬁnally, con-

ﬁguring the game according to the mapped character-

istics (Missura and G

artner, 2009). However, it may

be difﬁcult to ﬁt the player to a group of other players,

as human behavior presents quite singular and unique

traits (Charles et al., 2005).

There are still other approaches that focus on

creating adaptable agents that directly interacts with

the environment, taking advantage of Reinforcement

Learning techniques (Sutton et al., 1998). In this case,

the agent receives the current state of the environment

and a reward value aiming at learning the policy it

should follow. Previous work(Andrade et al., 2005)

has followed this approach by using the classical Q-

learning algorithm (Watkins and Dayan, 1992). Nev-

ertheless, value-based methods such as Q-learning

may face difﬁculties to deal with continuous space

and may converge slowly, as they are not directly op-

timizing the policy function.

In this work we propose a NPC-agent that does

not need a representative model of the player to

adapt to him, and whose policy is directly in-

duced from how the agent observes the game. To

achieve that, we designed a reward function that is

based on a game-balancing constant and introduce it

into the Proximal-Policy-Opmitization (PPO) (Schul-

man et al., 2017) algorithm, a reinforcement learn-

ing method that directly optimizes the policy using

gradient-based learning. In order to tackle the com-

plexity of the environment, the PPO implementation

we selected follows a Deep Reinforcement Learning

approach (Mnih et al., 2015). In this way, we can also

beneﬁt from the remarkable results that other game-

based problems have recently achieved (Mnih et al.,

2013; Silver et al., 2016; Lample and Chaplot, 2017).

We take advantage of the Unity ML-Agents Toolkit

(Juliani et al., 2018) to implement the graphical envi-

ronments and to run the PPO algorithm. Experimen-

tal results show that we are able to devise adaptable

agents, at least when facing other non-human players.

2 RELATED WORK

The challenge of balancing the game refers to the abil-

ity of a game to modify or adjust its level of difﬁculty

according to the level of the user so that he can be con-

tent with the game. This includes avoiding that the

player get stressed or bored when playing the game

due to very difﬁcult or easy situations (Csikszentmi-

halyi and Csikszentmihalyi, 1992). In fact, (Andrade

et al., 2006) has showed that satisfaction and balance

are quite related, by asking some players to answer a

series of questions after experiencing a ﬁghting game.

To achieve game balancing, previous work has fo-

cused on trying to add new content (environments,

elements, obstacles, etc., ) regarding the abilities of

the player while others has sought to create intelli-

gent agents capable of facing the player but without

hindering their possible success (Bakkes et al., 2012).

Examples of the ﬁrst group includes (Bakkes et al.,

2014) and (Hunicke, 2005), which tries to modify

the environment or adding new elements, but know-

ing beforehand how to represent the behavior of the

player within the game.

Regarding the second group, in (Missura and

artner, 2009), the authors tried to create an auto-

matic adjustment by ﬁrst identifying the level of the

player and then ﬁtting him into a speciﬁc group (Easy,

Medium or Difﬁcult). The goal is to include a new

player inside one of these groups so that their oppo-

nents are easier or more difﬁcult to confront. That

work sees the identiﬁcation of the type of player as

a fundamental issue to have balanced games. Our

work aims to avoid creating abstractions and repre-

sentations of the player, as this is sometimes difﬁcult

to observe beforehand. Instead, we focus on repre-

senting the current state of the game to determine the

decision-making policy of the agent against his oppo-

nent.

Meanwhile, (Silva et al., 2015) uses a heuristic

function to determine the performance of the player

when facing his enemies during the game and, accord-

ing to it, the difﬁculty of the game is increased or de-

creased. Regarding the use of Reinforcement Learn-

ing (RL), which is the Machine Learning technique

most used to game-based issues, (Andrade et al.,

2006) adopted RL to teach a virtual agent to imitate

the player. After that, the agent is further trained to

balance the difﬁculty of the game. The work pre-

sented in (Andrade et al., 2005), unlike to the previous

one, only used RL to teach the agent to ﬁght and the

balancing is achieved by a heuristic function tuned ac-

cording to the abilities of the player. We do not intend

to teach the agent to imitate a player, but, instead, we

aim at devising an agent that learns altogether how to

play the game and how to adapt to the player.

3 REINFORCEMENT LEARNING

In this work we rely on Reinforcement Learning (Sut-

ton et al., 1998) to make the agent learn how to act to

achieve game balancing. We beneﬁt from the Unity

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

694

ML-agents (Juliani et al., 2018) to implement the

game environment coupled with the RL algorithms’

implementation. In this section we overview both

these components of our work.

Reinforcement Learning aims at teaching the

agent to make decisions when facing a certain situa-

tion through trial and error. When learning, the agent

receives an stimuli in the form of a reward so that it

can evaluate if the chosen action was good for the de-

cision making or not.

A RL task can be formally described as a Markov

Decision Process (MDP) (White III and White, 1989).

Thus, at each time step t of the decision making

process, the environment is represented by a state

s ∈ S (the set of states), and the agent may choose

an action a ∈ A(the set of actions) that is available

in the state s. A transition function (possibly non-

determinisic) T ∇ deﬁnes indicates the state s

that

the agent will land on in the instant t + 1, given it

has taken action a in t when it was in the state s. Af-

ter going from the state s to the state s

because of

the action a, the agent receives an immediate reward

(s, s

). A discount factor γ indicates how the future

rewards matters compared to the current one.

The overall goal is to ﬁnd a policy π

∗

argmax

E[R|π] that maps from states to actions in or-

der to maximize the accumulated reward R (obtained

from the environment in every rollout of a policy) of

all states(Sutton et al., 1998).

Proximal Policy Optimization. As the RL method,

in this work we follow the Proximal Policy Opti-

mization algorithm due to its recent success on han-

dling a number of game problems (Schulman et al.,

2017). It is a policy search-based method largely in-

spired on the Advantage Actor Critic (A2C) frame-

work (Mnih et al., 2016). The Advantage Actor

critic method relies on a gradient estimator coupled

with an advantage function

= −V (s

)+r

+γr

t+1

··· + γ

T −t+1

T −1

+ γ

T −t

V (s

) such that ˆgL

PG(θ)

E [∇

logπ

]. The estimator of the advantage

function runs the policy for T timesteps instead of

waiting to the end of the episode to collect the reward,

and uses those collected samples to update the policy.

Here is where the Critic comes, to estimate the

advantage function value. Within Deep Reinforce-

ment Learning, we have a neural network to estimate

π(s, a, θ) (the actor) and another one to estimate the

advantage function (the critic), where they possibly

have shared parameters.

Algorithm 1 reproduces the PPO algorithm

from (Schulman et al., 2017). Note that L represents

either a clipped version or a KL-penalized version of

the objective function differentiated through the esti-

mator ˆg.

Algorithm 1: PPO top-level algorithm, Actor-Critic style,

as presented in (Schulman et al., 2017).

for iteration = 1, 2, ... do

for actor = 1 to N do

Run policy π

old

in environments for T

timesteps;

Compute advantage estimates

, ...,

;

end

Optimize surrogate L wrt θ, with K epochs

and minibatch size M ≤ NT ;

old

← θ

end

4 BALANCING GAMES WITH

DEEP RL AND A BALANCING

CONSTANT

This work aims at creating a game agent with two im-

plicit goals that should be targeted at the same time:

learning how to play a game and, at the same time,

learning how to go keep up with the player to main-

tain the game balanced. To that, we build a reinforce-

ment learning task and use the PPO algorithm to solve

it due to one main reason: our task contemplates a

continuous space and policy search is more appro-

priate to tackle this kind of environment. Moreover,

PPO has succeeded in a number of other game-based

tasks (Schulman et al., 2017). In this section we bring

the details of our contributions.

Reward Function. A key component of an RL task

is how we reward or penalize the learner according to

his actions in the environment. To tackle the balanc-

ing game problem, the reward function must envision

the behavior of the agent in the game environment

(for example, how to move, or how to ﬁght) along

with the balancing itself (how to not bw much worse

or much better than the player). Thus, we propose a

game-based reward function that includes a balancing

constant aiming at pointing out how the agent can be

a fair opponent to the player. In this way, he will have

some implicit knowledge on how far (or near) he is

from the desired balanced state and, if he is within it,

how he should behave in such a state.

Balancing Constant (BC). The balancing state

refers to the moments of the game that a skill dif-

ference remains in a certain range(1). The intuition

behind this is that in such a state, the agent is not

Towards Adaptive Deep Reinforcement Game Balancing

695

that easy to confront or that difﬁcult to overcome by

the player. Consequently, the balancing constant is a

value that helps our function to achieve the desired

behaviour. In other words, this constant means the

maximum skill difference between the agent and the

player.

0 ≤ ∆Skill ≤ BC (1)

BC-Based Reward Function (BCR). Neverthe-

less, by using the BC in our reward function it is pos-

sible to distinguish two additional situations when the

agent is not in a balanced state, i.e., the BCR can iden-

tify in which situation the agent is and, according to

it, to provide a reward value to stimulate the desired

learning (equation 2). The three reward situations are

as follows:

Subjugated Punishment (SP), deﬁned as

∆Skill < 0: In this case, the reward value is di-

rectly proportional to the inability of the agent to

represent an obstacle to the player, i.e., the more

docile the agent is the more it will be punished.

Conservation Reward (CR), deﬁned as , 0 ≤

∆Skill ≤ BC: In this case, the reward value represents

a compensation for maintaining the degree of com-

petition within the desired range of the BC, which in

turn encourages the agent to reach the limit of equi-

librium.

Rebellious Punishment (RP), deﬁned as BC <

∆Skill. This is an inverse retaliation to the Subju-

gated Punishment case, obtained by overcoming the

BC limit that the agent can have with respect to his

opponent.

BCR(∆Skill) =











∆Skill

Skill

max

, i f ∆Skill < 0

∆Skill

, i f 0 ≤ ∆Skill ≤ BC

−

∆Skill−BC

Skill

max

−BC

, i f BC < ∆Skill

(2)

We can see in the ﬁgure 1 that BCR(∆Skill) is a

piece-wise function that reﬂects our idea of the three

possible major cases that the agent may attained dur-

ing training. Thus, if the agent is very distant from

the range of balancing, consequently, it will receive

an even greater punishment; on the other hand, if it

gets to be within the balancing state, it is stimulated

to reach the limit of the range.

BC-Based Balancing Metric (BCM). The BCM

(equation 3) is a measure that allows for evaluating

the percentage of times that an agent is in one of the

states reﬂecting the three situations mentioned above:

Unbalanced by Subjugate, Balanced, or Unbalanced

by Rebellious.

BCM =

+ T

(3)

-0.5

-0.4

-0.3

-0.2

-0.1

0.1

0.2

0.3

0.4

0.5

-30 -20 -10 0 10 20 30 40 50 60

Minimum difference Maximum difference

Reward

ΔSkill

Figure 1: The BC-Based Reward Function with BC = 30

and Skill

max

= 100.

Our hypothesis is that as a base case of balanc-

ing, the player can perform different actions that may

cause the agent to receive a punishment for not keep-

ing the game within the BC interval, so the agent

must perform an action to control this situation. Con-

versely, an unbalanced state will make the agent to ex-

ecute an action to move the game to a balanced state,

as shown in ﬁgure 2.

Agent Balance ActionAgent Balance Action

Player Imbalance ActionPlayer Imbalance Action

Balanced

Unbalanced

Rebellious

Unbalanced

Subjugate

Figure 2: Minimal Agent Balancing Interaction.

5 EVALUATION

In this section we present our test scenario and show

the results we have obtained with the game balancing

method we proposed here.

5.1 Unity ML-Agents Toolkit

To produce the game environment scenarios and run

the RL algorithm, we rely on The Unity ML-Agents

Toolkit (Juliani et al., 2018). It is a free-source toolkit

that allows the development of virtual environments

to serve as training scenarios for intelligent agents,

using machine learning algorithms based on Ten-

sorFlow (Abadi et al., 2016). The toolkit contains

three high-level key components: (1) the Learning

Environment, containing all the elements of the

game that are part of the Unity, (2) the Python API,

composed of the training algorithms, and (3) the

External Communicator, connecting the Python API

to the Unity.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

696

The toolkit has a component called Brain which

contains three elements of the RL task: (1) the Ob-

servation, which is the information provided by the

agent about the environment, either in a numerical

format (ﬂoating-point number vector) or visual (im-

ages taken by the cameras linked to the agent); (2)

the Action, represented as a vector of continuous or

discrete actions; and (3) the Reward.

Furthermore, a set of extensible examples using

different RL algorithms are available in the toolkit.

The Proximal Policy Optimization (PPO) (Schulman

et al., 2017) implemented in the toolkit has the pos-

sibility of propagating its performance with the In-

trinsic Curiosity Module (ICM) or even with a Long-

Short-Term Memory model (LSTM). As we previ-

ously mentioned, we focus on creating an intelligent

agent based on the PPO algorithm.

5.2 Test Game Scenario

We evaluate our game balancing method in an en-

vironment similar to the commercial game Capcom

Street Fighter, inspired by (Andrade et al., 2006). The

game consists of the confrontation of two entities (the

agent and the player), whose objective is to defeat the

opponent through physical attacks. The game ends

when one of them makes the other to be lifeless or

when the time of 100 seconds has passed. In this last

case, the winner is the player that has more life points

in the end of the game. This game is based on a 3D

environment, like most of the current video games

5.3 Player Simulation

To simulate the behavior of players with distinctive

nature, we created two types of enemies with different

abilities that are going to confront our learner agent,

as follows:

• Defensive Simulation: This programmed enemy

emulates the behavior of a player whose goal is

to win by a minimum difference. Thus, he adopts

a defensive posture to avoid taking any damage

while still maintaining his advantage.

• Aggressive Simulation: The behavior of this

other type of emulated player differs from the pre-

vious one because this one tries to win with the

maximum possible difference, i.e., it will try to at-

tack all the possible times to generate the greatest

damage in our learner agent.

5.4 Skill Study

Naturally, as we work with a ﬁghting game, we con-

sider adequate to measure the balance of the game by

computing the difference in the life health score of the

agent and the player (Equation 4). Therefore, the bal-

ancing state is assumed as long as the agent maintains

a difference within the BC interval (Equation 5).

∆H = H

Agent

− H

Player

(4)

0 ≤ ∆H ≤ BC (5)

In this way, our objective is advising the agent

to maintain a life difference not bigger than the BC

along the game time, as shown in Figure 3.

(a) Balancing State at the time t

(b) Balancing State at the time t

i+1

Figure 3: Persistence of the BC throughout the game with

BC = 30.

5.5 Observation Parameters (State) and

Actions

The reinforcement learning task requires that we ab-

stract the environment into the state representation so

that the agent can make a decision regarding their ac-

tions. Figure 4 illustrates the learning loop of our

study game. Thus, as we tackle a ﬁghting game, we

represent the observation of the current state of the

game using the following information:

• Information related to the physical space: distance

between the learner agent and the player.

• Information related to the balancing constant: the

value of the difference between the health of the

agent and the health of the player and the type of

reward the agent receive (subjugated or rebellious

punishment, or conservative reward.

• Information related to the actions: the last action

of the agent and the player, as well as a discrete

value that tells us if the player has changed action

since the last observation.

All of these values are normalized between -1 and

1, as a recommendation of the Unity plugin for better

performance.

Concerning the actions, both agents can Move

(Walk forward, Walk Backward), Duck and Attack

(Kick, Punch).

Towards Adaptive Deep Reinforcement Game Balancing

697

Agent

Action A

Reward R

State S

t+1

Environment

Figure 4: RL process focused on games.

5.6 Experimentation

The experimentation process was carried out using

three values of BC (20,30 and 40), which the agent

uses as a limit of their ability. As we described ear-

lier, we created two different types of emulated play-

ers to face our agent and allows him to train, aiming

at obtaining:

• Two different types of agents, namely an aggres-

sive and a defensive one, which ﬁght and behave

within the boundary of the BC when confronted

with their same type of emulated agent.

• A third type of agent (Mixed Agent) that is trained

with the two different types of emulated players.

Thus, we would like to demonstrate that given the

experience of facing different types of players, the

agent can adapt to them when it is necessary.

We change some default parameters of the PPO

implementation to improve the training. The ﬁrst one

is the β parameter, responsible for deﬁning how the

agent explores its actions. The greater is this value,

the more at random the agent chooses its actions and,

therefore, the more it will explore the action space

during its training. This variable is directly related

to the strength of the entropy regularization. We set

this variable as (1e − 5) so that our agent experiences

new possibilities and applies the learned model to

reach the goal. During the experiments, smaller val-

ues make the agent learn only a certain number of ac-

tions that, while helped him reach part of his goals,

were very repetitive. The learning rate parameter was

also adjusted with the same intent as the β parame-

ter. We set this variable to be (5e − 4) so that the

agent maintains an equilibrium between not learning

too fast or delaying too much by exploring new ac-

tions. The other parameters were kept with the default

values of the plugin.

We created 9 independent copies of the scene to

accelerate the training by working together to ﬁnd a

better policy. All of them have the emulation of one

type of the player’s behavior, except for the mixed

Figure 5: Initial stage of the training process with 9 scenes

in parallel.

agent training that had 4 passive and 5 aggressive em-

ulations. The ﬁgure 5 shows us how each of them is

independent from each other: some of them are in a

balanced state (green platforms) while others have not

reached the best policy.

Figure 6: Intermediate stage of the training process with 9

scenes at the same time.

5.7 Results

Based on our experiments, we start by indicating that

after 500k steps the agent tends to continuously im-

prove its policy, generating a greater reward, as shown

in Figure 7 and many of the 9 scenarios manage to

balance the game (Figure 6). Hence, in the case of

the agents that learn from only one type of player,

we observed that the agent who faces aggressive em-

ulation did not take long to learn the (defensive) pol-

icy and then maintain it until the number of iterations

is ﬁnished. Because the emulation is so aggressive,

the agent learned that it is better to defend itself and

maintain a minimum difference than trying to reach

a ∆Skill = BC or higher (be more aggressive). Un-

like the other, he tries to be more aggressive and his

training goes more slowly since the agent must also

learn how to attack (to reach the balancing state) and

defend himself (to stay within the balancing state).

Figure 7 shows how the agent that learns with the

mixed simulated types of players has an intermedi-

ate reward gain between the other two cases. This is

likely due to the fact that he must learn how to behave

when facing two different strategies of its opponents,

consequently, he has more things to learn. Regarding

the balancing constant values, the results show that

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

698

when the BC is higher, the agent is able to maintain

his skill difference with respect to the player into the

interval of balancing. He is also able to remain in the

balanced state even when he suffers some damage by

the opponent because there is still some life difference

between them.

Figure 7: Accumulated Reward throughout the training.

Figure 8: Evaluation on BC-Metric.

Conversely, it is quite hard to preserve the equilib-

rium when the BC is small because any wrong action

by the agent could take it to any of the unbalanced

states. For example, we can observe in Figure 8 when

the agent has a BC = 20 and he needs to ﬁght against

an aggressive player, the former spends a little less

than 50% of the time in our target balanced state.

Finally, our BC-Metric showed 3 interesting

points. First, the agent remains at least 50% of time in

the state of balancing in all cases, except when BC =

20 as we described before. Second, we see some evi-

dence to conﬁrm the hypothesis about a small value of

BC, showing that in this case, it is difﬁcult to stay in

the balanced state with a narrow range. Third, the re-

sults show that by offering new experiences through

the mixed agent, it is able to achieve a better deci-

sions because he learns a new type of behavior that

one style alone cannot offer, In this case, the results

exceed 50% of BC-Metric, even with a BC = 20, as

exhibited in the Figure 8.

6 CONCLUSIONS

Game balancing faces the complex challenge of

adapting the virtual characters to the same (or a

higher) level as the player. In this work, we con-

tributed with a deep reinforcement learning strategy

that aims at reaching such a balancing through the re-

ward function. To that, we designed a function that

makes use of a balancing constant to limit and mea-

sure the behavior of the agent. By training the agent

with such a function, we were able to not only limit-

ing the agent to behave like the player (when the BC

is zero), but we also gave him the possibility of stim-

ulating himself to learn the games’ actions, so that it

is still a good adversary to the player. With the exper-

imental results, we could observe that the agent stays

as long as possible in a balanced state (at least 50%

of the game). Furthermore, we could see that it can

learn new ways of acting when confronting different

styles of players.

As future work, we intend to deeply explore the

adaptability of the agent, by composing rewards that

point out to more granular aspects such as resistance

and stamina, and gathering the results when it con-

fronts real players.

REFERENCES

Abadi, M. et al. (2016). Tensorﬂow: A system for large-

scale machine learning. In 12th USENIX Sympo-

sium on Operating Systems Design and Implementa-

tion (OSDI 16), pages 265–283.

Andrade, G., Ramalho, G., Gomes, A. S., and Corruble, V.

(2006). Dynamic game balancing: An evaluation of

user satisfaction. AIIDE, 6:3–8.

Andrade, G., Ramalho, G., Santana, H., and Corruble, V.

(2005). Extending reinforcement learning to provide

dynamic game balancing. In Proc. of the Workshop

on Reasoning, Representation, and Learning in Com-

puter Games, 19th Int. Joint Conference on Artiﬁcial

Intelligence (IJCAI), pages 7–12.

Bakkes, S., Tan, C. T., and Pisan, Y. (2012). Personalised

gaming: a motivation and overview of literature. In

Proc. of The 8th Australasian Conference on Interac-

tive Entertainment: Playing the System, page 4. ACM.

Bakkes, S., Whiteson, S., Li, G., Viniuc, G. V., Charitos, E.,

Heijne, N., and Swellengrebel, A. (2014). Challenge

balancing for personalised game spaces. In 2014 IEEE

Games Media Entertainment, pages 1–8.

Bakkes, S. C., Spronck, P. H., and Van Den Herik, H. J.

(2009). Opponent modelling for case-based adaptive

game AI. Entertainment Computing, 1(1):27–37.

Charles, D., Kerr, A., McNeill, M., McAlister, M., Black,

M., Kcklich, J., Moore, A., and Stringer, K. (2005).

Player-centred game design: Player modelling and

adaptive digital games. In Proc. of the digital games

research conference, volume 285, page 00100.

Towards Adaptive Deep Reinforcement Game Balancing

699

Csikszentmihalyi, M. and Csikszentmihalyi, I. S. (1992).

Optimal experience: Psychological studies of ﬂow in

consciousness. Cambridge university press.

Hunicke, R. (2005). The case for dynamic difﬁculty adjust-

ment in games. In Proc. of the 2005 ACM SIGCHI In-

ternational Conference on Advances in computer en-

tertainment technology, pages 429–433. ACM.

Juliani, A., Berges, V.-P., Vckay, E., Gao, Y., Henry, H.,

Mattar, M., and Lange, D. (2018). Unity: A General

Platform for Intelligent Agents. ArXiv e-prints.

Lample, G. and Chaplot, D. S. (2017). Playing fps games

with deep reinforcement learning. In AAAI, pages

2140–2146.

Lopes, R., Eisemann, E., and Bidarra, R. (2018). Authoring

adaptive game world generation. IEEE Transactions

on Games, 10(1):42–55.

Michalski, R. S., Carbonell, J. G., and Mitchell, T. M.

(2013). Machine learning: An artiﬁcial intelligence

approach. Springer.

Missura, O. and G

artner, T. (2009). Player modeling for

intelligent difﬁculty adjustment. In Int. Conference

on Discovery Science, pages 197–211. Springer.

Mitchell, T. (1997). Machine learning. McGraw-Hill

Boston, MA:.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,

Harley, T., Silver, D., and Kavukcuoglu, K. (2016).

Asynchronous methods for deep reinforcement learn-

ing. In Int. Conference on Machine Learning, pages

1928–1937.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing atari with deep reinforcement learn-

ing. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,

J., Bellemare, M. G., Graves, A., Riedmiller, M., Fid-

jeland, A. K., Ostrovski, G., et al. (2015). Human-

level control through deep reinforcement learning.

Nature, 518(7540):529.

Olesen, J. K., Yannakakis, G. N., and Hallam, J. (2008).

Real-time challenge balance in an rts game using rt-

neat. 2008 IEEE Symposium On Computational Intel-

ligence and Games, pages 87–94.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Silva, M. P., do Nascimento Silva, V., and Chaimowicz,

L. (2015). Dynamic difﬁculty adjustment through an

adaptive AI. In Computer Games and Digital Enter-

tainment (SBGames), 2015 14th Brazilian Symposium

on, pages 173–182. IEEE.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

Van Den Driessche, G., Schrittwieser, J., Antonoglou,

I., Panneershelvam, V., Lanctot, M., et al. (2016).

Mastering the game of go with deep neural networks

and tree search. nature, 529(7587):484.

Southey, F., Xiao, G., Holte, R. C., Trommelen, M., and

Buchanan, J. W. (2005). Semi-automated gameplay

analysis by machine learning. In AIIDE, pages 123–

128.

Spronck, P., Ponsen, M., Sprinkhuizen-Kuyper, I., and

Postma, E. (2006). Adaptive game AI with dynamic

scripting. Machine Learning, 63(3):217–248.

Sutton, R. S., Barto, A. G., Bach, F., et al. (1998). Rein-

forcement learning: An introduction. MIT press.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine

learning, 8(3-4):279–292.

White III, C. C. and White, D. J. (1989). Markov deci-

sion processes. European Journal of Operational Re-

search, 39(1):1–16.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

700