Table 1: An example game showing the attacker moving to CPU2 and subsequently to the DATA node. The defender performs
two defense actions in CPU1. Therefore, in this example game the attacker wins after two moves.
Gamestep Action Attacker Action Defender Node Attack Values Defense Values
0 - - START [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
CPU1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 2, 2, 2, 2, 2, 2, 2, 2, 2]
CPU2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [2, 0, 2, 2, 2, 2, 2, 2, 2, 2]
DATA [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [2, 2, 0, 2, 2, 2, 2, 2, 2, 2]
1 CPU2, Attack Type 2 CPU1, Defense Type 1 START [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
CPU1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [1, 2, 2, 2, 2, 2, 2, 2, 2, 2]
CPU2 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] [2, 0, 2, 2, 2, 2, 2, 2, 2, 2]
DATA [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [2, 2, 0, 2, 2, 2, 2, 2, 2, 2]
2 DATA, Attack Type 3 CPU1, Defense Type 1 START [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
CPU1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
CPU2 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] [2, 0, 2, 2, 2, 2, 2, 2, 2, 2]
DATA [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] [2, 2, 0, 2, 2, 2, 2, 2, 2, 2]
an action out of a set of possible actions. Agents
base their decision on the current state of the network.
States are defined differently for both agents, because
the complete state of the network is partially observ-
able for both agents, and they have access to different
kinds of information about the network. The agents
also have different actions to choose from.
The Attacker. States are defined as s(n), where n
is the node the agent is currently in. An action is de-
fined as A(n
0
, a), where n
0
is one of the neighbouring
nodes of node n that is chosen to be attacked and a
an attack type as before. In each state, the number of
possible actions is 10 * the number of neighbours of n
minus the actions that increment any attack value that
is already maximum (has the value 10).
The Defender. For a defender agent, the state is
defined as s(n
1
, n
2
, ..., n
i
), where each n
i
is the i-th
node in the network. An action for a defender is de-
fined as A(n, a), where n is the node in the network
that is chosen to invest in, and a ∈ d
1
, d
2
, ..., d
10
, det
the defend values and detection value. In each state,
the number of possible actions is 11 * the number of
nodes minus the actions that increment the defense
value of an attack type or detection value that is al-
ready maximum (has the value 10).
The Monte-Carlo and Q-learning agents do not
have an internal state representation, but they base
their actions on previous success, regardless of the en-
vironmental state. The neural and linear networks use
the entire observable environmental state as an input.
3.1 Monte Carlo Learning
In the first reinforcement learning technique, agents
learn using Monte Carlo learning (Sutton and Barto,
1998). The agents have a table with possible state-
action pairs, along with estimated reward values.
After each game the agents update the estimated
reward values of the state-action pairs that were se-
lected during the game. Monte Carlo learning up-
dates each state the agent visited with the same reward
value, using:
Q
t+1
(s, a) = Q
t
(s, a) + α ∗ (R − Q
t
(s, a))
where α is the learning rate which is a parameter rang-
ing from 0 to 1 that represents how much the agent
should learn from a new observation. R is the re-
ward obtained at the end of each game. The Q-values
are the estimated reward values. They represent how
much the agent expects to get after performing an ac-
tion. The s is the current state of the world, for the at-
tacker the node it currently is in and for the defender
it is empty: the state s has always the value 0. The
a is a possible action to do in that state (see also the
start of this section). For the defender the state s used
in the learning algorithm has always the value 0, be-
cause using a tabular approach it is unfeasible to store
all possible states. Although the state value is 0, the
environmental state determines which actions can be
selected. The attacker has information of the node it
currently is in, and this forms the state.
A reinforcement learning agent has the dilemma
between choosing the action that is considered best
(exploitation) and choosing some other action, to
see if that action is better (exploration). For Monte
Carlo learning, four different exploration algorithms
are implemented that try to deal with this problem
in the cyber security game. The four algorithms
are ε-greedy, Softmax, Upper Confidence Bound 1
(Auer et al., 2002), and Discounted Upper Confidence
Bound (Garivier and Moulines, 2008), which we will
now shortly describe.
ε-greedy Strategy. The first method is the ε-
greedy exploration strategy. This strategy selects the
best action with probability 1 − ε, and in the other
cases it selects a random action out of the set of pos-
sible actions. ε is here a value between 0 and 1, deter-
mining the amount of exploration.
Softmax. The second exploration strategy is Soft-
max. This strategy gives every action in the set of
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
562