2.1 Game Theory
We, as humans, are always making decisions as to
what to do next to achieve our desired purpose or
goals. In social environments, every decision is af-
fected by the decisions of other people. Game theory
(Okada, 2011) mathematically analyzes the relation-
ship among such decisions.
A game in game theory consists of the following
four elements (Okada, 2011):
1. Rules that govern the game;
2. Players who decide what to do;
3. Action strategies of the players; and
4. Payoffs given to the players as a result of their
decisions.
Game theory analyzes how players behave in an
environment in which their actions mutually influence
one another. We focus on two-person simultaneous
games in this study.
In a two-person simultaneous game, two players
simultaneously choose actions based on their given
strategies. After both players choose their respective
actions, each player is given a payoff determined by
the “joint actions” of both players. Since the payoff of
each player is determined by not only his or her action
but also the other player’s action, it is necessary to de-
liberate the other player’s action to maximize payoffs.
Note that all games used in this research are non-
cooperative game, and players choose their individual
strategies based on its own payoffs and all players’
actions. More specifically, after a player observes its
payoffs and the action of the other players, he or she
can then choose his or her own self-strategy.
A Nash equilibrium is defined as the combination
of actions in which no player is motivated to change
his or her strategy. Let us consider the ”prisoner’s
dilemma” game summarized in Table 1. Here, the row
and column correspond to the actions of Players 1 and
2, respectively; they gain the left and right payoffs,
respectively, corresponding to the joint action in the
matrix.
According to the payoff matrix, the player should
choose the Defection action regardless of the other
player’s action, because it always yields higher pay-
offs than the Cooperation action. Since the other
player considers the same, both players choose De-
fection, and finally, the combination of actions (i.e.,
Defection, Defection) becomes a Nash equilibrium.
Conversely, if both players select Cooperation,
both payoffs can be raised to 0.6 from 0.2; however,
it is very difficult for both players to choose Coopera-
tion, because the combination (i.e., Cooperation, Co-
operation) does not yield equilibrium and each player
is motivated to choose Defection. Moreover, even if a
player overcomes this motive for a certain reason, he
or she will yield a payoffof zero if the partner chooses
Defection. The prisoner’s dilemma game shows that
the individual’s rationality differs from that of social
rationality in a social situation.
Table 1: An example of payoffs in the prisoner’s dilemma
game.
Cooperation Defection
Cooperation 0.6,0.6 0.0,1.0
Defection 1.0,0.0 0.2,0.2
2.2 Reinforcement Learning
Reinforcement learning (Sutton and Barto, 1998) is a
learning method that learns strategies by interacting
with the given environment. An agent is defined as a
decision-making entity, while the environment is ev-
erything external to the agent that interacts with the
agent. Furthermore, the agent interacts with the en-
vironment at discrete time steps, i.e., t = 0,1,2,3,....
At each time step t, the agent recognizes current state
s
t
∈ S of the environment, where S is a set of possible
states, and decides action a
t
∈ A(s
t
) based on the cur-
rent state, where A(s
t
) is a set of actions selectable in
state s
t
.
At the next step, the agent receives reward r
t+1
∈
ℜ as a result of the action and transitions to new state
s
t+1
. The probability that the agent chooses possible
action a in state s is shown as strategy π
t
(s,a). Re-
inforcement learning algorithms update strategy π
t
or
the action values (not the strategy) at each time step,
choosing an action based on the strategy.
2.3 Three Foundational Learning
Algorithms
Here we introduce three learning algorithms that form
the basis for our proposal.
2.3.1 M-Qubed
M-Qubed (Crandall and Goodrich, 2011) is an excel-
lent state-of-the-art reinforcement learning algorithm
that consists of three strategies that can learn to coop-
erate with associates (i.e., other players) and avoid be-
ing exploited unilaterally in various games. M-Qubed
uses Sarsa (Rummery and Niranjan, 1994) to learn ac-
tion value function Q(s,a) (called the Q-value), which
means the value of action a in state s. Here, state is
defined as the latest joint action of the agent and its as-
sociates. Further, Q(s,a) is updated by the following