Alice and Bob, in a Prisoner’s Dilemma, the best pos-
sible outcome for Alice is to defect when Bob is coop-
erating. When both she and Bob cooperate they will
both receive the reward R, this is only the second best
outcome for Alice. If both defect both will receive the
punishment P that is worse than the reward. However,
the worst possible outcome for Alice occurs when she
cooperates while Bob defects, in this case she will re-
ceive the sucker’s payoff S. From this we can see that
a PD game requires that T > R > P > S. An additional
restriction 2R > T +S is often used when the PD game
is played repeatedly; this ensures that mutual coop-
eration yields the highest total payoff of both play-
ers. In this paper the same payoff parameters were
used as in Axelrod’s famous PD computer tournament
(Antony, 1992): S=0, R=3, P=1, T=5. The main con-
flict addressed by a Prisoner’s Dilemma game is the
best strategy for a selfish player is the worst strategy
for the society that benefits from the total payoff from
all players. Nowak et al. summarized five rules for
the emergence of cooperation (Nowak, 2006): kin se-
lection, direct reciprocity (Lindgren, 1994), indirect
reciprocity, network reciprocity (Nowak, 1993) and
group selection. In this paper it is direct reciprocity
that influences the players. The same players play the
PD game repeatedly and the players are given mem-
ory; this means that they can remember a certain num-
ber of past PD games and their results. Each Player
also possesses a set of responses to every possible out-
come of the previous games that are remembered, we
call this set of responses a strategy. A result of this
is that when a player defects, although he may gain a
greater payoff in that round, his opponent might de-
fect more in future rounds. This can result in a lower
payoff for him, which discourages defection. In this
paper the players are additionally given a certain pa-
tience, which determines the frequency of a player’s
strategy changes, as well as the past payoff he consid-
ers at the time of the strategy change. We also exam-
ine the effect of noise on the degree of cooperation of
a cheater (who cannot adapt a strategy that maintains
mutual cooperation).
2 METHOD
2.1 Partial Imitation of Players
A two-player PD game yields one of the four pos-
sible outcomes because each of the two independent
players has two possible moves, cooperate (C) or de-
fect (D). To an agent i, the outcome of playing a
PD game with his opponent, agent j, can be repre-
sented by an ordered pair of responses S
i
S
j
. Here S
i
can be either C for cooperate or D for defect. Thus,
there are four possible histories for any one game be-
tween them: S
i
S
j
takes on one of these four outcomes
(CC,CD,DC,DD). In general, for n games, there will
be a total of 4
n
possible scenarios. A particular pat-
tern of these n games will be one of these 4
n
scenar-
ios, and can be described by an ordered sequence of
the form S
i1
S
j1
···S
in
S
jn
. This particular ordered se-
quence of outcomes for these n games is called a his-
tory of games between these two players. In a PD-
game with a fixed memory-length m, the players can
get access to the outcomes of the past m games and
decide their next move. We use only players with
one-step memory mainly for the sake of simplicity.
There is no contradiction with the concept of patience,
which is measured by the number of games played
before the player changes his strategy. The reason
is that the cumulative effect of unsatisfactory perfor-
mance of a strategy over n games can be accumulated
by the payoff of each game once at a time, which re-
quires only a simple registry of the player with one-
step memory (this will be discussed in more detail in
section 2.3). Extension to players with longer mem-
ory will make the present problem very complex and
will be investigated in future works.
2.2 Player Types
The number of moves in a strategy, given that the
agent can memorize the outcomes of the last m games,
is
∑
4
m
(Baek, 2008; Antony, 2013). In this paper
we consider only one-step memory players (m = 1)
for their simplicity, the one-step memory-encoding
scheme is now described in detail. We allow our
agents to play moves based on their own last move
and the last move of their opponent. Thus we need 4
responses S
P
,S
T
,S
S
and S
R
for the DD, DC, CD and
CC histories of the last game. The agents also need to
know how to start playing if there is no history. We
add an additional first move S
0
. This adds up to a total
of 5 moves for a one-step memory strategy. A strategy
in one-step memory is then denoted as S
0
|S
P
S
T
S
S
S
R
,
where S
0
is the first move. There are 2
5
= 32 possible
strategies. In this paper, we consider a random player,
whose every move in a PD game is completely ran-
dom, and players with one-step memory. A one-step
memory player can use any of the 32 one-step strate-
gies, which can be classified into types as shown in
Table 1.
Our definition of ”nice” players are those who
start the game with C and when the last game both
players use C, he will also use C. As to the ”cheaters”,
they are players who start with D, and when the last
game both players use C, the cheater will use D. Note
IteratedPrisoner'sDilemmawithPartialImitationinNoisyEnvironments
229