prediction module, and then a category n
′
J
whose
choice intensity is the largest is determined. Next, we
use weighting vector W
′m
J
between n
′
J
and MF as pre-
diction vector P for the input to the prediction field of
FALCON AP. Then category n
J
whose intensity for
choice is the largest in CF of FALCON AP is chosen
as a winning category. An action is chosen according
to the weighting vector W
m
J
; action k whose weight-
ing value is the largest, i.e. maxarg
k
w
m
k,J
in W
m
J
, is
usually chosen.
In the action learning phase, either reinforcement
or reset of the relations among predictions, percepts,
actions, and rewards is performed depending on the
reward the learning agent obtained. When the learn-
ing agent received a positive reward, the prediction
vector P = (p
1
,..., p
O
) obtained from the action pre-
diction module is inputted to PF, the percept vector
S = (s
1
,...s
M
) obtained from sensors is inputted to SF,
the action vector A that indicates the action the agent
chose is inputted to MF, the reward vector R = (1, 0)
is inputted to FF. Then the weighting vectors between
the winning category n
J
and each of vectors in PF, SF,
MF, FF are updated. In the action learning phase of
the action prediction module shown in Figure 2, the
percepts vector is inputted to SF, the action vector A
that indicates the action the agent chose is inputted to
MF, and the weighting vectors between n
′
J
and each of
vectors in SF and MF are updated. When the learning
agent received a negative reward, the weighting vec-
tors are updated to weaken the relations among input
vectors.
2.2 FALCON ER
We propose another extended version of FALCON;
we call the extended version FALCON ER (FALCON
considering the expected reward). FALCON ER pre-
dicts other agents’ behavior for each action the learn-
ing agent can take using its action prediction module,
and determines the action of the learning agent ac-
cording to the expected value of rewards calculated
according to the prediction. For example, assume that
actions the learning agent can take are a
1
and a
2
. Also
assume that there are two other agents, and agent 1
and agent 2 choose and carry out actions according
to the action the learning agent carried out. Then, we
calculate the expected reward in the following. FAL-
CON ER first predicts action p
1,1
of agent 1 when the
learning agent choose action a
1
and then predicts ac-
tion p
2,1
of agent 2 after actions a
1
and p
1,1
are taken.
Next, we calculate expected reward r
1
that the learn-
ing agent receives after actions a
1
, p
1,1
and p
2,1
are
taken. Expected reward r
2
for action a
2
the learning
agent chooses is also calculated in the similar manner
to a
1
by predicting actions of agents 1 and 2.
FALCON ER predicts other agents’ behavior
from the moment the learning agent chooses an ac-
tion to the moment it receives a reward and deter-
mines the action of the learning agent based on the
expected reward. In experiments, we use a card game
Hearts for performance evaluation. When we apply
FALCON ER to Hearts, FALCON ER predicts cards
other agents play until one trick ends and then calcu-
lates the expected value of penalty points obtained at
the trick.
3 EXPERIMENTS
We employ a card game Hearts for performance eval-
uation. In the experiments, our learning agents play
the game against rule-based agents. We compare the
performance of FALCON, FALCON AP and FAL-
CON ER. Based on feature extraction by heuristics
(Fujita, 2004), we determine the percept vector S for
FALCON, FALCON AP and FALCON ER. For the
experiments, we implement a rule-based agent and
use it as players opponent to the learning agent. The
rule-based agent determines an action with rules ex-
tracted from gnome-hearts(Hearts, 2012).
3.1 Hearts
The number of players of Hearts is normally four.
Hearts uses a standard deck of 52 playing cards. The
higher card of the suit wins; the strength of cards is as
follows: in the descending order, A, K, Q, ... , 4, 3,
and 2. There is no superiority or inferiority among
suits. Each player is delivered 13 cards, and must
play a card from his hand at his turn. Starting from
a player and playing a card in clockwise direction un-
til the four players play is called a trick. One game
is completed after successive 13 tricks. In each trick,
the card played by the first player is called a leading
card, and the player is called the dealer. The objec-
tive of Hearts is to obtain fewest penalty points at the
completion of the game. The penalty points of cards
are as follows: Q♠ = 13 points, and every card of suit
♥ = 1 point.
3.2 Experimental Results
In this subsection, we show experimental results for
game Hearts. The maximum number of categories of
FALCON, FALCON AP and FALCON ER is limited
to 1000. Parameter values for them are chosen by pre-
liminary experiments. We use in the following Fig-
ures the average penalty ratio obtained through 1000
Multi-agentReinforcementLearningbasedonMulti-channelARTNetworks
463