
without the representation function (No Emb), TiD,
CTD, TFT, and Ada in Table 1. As listed in Table
1, regardless of whether the representation function is
used, RL agents earned higher scores than heuristic
agents (TiD, CTD, TFT, Ada). Therefore, by the for-
mulation proposed in this study, RL agents can learn
effective strategies, regardless of whether the repre-
sentation function is used.
As shown in Table 1, the agent that learned
30000 simulations without the representation func-
tion earned a higher score than the agent that learned
30000 simulations with the representation function.
We discuss the reasons for this. Because the repre-
sentation function is learned with discriminative rep-
resentation as one of its objectives, it outputs a dis-
criminative vector for each opponent agent. This vec-
tor enables the agent to perform RL by distinguish-
ing between the strategies of its opponents. In this
case, because the input to the representation function
is the state of the negotiation with each agent, the in-
formation obtained from the output of the represen-
tation function can be regarded as latent information
in the state. Therefore, we consider that the RL of
30000 simulations without the representation function
achieved a high final score in the experiment because
the agents learned while distinguishing the strategies
of opponents directly from the state. Conversely, RL
by adding the embeddings output by the representa-
tion function to the state would have increased the
size of the state space. Thus we could not obtain a
very good strategy with representation functions.
Evaluation by Agreement Rate
We show the mean values of the agreement rates of
RL agents and heuristic agents for each opponent
agent in Table 2.
As listed in Table 2, the agent learned without the
representation function had a lower agreement rate
with CTD than the agent learned with the represen-
tation function, TiD, CTD, and Ada. As shown in Ta-
ble 2, the agent learned with the representation func-
tion had a higher agreement rate with CTD than the
other agents whether the number of simulations. One
of the reasons that the agent learned without the rep-
resentation function had a lower agreement rate with
CTD is that it had obtained a too-bullish strategy. Be-
cause it learns to earn a high utility in negotiations
with any agents, including TiD, with which the maxi-
mum utility it can earn upon agreement is higher than
that with CTD, the agent learned without the repre-
sentation function frequently proposes a bid whose
utility value for it is higher than that with CTD can
accept. Conversely, the agent learned with the rep-
resentation function had a high agreement rate with
CTD because the embedding vector output by the rep-
resentation function enabled it to learn to distinguish
CTD from the other agents, including TiD.
As shown in Table 2, the agent learned for
30000 simulations with the representation function
had a lower agreement rate with TFT than the agent
learned for 30000 simulations without the represen-
tation function. One of the reasons for this is that
the states taken in negotiations with TFT are signif-
icantly different from the states taken in negotiations
with the other agents. If the RL agent can distinguish
TFT from the other agents only by the state, the em-
bedding vector output by the representation function
is redundant and prevents learning.
Evaluation by Agreement Utility
We show the mean values of the agreement utilities
of RL agents and heuristic agents for each opponent
agent in Table 3.
As listed in Table 3, regardless of the presence
or absence of the representation function, an agent
learned for 30000 simulations earned a higher agree-
ment utility than an agent learned for 3000 simula-
tions. Therefore, in the early stage of learning, an
agent learns to increase the agreement rate and then
learns to increase the agreement utility.
As shown in Table 3, the agent learned with the
representation function earns higher agreement utility
in the negotiation with CTD and Ada, whereas the
agent learned without representation function earns
higher agreement utility in the negotiation with TiD
and TFT. In negotiations with CTD and Ada, the state
transitions are similar to those of TiD. However, CTD
differs from TiD in the acceptable utility value in the
end, and Ada differs from TiD in that it becomes more
aggressive when its opponent agent concedes. There-
fore, by introducing the representation function and
learning in a more discriminative manner, it obtained
higher agreement utility. Conversely, TiD is a sim-
ple strategy, making it relatively easy to learn effec-
tive strategies. The state transitions in negotiations
with TFT are significantly different from those taken
in negotiations with other agents. Therefore, in ne-
gotiations in TiD and TFT, the agent learned without
the representation function obtained higher agreement
utility because the embedding vector was redundant.
Evaluation of Robustness of Embeddings
We show the mean values of IICR of the embedding
vector output by the representation function trained in
30000 simulations in the case more than one type of
agent is included in the opponents in Table 4. IICR
is smaller if the opponents include TFT and larger
Deep Reinforcement Learning Framework with Representation Learning for Concurrent Negotiation
237