
agent learns how to score goals such as passing and
shooting through interaction with the environment.
The size of the pitch is in the range −1 to +1 on
the x axis and −0.42 to +0.42 on the y axis, and the
size of the goal is in the range −0.044 to +0.044 on
the y axis. In this study, we only experimented with
the offense in order to simplify the experiment, and
conducted experiments with a scenario of four offen-
sive players and eight defensive players. For the de-
fense, we used a bot that follows the ball and moves
to its own team using a rule-based approach provided
by GRF. The initial positions of the agents in each
episode were set to the positions of all the players
when they received an assist pass in the shooting se-
quence of the real data. The actions taken by the
agents were limited to 12 types: moving in 8 direc-
tions in 45-degree increments, doing nothing, high
pass, short pass, and shoot. The states observed by
the agents were 44-dimensional (the x and y coordi-
nates for each player, the position coordinates of the
ball, a one-hot vector of 3 dimensions for the team
holding the ball (left team, right team, and non-ball-
holding state), and a one-hot vector of 11 dimensions
for the player IDs (11 players per team), for a total of
61-dimensional vectors. The episode ended when the
conditions for switching between offense and defense
were met, such as when a goal was scored, a goal was
conceded, or the ball was lost, or when 100 steps had
elapsed. For the weight of the shoot reward, we di-
vided the half court of football into five parts on the
x axis and eight parts on the y axis, and used the per-
centage of shots by position in the goal sequence of
the actual data calculated for each grid.
4.1.2 Dataset
In this study, we used event data and tracking data
from 95 J1 League games in 2021 and 2022 provided
by the 2023 Sports Data Science Competition. From
this dataset, we extracted 1,520 sequences from the
time an assist pass was made to the time a shot was
taken, and divided them into 141 goal sequences and
1,379 shot sequences. As a preprocessing step, we ex-
tracted only sequences in which there were 22 players
and a ball on the court, and downsampled them from
25 Hz to 8.33 Hz. We used these sequences for in-
verse reinforcement learning (proposed method), pre-
training (comparison method (Fujii et al., 2023)), and
reinforcement learning. For inverse reinforcement
learning (proposed method) and pre-training (com-
parison method), the data was divided into 102 train-
ing sequences and 39 validation sequences for the
goal sequence. For reinforcement learning, the ini-
tial position data for the shoot sequence was divided
into 1,179 training initial positions and 100 test initial
positions for the positions of all players when they
receive an assist pass. In addition, 100 shooting se-
quences were used as the teacher data to be stored in
the buffer during reinforcement learning in the com-
parative method. The four offensive players selected
as the experimental subjects were the players who
made the assist pass, the player who took the shot,
and two other players who were close to the goal but
not the two players mentioned above. The eight de-
fensive players were seven players close to the goal
and the goalkeeper.
4.1.3 Learning Parameters
We implemented the proposed method using DDQN
for the reinforcement learning component and DQN
for the inverse reinforcement learning component.
This method, referred to as DDQ IRL, was then com-
pared with both DDQN and DQAAS (Fujii et al.,
2023). Table 2 shows the basic learning parameters
for DQN and DDQN, such as batch size and dis-
count rate. We set common values for these param-
eters in both methods. Additionally, the number of
update steps for the Q-Network, a parameter specific
to DDQN, was set to 10, 000. Moreover, we used Pri-
oritized Experience Replay as a sampling method for
experience data from the buffer, as in Fujii et al. (Fu-
jii et al., 2023). In the loss function of DQAAS, λ
1
,
the weight for the supervised learning term, was set
to 0.04, while λ
2
, the weight for the regularization
term, was set to 1.00. The convergence threshold for
inverse reinforcement learning in DDQN IRL, ε, was
set to 0.1. In addition, for the parameters of the state
feature, we used four different values for r, the ra-
dius used to calculate the proportion of people: 0.1,
0.2, 0.3, and 0.4. For d
α
n
in Equation (1), which rep-
resents the distance to the n-th closest offense or de-
fense player, or the distance to the keeper, we adopted
the following values: in ascending order of relative
distance, 0.16, 0.25, and 0.36 for offense; 0.08, 0.13,
0.16, 0.21, 0.24, 0.28, 0.33, and 0.44 for defense; and
0.35 for the keeper. Note that α denotes either “of-
fense,” “defense,” or “keeper.”
4.1.4 Evaluation Metrics
In this study, we used the Kullback-Leibler (KL) di-
vergence (Yeh et al., 2019) shown in Equation (8) as
an evaluation metric to evaluate the distance between
the position distribution and the action distribution of
the learned football agent and the real player.
KL(P ∥ Q) =
∑
x
P(x)log
P(x)
Q(x)
(8)
Here, P(x) represents the distribution of players in
the real data, and Q(x) represents the distribution of
Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players
213