Construction of Football Agents by Inverse Reinforcement Learning

Using Relative Positional Information Among Players

Daiki Wakabayashi

, Tomoaki Yamazaki

2 a

and Kouzou Ohara

2 b

Graduate School of Science and Engineering, Aoyama Gakuin University, Kanagawa, Japan

College of Science and Engineering, Aoyama Gakuin University, Kanagawa, Japan

Keywords:

Neural Networks, Trajectory, Simulation, Multi-Agent, Reinforcement Learning.

Abstract:

Recent advancements in reinforcement learning have made it possible to develop football agents that au-

tonomously emulate the behavior of human players. However, it is still challenging for existing methods to

successfully replicate realistic player behaviors. In fact, agents exhibit behaviors like clustering around the ball

or shooting prematurely. One cause of this problem lies in reward functions that always assign large rewards

to certain actions, such as scoring a goal, regardless of the situation, which bias agents towards high-reward

actions. In this study, we incorporate the relative positional reward and the positional weight for shooting into

the reward function used for reinforcement learning. The relative positional reward, derived from the positions

of players, the ball, and the goal, is estimated using inverse reinforcement learning on a dataset of real football

games. The positional weight for shooting is similarly based on actual shooting positions observed in these

games. Through experiments on a dataset derived from real football games, we demonstrate that the relative

positional reward helps align the agents’ behaviors more closely with those of human players.

1 INTRODUCTION

With advancements in reinforcement learning tech-

niques, the development of sophisticated autonomous

agents for tasks such as self-driving and robotic con-

trol is rapidly progressing (Prudencio et al., 2024).

These autonomous agents are being applied across a

wide range of domains. In the ﬁeld of sports, par-

ticularly football, numerous studies have focused on

developing football agents within simulation environ-

ments (Kitano et al., 1997; Kurach et al., 2020) us-

ing reinforcement learning to explore and analyze the

behavior of football players (Scott et al., 2021; Lin

et al., 2023; Fujii et al., 2023; Song et al., 2023a). De-

veloping football agents that closely mimic real play-

ers can offer systematic, simulation-based methods

for addressing various tasks in this domain, including

exploring diverse strategies and assessing the perfor-

mance of actual football players (Scott et al., 2021;

Song et al., 2023b).

Zhu et al. (Zhu et al., 2020) addressed the task of

controlling a single player in an 11 vs. 11 scenario. In

their study, they employed Long Short-Term Mem-

https://orcid.org/0009-0009-4170-725X

https://orcid.org/0000-0002-7399-2472

ory (LSTM) (Hochreiter and Schmidhuber, 1997)

and Proximal Policy Optimization (PPO) (Schulman

et al., 2017) to develop a football agent named We-

Kick. Lin et al. (Lin et al., 2023) proposed a dis-

tributed multi-agent reinforcement learning system

leveraging Multi-Agent Proximal Policy Optimiza-

tion (MAPPO) (Yu et al., 2022) to control 10 play-

ers, excluding the goalkeeper, in an 11 vs. 11 sce-

nario. They adopted a learning framework designed

to train agents in environments with sparse rewards.

This framework includes curriculum self-play learn-

ing, where the scenario’s difﬁculty is gradually in-

creased for bots with low difﬁculty, and challenge

self-play learning, where the agent competes against a

past version of itself that has been previously trained.

However, the football agents developed using existing

methods often struggle to replicate the movements of

real players, exhibiting behaviors such as excessively

crowding around the ball or shooting prematurely. To

address the domain gap between the actions of rein-

forcement learning-based football agents and those of

real football players, Fujii et al. (Fujii et al., 2023)

proposed a method that combines supervised learning

and reinforcement learning using a football dataset

derived from actual games. However, the resulting

agents learned only to pass the ball and shoot, with-

208

Wakabayashi, D., Yamazaki, T. and Ohara, K.

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players.

DOI: 10.5220/0013322900003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 1, pages 208-217

ISBN: 978-989-758-737-5; ISSN: 2184-433X

out moving toward the goal. One possible reason for

this unnatural behavior is the reward function used in

existing methods, which consistently assigns large re-

wards to certain actions, such as scoring a goal, re-

gardless of the context. This approach encourages

agents to favor speciﬁc actions with high reward val-

ues. However, it overlooks situational factors on the

pitch, such as the relative positions of players or the

number of nearby defenders, which are critical con-

siderations for decision-making in football.

In this study, we introduce a novel reward, termed

the relative positional reward, into the reward func-

tion used in reinforcement learning. The relative po-

sitional reward is derived from state features, such as

the relative distances between players, with weights

estimated from actual football games using inverse

reinforcement learning. This approach aims to en-

able the resulting agents to exhibit behaviors that

more closely resemble those of human players. Fur-

thermore, we introduce a positional weight into the

reward for shooting. This weight is based on the

distribution of shooting positions observed in actual

matches, assigning larger values to areas where real

players are more likely to take shots. This weight

helps suppress unnatural shooting from unrealistic

positions. In the experimental evaluation, we demon-

strate the effectiveness of the proposed method in

training agents to exhibit behaviors that closely re-

semble those of real players. Speciﬁcally, we ana-

lyze the distance between the positional distributions

of football agents trained on actual football datasets

and those of real players, as well as the distance be-

tween their action distributions.

2 RELATED WORK

Reinforcement learning is a framework of machine

learning where an agent learns to perform speciﬁc

tasks through interactions with its environment. In

this process, the agent observes the current state of

the environment and determines its actions accord-

ingly. The actions the agent takes cause a transition

in the state, and based on the new state, the agent re-

ceives a reward from the environment. By repeating

this process, the agent learns a policy, i.e., a strategy

for action, that maximizes the cumulative reward. Q-

learning (Watkins and Dayan, 1992) is a fundamen-

tal reinforcement learning method that learns a policy

based on the Q-value, which represents the value of

taking action a in state s. It repeatedly updates the

action-value function Q(s, a), which predicts the Q-

value. In contrast, Deep Q Network (DQN) (Watkins

and Dayan, 1992) is a deep reinforcement learning

algorithm that uses a neural network to approximate

the action-value function in Q-learning. The neu-

ral network that approximates this action-value func-

tion is referred to as the Q Network. In DQN, the

same action-value function is used for both select-

ing actions and evaluating them, which introduces

noise into the estimation process and often results in

overestimated action-values. To address this, Has-

selt et al. (Van Hasselt et al., 2016) proposed Dou-

ble Deep Q-Network (DDQN), which separates the

action-value function used for action selection from

the one used for evaluation, effectively reducing noise

and mitigating overestimation.

In the domain of building football agents, Zhu

et al. (Zhu et al., 2020) tackle the task of control-

ling a single player at all times in an 11 vs. 11

scenario within the Google Research Football (GRF)

environment, which is a 3D football simulation en-

vironment designed for reinforcement learning. In

the GRF environment, agents learn to dribble, shoot,

and pass through interactions, ultimately mastering

the skills necessary to score goals. They developed

a football agent called WeKick, using Long Short-

Term Memory (LSTM) and Proximal Policy Opti-

mization (PPO), which is one of actor-critic reinforce-

ment learning methods. Wang et al. (Wang et al.,

2022) addressed the challenge of sparse rewards in

a 5 vs. 5 multi-agent scenario by proposing a method

where agents learn two distinct policies: one driven

by individual rewards for actions such as shooting

and passing, and another based on team rewards for

achieving goals. Their approach incorporates the sim-

ilarity of action selection probabilities between the

two policies into the loss function. Li et al. (Li et al.,

2021) proposed a loss function that maximizes the

mutual information between the trajectories of agents

in a 3 vs. 1 scenario to promote diversity in agent be-

havior. This approach encourages agents to explore

extensively and select diverse actions. As a result,

during offensive plays, off-ball agents began choos-

ing actions that effectively draw defenders away.

On the other hand, in a multi-agent environment

scenario controlling 10 players (excluding the goal-

keeper) in an 11 vs. 11 game, Lin et al. (Lin et al.,

2023) proposed a distributed multi-agent reinforce-

ment learning system using Multi-Agent Proximal

Policy Optimization (MAPPO) (Yu et al., 2022). To

train agents in environments with sparse rewards, they

introduced a learning framework incorporating two

strategies: curriculum self-play and challenge self-

play. Curriculum self-play gradually increases the

scenario difﬁculty against lower-skilled bots, whereas

challenge self-play involves agents competing against

earlier trained versions of themselves. In evaluation

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players

209

experiments, their football agents demonstrated su-

perior performance compared to existing agents in

both win rate and goal differential. Similarly, Song

et al. (Song et al., 2023a) developed a highly capable

football agent through self-play learning. While these

football agents are designed to achieve a high win

rate against rule-based bots or reinforcement learning

agents, they still face challenges in closely replicating

the movements of real players.

Unlike these studies, some research focuses on

building agents with more realistic movements by

utilizing trajectory data derived from real players.

Huang et al. (Huang et al., 2021) proposed an of-

ﬂine reinforcement learning approach, leveraging pre-

training demonstration data from the behavioral data

of the WeKick football agent, which includes states,

actions, and rewards. In this framework, actions with

higher rewards in the demonstration data are assigned

greater weight to ensure they are prioritized during

training. As a result, their method achieved the high-

est win rate compared to existing football agents in an

11 vs. 11 scenario. Additionally, they demonstrated

that their approach improves the learning speed of

multi-agent reinforcement learning.

Fujii et al. (Fujii et al., 2023) proposed a learn-

ing method combining a pre-trained action classiﬁer

and reinforcement learning to bridge the domain gap

between football agents and real players. The clas-

siﬁer, trained on a football dataset derived from real

matches, predicts player actions for given states. Us-

ing DDQN, they trained the agent by sampling actions

in the reinforcement learning environment with the

pre-trained model and integrating these samples with

demonstration data stored in a buffer. As a result, their

football agents achieved higher accuracy compared to

existing ones in terms of reward acquisition and tra-

jectory similarity to demonstration data, measured us-

ing the DTW distance (Vintsyuk, 1968). However,

the agents failed to exhibit realistic behavior, as they

learned only to pass and shoot the ball without mov-

ing toward the goal. This unnatural behavior can be

attributed to the ﬁxed nature of the reward function,

where rewards for events like goals and shots are al-

ways the same regardless of the situation. Such a rigid

reward structure tends to bias the agent toward spe-

ciﬁc high-reward actions, while neglecting contextual

factors essential for decision-making in football, such

as player positioning and the number of surrounding

defenders.

3 PROPOSED METHOD

3.1 Overview

This study aims to develop football agents that ex-

hibit behaviors closer to those of real football play-

ers. To this end, we introduce a novel reward, re-

ferred to as the Relative Positional Reward, into the

reward function of a reinforcement learning frame-

work. This reward is calculated based on the values

of state features and adjusts the basic reward for an

agent’s actions by either enhancing or discounting it,

depending on the positional relationships between the

target agent and other entities, such as other agents,

the ball, and the goal. In the proposed method, in-

verse reinforcement learning is applied to demonstra-

tion data obtained from real football matches to es-

timate the weights for the state features used in cal-

culating the Relative Positional Reward. In ordinary

reinforcement learning, the goal is to learn a policy

that selects appropriate actions in each state based on

rewards provided by the environment. In contrast, in-

verse reinforcement learning optimizes a reward func-

tion to enable learning the correct policy by using the

actions of experts as a reference. The overview of the

proposed method is illustrated in Figure 1.

As illustrated in Figure 1, the proposed method

comprises three steps: preprocessing, inverse rein-

forcement learning, and reinforcement learning. In

the preprocessing step, the demonstration data is

constructed using tracking data from real football

matches. Speciﬁcally, we derive sequences of con-

secutive actions from the real tracking data. The state

at each time-step in a sequence is described with state

features. We adopt two types of state features: global

state features and relative state features. The global

state features represent the overall environment, such

as the positional coordinates of all players. In con-

trast, the relative state features, which are based on

each player, describe the relative relationships from

the perspective of that player, such as the distances

between the target player and other players. For the

relative state features, we further calculate their ex-

pectations for player k, µ

E,k

. In the inverse reinforce-

ment learning step, the weights of the relative state

features, w

, used to compute the Relative Positional

Reward R

∗

are optimized to minimize the difference

between the expected values of the relative state fea-

tures for a real player (expert) k, µ

E,k

, and those of its

corresponding agent under policy π

E,k

(π

). In the

ﬁnal reinforcement learning step, each football agent

k is trained to behave similarly to a real football player

by learning the policy π

′

. This training uses a total

reward function composed of the sum of the Relative

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

210

Figure 1: Overview of the proposed method for developing an agent that corresponds to expert player k, wherer s

represents

a state associated with player k.

Table 1: Relative state features with respect to agent a.

Type State Features (notation)

Distance

distance to the n-th closest offensive player (D

off

(a)), distance to the n-th

closest defensive player (D

def

(a)), distance to a keeper (D

keeper

(a)), distance

to the ball, distance to the goal (D

goal

(a))

Proportion of players

in a speciﬁc area

proportion of offensive players within a radius r (P

off

(a)), proportion of de-

fensive players within a radius r (P

def

(a)). proportion of offensive players up

to the goal line (P

off

goal line

(a)), proportion of defensive players up to the goal

line (P

def

goal line

(a)), proportion of offensive players up to the goal (P

off

goal

(a)),

proportion of defensive players up to the goal (P

def

goal

(a)), proportion of of-

fensive players unobstructed by defenders on a direct line from the agent a

off

(a)), proportion of defenders on the direct line to the ball (P

def

ball

(a))

Angle angle to the goal (A(a))

Positional Reward R

∗

and a basic reward R

basic

, which

is a discrete reward for speciﬁc events commonly em-

ployed in existing research.

3.2 Status Features

As mentioned above, each state comprising an ac-

tion sequence is described using relative state fea-

tures, alongside the global state features commonly

employed in the literature. The positional coordi-

nates of players and the ball are typical examples of

global state features. In contrast, relative state fea-

tures are based on a speciﬁc player or agent. For ex-

ample, these features include the distances between

the player/agent and other players/agents. Table 1

provides the complete list of relative state features

used in this study.

Among these features, the distance between play-

ers is modeled by considering the distribution of rel-

ative distances observed in real game play. Specif-

ically, instead of the actual distance between players

, the value of the following Gaussian function f (x

)

is used as one of the relative state features:

f (x

) = e

−(3x

−d

)

, (1)

where α is an index that indicates whether the dis-

tance refers to that to an offensive player, a defensive

player, or the goalkeeper. Accordingly, the value x

represents the distance from the observed player or

agent to the n-th closest agent of type α. Meanwhile,

denotes the average distance to the n-th closest

player, derived from real data. The value of f (x

)

increases as x

gets closer to d

3.3 Preprocessing Real Tracking Data

In the preprocessing step, action sequences are ex-

tracted from real tracking data (demonstration data),

and the state at each time step is represented using the

previously described state features. The global state

features are deﬁned for each state, while the relative

state features are speciﬁed for each player within a

state. We further calculate the expected values of rel-

ative status features for a player k, µ

E,k

, which are

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players

211

used as the expert’s values in inverse reinforcement

learning. µ

E,k

is a vector deﬁned as follows:

E,k

∑

i=1

∞

∑

t=0

φ(s

(i)

t,k

), (2)

where s

(i)

t,k

is the i-th state transition sequence of length

m for player k in the demonstration data, and φ

φ(·) is

a function that transforms the state s

into a vector

representing the values of the relative state features.

Namely, φ

φ(s

(i)

t,k

) represents the relative state features

for the state s

(i)

t,k

3.4 Reward Function

The reward function in the proposed method consists

of the basic reward function, which is commonly used

in existing research, and the Relative Positional Re-

ward newly introduced in this study. The basic reward

basic

is deﬁned as follows:

basic

= w

shot

∗ R

shot

+ R

goal

+ R

conceded

+ R

gain

+ R

lost

+ R

receive

+ R

block

+ R

o f f side

(3)

where R

shot

, R

goal

, R

conceded

, R

gain

, R

lost

, R

receive

block

, and R

o f f side

are the reward values given for

shooting, scoring a goal, conceding a goal, winning

the ball, losing the ball, receiving a pass, blocking,

and being offside, respectively. The shot reward R

shot

is a reward directly assigned to a shooting action.

This encourages the agent to recognize shooting as

the most effective action for scoring a goal, which

carries a high reward value. To mitigate this bias

toward shooting, we divided the football half-court

into a grid and applied the position-speciﬁc shot rate

shot

, derived from actual data, as a weight for the

shot reward. This method is expected to discourage

the agent from learning behaviors that involve taking

forced shots from positions far from the goal. The

basic reward is assigned exclusively to the relevant

player, whereas the offside reward is distributed to the

entire team. Additionally, the basic reward is utilized

in both the inverse reinforcement learning and rein-

forcement learning processes. On the other hand, the

relative positional reward R

) of the agent k is de-

ﬁned as follows:

) = w

· φ

φ(s

), (4)

where w

is a weight vector associated with the state

, which is represented by the relative state features.

3.5 Learning Relative Positional

Reward with Inverse Reinforcement

Learning

In inverse reinforcement learning, the weight vec-

tor w

is optimized. Speciﬁcally, the relative posi-

tional reward R

) is estimated together with the ba-

sic reward R

basic

, considering the total reward func-

tion R

total,k

) in the reinforcement learning per-

formed within the inverse reinforcement learning

step. R

total,k

) is deﬁned as follows:

total,k

) = R

basic

+ R

) (5)

When the expert’s feature expectation value µ

E,k

given, the weights w

are determined in such a way

as to minimize the error between the expert’s feature

expectation value

E,k

(π

) and the expert’s feature ex-

pectation value µ

E,k

under the policy π

, which obeys

the reward function R

total,k

). Here, the observed

feature expectation value,

E,k

(π

), is deﬁned as in

Equation (6).

E,k

(π

) = E[

∞

∑

t=0

φ(s

)|π

], (6)

where γ is the discount factor in reinforcement learn-

ing. In this study, we use the Projection-based

Method (PM), an inverse reinforcement learning

method proposed by Abbeel and Ng (Abbeel and Ng,

2004), to estimate the weight w

3.6 Learning Behaviors with

Reinforcement Learning

During reinforcement learning, the policy π

′

learned using the estimated relative positional reward

∗

) and the basic reward R

basic

deﬁned by Equa-

tion (3) as the reward function for the football agent.

The reward function R

∗

total,k

) used in reinforcement

learning is expressed by Equation (7).

∗

total,k

) = R

basic

+ R

∗

) (7)

4 EXPERIMENTAL EVALUATION

4.1 Experimental Setup

4.1.1 Learning Environment

In this study, we used Google Research Football

(GRF) (Kurach et al., 2020) as the football simula-

tion environment. GRF is a reinforcement learning

environment for learning football movements, and the

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

212

agent learns how to score goals such as passing and

shooting through interaction with the environment.

The size of the pitch is in the range −1 to +1 on

the x axis and −0.42 to +0.42 on the y axis, and the

size of the goal is in the range −0.044 to +0.044 on

the y axis. In this study, we only experimented with

the offense in order to simplify the experiment, and

conducted experiments with a scenario of four offen-

sive players and eight defensive players. For the de-

fense, we used a bot that follows the ball and moves

to its own team using a rule-based approach provided

by GRF. The initial positions of the agents in each

episode were set to the positions of all the players

when they received an assist pass in the shooting se-

quence of the real data. The actions taken by the

agents were limited to 12 types: moving in 8 direc-

tions in 45-degree increments, doing nothing, high

pass, short pass, and shoot. The states observed by

the agents were 44-dimensional (the x and y coordi-

nates for each player, the position coordinates of the

ball, a one-hot vector of 3 dimensions for the team

holding the ball (left team, right team, and non-ball-

holding state), and a one-hot vector of 11 dimensions

for the player IDs (11 players per team), for a total of

61-dimensional vectors. The episode ended when the

conditions for switching between offense and defense

were met, such as when a goal was scored, a goal was

conceded, or the ball was lost, or when 100 steps had

elapsed. For the weight of the shoot reward, we di-

vided the half court of football into ﬁve parts on the

x axis and eight parts on the y axis, and used the per-

centage of shots by position in the goal sequence of

the actual data calculated for each grid.

4.1.2 Dataset

In this study, we used event data and tracking data

from 95 J1 League games in 2021 and 2022 provided

by the 2023 Sports Data Science Competition. From

this dataset, we extracted 1,520 sequences from the

time an assist pass was made to the time a shot was

taken, and divided them into 141 goal sequences and

1,379 shot sequences. As a preprocessing step, we ex-

tracted only sequences in which there were 22 players

and a ball on the court, and downsampled them from

25 Hz to 8.33 Hz. We used these sequences for in-

verse reinforcement learning (proposed method), pre-

training (comparison method (Fujii et al., 2023)), and

reinforcement learning. For inverse reinforcement

learning (proposed method) and pre-training (com-

parison method), the data was divided into 102 train-

ing sequences and 39 validation sequences for the

goal sequence. For reinforcement learning, the ini-

tial position data for the shoot sequence was divided

into 1,179 training initial positions and 100 test initial

positions for the positions of all players when they

receive an assist pass. In addition, 100 shooting se-

quences were used as the teacher data to be stored in

the buffer during reinforcement learning in the com-

parative method. The four offensive players selected

as the experimental subjects were the players who

made the assist pass, the player who took the shot,

and two other players who were close to the goal but

not the two players mentioned above. The eight de-

fensive players were seven players close to the goal

and the goalkeeper.

4.1.3 Learning Parameters

We implemented the proposed method using DDQN

for the reinforcement learning component and DQN

for the inverse reinforcement learning component.

This method, referred to as DDQ IRL, was then com-

pared with both DDQN and DQAAS (Fujii et al.,

2023). Table 2 shows the basic learning parameters

for DQN and DDQN, such as batch size and dis-

count rate. We set common values for these param-

eters in both methods. Additionally, the number of

update steps for the Q-Network, a parameter speciﬁc

to DDQN, was set to 10, 000. Moreover, we used Pri-

oritized Experience Replay as a sampling method for

experience data from the buffer, as in Fujii et al. (Fu-

jii et al., 2023). In the loss function of DQAAS, λ

the weight for the supervised learning term, was set

to 0.04, while λ

, the weight for the regularization

term, was set to 1.00. The convergence threshold for

inverse reinforcement learning in DDQN IRL, ε, was

set to 0.1. In addition, for the parameters of the state

feature, we used four different values for r, the ra-

dius used to calculate the proportion of people: 0.1,

0.2, 0.3, and 0.4. For d

in Equation (1), which rep-

resents the distance to the n-th closest offense or de-

fense player, or the distance to the keeper, we adopted

the following values: in ascending order of relative

distance, 0.16, 0.25, and 0.36 for offense; 0.08, 0.13,

0.16, 0.21, 0.24, 0.28, 0.33, and 0.44 for defense; and

0.35 for the keeper. Note that α denotes either “of-

fense,” “defense,” or “keeper.”

4.1.4 Evaluation Metrics

In this study, we used the Kullback-Leibler (KL) di-

vergence (Yeh et al., 2019) shown in Equation (8) as

an evaluation metric to evaluate the distance between

the position distribution and the action distribution of

the learned football agent and the real player.

KL(P ∥ Q) =

∑

P(x)log

P(x)

Q(x)

(8)

Here, P(x) represents the distribution of players in

the real data, and Q(x) represents the distribution of

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players

213

Table 2: Settings for the basic parameters of DQN and

DDQN.

Parameter Value

Batch size 500

Intermediate layer size 256

Discount rate 0.99

Learning Rate 0.00025

Optimizer Adam

n step 16

Warm-up samples 2,000

Initial terminate condition ε 1.00

Last terminate condition ε 0.02

ε decay step 10,000

Table 3: The distance between the position distributions of

the agents in each model and real players.

offensive agents

Model all on-ball off-ball

DQAAS 0.095 0.273 0.174

DDQN 0.158 0.444 0.167

DDQN IRL 0

0 0

5 0

the learned agent. In addition, in the evaluation of

the position distribution distance, x represents each

grid when the football half-court is divided into grids,

and in the evaluation of the action distribution dis-

tance, x represents the action. Using these metrics,

we compared our proposed method, DDQN IRL with

the method proposed by Fujii et al. (Fujii et al., 2023),

DQAAS, and DDQN.

4.2 Results and Discussion

The position distribution distances and action distri-

bution distances of the agents of each model and the

real players are shown in Tables 3 and 4 respectively.

From Table 3, we can see that the football agents us-

ing the proposed method have the position distribu-

tions closest to those of the real players in all items for

the entire offense, on-ball agents, and off-ball agents,

compared to DQAAS and DDQN. From Table 4, we

can see that the proposed method’s agents have the

action distribution closest to that of real players. This

shows that the relative positional reward introduced in

the proposed method contributes to making the posi-

tion distribution and action distribution of the football

agents after learning closer to that of real players.

In this section, we analyzed the agents in each

model by dividing them into on-ball agents and off-

ball agents. Here, to check whether the on-ball agents

shoot near the goal like real players, we examined the

distance between the on-ball agent’s shooting posi-

tion and the goal. The results are shown in Table 5. In

order to maintain fairness in the comparison, only se-

quences in which the agents of each model shoot the

Table 4: The distance between the action distributions of

the agents in each model and real players.

offensive agents

Model all on-ball off-ball

DQAAS 9.950 10.20 10.800

DDQN 0.406 5.10 0.440

DDQN IRL 0

2 2

4 0

Table 5: The average distance from the shooting position to

the goal.

Model Average distance

Real players 0.245

DQAAS 0.289

DDQN 0.255

DDQN IRL 0.221

ball were extracted. In this case, the average distance

from the initial position of the agent to the goal is

0.270. From this result, it can be seen that the agents

in the proposed method tend to shoot the ball from the

position closest to the goal. In addition, it can be seen

that DQAAS tends to shoot from a position further

away than the initial position, and that DDQN tends

to shoot from a position slightly further away than the

real player. Next, we show some trajectories of the

real players and the on-ball agents of each model until

they shoot the ball in Figures 2a and 2d . In these di-

agrams, the initial position of the agent is represented

by a red dot, the movement trajectory by a blue line,

the shooting position by a yellow star mark, and the

passing position by a yellow square. From Figure 2b,

we can see that the comparative method DQAAS has

a strong tendency to choose a path. On the other hand,

from Figures 2c and Figure 2d, it can be seen that the

proposed method, compared to DDQN, moves closer

to the goal before shooting, as in the case of a real

player (Figure 2a), even when the initial position is

far from the goal. From this, we can see that rela-

tive positional reward can contribute to the learning

of shooting actions in positions close to the on-ball

agent’s goal.

Here, the weights of the reward function estimated

by inverse reinforcement learning are shown in Ta-

ble 6. Agent 1 represents the on-ball agent, and agents

0, 2, and 3 represent the off-ball agents. By observing

the individual weights, we can see that the weights

for the distance features of the on-ball agent with the

keeper are positive. In addition, the weights for the

distance to the goal and the percentage of defend-

ers within the goal line have negative weights, and

the absolute values of these weights are larger than

the weights for similar state features of the off-ball

agents. These weights may contribute to the learning

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

214

(a) Real players. (b) DQAAS. (c) DDQN. (d) DDQN IRL.

Figure 2: The trajectory leading to the shot for agents trained with each method.

Table 6: The weights of the reward function of on-ball

agents.

agent

off

agent

off

agent

off

keeper

0.041 0.061 0.020 0.039

goal

−0.010 −0.058 −0.022 −0.043

def

(r = 0.1) 0.024 −0.019 0.016 0.007

(r = 0.2) 0.022 −0.015 −0.002 −0.010

(r = 0.3) −0.008 −0.005 −0.000 −0.062

(r = 0.4) −0.009 −0.066 −0.007 −0.056

def

goal line

−0.009 −0.066 −0.007 −0.056

def

goal line

−0.017 −0.023 −0.023 −0.002

of shooting behavior in positions close to the goal for

on-ball agents. In addition, negative weights are also

given to the weights for the percentage of defenders

within a radius r and the percentage of defenders up

to the goal, and in particular, the weight for the per-

centage of defenders up to the goal is −0.066, and the

absolute value of the weight is large compared to the

weight of the off-ball agent. This weight is thought to

represent the characteristics of the on-ball player who

is aiming to shoot.

Next, we checked whether the relative distances

taken by the off-ball and on-ball agents were close to

the distribution of the relative distances taken by real

players. The relative distances between the off-ball

and on-ball agents of Agent 2 in the test sequences

(100 cases) are shown in Figures 3a and 3d . In or-

der to unify the veriﬁcation environment, the on-ball

agent and the defense agent were made to move in the

same way as real players. As a result, from Figure 3d,

we can see that the relative distance taken by the off-

ball and on-ball agents of the proposed method is

close to the distribution of the relative distance taken

by the off-ball and on-ball players in real players (Fig-

ure 3a). Similarly, from Figure 3b, we can see that

DQAAS also has a distribution of relative distances

that is close to that of real players. On the other hand,

from Figure 3c, we can see that the relative distance

of DDQN tends to increase with each step. From this,

we can see that relative positional reward may con-

tribute to the off-ball agent learning movements that

Table 7: The weights of the reward function for off-ball

agents.

agent

off

agent

off

agent

off

−0.004 −0.025 −0.007 −0.005

off

0.004 0.015 0.002 0.002

off

0.019 0.005 0.002 0.001

ball

−0.004 0.011 −0.029 −0.013

goal

−0.010 −0.058 −0.022 −0.043

def

ball

0.007 0.020 −0.022 0.010

approach the distribution of the relative distance in re-

ality. As shown in Figures 4 and 5, a similar trend can

be seen for other agents.

Here, the weights of the reward function estimated

by inverse reinforcement learning are shown in Ta-

ble 7. From Table 7, we can see that the weights

for the distance features of the off-ball agent’s clos-

est offense are all negative, but the absolute value is

smaller than the weight of −0.025 given to the sim-

ilar state features of the on-ball agent. Therefore, it

is possible to say that the weights attached to the dis-

tance features contribute to the off-ball agent learning

movements that make the relative distance between

the on-ball agent and the off-ball agent closer to the

distribution of the relative distance in reality. In ad-

dition, the weights for the distance to the ball and the

distance to the goal for the off-ball agents are all neg-

ative, meaning that the weights are such that the off-

ball agents learn to behave in a way that brings them

closer to the on-ball agents and the goal. On the other

hand, the weights for the percentage of defenders on

a straight line to the ball are positive for Agents 0 and

3, at 0.007 and 0.010 respectively. As one of the roles

of an off-ball player is to receive passes from on-ball

players, it is desirable that the weight here is negative

rather than positive.

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players

215

(a) Real players. (b) DQAAS. (c) DDQN. (d) DDQN IRL.

Figure 3: The relative distance between the on-ball player/agent and the off-ball player/agent 2.

(a) Real players. (b) DQAAS. (c) DDQN. (d) DDQN IRL.

Figure 4: The relative distance between the on-ball player/agent and the off-ball player/agent 0.

(a) Real players. (b) DQAAS. (c) DDQN. (d) DDQN IRL.

Figure 5: The relative distance between the on-ball player/agent and the off-ball player/agent 3.

5 CONCLUSION

This study proposes a method to develop a football

agent mimicking real players by estimating player-

centric relative rewards from actual football data us-

ing inverse reinforcement learning and applying them

as reward functions in reinforcement learning. We

also demonstrated the effectiveness of the proposed

method quantitatively and qualitatively through the

experiments using real football data. For future work,

we plan to establish a correlation between the basic

reward function and the state reward function. In the

proposed method, the state reward function is inde-

pendently learned from the basic reward function us-

ing inverse reinforcement learning. However, as ac-

tions like shooting or passing depend on the current

state, it is essential to design the reward function so

that the state inﬂuences the rewards for these actions

and to learn the extent of this inﬂuence. Additionally,

the adversarial learning strategy used in this study

needs reﬁnement to ensure the resulting agents’ be-

haviors are more stable.

ACKNOWLEDGEMENTS

This work was partly supported by a grant from

the Center for Advanced Information technology Re-

search, Aoyama Gakuin University.

REFERENCES

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning

via inverse reinforcement learning. In Proceedings of

the 21th International Conference on Machine learn-

ing (ICML 2004).

Fujii, K., Tsutsui, K., Scott, A., Nakahara, H., Takeishi, N.,

and Kawahara, Y. (2023). Adaptive action supervision

in reinforcement learning from real-world multi-agent

demonstrations. arXiv preprint arXiv:2305.13030.

Hochreiter, S. and Schmidhuber, J. (1997). Long short term

memory. Neural Computation, 9(8):1735–1780.

Huang, S., Chen, W., Zhang, L., Xu, S., Li, Z., Zhu,

F., Ye, D., Chen, T., and Zhu, J. (2021). Ti-

kick: Towards playing multi-agent football full games

from single-agent demonstrations. arXiv preprint

arXiv:2110.04507.

Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., and Osawa,

E. (1997). The robot world cup initiative. In Pro-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

216

ceedings of the 1st International Conference on Au-

tonomous agents (AGENTS 1997), pages 340–347.

Kurach, K., Raichuk, A., Stanczyk, P., Zajac, M., Bachem,

O., Espeholt, L., Riquelme, C., Vincent, D., Michal-

ski, M., Bousquet, O., and Gelly, S. (2020). Google

research football: A novel reinforcement learning en-

vironment. In Proceedings of the 34th AAAI Con-

ference on Artiﬁcial Intelligence (AAAI 2020), vol-

ume 34, pages 4501–4510.

Li, C., Wang, T., Wu, C., Zhao, Q., Yang, J., and Zhang, C.

(2021). Celebrating diversity in shared multi-agent re-

inforcement learning. In Advances in Neural Informa-

tion Processing Systems (NeurIPS 2021), pages 3991–

4002.

Lin, F., Huang, S., Pearce, T., Chen, W., and Tu, W.-W.

(2023). Tizero: Mastering multi-agent football with

curriculum learning and self-play. In Proceedings

of the 2023 International Conference on Autonomous

Agents and Multiagent Systems (AAMAS 2023), pages

67–76.

Prudencio, R. F., Maximo, M. R. O. A., and Colombini,

E. L. (2024). A survey on ofﬂine reinforcement learn-

ing: Taxonomy, review, and open problems. IEEE

Transactions on Neural Networks and Learning Sys-

tems, 35(8):10237–10257.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Scott, A., Fujii, K., and Onishi, M. (2021). How does

ai play football? an analysis of rl and real-world

football strategies. In Proceedings of the 14th Inter-

national Conference on Agents and Artiﬁcial Intelli-

gence (ICAART 2021), volume 1, pages 42–52.

Song, Y., Jiang, H., Tian, Z., Zhang, H., Zhang, Y., Zhu,

J., Dai, Z., Zhang, W., and Wang, J. (2023a). An em-

pirical study on google research football multi-agent

scenarios. arXiv preprint arXiv:2305.09458.

Song, Y., Jiang, H., Zhang, H., Tian, Z., Zhang, W., and

Wang, J. (2023b). Boosting studies of multi-agent re-

inforcement learning on google research football envi-

ronment: the past, present, and future. arXiv preprint

arXiv:2309.12951.

Van Hasselt, H., Arthur, G., and David, S. (2016). Deep re-

inforcement learning with double q-learning. In Pro-

ceedings of 13th the AAAI conference on artiﬁcial in-

telligence (AAAI 2016), volume 30, pages 2094–2100.

Vintsyuk, T. K. (1968). Speech discrimination by dynamic

programming. Cybernetics, 4(1):52–57.

Wang, L., Zhang, Y., Hu, Y., Wang, W., Zhang, C., Gao, Y.,

Hao, J., Lv, T., and Fan, C. (2022). Individual reward

assisted multi-agent reinforcement learning. In Pro-

ceedings of the 39th International Conference on Ma-

chine Learning (PMLR 2022), pages 23417–23432.

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Ma-

chine learning, 8(3):279–292.

Yeh, R. A., Schwing, A. G., Huang, J., and Murphy, K.

(2019). Diverse generation for multi-agent sports

games. In Proceedings of the 2019 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR 2019), pages 4610–4619.

Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen,

A., and Wu, Y. (2022). The surprising effectiveness

of ppo in cooperative multi-agent games. In Advances

in Neural Information Processing Systems (NeurIPS

2022), volume 35, pages 24611–24624.

Zhu, F., Li, Z., and Zhu, K. (2020). We-

kick. https://www.kaggle.com/c/google-

football/discussion/202232.

Construction of Football Agents by Inverse Reinforcement Learning Using Relative Positional Information Among Players

217