Coordination for Complex Road Conditions at Unsignalized

Intersections: A MADDPG Method with Enhanced Data Processing

Ruo Chen

, Yang Zhu

1,2,∗

and Hongye Su

1,2

Ningbo Innovation Center, Zhejiang University, Ningbo, 315100, China

College of Control Science and Engineering, Zhejiang University, Hangzhou, 310027, China

{22360415, zhuyang88}@zju.edu.cn, hysu@iipc.zju.edu.cn

Keywords:

Deep Reinforcement Learning, MADDPG, Data Processing, Sliding Control.

Abstract:

In this paper, we use deep reinforcement learning to enable connected and automated vehicles (CAVs) to drive

in a intersection with human-driven vehicles. The multi-agent deep deterministic policy gradient (MADDPG)

algorithm is improved to be more efﬁcient for data processing, so that it can solve the problem of learning bot-

tlenecks in complex environments, and use sliding control to execute control strategies. Finally, the feasibility

of the method is veriﬁed in the simulation environment of CARLA.

1 INTRODUCTION

Intersections are often considered to be the main

source of trafﬁc congestion and energy waste (Shi-

razi and Morris, 2017). Based on the existing trafﬁc

system and human driving environment, many meth-

ods have been proposed to increase the operation efﬁ-

ciency of trafﬁc lights, like (Feng and Head, 2015).

Besides, trafﬁc ﬂow fundamental diagram (FD) is

viewed as the basis of trafﬁc ﬂow theory and been

used in many cases, like (Zhou and Zhu, 2020). In ad-

dition, deep learning methods (Zhang and Ge, 2024),

(Zhang and Li, 2023) and tree search based algorithm

(Li and Wang, 2006), (Xu and Zhang, 2020) have

been gradually applied in the transportation ﬁeld. In

recent years, the rapid development of autonomous

vehicles is expected to solve the problem of intersec-

tion congestion. The intersection management system

based on global information can better deal with the

data of automatic driving vehicles and the environ-

ment, so as to replace the traditional trafﬁc light sys-

tem to improve trafﬁc efﬁciency. Autonomous inter-

section management (AIM) is tailored for CAVs, aim-

ing at replacing the conventional trafﬁc control strate-

gies (Wu and Chen, 2019). Graph-based modeling is

often used to solve trafﬁc congestion (Chen and Xu,

2022). Meanwhile, some methods focus in reducing

energy consumption (Malikopoulos and Cassandras,

2018).

At present, most methods can only deal with the

∗

Corresponding author.

control strategy problem in a single environment,

such as all vehicles are driven by humans, and control

facilities are trafﬁc lights (Shobana and Shakunthala,

2023), or all vehicles are autonomous vehicles with-

out trafﬁc lights (Wang and Gong, 2024). However,

real-world environments are often more complex and

diverse, such as pedestrians and human-driven vehi-

cles at intersections, which can only be detected but

not controlled by AI systems. Secondly, a method

that simply generates an overall policy for passing

through an intersection have difﬁculties to cope with

sudden disturbances, and errors in the control process

can greatly affect the robustness of this policy. Due

to the high dynamics and randomness of unsignalized

control intersections, how to design efﬁcient and safe

multi-vehicle cooperative motion planning methods is

still a challenging problem.

As a deep reinforcement learning method (Lowe

et al., 2020), MADDPG can adapt to the input

changes and make corresponding responses to the

changing environment. And the action of the agent

is given by the neural network, which can ensure the

real-time action. Compared to traditional exhaustive

methods, MADDPG is less complex, but it also has a

performance impact (Xu and Liu, 2024).

294

Chen, R., Zhu, Y. and Su, H.

Coordination for Complex Road Conditions at Unsignalized Intersections: A MADDPG Method with Enhanced Data Processing.

DOI: 10.5220/0013133100003941

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 11th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2025), pages 294-300

ISBN: 978-989-758-745-0; ISSN: 2184-495X

2 SYSTEM FORMULATION

2.1 Intersections Crossing Mode

Set the environment as a 20m x 20m, two-lane inter-

section. The trafﬁc rules follow the People’s Republic

of China Trafﬁc Law (drive on the right). Human-

driven vehicles follow the right-hand driving and

other trafﬁc laws principles by default, and they are

not controlled by the intelligent system, but they can

be observed by other intelligent agent vehicles. The

intelligent agent vehicles share information among

themselves, including position, speed, and actions

taken(including throttle and steering wheel steering).

There is no error in the shared information.

2.2 Vehicle Model

The dynamic model of the car is deﬁned (Li, 2024).

In the simulation, this dynamic model is used to limit

the speed, angular speed and acceleration of the car.

The dynamic model of the car can be expressed by

Figure 1 and Equation 1:

Figure 1: Vehicle model.











˙v =

sinβ

s f

cosδ

+ F

l f

sinδ

+ F

s f

)

cosβ

l f

cosδ

− F

s f

sinδ

l f

−C

air

)

β =

cosβ

s f

cosδ

+ F

l f

sinδ

+ F

s f

)

−

sinβ

l f

cosδ

− F

s f

sinδ

l f

−C

air

)

ω =

l f

sinδ

+ F

s f

cosδ

− F

)

(1)

l f

, F

are the longitudinal forces of the front and

rear wheels, respectively. F

s f

, F

are the lateral forces

of the front and rear wheels, respectively. v, v

, v

are the vehicle speed, the longitudinal speed, and the

lateral speed. β, ψ, ω are the body side slip Angle,

the yaw Angle and the yaw speed. δ

is the steering

input Angle of the front wheel. l

is the distance from

center of gravity to front axle. l

is the distance from

center of gravity to the rear axis. m, I

are the total

mass of the vehicle and yaw inertia, respectively. C

air

is the air drag coefﬁcient. A

is the windward area. ρ

is the density of air.

The vehicle model will play a role in the later sim-

ulator validation section. Indicates that the planned

path given by the algorithm can be executed.

3 FILTER-FORGET MADDPG

3.1 Algorithm Flowchart and

Introduction

In this section, the FF-MADDPG diagram as given in

Algorithm 1:

and θ

denote policy network and critic net-

work parameters in respect. τ is the soft update pa-

rameter.

Traditionally the MADDPG algorithm is a rein-

forcement learning algorithm used for multi-agent

systems, extending the DDPG algorithm from single-

agent environments. MADDPG allows multiple

agents to learn optimal strategies in cooperative or

competitive environments.

The algorithm consists of Critic networks with

their target networks and Actor networks with their

target networks. The Critic network provides a Q

function for each agent, evaluating a combination of

global state, actions of all agents, and the next state of

itself. The Actor network provides a policy for each

agent, directly reﬂecting the actions observed by the

agent.

The critic network is updated by the loss function,

as shown in Equation 2:

Loss =

∑

− Q

(x, a

, a

, . . . , a

))

(2)

represents the target object Q value of the agent i ,

and its calculation formula is deﬁned as Equation 3:

= r

+ γQ

′

, a

′

, a

′

, a

′

, . . . , a

′

) (3)

The actor network is subsequently updated by the

following policy gradient as shown in Equation 4:

∇

J ≈

∑

∇

)∇

(x, a

, a

, . . . , a

) (4)

MADDPG also includes an experience replay

buffer (run policies in the environment, collect the

state, action, reward, and next state for each agent).

Coordination for Complex Road Conditions at Unsignalized Intersections: A MADDPG Method with Enhanced Data Processing

295

Algorithm 1: FF-MADDPG in Each training time.

for agent i = 1 to N do

′

← θ

, θ

′

← θ

end

while do

for t = 1 to max episode do

for agent i = 1 to N do

Obtain current environment state

. Select action a

according to

policy net, a

= µ(s

) + N

noise

end

Execute actions a

and calculate

reward r

as the formulas (8)

according to concrete situation and

acquire the new state s

t+1

. Through

the data ﬁlter,and store s

into

experience replay memory.

end

if Triggering the forget operation then

Delete previously stored data by

forgetting ratio.

end

for agent i = 1 to N do

Sample a random mini-batch samples

from experience replay memory.

Calculate Q value on the base of (3)

Update critic network by

minimizing the loss by (2) Update

actor network using the sampled

policy gradient by (4)

end

for agent i = 1 to N do

Update target network parameters by

′

← τθ

+ (1 − τ)θ

′

← τθ

+ (1 − τ)θ

′

end

Similar to DDPG, MADDPG uses an experience re-

play buffer to break the correlation of training data,

thus enhancing training stability.

However, due to the complexity of training en-

vironment, traditional methods are difﬁcult to adapt.

Therefore, this paper improves upon the traditional

MADDPG approach to better adapt to experimental

environments.

There are two main changes:

Firstly, introducing a data ﬁlter: The bottlenecks

during learning in experiments caused a large amount

of collision data or extreme action data. These data

reduced the efﬁciency of data utilization. In order to

solve this problem, this paper designs a data ﬁlter. It

adjusts the proportions between various types of data

(such as collision data, extreme action data, normal

driving data) by setting a probability parameter to de-

termine whether to record the data.

Secondly, adopting phased forgetting for the data

pool: This paper introduces phased forgetting opera-

tion for the buffer, taking action to delete some data

when a certain amount is reached or when learning

bottlenecks are detected.

3.2 Contrast with Traditional Methods

The traditional method lacks the part of data process-

ing, including the storage and ﬁltering of data. It just

stores all the data in the buffer and randomly extracts

it during training. As the amount of data continues

increasing, the information entropy of the extracted

data will signiﬁcantly decrease, in other words, the

similarity of the data will be improved. This makes it

difﬁcult for the neural network to learn more impor-

tant samples, resulting in a learning bottleneck prob-

lem.

Figure 2: Block diagram of the ﬁlter forgetting part.

Previously, the importance sample method has

been proposed, which adjusts the sampling probabil-

ity of some relatively important data. This method is

increases the efﬁciency by improving the learning in-

tensity of key information. The method proposed in

this paper is showed by Figure 2.

As the Figure 2 shows, data enters the buffer

through a ﬁlter. The input state through with differ-

ent probabilities depending on the conditions set. In

order to feel the beneﬁts brought by this method more

intuitively, the 3AI1 human scene is used as the ex-

Figure 3: Information entropy comparison.

VEHITS 2025 - 11th International Conference on Vehicle Technology and Intelligent Transport Systems

296

Figure 4: Flowchart of algorithm.

perimental object to record its information entropy.

Histogram method was been used for the calculation

of information entropy.

The formula for information entropy is given by

Equation 5.



H(X) = −

∑

i j

log

i j

bins = 10

(5)

i j

denotes the jth bit of the state array of the ith

vehicle.

It is clear in Figure 3, The (a) is about the tradi-

tional approach, due to the continuous accumulation

of unﬁltered data.

The operation of random fetching will increase the

probability of repeated data, as a result the informa-

tion entropy will be signiﬁcantly reduced. The (b)

shows by data processing, the diversity of the data

within the buffer is ensured.

The ﬂowchart of this paper’s algorithm is as

shown in Figure 4:

is denoted as the critic value in policy µ. Q

′

denoted the target critic network. Superscript ′ de-

note the next time. a is denoted as the actions of

each agents. where N is the number of all agents,

and M is the number of mini-batch experiences ex-

tracted during training. x is the state of each agents

, s

, . . . , s

3.3 Constraints and Completion

Conditions

The vehicle motion constraints are speciﬁed as the ta-

ble 1:

Table 1: Vehicle constraints.

Variable Minimum value Maximum value

Path offset 0m 1m

Speed 0m/s 6m/s

Acceleration 0m/s

2m/s

o f f set

0m 2m

o f f set

0m 2m

sa f e

5m +∞m

Schematic of the vehicle trajectory as shown in

Figure 5.

Figure 5: Path diagram.

The vehicle’s motion planning is mainly based on

a predetermined trajectory, At the same time, in order

Coordination for Complex Road Conditions at Unsignalized Intersections: A MADDPG Method with Enhanced Data Processing

297

to make the vehicle more skillfully deal with complex

road conditions, the vehicle is allowed to make some

degree of offset relative to the trajectory. X

o f f set

rep-

resents the offset on the X-axis relative to the planned

route. Y

o f f set

represents the offset on the Y-axis rela-

tive to the planned route. For the collision judgment

part, it is speciﬁed that within 5m from the vehicle

center point. Reaching the intersection where the plan

is reached is regarded as task completion. And D

sa f e

denotes the safe distance.

3.4 Parameter Setting

State is designed as shown in Equation 6:

i,t

= {P

sel f

, A

sel f

, P

other

, A

other

, D

other

, R

other

} (6)

i,t

denote the state space of the ith vehicle at time t.

sel f

is stand for the position of the car

, A

sel f

is stand

for the action of the car

, P

other

is stand for the posi-

tion of the other cars, A

other

is stand for the action of

the other cars, D

other

is stand for the distance between

the car

and other cars, A

other

is stand for the action

of the other cars. R

other

is stand for the relationship

between the car

and other cars. The relationship is as

shown in Equation 7:

other











−1, D

other

is smaller

0, D

other

is same

1, D

other

is bigger

(7)

The smaller, larger and the same in judgment condi-

tion, all compared to the D

other

at the time point be-

fore the moment.

Action is designed as shown in Equation 8:

i,t

= {X

o f f set

, A

sel f

} (8)

i,t

denote the action space of the ith vehicle at

time t.

The reward function is designed State is designed

as shown in Equation 9:

r =











∗ (D

other

− D

sa f e

)

∗ (A

sel f

), D

other

≤ D

sa f e

∗ (A

sel f

), D

other

sa f e

(9)

denotes the Gain coefﬁcient.

Forgetting ratio: it refers to a certain proportion of

data forgetting work when a certain data capacity is

reached, see the ﬂow chart

4 SIMULATION EXECUTION

4.1 Simulation Environment

In this paper, CARLA is used as the simulation plat-

form, the vehicle model in the experiment is ”Audi”,

and the control method is sliding mode control.

4.2 Design of Sliding-Mode Surface

Sliding Mode Control (SMC) is a nonlinear control

method, have strong robustness and fast response.

The method of sliding control was derived from

(Feng and Han, 2014). Deﬁne S

as the actual direc-

tion of the vehicle, S

as the target direction, x

as the

direction error, V

as the actual speed of the vehicle,

as the target speed, and x

as the speed error. T is

the throttle force of the vehicle and S is the steering

Angle of the steering wheel. t is the control time. k

, k

are proportional coefﬁcients.

The sliding-mode surface is designed as shown in

Equation 10:

s = k

T + k

sign(x

) | x

(10)

Aiming at the problem of chattering in sliding mode

control, the control signal of throttle is designed as

shown in Equation 11:







= −x

− k

sign(x

) | x

−k

= k

−t

+ 10 ∗ sign(s)

T = u

+ u

(11)

The control signal of the steering wheel Angle is de-

rived from x

, k

is proportional coefﬁcient and the

formula is as shown in Equation 12:

S = k

∗ x

(12)

5 RESULT AND DISCUSION

5.1 2AI and 1Human Situation

This scenario involves 2 AI vehicles and 1 human

driving vehicle.

Figure 6 shows the spatio-temporal relationship

diagram of the spatial locations of the three vehicles,

where the z-axis representing time and the X and Y

axes are the location coordinates of the intersection.

The red line (car 3) represents the human driving ve-

hicle, and the two ﬁgures show the perform of the AI

vehicle when the human driving vehicle is under dif-

ferent driving strategies. The results meet all previous

constraints and completion conditions.

VEHITS 2025 - 11th International Conference on Vehicle Technology and Intelligent Transport Systems

298

Figure 6: Trajectories in space–time of 2AI 1human.

5.2 3AI and 1Human Situation

This case is 3AI vehicles with one human driving ve-

hicle.

The Figure 7 shows the spatio-temporal relation-

ship diagram of the spatial position of the four cars.

The red line (car 4) represents the human driving ve-

hicle, and the two ﬁgures show the performance of

the AI vehicle when the human driving vehicle is un-

der different driving strategies. The results meet all

previous constraints and completion conditions.

Figure 7: Trajectories in space–time of 3AI 1human.

5 AI vehicles in this case.

The Figure 8 shows the spatio-temporal relation-

ship diagram of the spatial locations of the ﬁve ve-

hicles. The results meet all previous constraints and

completion conditions.

Figure 8: Trajectories in space–time of 5AI.

The result of CARLA can be seen from the Figure

9, at t=0s the car departs from the prescribed initial

point. Then the control strategy is executed accord-

ing to the speed and position plan given by the algo-

Figure 9: Simulation in CARLA.

rithm. At t=2s, t=5s, t=7s and t=9s, it can be clearly

seen that the autonomous vehicle responds to other

vehicles approaching for potential collisions. When

t=13s, it can be seen that the vehicles have completed

their assigned route and reached their respective des-

tinations.

6 CONCLUSIONS

In this paper, an improved MADDPG algorithm is

proposed and applied to solve the intersection plan-

ning problem of autonomous vehicles and human ve-

hicles. The effectiveness of the algorithm is veriﬁed

using sliding mode control in CARLA.

The main contributions of this paper are as fol-

lows:

(1) A more effective MADDPG algorithm for data

processing is proposed to address the issue of deep re-

inforcement learning frequently encountering bottle-

necks in complex environments.

(2) A reinforcement learning framework is built to

handle the intersection scene with involving human

driving vehicles, so that the autonomous driving ve-

Coordination for Complex Road Conditions at Unsignalized Intersections: A MADDPG Method with Enhanced Data Processing

299

hicle can not only deal with the full AI-controlled en-

vironment but also handle the complex environment

with human driving vehicles.

ACKNOWLEDGEMENT

This work is supported by Zhejiang Provincial

Natural Science Foundation of China (Grant No.

LQ23F030014), National Natural Science Foundation

of China (Grant No. 62303410), and Open Research

Project of State Key Laboratory of Industrial Control

Technology, Zhejiang University, China (Grant No.

ICT2024B46).

REFERENCES

Chen, C. and Xu, Q. (2022). Conﬂict-free coopera-

tion method for connected and automated vehicles

at unsignalized intersections: Graph-based modeling

and optimality analysis. IEEE Transactions on Intel-

ligent Transportation Systems, 23(11):21897–21914.

Feng, Y. and Han, F. (2014). Chattering free full-order

sliding-mode control. Automatica, 50(4):1310–1314.

Feng, Y. and Head, K. L. (2015). A real-time adaptive signal

control in a connected vehicle environment. Trans-

portation Research Part C: Emerging Technologies,

55:460–473.

Li, A. (2024). Intelligent Vehicle Perception, Trajectory

Planning and Control. Chemical Industry Press, Bei-

jing, 1 edition.

Li, L. and Wang, F. (2006). Cooperative driving at blind

crossings using intervehicle communication. IEEE

Transactions on Vehicular Technology, 55(6):1712–

1724.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mor-

datch, I. (2020). Multi-agent actor-critic for mixed

cooperative-competitive environments.

Malikopoulos, A. A. and Cassandras, C. G. (2018). A

decentralized energy-optimal control framework for

connected automated vehicles at signal-free intersec-

tions. Automatica, 93:244–256.

Shirazi, M. S. and Morris, B. T. (2017). Looking at intersec-

tions: A survey of intersection monitoring, behavior

and safety analysis of recent studies. IEEE Transac-

tions on Intelligent Transportation Systems, 18(1):4–

24.

Shobana, S. and Shakunthala, M. (2023). Iot based on smart

trafﬁc lights and streetlight system. In 2023 2nd Inter-

national Conference on Edge Computing and Appli-

cations (ICECAA), pages 1311–1316.

Wang, B. and Gong, X. (2024). Coordination for con-

nected and autonomous vehicles at unsignalized in-

tersections: An iterative learning-based collision-free

motion planning method. IEEE Internet of Things

Journal, 11(3):5439–5454.

Wu, Y. and Chen, H. (2019). Dcl-aim: Decentralized coor-

dination learning of autonomous intersection manage-

ment for connected and automated vehicles. Trans-

portation Research Part C: Emerging Technologies,

103:246–260.

Xu, H. and Zhang, Y. (2020). Cooperative driving at

unsignalized intersections using tree search. IEEE

Transactions on Intelligent Transportation Systems,

21(11):4563–4571.

Xu, J. and Liu, C. (2024). Resource allocation in multi-uav

communication networks using maddpg framework

with double reward and task decomposition. pages

371–376.

Zhang, J. and Ge, J. (2024). A bi-level network-wide coop-

erative driving approach including deep reinforcement

learning-based routing. IEEE Transactions on Intelli-

gent Vehicles, 9(1):1243–1259.

Zhang, J. and Li, S. (2023). Coordinating cav swarms

at intersections with a deep learning model. IEEE

Transactions on Intelligent Transportation Systems,

24(6):6280–6291.

Zhou, J. and Zhu, F. (2020). Modeling the fundamental

diagram of mixed human-driven and connected au-

tomated vehicles. Transportation Research Part C:

Emerging Technologies, 115:102614.

VEHITS 2025 - 11th International Conference on Vehicle Technology and Intelligent Transport Systems

300