Hybrid Optimal Trafﬁc Control: Combining Model-Based and

Data-Driven Approaches

Urs Baumgart and Michael Burger

Fraunhofer Institute for Industrial Mathematics ITWM, Fraunhofer-Platz 1, D-67663 Kaiserslautern, Germany

Keywords:

Trafﬁc Control, Model Predictive Control, Optimal Control, Imitation Learning.

Abstract:

We study different approaches to use real-time communication between vehicles, in order to improve and to

optimize trafﬁc ﬂow in the future. A leading example in this contribution is a virtual version of the prominent

ring road experiment in which realistic, human-like driving generates stop-and-go waves.

To simulate human driving behavior, we consider microscopic trafﬁc models in which single cars and their

longitudinal dynamics are modeled via coupled systems of ordinary differential equations. Whereas most

cars are set up to behave like human drivers, we assume that one car has an additional intelligent controller

that obtains real-time information from other vehicles. Based on this example, we analyze different control

methods including a nonlinear model predictive control (MPC) approach with the overall goal to improve

trafﬁc ﬂow for all vehicles in the considered system.

We show that this nonlinear controller may outperform other control approaches for the ring road scenario but

intensive computational effort may prevent it from being real-time capable. We therefore propose an imitation

learning approach to substitute the MPC controller. Numerical results show that, with this approach, we

maintain the high performance of the nonlinear MPC controller, even in set-ups that differ from the original

training scenarios, and also drastically reduce the computing time for online application.

1 INTRODUCTION

Increasing trafﬁc volumes, accompanied by trafﬁc

jams and negative environmental aspects, have led to

a high demand for intelligent mobility solutions that

should guarantee efﬁciency, safety, and sustainability.

Intelligent vehicles that communicate with other

road users may contribute to these solutions if they

not only maximize their own goals (e.g. minimizing

travel time) but aim to increase and stabilize trafﬁc

ﬂow for all vehicles in certain trafﬁc situations, for

instance, in stop-and-go waves and congestion that

emerge because of inefﬁcient human driving behav-

ior. With intelligent vehicle control these inefﬁcien-

cies may be outbalanced if we can directly affect other

vehicles’ driving behavior.

A crucial point is the design of such controllers.

While models that describe trafﬁc dynamics and hu-

man driving behavior have been studied since the last

century (Lighthill and Whitham, 1955; Gazis et al.,

1961), nowadays, an increasing amount of data is be-

ing collected by vehicles and infrastructure systems,

such that a huge amount of real-time trafﬁc data is

available. That means, besides approaches that re-

quire, at least partially, understanding of the vehi-

cle dynamics, like optimal control theory and model

predictive control (MPC), also purely data-driven ap-

proaches may be applied in trafﬁc control.

1.1 Related Work

For most of the time, trafﬁc ﬂow has been con-

trolled by infrastructure objects like speed lim-

its (Hegyi et al., 2008) or switching through traf-

ﬁc light phases (McNeil, 1968; De Schutter and

De Moor, 1998). But since driver assistance sys-

tems as well as (semi-) autonomous vehicles have

been developed, new ways to control trafﬁc ﬂows

have emerged. By communicating with other vehi-

cles such intelligent vehicle controllers may increase

trafﬁc ﬂow, e.g., with cruise control systems (Orosz

et al., 2010; Orosz, 2016).

One speciﬁc scenario, in which these controllers

may be applied, tested, and optimized is the artiﬁcial

ring road scenario based on the experiment described

in (Sugiyama et al., 2008). Here, human-like driv-

ing behavior leads to stop-and-go waves without the

presence of bottleneck situations. Further, it has been

Baumgart, U. and Burger, M.

Hybrid Optimal Trafﬁc Control: Combining Model-Based and Data-Driven Approaches.

DOI: 10.5220/0011838000003479

In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2023), pages 85-94

ISBN: 978-989-758-652-1; ISSN: 2184-495X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

shown in another real-world experiment that with one

intelligently controlled vehicle the trafﬁc ﬂow can be

stabilized (Stern et al., 2018). Different modeling set-

ups and control strategies have been proposed to an-

alyze the stability and controllability of the ring road

network (Cui et al., 2017; Wang et al., 2020; Zheng

et al., 2020).

While in these works, the underlying dynamical

system that describes the dynamics of all vehicles

in the system is linearized, we present an MPC ap-

proach that preserves the nonlinear vehicle dynamics

and show that a linearization may decrease the con-

troller’s performance.

Further, we apply an imitation learning approach

that mimics the nonlinear MPC controller with the

goal to reduce the computing time for online appli-

cations while keeping the high performance of the

nonlinear MPC controller. We compare our approach

with several other controllers and show that it dras-

tically reduces the required training time in compari-

son to other ML-based approaches that have been ap-

plied to this problem, like reinforcement learning (Wu

et al., 2017; Baumgart and Burger, 2021).

Accordingly, the remaining part of this contribu-

tion is organized as follows. In Section 2, we intro-

duce microscopic trafﬁc models and set up the spe-

ciﬁc trafﬁc control problem for the ring road. Then, in

Section 3, we present solution approaches for the con-

trol problem and compare their results in Section 4.

We also analyze the robustness of the imitation learn-

ing approach in different scenarios and close the con-

tribution in Section 5 with a short summary and po-

tential open problems.

2 TRAFFIC MODELING AND

CONTROL

To set up the speciﬁc trafﬁc control problem, we

model human-like driving behavior with microscopic

trafﬁc models - and, more precisely, car-following (or

follow-the-leader) models (Helbing, 1997; Kessels,

2019). With these models, we can describe traf-

ﬁc ﬂows of individual driver-vehicle units on single

lanes. To reﬂect human decision making, the acceler-

ation behavior of each vehicle i depends on the head-

way h

towards its leading vehicle i + 1 and both the

own speed v

and the leader’s speed v

i+1

We deﬁne the headway between the two vehicles in

terms of the bumper-to-bumper distance as follows

= s

i+1

−s

−l

i+1

with s

and s

i+1

being the (front bumper) lane posi-

tions of both vehicles and l

i+1

the length of the lead-

ing vehicle. Then, the dynamics of each vehicle i

can be described by a system of ordinary differential

equations (ODEs):

(t) = v

i+1

(t) −v

(t),

˙v

(t) = f (h

(t),v

i+1

(t)).

(1)

The right hand side f , that determines the ac-

celeration, depends on the speciﬁc car-following

model. One choice is the Intelligent Driver Model

(IDM) (Treiber and Kesting, 2013), in which the right

hand side is given by

f (h

i+1

)

= a

max

1 −



des



−



∗

,∆v)



∗

,∆v) = h

min

+ max



0,v

T +

∆v

√

max



(2)

with ∆v = v

−v

i+1

. The model depends on a set of

parameters

IDM

= [v

des

,T,h

min

,δ,a

max

,b] (3)

which may be ﬁtted to real-world trafﬁc data or be

used to model different driver types.

Remark 1. In our set-up, other car-following mod-

els may be used to describe the acceleration behav-

ior of the human-driven vehicles as well. We choose

the IDM because it realistically describes human driv-

ing behavior, including its inefﬁciencies which makes

it an adequate choice for the comparison of different

control approaches. Further, it avoids crashes and we

can introduce heterogeneous driving behavior by as-

signing each vehicle i an individual set of parameters

IDM

(cf. Section 4). Typical parameter values as well

as other trafﬁc models can be found in (Treiber and

Kesting, 2013).

2.1 Ring Road Control

In real-world experiments of the ring road (Sugiyama

et al., 2008; Stern et al., 2018), it has been shown

that human-driving behavior can be inefﬁcient and

that connected autonomous vehicles (AVs) may be

able to improve trafﬁc ﬂow for all vehicles in the sys-

tem. As it is a closed scenario, the number of ve-

hicles is ﬁxed which facilitates the mathematical de-

scription of the system. However, despite being an ar-

tiﬁcial scenario, similar inefﬁciencies (congestion and

stop-and-go-waves) as in real-world scenarios occur.

Therefore, the ring road experiment has proven to be

an effective benchmark scenario for developing and

testing new trafﬁc control strategies.

In this work, we focus on scenarios with a total

number of N vehicles including one AV and N −1 hu-

man drivers (HDs). We assume that the dynamics of

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

Figure 1: Visualization of the Ring Road in Matlab (The

MathWorks, Inc., 2021) with 21 HDs and one AV.

the HDs (i = 1,...,N −1) are given by a car-following

model (cf. (1)) and cannot be controlled directly. But

for the AV (i = N), we can set the acceleration behav-

ior with controls u such that the dynamics are given

(t) = v

(t) −v

(t),

˙v

(t) = u(t).

Then, we deﬁne the system’s state as

x =

















and the dynamics can be described as an ODE initial

value problem (IVP) problem:

˙x(t) = F(t,x(t),u(t)) =







(t) −v

(t)

f (h

(t),v

(t))

(t) −v

(t)

u(t)







x(t

) = x

,t ∈ [t

]

(4)

with initial value x

. To set up the control problem,

ﬁrst, we deﬁne two constraints for the AV such that

certain safety criteria are satisﬁed.

A box constraint to include physical limitations for

acceleration and deceleration of the AV,

u(t) ∈ [a

min

max

] =: U, (5)

and a safety distance, induced by Gipps’ model

(Gipps, 1981; Treiber and Kesting, 2013), to keep the

AV from crashing into its leading vehicle

(t) ≥ h

min

+ v

∆t

min

−

min

(6)

with AV reaction time ∆t

and deceleration bound

min

. For the other vehicles, the IDM (cf. (2)) is

known to prevent crashes for all HDs (Treiber and

Kesting, 2013).

While keeping these constraints satisﬁed, we aim to

ﬁnd such controls for the AV, that the trafﬁc ﬂow for

all vehicles in the system is maximized. One ap-

proach to measure the trafﬁc ﬂow in the system is the

average speed over all vehicles. We therefore deﬁne

the running cost l as

l(x,u) = r ·u

+ q ·

∗

−

∑

i=1

(7)

with r,q ≥ 0 to penalize high actuations u of the AV

and steer the average speed of the system to an equi-

librium speed v

∗

. We can summarize Eqs. (4)-(7)

to state the following nonlinear ODE optimal control

problem (OCP)

min

x,u

J(x,u) =

l(x(t), u(t))dt

s.t. ˙x(t) = F(t,x(t),u(t)), x(t

) = x

u(t) ∈ U, x(t) ∈ X , t ∈ [t

(8)

in which X ⊂R

denotes the set of states x such that

constraint (6) is satisﬁed.

3 CONTROL METHODS

To ﬁnd optimal controls u for the AV, in this sec-

tion, we present different methods. First, we present

a model predictive control approach that iteratively

computes optimal controls for certain time periods.

As a second model-based controller, we also set up a

linear quadratic regulator.

Then, we present an approach to substitute the

MPC controller by an imitation learning approach that

mimics the MPC controller’s behavior, but requires

much less computing time. Further, we compare this

approach with another ML-based techniques, rein-

forcement learning.

3.1 Model Predictive Control

To set up the MPC scheme, ﬁrst, we deﬁne a time

horizon T , a time shift τ such that τ < T < t

, and a

time grid

= {t

,...,t

}, t

k+1

−t

= τ.

Hybrid Optimal Trafﬁc Control: Combining Model-Based and Data-Driven Approaches

Figure 2: Visualization of the MPC scheme.

The MPC idea is visualized in Figure 2 and can be

summarized as follows: at each sampling step k, we

observe the current state x

and optimize the system

over time interval [t

+ T]. The resulting optimal

control sequence u

|[t

+τ]

up to time τ is then used

as a feedback control for the next sampling inter-

val (Gr

une and Pannek, 2011). That means, at each

step, we have to solve an updated version of the OCP

stated in (8):

min

x,u

J(x,u) =

l(x(t),u(t))dt

s.t. ˙x(t) = F(t, x(t), u(t)), x(t

) = x

u(t) ∈ U, x(t) ∈ X , t ∈ [t

+ T ].

(9)

Typically, the resulting OCP (9) has to be solved nu-

merically at each time instance which may be time

consuming in dependence of time horizon T . In this

work, we apply a reduced discretization approach,

cf. (Gerdts, 2011).

3.2 Linear Quadratic Regulator

Instead of solving OCP (9) numerically, another

approach is to transform the problem to a linear

quadratic problem. For these problems, we can solve

the resulting differential Riccati equation analytically

(Sontag, 1998; Locatelli, 2001). Here, the running

cost

l(x,u) = u

⊤

Ru + x

⊤

with R ∈R and Q ∈R

2N×2N

is quadratic and the right

hand side of the ODE is linear

˙x(t) = Ax(t) + Bu(t)

with A ∈ R

2N×2N

and B ∈ R

In the following, we set up a linear quadratic regula-

tor (LQR) similar to (Cui et al., 2017; Wang et al.,

2020). First, we deﬁne an equilibrium state in which

all vehicles travel with constant speed v

∗

. Depending

on the set of parameters β

IDM

(cf. (3)), the constant

equilibrium headways h

∗

may differ for each HD. As

all vehicles i travel with constant speed at the equilib-

rium state, the accelerations equal zero:

˙v

= f (h

i+1

) = f (h

∗

) = 0, i = 1,...,N.

Then, we deﬁne a state that indicates the deviation

from (h

∗



˜v





−h

∗

−v

∗



and apply a ﬁrst-order Taylor expansion around

f (h

∗

)

f (h

i+1

)

≈

∂ f

∂h

∗

)(

) +

∂ f

∂v

∗

)( ˜v

) +

∂ f

∂v

i+1

∗

)( ˜v

i+1

)

= α

+ α

˜v

+ α

˜v

i+1

Here, α

,α

denote the directional derivatives of

the IDM function (cf. (2)) at the equilibrium state for

each vehicle and, similar to (1), we can describe the

dynamics of the HDs as follows

= ˜v

i+1

− ˜v

˜v

= α

+ α

˜v

+ α

˜v

i+1

for i = 1,. . . , N −1. Again, the acceleration behavior

of the AV (i = N) is determined by the control input

= ˜v

− ˜v

˜v

= u.

Finally, the linear dynamics for all vehicles are given

˙x(t) = Ax + Bu,

with matrices A and B deﬁned as

A =







0 ... 0

0 C

. 0

. 0 C

N−1

0 ... 0 C







,B =













and where the submatrices C

and D

are given by



0 −1





0 1

0 α



,i = 1,...,N −1,



0 −1

0 0





0 1

0 0







Still, we aim on keeping low control values and a high

average speed. By setting

R = r ∈ R,Q = diag [0,q,0,q,...,0, q] ∈ R

2N×2N

we obtain the same running cost as in the previous

section (cf. (7)). The LQR is then deﬁned as follows:

min

x,u

J(x,u) =

u(t)

⊤

Ru(t) +x(t)

⊤

Qx(t)dt

s.t. ˙x(t) = Ax(t) + Bu(t), x(t

) = x

t ∈ [t

+ T ].

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

Remark 2. For the LQR, the optimal control u can be

computed analytically by solving the differential Ric-

cati equation for the tuple (A,B,Q,R), which, in gen-

eral, is much faster than solving the nonlinear OCP

(9) numerically. However, due to the linearization,

the description of the vehicle dynamics lacks in accu-

racy which may lead to an unsatisfying performance

of the computed controls.

Further, for the LQR set-up, constraints as (5) and

(6) have to be neglected such that infeasible solutions

may be computed. There are, however, approaches

for constrained LQRs based on constrained quadratic

programming (Nocedal and Wright, 2006). But, in

general, this requires additional constraint handling

and typically leads to much higher computational

loads as well. In this work, we combine the LQR ap-

proach with a so-called safe speed controller based on

Gipps’ safe speed model (Gipps, 1981; Treiber and

Kesting, 2013), such that at each time step constraint

(6) is satisﬁed and crashes are avoided.

3.3 Imitation Learning

In this section, we introduce a controller based on im-

itation learning (IL) with the goal to keep the high

performance of the nonlinear MPC approach but to

decrease the required computing time. The general

idea of IL is to imitate (or mimic) another controller

at a certain task. This controller, for the ring road sce-

nario, could be either an intelligent human driver or,

in our applications, a controller that has been realized

using MPC.

In particular, by solving the control problem over

a certain timer period [t

] numerically, we obtain

for each time step t of a time grid [t

,...,t

] a state

∈ X of state space X and a corresponding optimal

control u

∈ U of control space U. Now, we aim to

ﬁnd a function

: X → U

with parameters θ that maps optimally from state

space to control space. Optimally, in this context,

means as similar as possible to the other controller.

This can, e.g., be achieved by optimizing θ such that

θ = argmin

∑

t=t

∥

) −u

∥

. (10)

That is, for all states x

, the resulting function out-

put π

) should be as close as possible to the MPC-

based controls u

3.3.1 Neural Nets

A typical choice for π

are multi-layer feed-forward

neural nets (NNs) because they are well known for

their capability of approximating nonlinear func-

tions (Cybenko, 1989; Hornik et al., 1989; Bishop,

2006). In general, they can be described by a com-

position of K layers, k = 1,... K, that all consist

of M

neurons. Additionally, an activation function

: R → R for each layer has to be speciﬁed. All

of these quantities, the number of layers K, the num-

ber of neurons for each layer M

and the activation

functions σ

are hyperparameters that are chosen pre-

liminary. Further, each layer has a weight matrix W

and bias vector b

that have to be optimized by a so-

called training procedure. That means, if we choose

a NN for π

in the optimization problem (10), then θ

consists of W

,..., W

While the optimization (or training) may require

signiﬁcant computing time, the NN outputs of a

trained net, i.e., the controls, can be computed very

fast in comparison to solving nonlinear control prob-

lems (cf. Section 4).

3.4 Extended Imitation Learning

So far, we have introduced an IL controller that shall

mimic the MPC controller for one speciﬁc scenario.

However, we aim to ﬁnd a controller that performs

well in different situations. Consequently, in the fol-

lowing, we assume that we have solved L different

problems leading to a data set of states and controls

D = {(x

)} over time steps t ∈ {t

,...,t

} and tra-

jectories l = 1,...,L. These problems may differ in

their initial values x

and set-ups of the dynamical

system (4) induced by car-following parameters β

IDM

cf. (3). Then, similar to (10), we aim to optimize the

parameters of a function π

such that

θ = argmin

∑

l=1

∑

t=t



) −u



. (11)

Again, the function outputs π

) should be as simi-

lar as possible to the controls u

. But, by summing not

only over time steps t but also trajectories l, different

set-ups of the dynamical system can be taken into ac-

count. The goal of our approach is to be able to com-

pute controls u even for states x that have not been

observed exactly during training. To achieve this, it

is therefore crucial for the controller’s performance to

deﬁne a diverse and substantial dataset D induced by

the L different scenarios.

3.5 Reinforcement Learning

As an additional alternative to the approaches

presented here, we have applied a model-free rein-

forcement learning (RL) approach to the ring road

control problem. In RL, we optimize a control

Hybrid Optimal Trafﬁc Control: Combining Model-Based and Data-Driven Approaches

mapping (often called policy in the RL context)

which could be, e.g., a neural net. The main

difference to the other control approaches is that,

in general, the dynamical system, induced by the

ODE of (1), is unknown. Thus, several simulations

of the dynamical system are needed such that the

controller can observe different states and apply

different controls. By interacting with the dynamical

system, i.e., by observing the system’s response, the

goal is to ﬁnd such policy parameters θ that a certain

objective function, called reward, is maximized.

Remark 3. In our implementation, we have optimized

the RL agent with the Matlab Reinforcement Learn-

ing Toolbox and an actor-critic algorithm (The Math-

Works, Inc., 2021). The policy π

is represented by

a neural net with two hidden layers, 32 neurons for

each layer and the tanh activation function.

For a more detailed description of RL in general,

some common algorithms, and its application for traf-

ﬁc control, we refer to, e.g., (Sutton and Barto, 2018),

(Duan et al., 2016), and (Wu et al., 2017; Baumgart

and Burger, 2021), respectively.

4 RESULTS

In this section, we compare the different control ap-

proaches of Section 3 in terms of accuracy, robust-

ness, performance, and computational efﬁciency. We

show that the MPC controller performs well even if

the dynamical system is not completely known and

we study for the extended IL controller how many ve-

hicles have to send data to achieve desirable results.

General Set-Up

For all scenarios, we let the vehicles start with equal

headways and slightly vary each HD’s driving behav-

ior such that over time a stop-and-go wave occurs. To

obtain different driving behavior, ﬁrst, we introduce a

nominal IDM parameter set β

nom

IDM

= [16,1,2,4,1,1.5]

(cf. (3)). Then, before each simulation, we draw sam-

ples from a Gaussian distribution with the entries of

IDM

as mean values and standard deviation σ. Thus,

each HD’s i current characteristics can be described

by the corresponding IDM parameter set β

IDM

deﬁned

IDM

∼ N (β

nom

IDM

,σ). (12)

For each simulation l, we draw new parameters which

may be summarized by

1,l

IDM

,β

2,l

IDM

,...,β

22,l

IDM

. (13)

0 100 200 300 400 500 600

time [s]

Average Speed [m/s]

AV NLMPC

AV LQR

HD baseline

AV control start

0 100 200 300 400 500 600

time [s]

Speed AV [m/s]

Figure 3: Comparison of the MPC solution approaches in

terms of the average speed over all vehicles and the AV’s

speed.

At the beginning, all vehicles are controlled by HDs

with their corresponding parameters β

IDM

to observe

the emerging stop-and-go wave caused by heteroge-

neous driving behavior. Then, at t = 300s the AV con-

trol for vehicle i = N is switched on with the goal to

stabilize the trafﬁc ﬂow.

4.1 Comparison of Controllers

At ﬁrst, we compare the model-based solution ap-

proaches of Sections 3.1 and 3.2. For both, the non-

linear MPC (NLMPC) and the LQR approach, the re-

sulting OCP is solved for time horizon T = 30s and

then the resulting controls are applied with time shift

τ = 2s. In Figure 3, results are shown in terms of av-

erage speeds over all vehicles and the AV’s speeds for

the following controllers for one set of parameter B

• HD Baseline: First, all vehicles are controlled

by HDs with homogeneous parameters β

nom

IDM

such

that no stop-and-go wave occurs.

• HD: All vehicles are controlled by HDs but with

heterogeneous parameters β

IDM

. Without AV con-

trol the the stop-and-go wave evolves and does not

vanish over time.

• NLMPC and LQR: For both controllers, the

same stop-and-go wave evolves until the AV con-

trol start. Then, the controllers are switched on to

stabilize the trafﬁc ﬂow for the rest of the time pe-

riod. For the running cost of (7), we set q = 1 and

r = 5.

The results show that both controllers, after the AV

control start, slow down at the beginning, but then

increase their speed which stabilizes the trafﬁc ﬂow

and leads to an average speed similar to the HD base-

line scenario. However, the controllers differ in their

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

performance (cf. Figure 3 and Table 1): while the

NLMPC approach reaches a high average speed after

the control starts and stabilizes the trafﬁc ﬂow earlier,

the LQR approach has much faster computing times.

Further, for the NLMPC controller, the computing

time clearly exceeds the time shift of τ = 2s. That

means, at least with our implementation of the con-

troller, the NLMPC approach is not capable of being

applied as a real-time feedback controller.

Extended Imitation Learning Controller

To achieve a similar high performance as the NLMPC

approach while reducing the computing time for each

control, in Section 3.4, we have presented an ap-

proach to imitate the MPC controller. Thus, we gen-

erate a data set of L = 10 trajectories in which the

control problem is solved with the NLMPC approach.

For all trajectories l, we draw parameter sets B

train

,...,B

train

} according to (13) such that the HD

driving behavior varies in each scenario and the con-

troller is confronted with different stop-and-go situa-

tion at the control start. Then, we optimize the param-

eters θ of a NN according to (11).

The resulting NN of the extended IL controller is

then applied to ﬁve test scenarios with parameter sets

test

= {B

test

,...,B

test

} (that all differ from the pa-

rameter sets used to generate the training data). Addi-

tionally to the controllers of Figure 3, in Table 1, we

compare the following controllers for test scenarios

test

• extIL: Extended IL controller of Section 3.4.

Here, we train a NN that consists of two layers

K = 2, M

= M

= 10 neurons for each layer, and

the tanh activation function (cf. Section 3.3.1).

• RL: RL controller of Section 3.5 with hyperpa-

rameters and training algorithm as described in

Remark 3. To achieve desirable results, experi-

ments have shown, that we require at least a num-

ber of 1000 episodes (system simulations).

• PI sat PI with saturation controller of (Stern et al.,

2018).

We compare the average speed and the standard devi-

ation (SD) of the speed after the control start as well

as average times for stabilization and for computing

the next control over all test scenarios B

test

. For the

NLMPC approach the next control is computed on-

line, whereas the ML-based techniques consist of an

ofﬂine optimization (training) and an online compu-

tation of the controls.

We stress, that our extended IL approach, that has

been trained without direct knowledge of the dynam-

ical system and training data that differs from the test

Table 1: Comparison of different controllers at the ring road

scenario in terms of average values over test scenarios B

test

The average time to stabilize the system is deﬁned as the

time until all speeds v

,.. .,v

are within ±0.3m/s of the

equilibrium speed.

Control- Av. SD Av. Av.

speed speed time for comp.

[m/s] [m/s] stab. [s] time [s]

HD 4.60 3.98 - -

NLMPC 6.61 1.35 72.1 8.1

LQR 5.82 2.02 246.5 0.053

extIL 6.57 1.40 131.2 0.0041

RL 5.65 2.10 191.1 0.0026

PI Sat 6.41 1.60 181.5 0.0007

data, performs almost as good as the NLMPC ap-

proach at least in terms of the average speed. How-

ever, in contrast to the NLMPC approach, it is real-

time capable, because we only have to feedforward

the current state through the NN to obtain the next

control in contrast to solving an OCP numerically.

Further, it can be observed that satisfying results can

already be achieved with 10 training trajectories, thus,

it requires much less forward runs than, e.g., RL. The

latter is due to the fact that, with the used MPC tra-

jectory data, expert system knowledge is induced into

the training procedure.

In Figure 4, we show resulting trajectories of the

two ML-based techniques, extIL and RL, as well as of

the NLMPC approach for the same set of parameters

as in Figure 3 that is part of the test data set B

test

0 100 200 300 400 500 600

time [s]

Average Speed [m/s]

AV NLMPC

AV ExtIL

AV RL

HD baseline

AV control start

0 100 200 300 400 500 600

time [s]

Speed AV [m/s]

Figure 4: Comparison of the NLMPC approach with extIL

and RL in terms of the average speed over all vehicles and

the AV’s speed.

4.2 Analysis of Robustness

In this section, we analyze the robustness of the ex-

tended IL approach in terms of the training data and

Hybrid Optimal Trafﬁc Control: Combining Model-Based and Data-Driven Approaches

the observed data of other vehicles that is required

for an online application.

4.2.1 Incorrect Description of the Underlying

Dynamical System

For the NLMPC approach, we require to know the

underlying dynamics of OCP (9). However, in real-

world situations, it is a challenging task to ﬁnd car-

following parameters β

IDM

that exactly describe the

driving behavior of HDs. If we have enough obser-

vation data, we can calibrate the parameters of the

car-following model (Kesting and Treiber, 2008). But

even then, their driving behavior may change over

time.

To simulate this situation, we let all HDs travel

with one of the parameters sets B

test

as in Section 4.1

such that the same stop-and-go wave occurs at each

simulation. But now, we draw new parameters sets

= {B

,...,B

} for different standard deviations

σ according to (12) and (13). That means, at each

instance, the controller optimizes a system with pa-

rameters B

that differs from the simulated system

induced by parameters B

test

. By increasing the stan-

dard deviation σ ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, we an-

alyze how much the optimized system is allowed to

differ from the simulated system. Table 2 shows aver-

age values over all vehicles and parameter sets B

Desirable results up to σ = 0.3 indicate a certain

robustness of the NLMPC approach against incorrect

description of the underlying dynamics. This may be

explained by the fact, that the MPC scheme allows to

observe the real state at each sampling step and to take

this real state as initial value for the next optimization.

Table 2: Inﬂuence of wrong parameters B

on the perfor-

mance of the NLMPC approach.

SD σ Av. SD Av.

for speed speed time for

IDM

[m/s] [m/s] stab [s]

σ = 0 6.52 1.72 97.5

σ = 0.1 6.40 1.69 90.1

σ = 0.2 6.23 1.65 96.8

σ = 0.3 6.24 1.69 102.5

σ = 0.4 5.76 1.83 140.9

σ = 0.5 4.92 1.91 210.6

4.2.2 Data-Availability

In a real-world application, we may not obtain real-

time data from all other vehicles, for instance, be-

cause they are unwilling or physically unable to send

data. But even if (in the near future) we overcome

these problems, it is not guaranteed to observe all ve-

hicles at a certain trafﬁc situation. We therefore ana-

lyze the impact of data-availability of preceding vehi-

cles for the extended IL controller.

In the following, we use the same training data

set obtained with parameters B

train

and same hyper-

parameters for the NNs as in Section 4.1, but now the

controller only obtains data from a limited number of

leading vehicles during the training. That means, ad-

ditionally to the NN in which data from all leaders

is available, further NNs with only input data from a

limited number of leading vehicles are trained. Then,

all of them are applied to the test scenarios, induced

by B

test

. In Table 3, we compare average values over

all scenarios for different numbers of leading vehi-

cles.

The results show that, indeed, the number of ve-

hicles sending data may affect the controller’s per-

formance. Up to four leaders the controller’s perfor-

mance in terms of the average speed increases and the

highest performance is reached when all data is avail-

able.

For the time to stabilize the system, the interpre-

tation is not completely clear because low values are

not always achieved for the NNs with high average

speed. One explanation is the fact, that in some sce-

narios the controller dampens the stop-and-go wave

relatively fast which directly increases the averages

speed, but may require some signiﬁcant time to reach

a completely stable trafﬁc ﬂow.

Consequently, as for a small number of leaders the

results are already satisfying, it seems that not all data

is necessary for an application of the extended IL con-

troller.

Table 3: Inﬂuence of the number of the number of leading

vehicles sending data in terms of average values over test

scenarios B

test

Num. of Av. SD Av.

cons. speed speed time for

leaders [m/s] [m/s] stab [s]

1 6.25 1.52 155.8

2 6.42 1.40 156.3

3 6.43 1.44 178.3

4 6.55 1.48 124.6

6 6.44 1.43 122.6

8 6.41 1.32 99.1

10 6.38 1.36 115.1

16 6.30 1.54 177.2

all 6.57 1.40 131.2

Remark 4. To obtain the results of Table 3, we choose

the same hyperparameters for each NN with K = 2

layers, M

= M

= 10 neurons for each layer, and the

tanh activation function. But, feedforward networks

are typically initialized with random weights and bi-

ases. Depending on the parameter initialization, typ-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

ical training algorithms may lead to locally instead

of globally optimal solution for the objective function

(11). We therefore use a Monte-Carlo type approach

with 50 parameter initializations for each of the NNs

to make them more comparable.

5 SUMMARY AND CONCLUSION

We have presented different model-based and data-

driven solution approaches to solve the ring road

control problem. While the highest performance is

reached for the NLMPC approach and it performs

well even if the dynamical system differs from the

real system (cf. Section 4.2.1), its computing times

may prevent it from being real-time applicable. Thus,

we have replaced the MPC controller by an imitation

learning approach which can be extended to include

several different training scenarios. With this exten-

sion, the controller can perform well even in scenarios

that have not been observed exactly during the train-

ing and may not even require data from all vehicles in

the system (cf. Section 4.2.2). By imitating another

controller, expert knowledge about the dynamical sys-

tem is induced which leads to much faster training

than for other ML techniques like RL.

For the training, we require an, at least basic, un-

derstanding of the HD dynamics induced by the car-

following model. But to run it in real-world appli-

cation, we need real-time data from other vehicles.

We expect that current developments in communica-

tion technologies will enable the realization of such

controllers in real-world trafﬁc situations in the near

future.

Although its performance depends on the ob-

served training data, there are advantages of apply-

ing data-driven controllers such as the extIL approach

in real-world situations. That is, while the presented

NLMPC controller is only valid for the closed ring

road in which we are able to describe the dynamics,

the extIL controller may be applied in all set-ups in

which data of one or several leading vehicles is avail-

able. As the occurring stop-and-go waves and trafﬁc

jams in, for instance, city trafﬁc are similar to the ones

at the ring road, we expect our extIL controller to per-

form well in these situations, too.

Like in other applications, in which ML tech-

niques show promising results, there are still open

questions regarding the application of NN-based con-

trollers in safety critical situations like steering a ve-

hicle. While we have shown a certain robustness of

our approach, guaranteed accuracy and stability are

still hard to prove and, at least partially, open tasks.

Further, the tuning of hyperparameters and the repro-

ducibility of training results of Neural Nets are crit-

ical as well and may be overcome by using other

structures like radial basis function (RBF) networks

(Bishop, 2006).

In this work, we have focused on scenarios with

only one intelligently controlled vehicles to empha-

size that already one of these sufﬁces to outbalance

the emerging stop-and-go wave. However, taking into

account several AVs may further improve the results

and may lead to a more efﬁcient outbalancing effect,

cf. (Chou et al., 2022).

In future work, we aim to further analyze the robust-

ness and stability of our approach and other AI-based

techniques. Especially, to guarantee certain safety cri-

teria such that the controller can be applied not only in

artiﬁcial but also realistic real-world trafﬁc situations.

REFERENCES

Baumgart, U. and Burger, M. (2021). A reinforcement

learning approach for trafﬁc control. In Proceedings

of the 7th International Conference on Vehicle Tech-

nology and Intelligent Transport Systems - VEHITS,,

pages 133–141. INSTICC, SciTePress.

Bishop, C. M. (2006). Pattern Recognition and Machine

Learning. Springer-Verlag New York.

Chou, F.-C., Bagabaldo, A. R., and Bayen, A. M. (2022).

The Lord of the Ring Road: A Review and Evaluation

of Autonomous Control Policies for Trafﬁc in a Ring

Road. ACM Transactions on Cyber-Physical Systems

(TCPS), 6(1):1–25.

Cui, S., Seibold, B., Stern, R., and Work, D. B. (2017). Sta-

bilizing trafﬁc ﬂow via a single autonomous vehicle:

Possibilities and limitations. In 2017 IEEE Intelligent

Vehicles Symposium (IV), pages 1336–1341.

Cybenko, G. (1989). Approximation by superpositions of a

sigmoidal function. Mathematics of Control, Signals

and Systems, 2(4):303–314.

De Schutter, B. and De Moor, B. (1998). Optimal Traf-

ﬁc Light Control for a Single Intersection. European

Journal of Control, 4(3):260 – 276.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and

Abbeel, P. (2016). Benchmarking Deep Reinforce-

ment Learning for Continuous Control. In Proceed-

ings of The 33rd International Conference on Ma-

chine Learning, volume 48 of Proceedings of Machine

Learning Research, pages 1329–1338.

Gazis, D. C., Herman, R., and Rothery, R. W. (1961). Non-

linear Follow-the-Leader Models of Trafﬁc Flow. Op-

erations Research, 9(4):545–567.

Gerdts, M. (2011). Optimal Control of ODEs and DAEs.

De Gruyter.

Gipps, P. (1981). A behavioural car-following model for

computer simulation. Transportation Research Part

B: Methodological, 15(2):105–111.

une, L. and Pannek, J. (2011). Nonlinear Model Predic-

tive Control. Springer-Verlag, London.

Hybrid Optimal Trafﬁc Control: Combining Model-Based and Data-Driven Approaches

Hegyi, A., Hoogendoorn, S., Schreuder, M., Stoelhorst, H.,

and Viti, F. (2008). SPECIALIST: A dynamic speed

limit control algorithm based on shock wave theory.

In 2008 11th International IEEE Conference on Intel-

ligent Transportation Systems, pages 827–832.

Helbing, D. (1997). Verkehrsdynamik. Springer-Verlag

Berlin Heidelberg.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multi-

layer feedforward networks are universal approxima-

tors. Neural Networks, 2(5):359–366.

Kessels, F. (2019). Trafﬁc Flow Modelling. Springer Inter-

national Publishing.

Kesting, A. and Treiber, M. (2008). Calibrating car-

following models by using trajectory data: Method-

ological study. Transportation Research Record,

2088(1):148–156.

Lighthill, M. J. and Whitham, G. B. (1955). On kinematic

waves. II. A theory of trafﬁc ﬂow on long crowded

roads. Proc. R. Soc. Lond. A, 229:317–345.

Locatelli, A. (2001). Optimal Control: An Introduction.

Birkh

auser Verlag AG Basel.

McNeil, D. R. (1968). A Solution to the Fixed-Cycle Trafﬁc

Light Problem for Compound Poisson Arrivals. Jour-

nal of Applied Probability, 5(3):624–635.

Nocedal, J. and Wright, S. (2006). Numerical Optimization.

Springer Series in Operations Research and Financial

Engineering. Springer New York.

Orosz, G. (2016). Connected cruise control: modelling, de-

lay effects, and nonlinear behaviour. Vehicle System

Dynamics, 54(8):1147–1176.

Orosz, G., Wilson, R. E., and St

an, G. (2010). Trafﬁc

jams: dynamics and control. Philosophical Transac-

tions of the Royal Society A: Mathematical, Physical

and Engineering Sciences, 368(1928):4455–4479.

Sontag, E. D. (1998). Mathematical Control Theory: Deter-

ministic Finite-Dimensional Systems. Springer-Verlag

New York.

Stern, R. E., Cui, S., Delle Monache, M. L., Bhadani, R.,

Bunting, M., Churchill, M., Hamilton, N., Haulcy,

R., Pohlmann, H., Wu, F., Piccoli, B., Seibold, B.,

Sprinkle, J., and Work, D. B. (2018). Dissipation

of stop-and-go waves via control of autonomous ve-

hicles: Field experiments. Transportation Research

Part C: Emerging Technologies, 89:205 – 221.

Sugiyama, Y., Fukui, M., Kikuchi, M., Hasebe, K.,

Nakayama, A., Nishinari, K., Tadaki, S.-i., and

Yukawa, S. (2008). Trafﬁc jams without bottle-

necks—experimental evidence for the physical mech-

anism of the formation of a jam. New Journal of

Physics, 10(3):033001.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing: An Introduction. Adaptive Computation and Ma-

chine Learning. MIT Press, 2nd edition.

The MathWorks, Inc. (2021). Matlab R2021b. Natick, Mas-

sachusetts, United State.

Treiber, M. and Kesting, A. (2013). Trafﬁc Flow Dynamics.

Springer-Verlag Berlin Heidelberg.

Wang, J., Zheng, Y., Xu, Q., Wang, J., and Li, K.

(2020). Controllability Analysis and Optimal Control

of Mixed Trafﬁc Flow With Human-Driven and Au-

tonomous Vehicles. IEEE Transactions on Intelligent

Transportation Systems, pages 1–15.

Wu, C., Kreidieh, A., Parvate, K., Vinitsky, E., and Bayen,

A. M. (2017). Flow: Architecture and Benchmarking

for Reinforcement Learning in Trafﬁc Control. arXiv:

1710.05465.

Zheng, Y., Wang, J., and Li, K. (2020). Smoothing Traf-

ﬁc Flow via Control of Autonomous Vehicles. IEEE

Internet of Things Journal, 7(5):3882–3896.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems