Model-Free versus Model-Based Reinforcement Learning for

Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

David Olivares

1,2 a

, Pierre Fournier

1 b

, Pavan Vasishta

2 c

and Julien Marzat

1 d

DTIS, ONERA, Universit

e Paris-Saclay, 91123 Palaiseau, France

AKKODIS Research, 78280 Guyancourt, France

Keywords:

Reinforcement Learning, Unmanned Aerial Vehicle, Fixed-Wing Unmanned Aerial Vehicle, Attitude Control,

Wind Disturbances.

Abstract:

This paper evaluates and compares the performance of model-free and model-based reinforcement learning for

the attitude control of ﬁxed-wing unmanned aerial vehicles using PID as a reference point. The comparison

focuses on their ability to handle varying ﬂight dynamics and wind disturbances in a simulated environment.

Our results show that the Temporal Difference Model Predictive Control agent outperforms both the PID con-

troller and other model-free reinforcement learning methods in terms of tracking accuracy and robustness over

different reference difﬁculties, particularly in nonlinear ﬂight regimes. Furthermore, we introduce actuation

ﬂuctuation as a key metric to assess energy efﬁciency and actuator wear, and we test two different approaches

from the literature: action variation penalty and conditioning for action policy smoothness. We also evaluate

all control methods when subject to stochastic turbulence and gusts separately, so as to measure their effects on

tracking performance, observe their limitations and outline their implications on the Markov decision process

formalism.

1 INTRODUCTION

Robotic controllers must act according to their state

and environment. In the context of ﬁxed-wing un-

manned aerial vehicles (FWUAV) ﬂight, modeling the

aircraft and its interactions with the environment is a

challenging task for two reasons.

First, the aerodynamic forces exhibit different dy-

namic regimes that depend on the attitude of the

FWUAV. On one hand, low attitude angles near the

equilibrium state present relatively simple dynam-

ics often approximated by formulating hypotheses on

their linearity and independent coupling of the con-

trolled axes. On the other hand, high attitude angles

lead the FWUAV to encounter complex aerodynam-

ics where lift and drag behave nonlinearly and also

exhibit important cross-couplings between axes.

A second source of complexity lies in FWUAVs

potentially evolving in disturbed environments. These

disturbances are the result of unpredictable meteoro-

https://orcid.org/0009-0001-5021-3042

https://orcid.org/0000-0002-0209-2901

https://orcid.org/0000-0002-2976-8457

https://orcid.org/0000-0002-5041-272X

logical phenomena such as turbulence and wind gusts.

As a result, effectively modeling FWUAVs’ aerody-

namics is often a tedious task required to design Con-

trol Theory (CT)-based controllers.

A solution could come from Reinforcement

Learning (RL) to alleviate the burden of system mod-

eling by using data to approximate these models ei-

ther implicitly (model-free RL) or explicitly (model-

based RL).

Model-free RL (MF-RL) control for attitude con-

trol of a quadrotor UAV is presented as outperform-

ing the widely used PID in (Koch et al., 2019). Sim-

ilar conclusions are drawn by (Song et al., 2023b)

for acrobatic high-speed UAV racing. For FWUAVs,

MF-RL was introduced as a robust control approach

to deal with turbulence (Bøhn et al., 2019). As

for model-based RL (MB-RL), it showed recent

breakthroughs by outperforming MF-RL algorithms

in highly complex dynamical environments such as

games, locomotion and manipulation tasks (Hafner

et al., 2023; Hansen et al., 2024). We identify a gap

in literature for MB-RL applied to UAV ﬂight: most

recent works focus on MB-RL’s sample efﬁciency for

learning ﬂying policies (Becker-Ehmck et al., 2020;

Lambert et al., 2019) and do not evaluate the perfor-

Olivares, D., Fournier, P., Vasishta, P. and Marzat, J.

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions.

DOI: 10.5220/0012946600003822

In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 1, pages 79-91

ISBN: 978-989-758-717-7; ISSN: 2184-2809

mance of latest MB-RL algorithms. While not be-

ing the focus of our study, MB-RL often incorporates

concepts from Control Theory (CT) which can miti-

gate the blackbox aspect of MF-RL (by adding a plan-

ning step) and could be an additional beneﬁt of this

class of methods. Our work aims to evaluate the ben-

eﬁts of RL methods, in particular MB-RL versus MF-

RL, for attitude control of a FWUAV under varying

speciﬁc conditions (wind disturbances).

We focus on the attitude control problem, where

the controller has to track reference attitude angles for

roll and pitch. First, we choose state-of-the-art MF-

RL methods such as Proximal Policy Optimization

(PPO) and Soft Actor Critic (SAC). Second, we select

Temporal Difference Model Predictive Control (TD-

MPC), a MB-RL method that mixes concepts from

RL such as temporal difference learning and concepts

from CT for predictive planning. Finally, we include

a PID controller as a reference point, because of its

very wide use in UAV attitude control.

Our contributions are as follows:

• We propose the ﬁrst application of a recent MB-

RL method (TD-MPC) to FWUAV attitude con-

trol under varying wind conditions.

• Our comparison provides insights for FWUAV

practitioners on the current state-of-the-art RL al-

gorithms with respect to the industry standard

PID.

• We exhibit TD-MPC’s superiority for pure refer-

ence tracking in the case of nominal wind condi-

tions (no turbulence and no gusts), most notably

on hard attitude angle references.

• We improve this analysis with two additional

studies:

– A secondary metric for actuation ﬂuctuation

shows that RL, be it model-free or model-

based, struggles to produce smooth action out-

puts. We evaluate two methods from the litera-

ture to deal with this issue.

– In the presence of wind perturbations and when

evaluated speciﬁcally on this aspect, the gains

of RL methods appear more limited than sug-

gested by previous work (Bøhn et al., 2019).

• For replicability and facilitating comparison

among contributors, we release our code for the

RL-compatible simulation framework

based on

the open-source ﬂight simulator JSBSim (Berndt,

2004) and the control algorithms

presented in

this work.

https://github.com/Akkodis/FW-JSBGym

https://github.com/Akkodis/FW-FlightControl

2 RELATED WORKS

Learning-based methods appear as a promising tool

to achieve nonlinear and perturbation resilient con-

trollers for UAVs. An example of learning with CT

is the ﬁeld of adaptive control, which now includes

learning-based approaches where a neural network

learns a model of the problematic part of the dy-

namics (disturbances, modeling errors) and uses it as

a feed-forward component inside a classical control

controller (Shi et al., 2019; Doukhi and Lee, 2019;

Shi et al., 2021). These studies displayed encourag-

ing results for jointly using a learned model of the un-

known dynamics together with a mathematical model

of the known nominal dynamics.

2.1 Reinforcement Learning Control of

UAVs

However, when the nominal dynamics are not given,

RL has proven to be a powerful choice for learn-

ing controllers without any prior knowledge of the

system. It has shown impressive results for deci-

sion making in games (Mnih et al., 2013; Schrit-

twieser et al., 2020) and in robotics (Hansen et al.,

2024; Hafner et al., 2023) such as dexterous robotic

hand manipulation (Andrychowicz et al., 2020) or

quadruped locomotion (Lee et al., 2020; Peng et al.,

2020) and can be classiﬁed into model-free and

model-based approaches. Since rotary-wing and

ﬁxed-wing UAVs share similarities, methods applica-

ble to one may transfer effectively to the other.

2.1.1 Model-Free RL

In model-free RL (MF-RL), the agent learns to take

actions through trial and error without learning or us-

ing an explicit model of the environment’s dynamics.

For rotary-wing UAVs, MF-RL has been applied as

a ﬂight controller for waypoint navigation (Hwangbo

et al., 2017) and the recent work of (Koch et al., 2019)

has applied various RL algorithms in order to learn

attitude control policies outperforming their base-

line PID controller counterpart. A proof-of-concept

Deep Deterministic Policy Gradient (DDPG) agent

has demonstrated encouraging attitude control perfor-

mance despite exhibiting steady state errors (Tsour-

dos et al., 2019). Recent work has highlighted the

beneﬁts of RL for trajectory tracking over classical

two-stage controllers (a high-level stage for trajectory

planning and a lower-level stage for attitude control)

(Song et al., 2023b). The authors concluded that the

cascading of two control loops splits the overall way-

point tracking objective into two sub-objectives lead-

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

ing to suboptimal solutions, whereas RL can optimize

the higher-level waypoint objective directly.

For FWUAVs, experiments have been conducted

in the context of high performance control in a com-

bat environment of an F-16 ﬁghter jet (De Marco

et al., 2023). RL has also been studied for addressing

speciﬁc problems on civil aircraft (Liu et al., 2021),

where the authors used vision-based observations to

achieve takeoff under crosswind conditions. A study

(Bøhn et al., 2019) compared RL and PID attitude

controllers under wind disturbances and both meth-

ods achieved similar performance, paving the way

for the deployment of a RL control law in a real

FWUAV (Bøhn et al., 2023). The former (Bøhn et al.,

2019) demonstrated PPO’s resilience to stochastic

turbulence in simulation and the latter demonstrated

SAC as a data-efﬁcient algorithm for real-world de-

ployment of the RL attitude controller. We use those

two works as starting points and extend the analysis

with a MB-RL method (TD-MPC) and under an addi-

tional disturbance source: wind gusts.

2.1.2 Model-Based RL

Model-based RL (MB-RL) is a branch of RL where

the agent, through trial and error too, separately learns

a model of the environment and an action planning

strategy from this model. MB-RL is often more sam-

ple efﬁcient, requiring fewer training episodes than

MF-RL algorithms to converge. This observation

was exploited in (Becker-Ehmck et al., 2020) where

the authors trained a controller for waypoint naviga-

tion of a rotary-wing UAV, directly in the real world.

(Lambert et al., 2019) present a sample efﬁcient MB-

RL method for hovering control using a “random-

shooter” MPC for simulating candidate trajectories

obtained with a learned-model of the dynamics. A

higher-level trajectory tracking off-line MB-RL con-

troller is presented by (Liang et al., 2018), lever-

aging data trajectories obtained on multiple UAVs.

Current MB-RL literature for UAVs emphasizes MB-

RL’s sample efﬁciency for learning ﬂying policies, yet

does not assess the performance of the latest MB-RL

algorithms. Moreover, we found no studies regarding

MB-RL applied to FWUAV ﬂight under varying wind

conditions. With the present paper, we intend to ex-

tend the existing literature by evaluating the beneﬁts

of MF-RL versus MB-RL for this speciﬁc setting.

3 UAV MODEL

3.1 Kinematics

The studied FWUAV is the Skywalker x8 as in (Bøhn

et al., 2019). The FWUAV is modeled as a rigid-body

mass m with inertia matrix I. According to (Beard and

McLain, 2012), the following kinematic equations ap-

ply to the positions p = [x, y, z]

and the orientation q:

p = R

(q)v (1)

q =



0 −ω

ω −S(ω)



q (2)

where R

is the rotation matrix from the body frame

{b} attached to the FWUAV center of gravity to the

inertial {n} North-East-Down (NED) world frame,

while v = [u, v, w]

and ω = [p, q, r]

are the linear

and angular body velocities respectively and S is the

skew-symmetric matrix.

The linear and angular velocities v and ω expres-

sions are derived from the Newton-Euler equations of

motion:

v + ω × v = R

(q)

+ F

prop

+ F

aero

(3)

ω + ω ×Iω = M

prop

+ M

aero

(4)

Where g

is the force of gravity in the inertial frame.

The other terms, F

aero

and M

aero

are the aerodynam-

ical forces and moments while F

prop

and M

prop

are

the propulsion forces and moments.

3.2 Aerodynamic Forces and Moments

3.2.1 Wind Disturbances

is the velocity of the FWUAV with respect to the

relative wind as:

= v

−v

(5)

with v

the velocity vector of the aircraft relative to

the ground and v

the wind speed in the body frame.

The wind speed can be decomposed into two parts: a

steady ambient wind component v

expressed in the

inertial frame and a stochastic process v

represent-

ing wind turbulence expressed in the body frame.

= R

(q)v

+ v









(6)

Similarly, the angular rate relative to the air comprises

two terms: the nominal angular rate of the FWUAV

and the additional turbulent angular rates ω

= ω

−ω









(7)

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

The stochastic processes modeling the additional tur-

bulence correspond to white noise passed through ﬁl-

ters given by the Dryden spectrum model (Beard and

McLain, 2012; mil, 1980). The airspeed V

, angle of

attack α and sideslip angle β are deﬁned as:

+ v

+ w

(8)

α = tan

−1





, β = sin

−1





(9)

3.2.2 Aerodynamic Model

The ﬂying wing Skywalker x8 is not equipped with a

rudder. The available control surfaces only consist in

aileron and elevator deﬂection angles, δ

and δ

, and

a throttle command δ

. The translational aerodynamic

forces F

aero

are the drag D, the lateral force Y and the

lift L. They are expressed in the wind frame and can

be expressed in the body frame as:

aero

= R

(α, β)





− D

− L





(10)









ρV





(α, β, q

, δ

)

(β, p

, r

, δ

)

(α, q

, δ

)





(11)

Typically, these nonlinear forces are described by

, C

. They consist in a set of coefﬁcients de-

termined by computational ﬂuid dynamics simula-

tions and/or experimental wind tunnel data. The ex-

pressions of the aerodynamic moments M

aero

can be

found following the same logic:

aero

ρV





(β, p

, r

, δ

)

(α, q

, δ

)

(β, p

, r

, δ

)





(12)

where ρ is the air density, V

the nominal airspeed, S

the wing area, b the wingspan and c the aerodynamic

chord. The aerodynamic coefﬁcients and geometric

parameters of the x8 FWUAV are taken from (Gryte

et al., 2018).

3.2.3 Propulsion Forces and Moments

The propeller of the x8 FWUAV is located at the back

of the airframe and aligned with the body frame’s

x-axis such that F

prop

= [T

, 0, 0]

. In the current

simulation, a simpliﬁed model of the engine and the

propeller is used, which provides a linear relation

= Cδ

between the thrust force T

and the δ

throt-

tle command given in the [0, 1] interval, with C = 5.9.

3.3 Actuator Dynamics and Constraints

Input commands in the simulator are normalized

in the range [−1, 1] for control surface deﬂections

(δ

, δ

) and between [0, 1] for the throttle δ

. Fol-

lowing (Bøhn et al., 2019), the aileron and eleva-

tor dynamics are described by rate-limited and satu-

rated second-order transfer functions with natural fre-

quency ω

= 100 and damping ζ =

√

. The control

surface deﬂection angles and rates are limited to ±30

◦

and ±200

◦

/s. The throttle dynamics are modeled by a

ﬁrst-order transfer function, with gain K = 1 and time

constant T = 0.2.

4 CONTROL METHODS

In this study the controller’s task is to track desired

states φ

and θ

for the attitude roll and pitch angles φ

and pitch θ of the aforedescribed FWUAV when sub-

ject to various wind disturbances. Following (Bøhn

et al., 2023), the airspeed remains controlled by a PI

controller driving the throttle command δ

to maintain

a nominal airspeed, with the gains tuned as in (Bøhn

et al., 2019): k

= 0.5 and k

= 0.1. This airspeed PI

controller is common to all the following controllers.

This choice was motivated by the fact that when cer-

tain references are generated, the desired pitch can

contradict the desired airspeed, e.g. low desired air-

speed and nose down desired pitch. In a case where

one would control the desired altitude, the outer loop

would give coherent desired states to the inner atti-

tude control loops, but this is beyond the scope of the

present work.

4.1 PID Control

PID control is a widely used controller for UAV atti-

tude control because of its ease of use and is therefore

considered as a reference point in the RL literature

for UAV control (Bøhn et al., 2019; Bøhn et al., 2023;

Koch et al., 2019; Lambert et al., 2019). We ﬁne-

tuned the PID gains given by (Bøhn et al., 2019) to

better suit our simulated loops in nominal conditions

i.e. with no wind disturbances, with the gains pre-

sented in Table 1.

Table 1: PID controller parameters.

Parameter Value Parameter Value

pφ

1.5 k

pθ

−2

iφ

0.1 k

iθ

−0.3

dφ

0.1 k

dθ

−0.1

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

4.2 Reinforcement Learning (RL)

Control

In RL, the control problem is formulated as that of

ﬁnding an optimal action strategy in an environment

modeled as a Markov Decision Process (MDP) (Sut-

ton and Barto, 2018) comprised of:

• A set of actions a ∈ A, here the same as in the

roll and pitch PID loops with the superscript c de-

noting a commanded position of the aileron and

elevator, a = (δ

, δ

• A set of states s ∈ S, based on (Bøhn et al., 2023):

– the attitude angles: roll φ and pitch θ;

– the airspeed V

;

– the angular rates ω

= [p

, q

, w

]

;

– the angle of attack α and the angle sideslip β;

– the roll and pitch errors:

= φ

−φ

= θ

−θ

(13)

– the roll and pitch integral errors:

∗

= I

∗

+ e

∗

·dt, ∗ = {φ, θ}, dt = 0.01 (14)

where dt is the simulation period, also equal to

the control rate. The integral error is reset at the

beginning of each episode or target change.

– The last action taken by the agent: δ

t−1

and

t−1

• A transition function T , deﬁning a distribution

over the next states given a current state and ac-

tion: P(s

t+1

, a

) = T (s

, a

• A scalar reward function R

, outputting a score,

rewarding the agent for good actions and penaliz-

ing bad ones, that shapes the agent’s task.

The reward function has been adapted from (Bøhn

et al., 2019) as follows:

= clip



|φ −φ

, 0, c



= clip



|θ −θ

, 0, c



= −(R

+ R

)

= 3.3, ζ

= 2.25, c

= c

= 0.5

(15)

where ζ

∗

are scaling coefﬁcients to take into account

the differences between roll and pitch error ranges,

each reward component is clipped to 0.5 so the mini-

mal total reward equals to -1.

RL algorithms aim to discover the best map-

ping π from a state to an action (control law), in

the RL literature, referred as the optimal policy

(π

∗

), and deﬁned as the policy leading to trajectories

τ = (s

, a

, s

, a

, ...) that maximize the expected dis-

counted sum of accumulated rewards:

∗

= argmax

τ∼π

∞

∑

t=0

, a

)

(16)

with γ being the discount factor.

This policy, often represented by an artiﬁcial neu-

ral network (ANN) with learnable parameters Θ,

serves as a nonlinear controller.

4.2.1 Model-Free RL: PPO & SAC

We choose Proximal Policy Optimization (PPO)

(Schulman et al., 2017) and Soft Actor-Critic (SAC)

(Haarnoja et al., 2018) to represent the MF-RL end

of the control spectrum because of their state-of-the-

art performance for robotic applications with continu-

ous action spaces. Moreover, SAC’s choice was moti-

vated by its similar policy update to the mixed method

used in this work (TD-MPC).

4.2.2 Model-Based Method: TD-MPC

Temporal Difference Model Predictive Control (TD-

MPC) (Hansen et al., 2022) is an MB-RL, mixed con-

trol method combining an explicit dynamics model

and a terminal value function learned by using Tem-

poral Difference learning in an RL setting. We use its

latest implementation from (Hansen et al., 2024).

Trainable Models. TD-MPC learns a Task-Oriented

Latent Dynamics (TOLD) model, comprised of sev-

eral MLPs (Multi-Layer Perceptrons) with trainable

parameters Θ. They aim at modeling certain func-

tions, most notably: the agent’s dynamics, the reward

function R

, the Q-value function Q

and the policy.

During training, a trajectory of prediction horizon H

is sampled from a replay buffer in an off-policy RL

fashion. Using this sampled trajectory, the TOLD

model is trained by optimizing a loss function in-

cluding: the reward prediction function, the dynamics

model and a Bellman error component for Q

. The

policy is learned by maximizing a similar objective to

the SAC algorithm (Haarnoja et al., 2018).

Planning. TD-MPC uses a control-theory basis

for planning, utilizing Model Predictive Path Inte-

gral (MPPI) (Williams et al., 2015). MPPI is a

sampling-based model-predictive control (MPC) al-

gorithm iteratively learning the parameters of a mul-

tivariate Gaussian distribution using importance sam-

pling. First, TD-MPC simulates candidate trajecto-

ries Γ by sampling a mixture of trajectories of horizon

H by querying both the Gaussian distribution and the

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

learned policy function π

for actions and by recur-

sively predicting next states using the learned dynam-

ics model. One of TD-MPC’s key idea is to use the

learned Q-value function Q

to give an estimation of

the terminal value when computing a trajectory’s to-

tal return φ

. Here, Q

gives an estimate of the value

of the trajectory beyond the horizon limit, emulating

inﬁnite-horizon MPC.

≜ E

[γ

, a

)

| {z }

Terminal value

estimated by Q

H−1

∑

t=0

, a

)

| {z }

Total return of the

candidate trajectory over

the time horizon H

] (17)

As illustrated in Figure 1, after taking the trajec-

tories with the best return, MPPI computes the ac-

tion Gaussian parameters update by using importance

sampling, see (Hansen et al., 2022). The action Gaus-

sian is then sampled and the ﬁrst action is sent to the

actuators, as in receding-horizon MPC.

Apply the action

State & Reward

Environment

Trajectories

Simulation

TOLD

Training

Select action from

the best trajectories

Figure 1: TD-MPC Control Block Diagram. 1) The trained

TOLD model is used for simulating candidate trajectories

and estimating their return. 2) Using importance sampling

update an action Gaussian distribution and sample it. 3)

Sample an action from the action distribution and apply it.

5 EXPERIMENTAL SETUP

5.1 Simulation

We conducted our experiments in the JSBSim ﬂight

simulator (Berndt, 2004) because of its ability to han-

dle ﬁxed-wing aircraft ﬂight dynamics and to model

all the studied wind disturbances: constant wind,

wind gusts and stochastic turbulences. JSBSim was

preferred over the PyFly simulator presented in (Bøhn

et al., 2019) because of its wide use in the aerospace

open-source and research communities and its built-

in wind gust generation (Moallemi and Towhidnejad,

2016; Mathisen et al., 2021; De Marco et al., 2023).

We are interested in separating the performance of

the studied control methods between linear and non-

linear regimes. Therefore, we introduce a classiﬁca-

tion of attitude angle references in Table 2, with nom-

inal and hard levels representing linear and nonlinear

regions of the state space respectively.

Table 2: References classiﬁcation.

Difﬁculty Roll(°) Pitch(°)

Nominal [−45, 45] [−25, 25]

Hard [−60, −45] ∪[45, 60] [−30, −25] ∪[25, 30]

5.2 Implementation Details

For the optimization algorithms, we used the

CleanRL PPO & SAC implementations (Huang et al.,

2022) with the parameters presented in Tables 3 & 4.

We used the most recent implementation of TD-MPC,

TD-MPC2 (Hansen et al., 2024) with the parameters

speciﬁed in Table 5.

We trained the RL agents for 375 episodes of

2000 timesteps each, i.e. 20 seconds of ﬂight time

at 100 Hz. At the start of every episode, we initialize

the FWUAV at 600 m above sea level, with a nomi-

nal speed of 17 m/s and roll, pitch, yaw {φ, θ, ψ} = 0.

We compute the mean and standard error of the mean

over 5 runs for each metric. Training was conducted

on a RTX A6000 with 48GB of memory.

Table 3: PPO Parameters.

Parameter Value Parameter Value

LR 3e-4 Entropy coef 1e-2

Parallel envs 6 Value fn coef 0.5

Rollout steps 2048 Total timesteps 750k

Discount factor γ 0.99 Minibatches 32

GAE λ 0.95 Batch size 12228

PPO clip 0.2

Table 4: SAC Parameters.

Parameter Value Parameter Value

Buffer size 1e6 Q-net LR 1e-3

Discount factor γ 0.99 π update freq 2

targ

smoothing τ 5e-3 Q

targ

update freq 1

Batch size 256 Total timesteps 750k

Seed steps 5000 Noise clip 0.5

Policy LR 3e-4 Entropy reg 0.2

Table 5: TD-MPC Parameters.

Parameter Value Parameter Value

Buffer size 1e6 Total timesteps 750k

Discount factor γ 0.99 Temporal coef 0.5

Target smoothing τ 0.01 MPPI Iterations 6

Batch size 256 MPPI Samples 512

Seed steps 10 000 MPPI Elites 64

LR 3e-4 MPPI π Trajs 24

Reward coef 0.1 Horizon 3

Value coef 0.1 Temperature 0.5

Consistency coef 20

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

6 RESULTS AND DISCUSSIONS

Our ﬁrst experiment aims at comparing RL agents in

nominal, non-disturbed conditions and with respect

to reference tracking only (Section 6.1.1). We also

show that RL agents tend to converge towards policies

that produce oscillating actions, and we evaluate two

strategies to mitigate the issue (Section 6.1.2). Then

we study the impact of turbulence and wind gusts

separately on each algorithm performances (Section

6.2) and provide a deeper discussion of the results

(Section6.3).

6.1 Nominal Wind Conditions

This section focuses on evaluating all 4 controllers

under nominal wind conditions i.e. no wind, no turbu-

lence and no wind gusts. Agents are trained and tested

with these wind settings and across the two reference

difﬁculty levels from Table 2.

6.1.1 Base RL Algorithms

The results from Table 6 conﬁrm that PID is a strong

baseline in nominal conditions, with only TD-MPC

from RL algorithms performing slightly better. On

the contrary, under more nonlinear dynamics forced

by hard references, all RL methods clearly outper-

form PID, with a signiﬁcant advantage of TD-MPC

over PPO and SAC. Figure 4 illustrates TD-MPC’s

better hard reference tracking compared to PPO.

Overall TD-MPC displays moderate to strong gains

over all methods across both dynamics regimes.

6.1.2 RL Policies and High Action Oscillation

The RL literature usually focuses on performance

metrics only (Koch et al., 2019; Hansen et al., 2022),

but for true robotic applications, we practitioners

must also make sure to minimize wear and tear as well

as energy consumption. This can only be done by

monitoring actuation solicitation and thus we incor-

porate in our evaluation an actuation ﬂuctuation (Act

Fluct) metric, following (Song et al., 2023a):

ξ(π) =

∑

j∈[a,e]

∑

n=1

∑

t=1

|δ

−δ

t−1

(18)

It incentivizes the minimization of the distance be-

tween two consecutive actuator positions (δ

or δ

Table 7 shows that high performances of RL meth-

ods come at the cost of strongly oscillating actions.

In fact, RL can lead to bang-bang control, where the

learned optimal policy consists in alternating abruptly

between actions far from each other in the action

space (Seyde et al., 2021). To mitigate this issue, we

evaluate two oscillation regulation methods from the

literature.

Action Variation Penalty (AVP). Adding an AVP in

the reward function is a common way of addressing

the bang-bang policy problem. Hence, we modify

the base reward of Equation 15 and propose adding

a simple additional reward component R

, similar to

(Bøhn et al., 2019) for penalizing abrupt changes of

consecutive actions, forming the reward function de-

noted R

AV P

= clip

∑

j∈[a,e]

|δ

−δ

t−1

, 0, c

t−1

AV P

= −(R

+ R

)

= 3.3, ζ

= 2.25, ζ

= 2

(19)

We separately tuned the c

∗

clipping coefﬁcients of

the AVP reward function for each RL method, giving

= c

= 0.25 and c

= 0.5 for TD-MPC and SAC.

However, such a tuning led PPO not converging to a

successful ﬂying policy, therefore we found an appro-

priate tuning for this method as c

= c

= 0.45 and

= 0.1.

Conditioning for Action Policy Smoothness

(CAPS). Another candidate to mitigate this behavior

is the Conditioning for Action Policy Smoothness

(CAPS) loss (Mysore et al., 2021). CAPS is added

to the policy loss and prevents the emergence of

non-smooth action policies by applying temporal and

spatial regularization:

CAPS

(π

) = λ

·L

(π

) + λ

·L

(π

)

(π

) = ∥π

) −π

t+1

)∥

(π

) = ∥π

) −π

( ˆs

)∥

, ˆs

∼ N (s

, 0.01)

(20)

The time-only version of the CAPS loss is used to

enforce temporal smoothness (with λ

T S

= 0.05) while

leaving room for high reactivity in the state space.

Results. Table 7 exhibits the consistent improve-

ment of the action ﬂuctuation metric across all MF-

RL agents with CAPS. For RMSE tracking perfor-

mance, Table 7 conﬁrms that TD-MPC remains better

than PID and MF-RL agents.

As for action regulation methods, we observe that

action regulation methods remain algorithm depen-

dent. Table 7 shows that the AVP reward function

has a strong effect on actuation ﬂuctuation, reducing

it by at least an order of magnitude for TD-MPC and

SAC. Conversely, AVP for PPO shows a drastic in-

crease in actuation ﬂuctuation, which highlights the

difﬁcult task of tuning the reward function with the

goal of combining a lower actuator ﬂuctuating ﬂying

policy. In fact, one tuning of AVP can work for certain

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

Table 6: Nominal Atmosphere. Average of roll and pitch tracking RMSE, over all reference difﬁculty levels. Best scores are

highlighted in blue.

Agent Nominal Refs (x10

−2

) Hard Refs (x10

−2

)

PID 4.20 20.10

TD-MPC 3.81 ± 0.10 9.10 ± 0.69

PPO 5.29 ± 0.12 15.78 ± 2.48

SAC 6.64 ± 0.83 11.15 ± 0.65

Table 7: Nominal Atmosphere + Actuation Regulation (through AVP or CAPS). Average of roll - pitch tracking errors and

aileron - elevator action ﬂuctuation, over all reference difﬁculty levels. Bold values are the best score among one agent type.

Blue : best RMSE score across all agents. Red : best Actuator Fluctuation score across all agents.

Nominal Refs (x10

−2

) Hard Refs (x10

−2

)

Agent RMSE Act Fluct RMSE Act Fluct

PID 4.20 0.07 20.10 0.17

TD-MPC 3.81 ± 0.10 4.37 ± 0.48 9.10 ± 0.69 6.02 ± 0.59

TD-MPC AVP 3.58 ± 0.01 0.54 ± 0.02 8.09 ± 0.03 0.60 ± 0.02

PPO 5.29 ± 1.20 1.98 ± 0.58 15.78 ± 2.48 5.54 ± 0.93

PPO CAPS 5.05 ± 1.50 1.47 ± 0.27 19.17 ± 4.68 4.56 ± 0.60

PPO AVP 5.78 ± 2.10 13.78 ± 3.16 13.80 ± 0.82 13.48 ± 2.50

SAC 6.64 ± 0.83 10.07 ± 1.12 11.15 ± 0.65 12.00 ± 0.71

SAC CAPS 6.79 ± 0.74 0.53 ± 0.27 12.28 ± 1.17 1.44 ± 0.55

SAC AVP 6.54 ± 0.75 0.50 ± 0.18 15.14 ± 3.39 1.20 ± 0.32

algorithms such as SAC and TD-MPC and give an un-

derperforming PPO agent requiring a speciﬁc tuning.

CAPS appears to be a better action regulation method

for MF-RL methods because of its consistency with

a single tuned parameter λ

T S

= 0.05. As an exam-

ple, Figure 5 outlines CAPSs’ action smoothing for

SAC through a trajectory plot. As a result, the CAPS

method has been retained for the following ablations.

Despite showing great results for MF-RL methods,

CAPS is not directly applicable to TD-MPC because

it only regularizes its learned policy prior, while TD-

MPC never directly outputs an action by sampling this

prior. Indeed, we found in additional experiments that

applying CAPS led to learning a smooth action policy

prior but did not guarantee smooth actions after the

planning step.

6.2 Wind Disturbances

Aerial robotics present challenging environmental

conditions which add a layer of complexity for con-

trol systems. This section focuses on studying the

impact of such conditions on our four controllers of

interest. Wind disturbances are divided into two sub-

categories: stochastic turbulence modeled by proba-

bilistic dynamics and wind gusts consisting in unpre-

dictable and sudden changes in dynamics. Agents are

tested and trained under stochastic turbulence in Sec-

tion 6.2.1 and wind gusts in Section 6.2.2 separately.

The parameters for turbulence and gusts are drawn

uniformly along the values reported in Table 8. This

classiﬁcation of constant winds is only used for eval-

uation. During training, constant wind magnitudes

are uniformly sampled from the continuous interval

[0, 23] m/s. Constant wind and wind gusts directions

with respect to the NED inertial frame are also sam-

pled uniformly in the [−1, 1] interval.

To better isolate the effects of each disturbance

from the effects of hard attitude angle reference track-

ing, the agents were trained and tested only for the

nominal attitude reference angle range from Table 2.

Table 8: Wind Disturbance Severity Levels (m/s). The clas-

siﬁcation for constant wind and gust magnitudes is taken

from (Bøhn et al., 2019). ”Turbulence W20” is a predeﬁned

parameter in JSBSim Dryden model turbulence classiﬁca-

tion (Berndt, 2004).

Constant Wind Turbulence W20 Gust

OFF 0 0 0

Light 7 7.6 7

Moderate 15 15.25 15

Severe 23 22.86 23

6.2.1 Stochastic Turbulence

As explained in Section 3.2.1, turbulence is simu-

lated as a stochastic process applying linear and an-

gular forces to the FWUAV. On one side, Figure 2

outlines SAC’s underperforming across all turbulence

levels. On the other side, TD-MPC shows good per-

formance, reaching the same levels of performance as

PID and PPO. We observe that the severity of wind

disturbance dominates over the choice of algorithm

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

to predict the ﬁnal performances. In contrast to Sec-

tion 6.1.1, where TD-MPC showed a clear advan-

tage on deterministic nonlinear dynamics for track-

ing high attitude reference angles, here its learned dy-

namics model seems here to have difﬁculty captur-

ing the stochastic turbulent dynamics of this scenario,

and can only resort to learning an “on-average” ro-

bust policy. To the best of our knowledge, TD-MPC

has not been applied to environments with stochastic

dynamics and adapting TD-MPC to stochastic envi-

ronments remains an open problem.

OFF Light Moderate Severe

0.02

0.04

0.06

0.08

0.1

Agents

PID

TD-MPC AVP

PPO CAPS

SAC CAPS

Turb Level

AVG RMSE

Figure 2: Stochastic Turbulences + Action Regulation. Av-

erage tracking RMSE on nominal reference difﬁculty.

6.2.2 Wind Gusts

In our simulated environment, wind gusts are trig-

gered twice per episode at timesteps 500 and 1500.

Gusts are parameterized with a startup duration (time

for reaching maximum wind speed) of 0.25 s, a steady

duration (time span where the wind stays at maximal

magnitude) of 0.5 s, and an end duration (time for the

gust to disappear) of 0.25 s. The startup and end tran-

sients are modeled as a smooth cosine function. The

magnitude level of the gust is determined by its max-

imum wind speed and follows Table 8.

The results presented in Figure 3 point that

SAC signiﬁcantly underperforms relative to the other

agents and TD-MPC slightly underperforms PPO and

PID. To explain these observations, we emit the hy-

pothesis that such a disturbance alters the MDP for-

malism into a Partially Observable Markov Decision

Process (PO-MDP). In fact, an MDP is comprised of

a transition function P(s

t+1

, a

) = T (s

, a

) which

does not exist at every point in time in the case of un-

predictable gusts. From a formal point of view, while

hard references simply lead to a more complex tran-

sition function, wind gusts actually make it impos-

sible to rigorously deﬁne such a transition function.

Wind information is hidden from the agent which can-

not anticipate wind variations, thus conducing to state

aliasing (McCallum, 1996). In fact, unpredictable

gusts lead to the following situation: for two sepa-

rate experiments, for a given common state, taking

the same action could lead to two different next states.

Recurrent Neural Networks have been presented as a

good baseline for solving PO-MDPs (Ni et al., 2022;

Asri and Trischler, 2019) and could be applied to all

the RL methods presented in the present article.

RL methods learn the transition function either

explicitly in the case of model-based RL (TD-MPC)

or implicitly in the case of model-free RL (PPO and

SAC). Due to state aliasing, opposite updates of the

function approximators occur at the onset of a wind

gust. To explain PPO’s better performance compared

to its RL counterparts (SAC and TD-MPC), a second

hypothesis could be PPO’s clipping of the policy loss

(Schulman et al., 2017). Motivated by limiting too

large policy updates that could lead to a bad policy

where it can be hard to recover from, this conserva-

tive policy update could protect the PPO agent from

the contradictory updates when subject to wind gusts.

OFF Light Moderate Severe

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Agents

PID

TD-MPC AVP

PPO CAPS

SAC CAPS

Gusts Level

AVG RMSE

Figure 3: Wind Gusts + Action Regulation. Average track-

ing RMSE on nominal reference difﬁculty.

6.3 Discussions

TD-MPC’s Superiority for Nonlinear Regimes.

The superior results of TD-MPC under nonlinear dy-

namics are consistent with the reported superior per-

formance of TD-MPC over other MF-RL methods

for complex nonlinear manipulation and locomotion

tasks (Hansen et al., 2022; Hansen et al., 2024). It

solidiﬁes the hypothesis that mixed control, combin-

ing learning an explicit model of the system dynamics

and leveraging it for planning with terminal value esti-

mation, can result in improved tracking performance

for a wide range of attitude angles. It also suggests

that learning an implicit representation of the dynam-

ics limits the ability of MF-RL agents to deal with dif-

ferent dynamic regimes when trained simultaneously

on nominal and hard regimes.

Wind Disturbances. We also evaluated the control

methods under wind disturbances of different nature.

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

(a) PPO.

(b) TD-MPC.

Figure 4: TD-MPC’s Superiority for Hard References: PPO vs TD-MPC. Red dashed lines are the references: Roll = 55

◦

Pitch = 28

◦

. The red area around the reference line corresponds to ±5

◦

error bounds.

Both studies revealed that with the exception of SAC

underperforming, the severity of turbulences domi-

nates over the choice of algorithm to predict ﬁnal per-

formances. As previously stated, hard references only

consist in a harder-to-model nonlinear MDP transition

function. We hypothesized that turbulence and gusts

transform the MDP formalism into a non-stationary

PO-MDP and we observe that not all perturbations

affect with equal magnitude how well the MDP for-

malization ﬁts the environment conditions. The tur-

bulence study presented in Section 6.2.1 suggest that

turbulence can be dealt with more simply than gusts

because there exists a time-averaged MDP close to

the true MDP that RL algorithms fall back into by de-

fault. However, gusts present a multi-modal problem

(nominal mode and gust mode) by making it impossi-

ble to strictly deﬁne a transition function which could

be detrimental in the learning phase as presented in

Section 6.2.2. Therefore, we conclude that training

agents with various of perturbations is not enough to

ensure robustness and generalization and that differ-

ent types of perturbations may require different types

of approaches. The multi-model MDP framework

could be investigated for this purpose (Steimle et al.,

2021).

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

(a) Base SAC.

timestep

(b) SAC + CAPS.

Figure 5: Mitigating Highly Action Oscillating Policy: SAC vs SAC + CAPS. Red dashed lines are the references: Roll =

−15

◦

, Pitch = −10

◦

. The red area around the reference line corresponds to ±5

◦

error bounds.

7 CONCLUSIONS AND

PERSPECTIVES

We proposed an in-depth study of different RL con-

trollers with the aim of comparing model-free and

model-based RL approaches. TD-MPC, an MB-RL

method with algorithmic elements from CT and RL,

yielded the best performance in nominal wind condi-

tions. Its superiority especially shined for hard ref-

erences and demonstrated its ability to perform well

for deterministic nonlinear dynamics across the entire

state-space. We attribute such results to TD-MPC’s

learning of an explicit dynamics model jointly used

with predictive planning. We evaluated these con-

trol methods under various wind perturbations and

found that, aside from SAC underperforming, turbu-

lence severity signiﬁcantly impacts ﬁnal performance

more than the choice of algorithm.

We identiﬁed high actuation ﬂuctuation as an im-

portant drawback of RL methods. Since this metric

is is key in robotics, we applied and tested counter-

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions

measures, namely: action variation reward penalty

(AVP) for all methods and regularization of the ac-

tor network through CAPS for the MF-RL methods.

As a result, we retained CAPS as action regulation

method due to its consistency and ease of tuning

across all MF-RL agents. Another path to smooth ac-

tions for TD-MPC (for which CAPS is not applicable)

is MoDeM-v2 (Lancaster et al., 2023) which aims at

learning safe policies by biasing the initial data dis-

tribution towards a desired behavior, in our case an

action smooth controller.

We identify several future research directions:

one could focus on experimenting with probabilistic

models for TD-MPC in order to better capture the

stochasticity of turbulence dynamics, thus formulat-

ing a more rigorous transition function that outputs

a probability distribution. For both gusts and turbu-

lence, RL algorithms with recurrent networks appear

to be a good starting point (Ni et al., 2022; Asri and

Trischler, 2019) to solve PO-MDPs. One could also

use ideas from robust-RL to achieve disturbance re-

siliency of RL controllers (Hsu et al., 2024). Other

directions include learning-based adaptive control as

a ﬁeld of potential mixed control methods (Shi et al.,

2019; Doukhi and Lee, 2019), where a closed form

of the nominal dynamics is used together with a feed-

forward component of the unknown, disturbance dy-

namics predicted by a learned model.

REFERENCES

(1980). Flying qualities of piloted airplanes, military spec-

iﬁcation. Technical report, MIL-F-8785C.

Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz,

R., McGrew, B., Pachocki, J., Petron, A., Plappert,

M., Powell, G., Ray, A., et al. (2020). Learning dex-

terous in-hand manipulation. The International Jour-

nal of Robotics Research, 39(1):3–20.

Asri, L. E. and Trischler, A. (2019). A study of state aliasing

in structured prediction with RNNs. arXiv preprint

arXiv:1906.09310.

Beard, R. W. and McLain, T. W. (2012). Small Unmanned

Aircraft: Theory and Practice. Princeton University

Press.

Becker-Ehmck, P., Karl, M., Peters, J., and van der Smagt,

P. (2020). Learning to ﬂy via deep model-based rein-

forcement learning. arXiv preprint arXiv:2003.08876.

Berndt, J. (2004). JSBSim: An open source ﬂight dynam-

ics model in C++. In AIAA Modeling and Simulation

Technologies Conference and Exhibit, page 4923.

Bøhn, E., Coates, E. M., Moe, S., and Johansen, T. A.

(2019). Deep reinforcement learning attitude control

of ﬁxed-wing UAVs using proximal policy optimiza-

tion. In International Conference on Unmanned Air-

craft Systems (ICUAS), pages 523–533.

Bøhn, E., Coates, E. M., Reinhardt, D., and Johansen, T. A.

(2023). Data-efﬁcient deep reinforcement learning

for attitude control of ﬁxed-wing UAVs: Field exper-

iments. IEEE Transactions on Neural Networks and

Learning Systems.

De Marco, A., D’Onza, P. M., and Manfredi, S. (2023).

A deep reinforcement learning control approach for

high-performance aircraft. Nonlinear Dynamics,

111(18):17037–17077.

Doukhi, O. and Lee, D. J. (2019). Neural network-

based robust adaptive certainty equivalent controller

for quadrotor UAV with unknown disturbances. Inter-

national Journal of Control, Automation and Systems,

17(9):2365–2374.

Gryte, K., Hann, R., Alam, M., Roh

c, J., Johansen, T. A.,

and Fossen, T. I. (2018). Aerodynamic modeling of

the Skywalker x8 ﬁxed-wing unmanned aerial vehi-

cle. In International Conference on Unmanned Air-

craft Systems (ICUAS), pages 826–835.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. In In-

ternational Conference on Machine Learning, pages

1861–1870. PMLR.

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2023).

Mastering diverse domains through world models.

arXiv preprint arXiv:2301.04104.

Hansen, N., Su, H., and Wang, X. (2024). TD-MPC2:

Scalable, robust world models for continuous control.

In International Conference on Learning Representa-

tions (ICLR).

Hansen, N., Wang, X., and Su, H. (2022). Temporal differ-

ence learning for model predictive control. In Inter-

national Conference on Machine Learning.

Hsu, H.-L., Meng, H., Luo, S., Dong, J., Tarokh, V., and

Pajic, M. (2024). REFORMA: Robust reinforcement

learning via adaptive adversary for drones ﬂying un-

der disturbances. In 2024 IEEE International Confer-

ence on Robotics and Automation (ICRA).

Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty,

D., Mehta, K., and Ara

ujo, J. G. (2022). CleanRL:

High-quality single-ﬁle implementations of deep re-

inforcement learning algorithms. Journal of Machine

Learning Research, 23(274):1–18.

Hwangbo, J., Sa, I., Siegwart, R., and Hutter, M. (2017).

Control of a quadrotor with reinforcement learning.

IEEE Robotics and Automation Letters, 2(4):2096–

2103.

Koch, W., Mancuso, R., West, R., and Bestavros, A.

(2019). Reinforcement learning for UAV attitude con-

trol. ACM Transactions on Cyber-Physical Systems,

3(2):1–21.

Lambert, N. O., Drew, D. S., Yaconelli, J., Levine, S., Ca-

landra, R., and Pister, K. S. (2019). Low-level con-

trol of a quadrotor with deep model-based reinforce-

ment learning. IEEE Robotics and Automation Let-

ters, 4(4):4224–4230.

Lancaster, P., Hansen, N., Rajeswaran, A., and Kumar,

V. (2023). MoDem-V2: Visuo-motor world mod-

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

els for real-world robot manipulation. arXiv preprint

arXiv:2309.14236.

Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hut-

ter, M. (2020). Learning quadrupedal locomotion over

challenging terrain. Science robotics, 5(47).

Liang, X., Zheng, M., and Zhang, F. (2018). A scalable

model-based learning algorithm with application to

uavs. IEEE control systems letters, 2(4):839–844.

Liu, F., Dai, S., and Zhao, Y. (2021). Learning to have a

civil aircraft take off under crosswind conditions by

reinforcement learning with multimodal data and pre-

processing data. Sensors, 21(4):1386.

Mathisen, S., Gryte, K., Gros, S., and Johansen, T. A.

(2021). Precision deep-stall landing of ﬁxed-wing

uavs using nonlinear model predictive control. Jour-

nal of Intelligent & Robotic Systems, 101:1–15.

McCallum, A. K. (1996). Reinforcement learning with se-

lective perception and hidden state. University of

Rochester.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing Atari with deep reinforcement learn-

ing. arXiv preprint arXiv:1312.5602.

Moallemi, M. and Towhidnejad, M. (2016). B-737 autopilot

design and implementation for simulated ﬂight man-

agement system. In Proceedings of the 49th Annual

Simulation Symposium, pages 1–7.

Mysore, S., Mabsout, B., Mancuso, R., and Saenko, K.

(2021). Regularizing action policies for smooth

control with reinforcement learning. In IEEE In-

ternational Conference on Robotics and Automation

(ICRA), pages 1810–1816.

Ni, T., Eysenbach, B., and Salakhutdinov, R. (2022). Re-

current model-free RL can be a strong baseline for

many pomdps. In International Conference on Ma-

chine Learning, pages 16691–16723. PMLR.

Peng, X. B., Coumans, E., Zhang, T., Lee, T.-W., Tan, J.,

and Levine, S. (2020). Learning agile robotic loco-

motion skills by imitating animals. arXiv preprint

arXiv:2004.00784.

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan,

K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E.,

Hassabis, D., Graepel, T., et al. (2020). Mastering

Atari, Go, Chess and Shogi by planning with a learned

model. Nature, 588(7839):604–609.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Seyde, T., Gilitschenski, I., Schwarting, W., Stellato, B.,

Riedmiller, M., Wulfmeier, M., and Rus, D. (2021). Is

bang-bang control all you need? solving continuous

control with bernoulli policies. Advances in Neural

Information Processing Systems, 34:27209–27221.

Shi, G., Azizzadenesheli, K., O’Connell, M., Chung, S.-J.,

and Yue, Y. (2021). Meta-adaptive nonlinear control:

Theory and algorithms. Advances in Neural Informa-

tion Processing Systems, 34:10013–10025.

Shi, G., Shi, X., O’Connell, M., Yu, R., Azizzadenesheli,

K., Anandkumar, A., Yue, Y., and Chung, S.-J. (2019).

Neural lander: Stable drone landing control using

learned dynamics. In International Conference on

Robotics and Automation, pages 9784–9790.

Song, X., Duan, J., Wang, W., Li, S. E., Chen, C., Cheng,

B., Zhang, B., Wei, J., and Wang, X. S. (2023a). Lip-

sNet: a smooth and robust neural network with adap-

tive Lipschitz constant for high accuracy optimal con-

trol. In International Conference on Machine Learn-

ing, pages 32253–32272. PMLR.

Song, Y., Romero, A., M

uller, M., Koltun, V., and Scara-

muzza, D. (2023b). Reaching the limit in autonomous

racing: Optimal control versus reinforcement learn-

ing. Science Robotics, 8(82).

Steimle, L. N., Kaufman, D. L., and Denton, B. T. (2021).

Multi-model Markov decision processes. IISE Trans-

actions, 53(10):1124–1139.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Tsourdos, A., Permana, I. A. D., Budiarti, D. H., Shin, H.-

S., and Lee, C.-H. (2019). Developing ﬂight con-

trol policy using deep deterministic policy gradient.

In IEEE International Conference on Aerospace Elec-

tronics and Remote Sensing Technology (ICARES).

Williams, G., Aldrich, A., and Theodorou, E. (2015).

Model predictive path integral control using covari-

ance variable importance sampling. arXiv preprint

arXiv:1509.01149.

Model-Free versus Model-Based Reinforcement Learning for Fixed-Wing UAV Attitude Control Under Varying Wind Conditions