
The intricacies of the TD3 model and its training
process are encapsulated in a single entity, as demon-
strated in Algorithm 2. This entity efficiently orches-
trates the model’s interactions with its environment,
facilitating learning, and adaptation.
Data: state dim, action dim, max action
Result: TD3 Agent
Function TD3(state dim, action dim,
max action)
Initialize actor, actor target, critic, and
critic target networks;
Load actor, critic weights into actor target,
critic target;
Initialize actor and critic optimizer;
Set maximum action value;
Function select action(state)
return actor(state);
Function train(replay buffer,
iterations, batch size, discount,
tau, policy noise, noise clip,
policy freq)
for it ← 1 to iterations do
Sample transitions (s, s’, a, r, d);
From the next state s’, compute the
next action a’ using actor target;
Add Gaussian noise to the next action
and clamp it;
Compute target Q-values using the two
critic targets;
Keep the minimum of these two
Q-values: min(Qt1, Qt2);
Compute target Q-values with discount
factor;
Compute current Q-values using two
critic networks;
Compute critic loss;
Backpropagate the Critic loss and
update the parameters of the two
Critic models using SGD optimizer;
if it % policy freq == 0 then
Compute actor loss;
Update actor parameters using
gradient ascent optimizer;
Update actor target and critic target
weights using Polyak averaging
every two iterations;
end
end
Function save(filename, directory)
Save actor, critic weights to a file;
Function load(filename, directory)
Load actor, critic weights from a file;
Algorithm 1: TD3 Class.
2.3.4 Model Training
The training process unfolds over millions of
timesteps, driven by a robust training loop. The key
steps in the training loop encompass episode monitor-
ing, policy evaluation, reward calculation, and storage
of evaluation results. These iterative steps ensure that
the agent continuously refines its policy through in-
teractions, training, and evaluations within the envi-
ronment.
At each timestep, the procedure checks if the
episode has ended or if the maximum steps per
episode have been reached. If so, training begins.
Using experiences from the replay buffer, the pol-
icy is learned with the train() function, but only
if there are enough experiences and it’s not the first
timestep. After training, the current policy is eval-
uated with the evaluate_policy() function, based
on the eval_freq argument. The evaluation results
are stored, and the policy is saved.
The evaluate_policy() function takes a given
model and ”n” episodes to test it on each episode con-
sisting of ”m” steps and then returns the average re-
ward over testing and the average number of steps per
episode before termination. In our train the models
were tested on 10 episodes of 10e4 steps each. again
these values were obtained through multiple iterations
of tuning of trying to ensure a steady reward output of
the same model.
Episode-related variables are updated, timesteps
increase, and the environment state resets. The agent
decides whether to explore (before start timesteps)
or choose an action based on the learned policy. If
there’s exploration noise, it adds noise to the ac-
tion within action space boundaries. The chosen ac-
tion is executed in the environment, and the code re-
trieves the next observation, reward, and episode sta-
tus. Episode and overall rewards are updated, and the
transition is stored in the replay buffer.
Once the main loop ends, the final policy evalu-
ation is added to the evaluations list, and if the save
models flag is set, the policy is stored. The final
evaluation result is saved as well. The environment
is closed and restarted. The average reward across
episodes is computed by dividing the total reward by
the number of episodes (episode num). This loop it-
eratively improves the agent’s policy through interac-
tions, training, and periodic evaluations with the en-
vironment.
2.4 Adaptive Module
The final piece of our framework is the adaptive mod-
ule, a hybrid approach that combines batch learn-
Dynamic Path Planning for Autonomous Vehicles Using Adaptive Reinforcement Learning
275