
processing NNs around the VQC – technique also
known as dressed VQCs – to solve Gym-like envi-
ronments, which makes the contribution of the quan-
tum submodules difficult to assess (Acuto et al., 2022;
Lan, 2021). Other contributions benchmark QRL for
CAS on entirely quantum tasks, where the actions and
states of the environment are already quantum opera-
tions (Wu et al., 2023), such as the quantum state gen-
eration problem and the eigenvalue problem. There
is also progress towards using VQCs with no other
NNs that solve Gym environments such as the normal
and inverted pendulum and lunar lander (Kruse et al.,
2024). These motivate this work to further apply CAS
QRL onto a more intricate robot navigation task.
2 RELATED WORK
Based on the degree to which quantum principles and
technology are integrated into the method, there are
four main QRL categories: quantum-inspired RL al-
gorithms, VQC-based RL function modules, RL al-
gorithms with quantum subroutines, as well as fully-
quantum RL (Meyer et al., 2024).
This chapter presents the QRL branch our work
belongs to, where the NN function approximators
of classical RL algorithms are replaced with VQCs.
One can employ VQCs as the Q-value computational
block (Hohenfeld et al., 2024), as well as the pol-
icy and/ or value function of actor-critic algorithms,
such as SAC (Acuto et al., 2022), PPO (Dr
˘
agan et al.,
2022), or asynchronous advantage actor-critic (Chen,
2023). In these works, either one or both the actor and
the critic are replaced with hybrid quantum-classical
(HQC) VQCs and trained with classical optimizers.
A hybrid DDPG is presented in (Wu et al., 2023).
All four main and target Q-value and policy approx-
imators are HQC VQCs. However, it solves the
quantum state generation and the eigenvalue problem,
which are already encoded as quantum operations. In
the works of (Acuto et al., 2022; Lan, 2021), the actor
and critic networks are replaced with dressed data re-
uploading VQCs. They benchmark their approaches
on continuous Gym(-derived) environments, namely
a robotic arm and the pendulum. The usage of pre-
and post-processing NNs, without comparison to pure
VQC approaches or to state-of-the-art classical coun-
terparts leads to difficulties in distingushing the con-
tribution of the quantum sub-modules.
In (Kruse et al., 2024) an HQC PPO algo-
rithm tackles continuous actions without pre- or post-
processing NNs on top of the data re-uploading VQC
actor and critic. They analyze between choices in
VQC architectural blocks and in measurement post-
processing, and show that normalization and trainable
scaling parameters lead to better results. While this
work is a first step towards CAS QRL, it is limited to
Gym environments and left looking deeper into VQC
architectures as further work.
The environment in this paper is a modified ver-
sion of the robot navigation task presented in (Ho-
henfeld et al., 2024). In their work, an HQC double
deep q-network (DDQN) algorithm is used to navi-
gate through a maze of continuous states and three
discrete actions: forward, turn left, and turn right.
Data re-uploading VQCs are used, where features are
embedded as parameters of rotational gates, scaled by
trainable weights. They propose four benchmarking
scenarios: a 3 × 3, a 4 × 4, a 5 × 5 and a 12 × 12
maze. For the first three map configurations, the three
continuous input features are global x,y,z coordinates,
whereas in the last case, the feature space contains 12
values: 10 local features generated by LiDAR sen-
sors, as well as the global distance and orientation
to the goal. While the authors treat a discrete ac-
tion space, in the simulation model the robot moves
by adjusting the continuous speeds of its two wheels.
This enables us to advance the task to a CAS, with the
added complexity of only six state features, three of
which are LiDAR readings.
3 MAZE DEFINITION
Environments solved by RL agents are defined as
Markov Decision Processes (MDP), characterized by
the tuple MDP = (S,A,P,r). The state space S is the
ensemble of all possible environmental states, the ac-
tion space A is the set of all actions an agent can take,
and P(s
t
,a
t
,s
t+1
) : S × A × S → [0,1] is the probabil-
ity function that dictates the likelihood of the agent
to take action a
t
∈ A in state s
t
∈ S at time step
t ∈ {1, 2, . . . , T } and results in state s
t+1
∈ S at the next
time step. The reward function r(s
t
,a
t
,s
t+1
) dictates
the feedback given by the environment to the agent
after taking an action, where r : S × A × S → R. This
iterative loop between the agent taking actions and the
environment providing feedback constitutes the gen-
eral interaction scheme of the RL agent. The action
a
t
is taken according to the agent’s internal policy
π : S → A, which is continuously adjusted in order to
maximize the total reward accumulated by the agent
during one interaction sequence, an episode.
The robot navigation use case is based on the
Turtlebot 2 robot which navigates a warehouse from
the upper-left start to the lower-right end goal and
avoids obstacles (Hohenfeld et al., 2024). We chose
three static benchmarking maps of dimensions 3 × 3,
QAIO 2025 - Workshop on Quantum Artificial Intelligence and Optimization 2025
808