
It means that although a simulation as such stops at
the collision moment, the agent keeps on receiving
constant negative rewards for collision until the end
of episode. Note that this is consistent with the zero-
valued reward for successful parking which cancels
the tail of discounted rewards in a similar manner.
In the context of double Q-learning computations and
the formula (9) for the target regression values:
y
∗
i
:= r
i
+ γ
e
Q
s
′
i
, argmax
a
′
b
Q(s
′
i
,a
′
;
b
w);
e
w
,
the proper handling of the future returns for terminal
states can be achieved simply by replacing the
e
Q(·)
response as follows
y
∗
i
:= r
i
+ γ · 0, (26)
y
∗
i
:= r
i
+ γ · r
c
, (27)
for the successfully parked car and the collided car,
respectively.
We now move on to the details of this addi-
tional experiment. The park place was located at
⃗x
p
= (0.0,0.0) and directed along
⃗
d
p
= (−1.0,0.0).
Two obstacles were adjacent sideways to it, each 1 m
away from the park place border. Random distri-
butions for the initial car position and angle were:
⃗x ∼ U([5.0,15.0] × [−5.0, 5.0]), ψ ∼ U(
1
2
π,
3
2
π) —
i.e. a range of 180
◦
. In experiments involving 8 sen-
sors, their layout was (3, 3, 1): 3 in front, 3 in the
back, 1 at each side. In experiments involving 12 sen-
sors, their layout was (3, 3, 3).
Table 4: Results of preliminary experiments on
parking with obstacles. Reward function coeffi-
cients: (λ
d
,λ
φ
,λ
g
) = (1,32,8), state representation:
dv_flfrblbr2s_dag_invariant_sensors, batch size for
experience replay: 262k.
no.
experiment
hash code sensors
final frequency
of ‘parked’
event at
learning stage
final EMA
of ‘parked’
event frequency
at learning stage
frequency of
‘parked’
event at
test stage
episodes: 10k, NN: 9 × (256,128, 64, 32)
1 0809626551 (3, 3, 1) 32.29% 60.54% 42.3%
2 0501599417 (3, 3, 3) 39.60% 68.66% 57.8%
episodes: 20k, NN: 9 × (256,128, 64, 32)
3 2914586007 (3, 3, 1) 63.72% 73.59% 80.6%
4 2606558873 (3, 3, 3) 35.33% 40.71% 69.4%
episodes: 10k, NN: 9 × (512,256, 128, 64)
5 0726961302 (3, 3, 1) 21.70% 48.25% 56.4%
6 4063022036 (3, 3, 3) 16.18% 35.23% 41.0%
episodes: 20k, NN: 9 × (512,256, 128, 64)
7 2831920758 (3, 3, 1) 38.41% 56.85% 62.0%
8 1873014196 (3, 3, 3) 41.19% 59.48% 70.4%
Table 4 summarizes the results obtained in this
preliminary experiment (example trajectories shown
in Fig. 12). Overall, the results are not satisfactory but
also not too pessimistic. Six out of 8 models managed
to perform more than 50% successful parking ma-
neuvers at the test stage in the presence of obstacles.
The best observed model (2914586007) achieved the
success rate of 80.6%. The troubling aspect is that
no clear tendencies can be seen in the results, which
makes them difficult to understand. All the tested
settings (smaller / larger NNs, fewer / more sensors,
fewer / more training epsiodes) seem not to have a
clear impact on final rates. Therefore, as mentioned
before, the general problem setting — parking with
obstacles — is planned as our future research direc-
tion.
9 CONCLUSIONS AND FUTURE
RESEARCH
Within the framework of reinforcement learning, we
have studied a simplified variant of the parking prob-
lem (no obstacles present, but time regime imposed).
Learning agents were trained to park by means of
the double Q-learning algorithm and neural networks
serving as function approximators. In this context,
our main points of attention pertained to: reward
functions (parameterized) and state representations
relevant for this problem.
We have demonstrated that suitable proportions of
penalty terms in the reward function, coupled with in-
formative state representations, can translate onto ac-
curate neural approximations of long-term action val-
ues, and thereby onto an efficient double Q-learning
procedure for a car parking agent. Using barely 10 k
training episodes we managed to obtain high success
rates at the testing stage. In the main set of exper-
iments (Section 7) that rate was exceeding 95% for
several models, reaching 99% and 99.8% for two
cases.
In the additional set of experiments (Section 8) we
showed that using larger capacities of neural models
the agent was able to learn performing well in more
general scenes involving arbitrary initial positions and
rotations of both the park place and the car. In par-
ticular, the agent learned to perform complicated and
interesting maneuvers such as hairpin turns, rosette-
shaped turns, or zigzag patterns without supervision
— i.e. with being explicitly instructed about trajecto-
ries of such maneuvers.
Our future research shall pertain to a general park-
ing problem with obstacles present in the scenes and
sensor information included in state representation
(with time regime preserved). Preliminary results for
such a problem setting (Section 8) indicate the need
for more experiments and analysis. Also, it seems
appropriate to conduct in our future work a compar-
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
170