
Table 3: Evaluation Table of a) CIL++, b) CILRL, c)
CILRL++ and d) CILv3D using CARLA Leaderboard.
Score CIL++ CILRL CILRL++ CILv3D
Avg. Driving Score (%) 2.59 0.09 10.01 11.73
Avg. Route Completition (%) 10.93 0.29 14.44 15.12
Avg. Infraction Penalty 0.44 0.47 0.64 0.59
Collisions with Vehicles 0.0 0.0 0.0 0.0
Collisions with Pedestrians 256.21 2759.67 52.82 128.43
Collisions with Layout 411.01 14496.83 255.22 285.66
Red Light Infractions 9.34 0.0 7.98 8.83
Stop Sign Infractions 5.77 846.62 0.0 2.25
Off-Road Infractions 253.67 1457.57 186.47 177.53
Route Deviations 0.0 144.4 76.74 55.81
Route Timeouts 0.0 330.96 0.0 0.0
Agent Blocked 399 10985.68 222.65 198.69
est average driving score and route completion with
less amount of training time.
6 CONCLUSION & FUTURE
WORK
In this paper, we present CILv3D, which addresses
several key limitations of CIL++ by refining the
dataset collection process, incorporating multi-frame
and sequential inputs and utilizing a more advanced
backbone model. Furthermore, we address the gener-
alization issues by exploring effective data augmenta-
tion techniques and injecting noise during the training
process.
Our experimental results in several scenarios in-
cluded in the CARLA Leaderboard indicate that that
CILv3D achieves higher driving score than CIL++
and CILRL++, in terms of overall driving score and
route completion. Notably, CILv3D achieves these
improvements without the need for extensive fine-
tuning techniques, such as utilizing DRL methods, re-
sulting in reduced training time and computational re-
quirements. Although CILv3D demonstrates promis-
ing results, several directions for future research could
further enhance its navigation performance, such as
a 360
◦
view of the environment, which could po-
tentially improve the controller’s decisions at lane
change tasks.
REFERENCES
Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J.,
Ecoffet, A., Houghton, B., Sampedro, R., and Clune,
J. (2022). Video pretraining (vpt): Learning to act by
watching unlabeled online videos. Advances in Neural
Information Processing Systems, 35:24639–24654.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S. (2020). End-to-end object de-
tection with transformers. In European conference on
computer vision, pages 213–229. Springer.
Codevilla, F., Santana, E., L
´
opez, A. M., and Gaidon,
A. (2019). Exploring the limitations of behavior
cloning for autonomous driving. In Proceedings of
the IEEE/CVF international conference on computer
vision, pages 9329–9338.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and
Koltun, V. (2017). Carla: An open urban driving sim-
ulator. In Conference on robot learning, pages 1–16.
PMLR.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li,
H., and Qiao, Y. (2023). Uniformer: Unifying convo-
lution and self-attention for visual recognition. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 45(10):12581–12600.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2020). Grad-cam: visual
explanations from deep networks via gradient-based
localization. International journal of computer vision,
128:336–359.
Song, Q., Liu, Y., Lu, M., Zhang, J., Qi, H., Wang, Z., and
Liu, Z. (2023). Autonomous driving decision control
based on improved proximal policy optimization al-
gorithm. Applied Sciences, 13(11):6400.
Tampuu, A., Matiisen, T., Semikin, M., Fishman, D., and
Muhammad, N. (2020). A survey of end-to-end driv-
ing: Architectures and training methods. IEEE Trans-
actions on Neural Networks and Learning Systems,
33(4):1364–1384.
V. Kochliaridis, E. Kostinoudis, I. V. (2024). Optimiz-
ing pretrained transformers for autonomous driving.
In 13th EETN Conference on Artificial Intelligence
(SETN). ACM.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Xiao, Y., Codevilla, F., Porres, D., and L
´
opez, A. M. (2023).
Scaling vision-based end-to-end autonomous driving
with multi-view attention learning. In 2023 IEEE/RSJ
International Conference on Intelligent Robots and
Systems (IROS), pages 1586–1593. IEEE.
Zhang, Z., Liniger, A., Dai, D., Yu, F., and Van Gool, L.
(2021). End-to-end urban driving by imitating a re-
inforcement learning coach. In Proceedings of the
IEEE/CVF international conference on computer vi-
sion, pages 15222–15232.
Scaling Multi-Frame Transformers for End-to-End Driving
503