
understanding of the interplay between DL algorithms
and human processing styles and potentially drive the
development of future DL algorithms that better mirror
visual human processing.
5 CONCLUSIONS
In conclusion, our study focused on the extraction
and comparison of machine-generated and human-
generated heatmaps in the context of urban obstacle
detection. The experiments utilized a diverse set of
DL models fine-tuned on images depicting various
obstacles encountered on pavements that affect pedes-
trian safety. We employed the Grad-CAM algorithm to
extract machine-generated heatmaps, visualizing the
features learned by the models during obstacle detec-
tion. These heatmaps were systematically compared
with human-generated heatmaps obtained through eye-
tracking experiments involving 35 participants. The
visual dissimilarity values provided insights into the
alignment of machine-generated heatmaps with human
visual attention. ViT-B/16 demonstrated the closest
resemblance to human heatmaps. ViT-B/16’s supe-
rior performance prompts further investigation into
the specific architectural elements contributing to its
alignment with human perception.
By pulling back the veil on how these models are
attributing significance within images, we can better
understand and trust their outputs. If machine learning
models are designed to more closely resemble human
perception, their decision-making processes may be-
come inherently more understandable, sharing com-
mon ground with recognized human cognitive patterns.
Such an approach could not only improve the inter-
pretability of individual models, but also contribute to
a broader understanding of how to design models that
are both accurate and explainable, which is a signif-
icant goal in the field of artificial intelligence. Sim-
ilarly, when dealing with image interpretation tasks
where humans display increased accuracy, the use of
network architectures that resemble human perception
could lead to more accurate results whereas for tasks
that human performance is inferior, architectures that
resemble human perception should be avoided. The
findings pave the way for the development of more
explainable and more accurate models.
This paper presents the preliminary results of our
work that lay the foundations for further investigation.
Our future research plans include extracting and com-
paring the heatmaps of additional DL architectures as
well as investigating how the extracted results can be
used to improve the accuracy and explainability of the
generated models.
ACKNOWLEDGEMENTS
This project has received funding from the European
Union’s Horizon 2020 research and innovation pro-
gramme under grant agreement No 739578 comple-
mented by the Government of the Republic of Cyprus
through the Directorate General for European Pro-
grammes, Coordination and Development.
REFERENCES
Blascheck, T., Kurzhals, K., Raschke, M., Burch, M.,
Weiskopf, D., and Ertl, T. (2014). State-of-the-Art
of Visualization for Eye Tracking Data. In EuroVis -
STARs, page 29.
Chinu and Bansal, U. (2023). Explainable AI: To Reveal the
Logic of Black-Box Models. New Generation Comput-
ing.
Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F.,
Berry, E., Moulin, C. J. A., Lindley, S., Kelly, P., and
Foster, C. (2013). Wearable Cameras in Health: The
State of the Art and Future Possibilities. American
Journal of Preventive Medicine, 44(3):320–323.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D.,
Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.
(2021). An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. In International
Conference on Learning Representations (ICLR 2021).
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Resid-
ual Learning for Image Recognition.
International Transport Forum (2012). Pedestrian Safety, Ur-
ban Space and Health. ITF Research Reports. OECD.
Kaufmann, V., Bergman, M. M., and Joye, D. (2004). Motil-
ity: Mobility as Capital. International Journal of Ur-
ban and Regional Research, 28(4):745–756.
Lee, K., Sato, D., Asakawa, S., Kacorri, H., and Asakawa, C.
(2020). Pedestrian Detection with Wearable Cameras
for the Blind: A Two-way Perspective. In Proceedings
of the 2020 CHI Conference on Human Factors in
Computing Systems, CHI ’20, pages 1–12. Association
for Computing Machinery.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,
and Guo, B. (2021). Swin Transformer: Hierarchical
Vision Transformer using Shifted Windows.
Maeda, H., Sekimoto, Y., Seto, T., Kashiyama, T., and
Omata, H. (2018). Road Damage Detection and Classi-
fication Using Deep Neural Networks with Smartphone
Images. Computer-Aided Civil and Infrastructure En-
gineering, 33(12):1127–1141.
Prabu, A., Shen, D., Tian, R., Chien, S., Li, L., Chen, Y.,
and Sherony, R. (2022). A Wearable Data Collection
System for Studying Micro-Level E-Scooter Behavior
in Naturalistic Road Environment.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen,
L.-C. (2018). MobileNetV2: Inverted Residuals and
Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?
363