Compared to early 2-stage detectors, which had
intermediate steps for selecting, e.g. region proposals
which were then fed into a second neural network,
the advantage is a massive speed increase (and only a
minor loss in accuracy) since this intermediate steps
take time since they are running on the CPU.
In this article, we demonstrated and visualized the
inner mechanics of the YOLO architecture. Our key
message is that YOLO is not really “looking once”,
but a lot more often. Because of a clever exploitation
of Artificial Neural Network structures, which make
it possible to share most of the computation between
regions and also allow to easily parallelize the compu-
tations on a GPU, this can be very fast and efficient.
Our findings might be used for future develop-
ments in architecture design or for evaluating trained
models. Interesting future work include the improve-
ment of these visualization techniques. Also, the pro-
posed detection dream approach might be used to de-
termine how much information about a training image
is actually saved “within the weights of the network”
by setting the target output to the actual prediction
output of the model and optimize from a gray image.
Also, different other constraints can be added to the
optimization loss. By including the distance to a color
histogram into the loss function, the reconstructed im-
ages might can be improved to have a more realistic
color distribution.
ACKNOWLEDGEMENTS
This work was supported by a fellowship within the
IFI program of the German Academic Exchange Ser-
vice (DAAD).
REFERENCES
Bach, N., Melnik, A., Rosetto, F., and Ritter, H. (2020).
An error-based addressing architecture for dynamic
model learning. In International Conference on
Machine Learning, Optimization, and Data Science,
pages 617–630. Springer.
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).
Yolov4: Optimal speed and accuracy of object detec-
tion. arXiv preprint arXiv:2004.10934.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).
Rich feature hierarchies for accurate object detec-
tion and semantic segmentation. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 580–587.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot
multibox detector. In European conference on com-
puter vision, pages 21–37. Springer.
Melnik, A., Akbulut, E., Sheikh, J., Loos, K., Buettner, M.,
and Lenze, T. (2022). Faces: Ai blitz xiii solutions.
arXiv preprint arXiv:2204.01081.
Melnik, A., Harter, A., Limberg, C., Rana, K., Sünderhauf,
N., and Ritter, H. (2021). Critic guided segmentation
of rewarding objects in first-person views. In German
Conference on Artificial Intelligence (Künstliche In-
telligenz), pages 338–348. Springer.
Melnik, A., Schüler, F., Rothkopf, C. A., and König, P.
(2018). The world as an external memory: the price of
saccades in a sensorimotor task. Frontiers in behav-
ioral neuroscience, 12:253.
Mordvintsev, A., Olah, C., and Tyka, M. (2015). Inception-
ism: Going deeper into neural networks.
Olah, C., Mordvintsev, A., and Schubert, L. (2017). Feature
visualization. Distill. https://distill.pub/2017/feature-
visualization.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
(2016). You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 779–
788.
Redmon, J. and Farhadi, A. (2016). Yolo9000: Better,
faster, stronger.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental
improvement.
Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster
r-cnn: Towards real-time object detection with region
proposal networks. In arXiv.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2017). Grad-cam: Visual
explanations from deep networks via gradient-based
localization. In Proceedings of the IEEE international
conference on computer vision, pages 618–626.
Sheth, B. R. and Young, R. (2016). Two visual pathways
in primates based on sampling of space: exploitation
and exploration of visual information. Frontiers in in-
tegrative neuroscience, 10:37.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
160