Table 7: Comparing Processing Average Precision and mAP of VRUNet, SSD, and FasterRCNN.
Models
Average Precision
mAP
Motorbike Bicycle Dog Mobility Aids Stroller Wheelchair Mobility Scooter
VRUNet (our) 0.747 0.877 0.753 0.855 0.895 0.832 0.179 0.734
SSD 0.5528 0.794 0.6814 0.8272 0.8121 0.781 0.0549 0.6433
Faster-RCNN 0.002 0.8154 0.742 0.897 0.7613 1 0 0.6025
Table 8: Comparing Processing Time of VRUNet vs Faster-
RCNN and SSD.
Method Processing Time (s)
VRUNet (Ours) 0.0275
FasterRCNN 0.125
SSD 0.0153
types of vulnerable users (i.e., motorbike, bicycle,
dog, stroller, and mobility scooter). The motorbike’s
AP of our model is 19.42% higher than the SSD and
74.5% higher than the Faster R-CNN. We also notice
that the APs of the mobility scooter are much lower
than the other classes for all three models (see Fig. 6).
Lacking mobility scooter samples is the main reason.
Currently, our dataset only has 56 mobility scooter
samples. However, our model still produces the high-
est AP of this class, demonstrating that our model can
slightly reduce the influence of lacking training sam-
ples.
In assessing an object detection method’s perfor-
mance, processing time is an important metric that in-
dicates suitability to different requirements (e.g., near
real-time, real-time, or offline applications, etc.). On
average, the proposed VRUNet takes 0.0275s per im-
age to detect objects of interest, whereas the Faster
R-CNN and SSD methods took 0.125s and 0.0153s,
respectively. VRUNet performs around 4:55x faster
than Faster R-CNN, but slightly (1:8x) slower than
SSD. It is common to know that the one-stage models
process quicker than two-stage models. However, our
model achieves the close processing time to the one-
stage model. So, effectively, the proposed VRUNet
performs at speeds of at least 36 frames per second
on the computing platform used in this work.
5 CONCLUSION
In this paper, we proposed a two-stage Convolu-
tional Neural Network (CNN)-based VRUs detec-
tion and recognition framework called VRU-Net. We
considered Seven types of VRUs (MobilityScooters,
Wheelchairs, Strollers, MobilityAids, Motorbikes,
Bicycles, and Dogs) to detect at road intersections.
We predicted, in the first stage of the VRU-Net, only
the grid-cells that most likely contain a VRU of in-
terest. The predicted grid-cells regions are classi-
fied following their types by the second stage of the
CNN. We compared VRU-Net to two state-of-the-
art models SSD and Faster RCNN. The proposed
model achieves a speedup of 4.55×; and performs at
speeds of at least 36 frames per second on the com-
puting platform used in this project. Also, VRU-Net
has 13.2% higher mAP when compared to the Faster
RCNN. Our method also achieves 9% higher mAP,
comparing to SSD. As a future work, we plan to im-
prove our model considering special classes of VRUs,
different weather and illumination conditions, which
present unique challenges for detection and localiza-
tion methods.
REFERENCES
A. Mukhtar, M. J. Cree, J. B. S. and Streeter, L. (2018).
Mobility aids detection using convolution neural net-
work (cnn). In International Conference on Image and
Vision Computing New Zealand (IVCNZ), Auckland,
New Zealand, pages 1–5.
Espinosa, J. E., Velastin, S. A., and Branch, J. W. (2018).
Motorcycle detection and classification in urban sce-
narios using a model based on faster r- cnn. In Inter-
national Conference on Pattern Recognition Systems,
22-24 May, Valparaiso, Chile.
Everingham, M., Eslami, S. M. A., Gool, L. V., Williams,
C. K. I., Winn, J., and Zisserman, A. (2015). The
pascal visual object classes challenge: A retro- spec-
tive. International Journal of Computer Vision,
111(4):98–136.
Han, L., Zheng, P., Li, H., Chen, J., Hua, Z., and Zhang,
Z. (2022). A novel early warning strategy for right-
turning blind zone based on vulnerable road users de-
tection. Neural Comput and Applic, 34:6187–6206.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition,
CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016,
pages 770–778. IEEE Computer Society.
L. Beyer, A. H. and Leibe, B. (2017). Drow: Real-time deep
learning-based wheelchair detection in 2-d range data.
IEEE Robotics and Automation Letters, 2(2):585–592.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Doll
´
ar, P., and Zitnick, C. L. (2014). Mi-
crosoft coco: Common objects in context. In Fleet,
D., Pajdla, T., Schiele, B., and Tuytelaars, T., edi-
tors, Computer Vision – ECCV 2014, pages 740–755,
Cham. Springer International Publishing.
Mammeri, A., Boukerche, A., and Tang, Z. (2016a). A real-
time lane marking localization, tracking and commu-
nication system. Comput. Commun., 73:132–143.
VRU-Net: Convolutional Neural Networks-Based Detection of Vulnerable Road Users
265