Table 7: Comparing Processing Average Precision and mAP of VRUNet, SSD, and FasterRCNN.
Average Precision
Motorbike Bicycle Dog Mobility Aids Stroller Wheelchair Mobility Scooter
VRUNet (our) 0.747 0.877 0.753 0.855 0.895 0.832 0.179 0.734
SSD 0.5528 0.794 0.6814 0.8272 0.8121 0.781 0.0549 0.6433
Faster-RCNN 0.002 0.8154 0.742 0.897 0.7613 1 0 0.6025
Table 8: Comparing Processing Time of VRUNet vs Faster-
Method Processing Time (s)
VRUNet (Ours) 0.0275
FasterRCNN 0.125
SSD 0.0153
types of vulnerable users (i.e., motorbike, bicycle,
dog, stroller, and mobility scooter). The motorbike’s
AP of our model is 19.42% higher than the SSD and
74.5% higher than the Faster R-CNN. We also notice
that the APs of the mobility scooter are much lower
than the other classes for all three models (see Fig. 6).
Lacking mobility scooter samples is the main reason.
Currently, our dataset only has 56 mobility scooter
samples. However, our model still produces the high-
est AP of this class, demonstrating that our model can
slightly reduce the influence of lacking training sam-
In assessing an object detection method’s perfor-
mance, processing time is an important metric that in-
dicates suitability to different requirements (e.g., near
real-time, real-time, or offline applications, etc.). On
average, the proposed VRUNet takes 0.0275s per im-
age to detect objects of interest, whereas the Faster
R-CNN and SSD methods took 0.125s and 0.0153s,
respectively. VRUNet performs around 4:55x faster
than Faster R-CNN, but slightly (1:8x) slower than
SSD. It is common to know that the one-stage models
process quicker than two-stage models. However, our
model achieves the close processing time to the one-
stage model. So, effectively, the proposed VRUNet
performs at speeds of at least 36 frames per second
on the computing platform used in this work.
In this paper, we proposed a two-stage Convolu-
tional Neural Network (CNN)-based VRUs detec-
tion and recognition framework called VRU-Net. We
considered Seven types of VRUs (MobilityScooters,
Wheelchairs, Strollers, MobilityAids, Motorbikes,
Bicycles, and Dogs) to detect at road intersections.
We predicted, in the first stage of the VRU-Net, only
the grid-cells that most likely contain a VRU of in-
terest. The predicted grid-cells regions are classi-
fied following their types by the second stage of the
CNN. We compared VRU-Net to two state-of-the-
art models SSD and Faster RCNN. The proposed
model achieves a speedup of 4.55×; and performs at
speeds of at least 36 frames per second on the com-
puting platform used in this project. Also, VRU-Net
has 13.2% higher mAP when compared to the Faster
RCNN. Our method also achieves 9% higher mAP,
comparing to SSD. As a future work, we plan to im-
prove our model considering special classes of VRUs,
different weather and illumination conditions, which
present unique challenges for detection and localiza-
tion methods.
