4.3 Speed Comparison on Different
Hardware Devices
When measuring the speed of all the network systems,
we ran them on three different hardware devices with
Nvidia GPUs in Fig. 7. Detailed running speed com-
parison is shown in Table 2. Here MSG means multi-
scale grouping and SSG denotes single scale group-
ing (Qi et al., 2017b). From Table 2, we can see that
Table 2: Running speed comparison in frames per second
(fps) on different Nvidia GPUs.
Methods 1080Ti 1050Ti Jetson TX2
F-PNv1 70-100 20-30 5-15
F-PNv2 SSG 50-55 -- --
F-PNv2 MSG 15-20 -- --
OurFPN 80-105 25-30 10-15
our RGB-D modified Frustum PointNet is much faster
than Frustum PointNet v2, and almost the same as
Frustum PointNet v1 on different GPU devices. How-
ever, ours has better accuracy than Frustum PointNet
v1. Frustum PointNet v2 MSG and SSG are both slow
even in our desktop PC equipped with a single Nvidia
GTX 1080Ti GPU, therefore we did not transfer our
system on mobile robots and the Jetson TX2 devel-
oper kit equipped with less powerful GPUs.
5 CONCLUSIONS
In this paper, we designed a new network system that
combines both useful features of 2D and 3D for chal-
lenging object detection tasks. The original Frustum
PointNet was only trained and tested on the KITTI
benchmark and directly used 2D object detection re-
sults. The 3D PointNet sub-system did not re-use
2D information in their network. Our simplified net-
work system (OurFPN) has higher accuracy and faster
speed than most current state-of-the-art 3D object de-
tection networks on three different devices, as well as
on the KITTI benchmark.
In the near future, we will optimize OurFPN to
realize higher precision and faster speed. We also in-
tend to use Lidar sensors as addition since the efficient
range of RGB-D cameras is limited, which is not suf-
ficient for outdoor street scene understanding of au-
tonomous driving cars. Actually, we notice that Velo-
dyne can get quite far away points like 30m but quite
sparse and not well for close distance points like 3m.
On the contrary, RGB-D cameras can work quite well
for close objects while can not detect far-away points.
We believe that combining these two types of sensor
information together will be an efficient way for better
object detection results. Besides, we will use the out-
come 3D bounding box information of object detec-
tion to help improve SLAM system for UAVs, while
the existing object-based SLAM system only uses 2D
image information.
REFERENCES
Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). Multi-
view 3d object detection network for autonomous
driving. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 1907–1915. IEEE.
Chen, Y., Liu, S., Shen, X., and Jia, J. (2019). Fast point r-
cnn. In Proceedings of the IEEE International Confer-
ence on Computer Vision (ICCV), pages 9775–9784.
IEEE.
Engelcke, M., Rao, D., Wang, D. Z., Tong, C. H., and Pos-
ner, I. (2017). Vote3deep: fast object detection in 3d
point clouds using efficient convolutional neural net-
works. In IEEE International Conference on Robotics
and Automation (ICRA), pages 1355–1361. IEEE.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready
for autonomous driving? the kitti vision benchmark
suite. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
3354–3361. IEEE.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE
International Conference on Computer Vision (ICCV),
pages 1440–1448. IEEE.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), pages 2961–
2969. IEEE.
Johns, E., Leutenegger, S., and Davison, A. J. (2016). Pair-
wise decomposition of image sequences for active
multi-view recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 3813–3822. IEEE.
Kanezaki, A., Matsushita, Y., and Nishida, Y. (2018). Ro-
tationnet: joint object categorization and pose estima-
tion using multiviews from unsupervised viewpoints.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5010–
5019. IEEE.
Lin, T.-Y., Doll
´
ar, P., Girshick, R., He, K., Hariharan, B.,
and Belongie, S. (2017). Feature pyramid networks
for object detection. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 2117–2125. IEEE.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,
C.-Y., and Berg, A. C. (2016). Ssd: single shot multi-
box detector. In European Conference on Computer
Vision (ECCV), pages 21–37. Springer.
Qi, C. R., Liu, W., Wu, C., Su, H., and Guibas, L. J. (2018).
Frustum pointnets for 3d object detection from rgb-d
Real-time 3D Object Detection from Point Clouds using an RGB-D Camera
413