sparse, such that points clouds are converted to voxels
and therefore just samples with certain chosen thresh-
olds. Next, random samples are chosen and converted
to point-wise inputs for feature learning. Figure 4
shows various layers and final output for VoxelNet ar-
chitecture.
Figure 4: VoxelNet architecture (Zhou and Tuzel, 2017).
2.5 3D Object Detection Networks
Ever since the introduction of depth data from sen-
sors such as Microsoft Kinect, researchers are trying
to incorporate depth data into computer vision prob-
lems. Lin et al. proposed a method for 3D object de-
tection(Lin et al., 2013) using 2D segmentation and
3D geometry. Song et al. (Song et al., 2015) pro-
pose deep sliding models for Amodal 3D object de-
tection. Schwarz et. al.(Schwarz et al., 2018) used
multiview RGBD cameras to perform heuristics on
depth data for their object picking robot. Gupta et
al. (Gupta et al., 2014) provided HHA encoding to
train the network on Faster RCNN. Gupta and Hoff-
man provided several transfer learning techniques to
transfer weights from 2D network to depth data (Hoff-
man et al., 2016) (Gupta et al., 2016).
2.6 Shortcomings of Prior Art
Most of the networks previously described methods
use only conventional 2D image datasets. NYUD
possesses RGB+D data, but unfortunately, the dataset
exhibits a very high density of cluttered objects, with
many of these objects exhibiting no depth features
(curtains, windows, etc.) Unfortunately, with such
a cluttered RGB+D dataset, any training using RGB
data and depth data, depth data will not contribute
much to learning. For example, we previously tried to
use the depth data of NYUD for training, but because
there are not many features to learn from depth. It
is challenging to stop overfitting because of the small
number of images.
In regards to VoxelNet, the network is trained with
a number of different objects (Car, Bike, etc.), such
that training for another network with a different set of
objects is impractical. For both NYUD and VoxelNet,
a large amount of 3D depth data for regular training
of neural-network is required. Unfortunately, there
currently exists no 3D-depth dataset large enough to
rival conventional 2D-image datasets (ImageNet, etc.)
Ever since the publication of RGBD datasets like
NYUD and SUN RGBD (Song et al., 2015), re-
searchers are trying to incorporate depth dataset into
learning and detection processes. Methods range
from using heuristics to use encoding techniques such
as HHA. Researchers also tried to use transfer learn-
ing techniques to train depth weights using ImageNet
datasets. All these techniques are computationally in-
tensive and hard to move closer to hardware such as a
SOC. These techniques also make these systems more
complex.
3 PROPOSED WORK
In this paper, we propose a DNN Object Detection
system that dissociates the depth-data from the RGB
data. In this way, our proposed system does not
need to require a large training dataset for the depth
data but still simultaneously extracts meaningful in-
formation from the depth sensor. Furthermore, our
system combines this depth data with conventional
2D-trained image data to generate a practical, low-
complexity object detection and classification system.
This paper is constructed as follows. First, we will
give a system overview of our system, followed by
test results, a comparison with current state-of-the-art
systems, and then concluding with our future research
direction.
3.1 System Overview
Figure 5 shows an overview of the system architec-
ture. We designed our system with the understanding
that there currently exists no large dataset of depth
data for training. First, our system generates bound-
ing boxes from the depth sensor output by perform-
ing clustering on the 3D point cloud. After denois-
ing and cleaning of the clustered depth objects, we
use the clustered objects to split the 2D image into
sub-images which are fed to a 2D-Image Deep Neu-
ral Classification Network.For Threshold Filtering, a
threshold for the classification top-1 accuracy is set to
judge the availability of the classification result. If the
top-1 accuracy is greater than the threshold, then the
output is accepted. Detailed operation for each sub-
system is explained below.
FotonNet: A Hardware-efficient Object Detection System using 3D-depth Segmentation and 2D-deep Neural Network Classifier
463