for 3D object detection called PointPillars(Lang et al.,
2019) and the best state-of-the-art 2D object recog-
nition algorithm in terms of speed and accuracy on
image called YOLO(Redmon et al., 2016). We use
a branch network based on PointNet(Qi et al., 2017)
which is trained iteratively to improve the classifica-
tion of LiDAR-based 3d detected objects while in op-
eration. Original PointNet network is used to classify
indoor 3D objects, however, in this context we use
it for outdoor object classification in AD/ADAS sce-
narios. Here PointPillars, YOLO, and PointNet are
optional networks that can be used as plug and play
with alternative best algorithms to improve the classi-
fication of 3D objects.
This research paper is organized as follows: A
detailed review of semi-supervised or weakly super-
vised algorithms involving LiDAR and camera for
AD and ADAS applications is presented in section 2.
Section 3 describes the contribution of current work.
In section 4, we have presented the methodology of
the proposed algorithm. In section 6 we have dis-
cussed network architecture and implementation de-
tails. Data preparation details are presented in sec-
tion 7. Experimentation and results on the nuScenes
dataset are discussed in section 8 and applications are
proposed in 9.1. This paper concludes with future
scope in section 10.
2 RELATED WORK
LiDAR camera fusion has been a preferred way of ob-
ject detection and classification for active ADAS and
AD systems. As mentioned in the earlier sections, Li-
DAR and camera have better localization and classi-
fication respectively when compared with each other.
Hence, fusion not only increases the accuracy of de-
tection and classification but also increases the redun-
dancy of the sensor setup. There have been several
deep learning-based methods for LiDAR and camera
fusion in recent years. They can be classified as early
and late fusion. Early fusion-based methods, combine
the sensor data at the initial stages (Qi et al., 2018;
Chen et al., 2017; Ku et al., 2018).
Similarly, late fusion-based methods(Song and
Xiao, 2016; Hoffman et al., 2016) process the sen-
sor data separately to arrive at individual predictions.
These predictions are further combined using various
models to arrive at detection and classification.
Of particular interest for us in the context of
current work is the late fusion category of meth-
ods. Since LiDAR and camera data are pro-
cessed separately, there exists two separate detec-
tion/classification models which are completely inde-
pendent of each other. In general, both models are
trained separately with a separate set of ground truth
which has been marked separately. However, in this
work, we would like to explore the possibility of ex-
ploiting predictions of one of the sensor modalities to
generate classification labels eliminating pre-training
of one of the sensor models. Further such a method
also helps in domain adaptation for unknown environ-
ments.
There have been very few works in this direction,
which exploit the redundancy across sensor domains
to train the sensor models for detection and classifica-
tion.
In (Kuznietsov et al., 2017), the authors predicted
depth from a single image using sparse LiDAR depth
as ground truth and unsupervised depth measure-
ments from a stereo pair. Here, LiDAR depth acts as
ground truth for image-based depth estimation. Simi-
larly, in (Caltagirone et al., 2019), the authors propose
two classifiers acting on different views of the data co-
operatively and iteratively improve each other’s per-
formance by using unlabelled examples. This method
is among the top performers while using only a small
amount of labelled data.
In (Buhler et al., 2020), The authors proposed two
architectures to learn common representations of Li-
DAR and camera data, in the form of a 2D image. It is
useful in feature matching algorithms. In (Yan et al.,
2018), the human classifier can be learnt directly from
the deployment environment, removing the depen-
dence on labelled data. This method tracks people
by detection of legs extracted from a 2D LiDAR and
fusing this with the faces or the upper bodies detected
with a camera using a sequential implementation of
the Unscented Kalman Filter (UKF). Depth estima-
tion from a single mono camera is proposed in (Ku-
mar et al., 2018). This method trains using sparse
LiDAR data as ground truth for depth estimation for
fisheye camera. In (Teichman and Thrun, 2012) using
limited training data, a classifier is trained, and the
predicted label is propagated across frames, which are
again used for training.
As can be seen from the discussion, weakly su-
pervised online training of detection and classification
models have been used for dense depth estimation and
object detection to a certain extent.
3 CONTRIBUTION
As can be seen from the previous section, there exist a
series of methods that exploit redundancy across sen-
sor domains for proposing a weakly supervised detec-
tion/classification algorithm for various applications.
LiDAR and Camera Based 3D Object Classification in Unknown Environments Using Weakly Supervised Learning
305