and 3D box annotations. Such 3D annotations are
available in datasets for autonomous driving such
as KITTI (Geiger et al., 2012), but the in-car cam-
era viewpoints are different from typical road-side
surveillance viewpoints. We propose a novel anno-
tation processing chain that converts existing 2D box
labels to 3D boxes using labeled scene information.
This labeled scene information is derived from 3D
geometry aspects of the scene, such as the vanishing
points per region and the estimated vanishing points
per vehicle depending on the vehicle orientation. The
use of vanishing points combined with camera cali-
bration parameters enables to derive 3D boxes captur-
ing the car geometry in a realistic setting. For validat-
ing the concept, the existing KM3D (Li, 2020) CNN
object detector is evaluated on our automatically an-
notated traffic surveillance dataset to investigate the
effect of different processing configurations, in order
to optimize the system into a final pipeline.
The remainder of this paper is structured as fol-
lows. First, a literature overview is given in Section 2.
Second, the 3D box annotation processing chain is
elaborated in Section 3. The results of the experi-
ments are presented in Section 4 where also the op-
timal pipeline is determined. Section 5 discusses the
conclusions.
2 RELATED WORK
There are two categories of 3D object detectors for
monocular camera images: (A) employing geometric
constraints and labeled scene information and (B) di-
rect estimation of the 3D box from a single 2D image.
A. Geometric Methods. In (Dubsk
´
a et al., 2014),
the authors propose an automatic camera calibration
from vehicles in video. This calibration and vanishing
points are then used to convert vehicle masks obtained
by background modeling to 3D boxes. They assume
a single set of vanishing points per scene, as the road
surface in the scenes are straight and the camera posi-
tion is stationary. The authors of (Sochor et al., 2018)
improve vehicle classification using information from
3D boxes. Because their method works on static 2D
images and motion information is absent to estimate
the object orientation, they propose a CNN to esti-
mate the orientation of the vehicle. With the vehicle
orientation and camera calibration the 3D box is esti-
mated based on work of (Dubsk
´
a et al., 2014). In our
case, we adopt the 3D box generation from Dubsk
´
a et
al., but instead of using the single road direction (per
scene) for each vehicle, we calculate the orientation
for each vehicle independently. This enables to ex-
tend the single-road case to scenes with multiple vehi-
cle orientations, such as road crossings, roundabouts
and curved roads.
B. Direct Estimation by Object Detection. Early
work uses a generic 3D vehicle model that is pro-
jected to 2D given the vehicle orientation and cam-
era calibration and then matched to the image. The
authors of (Sullivan et al., 1997) use an 3D wire-
frame model (mesh) and match it on detected edges in
the image and Nilsson and Arn
¨
o (Nilsson and Ard
¨
o,
2014) use foreground/background segmentation. Ve-
hicle matching is sensitive to the estimated vehicle
position, the scale/size of the model, and the vehicle
type (stationwagon vs. hatchback). Histogram of Ori-
ented Gradients (HOG) (Dalal and Triggs, 2005) gen-
eralizes the viewpoint specific wire-frame model to a
single detection model. Wijnhoven et al. (Wijnhoven
and de With, 2011) divide this single detection model
into separate viewpoint-dependent models.
State-of-the-art CNN detectors generalize into a
multi-layer detection system. Here, the vehicle and
its 3D pose are estimated directly from the 2D im-
age. Most methods use two separate CNNs or ad-
ditional layers. Depth information is used as an ad-
ditional input to segment the vehicles in the 2D im-
age in (Brazil and Liu, 2019; Ding et al., 2020; Chen
et al., 2016; Cai et al., 2020). Although the depth
is estimated from the same 2D images, it requires an
additional depth-generation algorithm. Mousavian et
al. (Mousavian et al., 2017) use an existing 2D CNN
detector and add a second CNN to estimate the ob-
ject orientation and dimensions. The 3D box is then
estimated as the best fit in the 2D box, given the ori-
entation and dimensions. RTM3D (Li et al., 2020)
and KM3D (Li, 2020) estimate 3D boxes from the 2D
image directly into a single CNN. They utilize Cen-
terNet with a stacked hourglass architecture to find
8 key points and the object center, to define the 3D
box. Whereas RTM3D utilizes the 2D/3D geometric
relationship to recover the dimension, location, and
orientation in 3D space, KM3D estimates these values
directly, which is faster and can be jointly optimized.
We adopt the KM3D model as the 3D box detector
because it is fast and can be trained end-to-end on 2D
images only.
3 SEMI-AUTOMATIC 3D
DATASET GENERATION
This section presents several techniques to semi-
automatically estimate 3D boxes from scene knowl-
edge and existing 2D annotated datasets for traffic
surveillance. Estimating a 3D box in an image is car-
ried out in the following four steps (see Figure 2).
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
98