tally, the pictures are rectified.
2. A similarity measure between each pixel in one
image and each pixel on the same horizontal line
in the other image is calculated.
3. The similarity measure is used to match pixels of
both images. Due to ambiguities, wrong matches
can easily occur if only the similarity measure is
taken into account. This makes an additional cost
function and complex optimization necessary.
Different similarity measures have been utilized. A
very efficient one is a convolutional neural network,
which outperforms traditional methods like sum of
absolute differences, census transform and norma-
lized cross-correlation (Zbontar and LeCun, 2016).
This gave rise to the idea of integrating the second
and third steps into a single neural network. Diffe-
rent architectures thereof exist. In (Dosovitskiy et al.,
2015) a network for estimating optical flow is propo-
sed. That means, a field around a pixel in consecu-
tive frames of a video sequence needs to be searched.
If left and right pictures of a stereo camera are used
as input, the flow corresponds to disparity. The idea
thus was modified for stereo images in (Mayer et al.,
2016), where a correlation layer is used to account
for the epipolar geometry. In (Kendall et al., 2017)
the idea of a cost volume is introduced. First, fea-
ture maps for left and right images are calculated in
the network. Then feature maps of different disparity
levels of one image are stacked on top of the feature
maps of the other image. This approach embeds the
epipolar geometry. In (Smolyanskiy et al., 2018) a
similar network with a semi-supervised training pro-
cedure is implemented, which can run near real-time.
Besides Flownet (Dosovitskiy et al., 2015), all
these methods search through disparity space, which
corresponds to a search in the cameras viewing di-
rection. This creates two problems in the targeted ap-
plication:
1. Since the cameras are mounted behind the winds-
hield, the angle between the viewing direction of
the cameras and the road is sharp. For the task of
depth estimation of the surface, many small steps
through disparity space are necessary to get a high
depth resolution of the surface.
2. The other problem is the rectification of images.
To produce a high depth resolution, the baseline
between cameras has to be large. This in turn re-
quires the cameras to be tilted in, in order to get
an overlap of the images in the region of interest.
In this case, rectification can result in a reduction
of quality, due to the required stretching and inter-
polation.
Both problems are solved by the plane sweep appro-
ach, which was first introduced in (Collins, 1996):
A virtual plane is placed at arbitrary positions in 3D
space. Features of both images are projected onto the
plane and match if the plane’s position is correct. In
(Yang et al., 2003) this approach is used to warp entire
images for dense scene reconstruction.
In this work, a neural network is extended by a
plane sweep approach, in which feature maps are war-
ped by a plane homography inside the network. By
projecting the feature map of one camera onto the
plane and into the other camera, rectification becomes
unnecessary, and, more importantly, by sweeping the
virtual plane from below to above the road surface,
the search space is reduced. The network is trained
on a dataset, which was created by a plane sweep ap-
proach in conjunction with semi global matching.
3 METHOD
The method described in this paper consists of an ex-
isting convolutional neural network for disparity esti-
mation, which is modified to estimate change in sur-
face elevation by a plane sweep approach. The plane
sweep direction is perpendicular to the road surface
and therefore a plane must be found which represents
the mean road surface. This plane is guessed initially
and refined later on.
In this section first the idea of plane sweep and
its usage is described. Next, the convolutional neu-
ral network on which this work is based on is briefly
recapitulated. Subsequently the embedding of plane
sweep into this network is described. At the end it is
shown how the mean surface can be found.
3.1 Plane Sweep
The left camera image of a plane, that is parallel to
the x-y-plane, is calculated by the 2D homography
(Collins, 1996)
H
L,i
= K
L
r
L,1
r
L,2
z
i
r
L,3
+ t
L
. (1)
K
L
is the camera matrix. The camera location is given
by the columns of the rotation matrix r
L,{1,2,3}
and the
translation vector t
L
. z
i
is the distance between the x-
y-plane and the parallel plane.
To find the plane parts of a pair of images was ta-
ken of, both images can be back projected onto virtual
planes i by the inverse homographies H
−1
L,i
and H
−1
R,i
,
where H
R,i
is the homography for the right camera. If
the plane is at the correct position for these parts of
the images, they will match on the virtual plane. This
Incorporating Plane-Sweep in Convolutional Neural Network Stereo Imaging for Road Surface Reconstruction
785