reception of the IR-signal. ToF sensors run in real
time and provide good results even on textureless sur-
faces. Yet, the sensors suffer from limited resolu-
tion as well as various error sources such as noise,
multi path, ”flying pixels” and are susceptible to back-
ground illumination (Foix et al., 2011). Due to size
and power limitations it is difficult to deploy ToF on
smartphones.
SL sensors work by projecting a light pattern onto
the scene and then capturing it with a camera. The
distortion is used to infer the 3D geometry as it is
a function of the 3D shape (Scharstein and Szeliski,
2003a). The light pattern acts as texture and hence
texture-less scenes can also be sensed. The Mi-
crosoft Kinect performs SL sensing on a dedicated
chip achieving real-time 3D imaging. Since the pro-
jected pattern is made up of IR light, it does not work
in the presence of sunlight. The projected pattern is
relatively weak due to power limitations thus limiting
the sensing range to < 10m. Since SL sensing essen-
tially performs stereo vision, the sensing accuracy is
a function of the IR camera resolution and the depth
of the scene. In our paper we show how to properly
account for the decreasing accuracy with increasing
depth in the fusion algorithm (Section 4.1.1).
2.3 Enhanced Depth Estimation
through Fusion
In (Wei-Chen Chiu and Fritz, 2011), a promising ap-
proach that utilizes cross modal stereo reconstruction,
known as IR-image RGB registration, is proposed to
find correspondences between the IR and RGB im-
ages of the Kinect. By combining the RGB channels
with appropriate weightings, the image response of
the IR-sensor is resembled, which allows for depth
estimation for reflective and transparent objects via
stereo reconstruction. Fusing the stereo reconstruc-
tion results with the structured light measurements
extends the abilities of the Kinect without the need
for additional hardware. The stereo reconstruction
approach proposed in our work does not require an
optimization as it is proposed by Wei-Chen et al. By
utilizing the same camera for both stereo images we
avoid a degradation of stereo resulting from the use of
two different cameras.
(Li et al., 2011), (Scharstein and Szeliski, 2003b)
and (Choi et al., 2012) achieve a highly accurate fu-
sion of structured light scans and stereo reconstruc-
tion by recording a projected pattern with a set of
stereo RGB cameras. The structured light sensor used
in our work provides reliable depth measurements out
of the box. Moreover, it records RGB and depth im-
ages from SL with the same sensor chip and therefore
achieves a highly precise alignment as well.
(Gandhi et al., 2012) generate a high-resolution
depth map by using ToF measurements as a low-
resolution prior that they project into the high-
resolution stereo image pair as an initial set of corre-
spondences. Utilizing a Bayesian model allows prop-
agating the depth prior to generate high-resolution
depth images.
In (Somanath et al., 2013), high-resolution stereo
images are fused with the depth measurements from
the Microsoft Kinect. The authors use a graph cuts-
based stereo approach for the fusion. Therefore, the
influence of the individual sensors is considered with
a confidence map, which is determined by the stereo
images as well as the Kinect measurements. For
the fusion, Somanath et al. project the SL measure-
ments into the high-resolution stereo images, which
results in a reduced confidence of the Kinect data. In-
stead, our setup allows capturing the RGB as well
as the depth images from SL with a single camera
and avoids the resulting alignment errors in the fu-
sion. Therefore, our confidence consideration is not
affected by alignment and projection issues and is
solely based on the error characteristics of the smart-
phones’s SL sensor.
The aforementioned stereo-range superresolution
approaches (Li et al., 2011), (Gandhi et al., 2012) and
(Somanath et al., 2013) perform a fusion of passive
stereo vision and active depth measurements by pro-
jecting a low resolution prior into the stereo images
to perform a fusion at high-resolution. In contrast,
we propose an iterative fusion approach that is ini-
tialized at the low-resolution of the SL depth images.
This approach results in a tremendous acceleration of
the correspondence computation, since both, the num-
ber of pixels that have to be assigned a disparity and
the considered label space, are much smaller at low-
resolution. Iteratively launching the algorithm with
the disparities found in the stereo-SL depth fusion al-
lows us to retrieve a superresolution depth image in
much shorter time than the previously mentioned ap-
proaches.
2.4 Google Project Tango
Figure 1 depicts the Google Project Tango device that
we use in our experiments to perform a fusion of
SL and stereo depth maps. The Project Tango de-
vice uses essentially the same sensing technique as
the Kinect. In fact, it is equipped with a Primesense
chip for hardware-based disparity computation (Gold-
berg et al., 2014), just as is the Kinect. Contrary to
the Kinect, however, the Project Tango device uses
the same camera (Identified as “4MP” in Figure 1)
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
514