COMPLEXITY REDUCTION OF REAL-TIME DEPTH SCANNING

ON GRAPHICS HARDWARE

Sammy Rogmans

1,2

, Maarten Dumont

, Tom Cuypers

, Gauthier Lafruit

and Philippe Bekaert

Hasselt University – tUL – IBBT, Expertise centre for Digital Media, Wetenschapspark 2, 3590 Diepenbeek, Belgium

Multimedia Group, IMEC, Kapeldreef 75, 3001 Leuven, Belgium

Keywords:

Complexity reduction, Real-time, Depth scan, Graphics hardware.

Abstract:

This paper presents an intelligent control loop add-on to reduce the total amount of hardware operations – and

therefore the resulting execution speed – of a real-time depth scanning algorithm. The analysis module of the

control loop predicts redundant brute-force operations, and dynamically adjusts the input parameters of the

algorithm, to avoid scanning in a space that lacks the presence of objects. Therefore, this approach reduces

the algorithmic complexity in proportion with the amount of void within the scanned volume, while remaining

fully compliant with stream-centric paradigms such as CUDA and Brook+.

1 INTRODUCTION

For many years, depth map estimation has been an

active research topic to progressively drive a vast

amount of applications such as robot navigation, im-

mersive teleconferencing and 3DTV with autostereo-

scopic displays. Although depth information can be

derived in a variety of ways, when real-time perfor-

mance is considered, many researchers still tend to

prefer local algorithms instead of global optimization

schemes. The reason is that local algorithms are data-

parallel and can therefore be highly accelerated by

many-core platforms such as the Graphics Process-

ing Unit (GPU). However, these algorithms often re-

sort to brute-force approaches, resulting in a large

amount of redundant operations. Although real-time

performance is still achieved, the redundant brute-

force operations signiﬁcantly suppress the high-speed

and low-power potential of these local algorithms.

This paper presents an intelligent control loop

add-on to reduce the intrinsic complexity of real-time

depth scanning algorithms, overcoming their ‘unin-

telligent’ brute-force approach. The algorithms their-

selves remain completely unaltered by predicting the

redundant operations in advance, while dynamically

changing the input parameters similar to a regulator

in control systems. The proposed approach is there-

fore compliant with stream-centric paradigms such as

CUDA and Brook+, but also with ﬁxed-functionality

hardware (e.g. ASICs).

The layout of the paper is as follows: Section 2 ex-

plains the principles of depth scanning, and presents

the reader with a complexity analysis of one of the

currently most promising state-of-the-art local algo-

rithms. Consequently, Section 3 describes the control

loop add-on to reduce the total algorithmic complex-

ity, while Section 4 exposes optimization schemes

that can be applied speciﬁcally on the GPU. The re-

sults are discussed in Section 5, and the paper is con-

cluded in Section 6.

2 DEPTH SCANNING

In the most general case, local algorithms determine

depth information by sweeping a plane through the

3D volume (Yang et al., 2002), checking each voxel

(i.e. pixel-sized 3D elementary volume) for objects in

a brute-force manner. This general approach can be

applied to multi-camera setups, but a rectiﬁed two-

fold camera setup is often used to simplify the 3D

scan, which is known as stereo matching. In a stereo

camera setup, the depth of objects is determined by

estimating the amount of pixels they have shifted

from the left to the right image. As depicted in Fig. 1,

this shift is consistently known as the motion paral-

547

Rogmans S., Dumont M., Cuypers T., Lafruit G. and Bekaert P. (2009).

COMPLEXITY REDUCTION OF REAL-TIME DEPTH SCANNING ON GRAPHICS HARDWARE.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 547-550

DOI: 10.5220/0001806505470550

 SciTePress

Right View

Depth

Scan Range

Max.

Parallax

Camera Baseline

Left View

Motion Parallax

Figure 1: Basic principles of stereo matching, that can be

seen as a simpliﬁed 3D depth scan.

lax or disparity, and directly correlates to the objects’

depth. Objects close to the camera exhibit a large

parallax and vice versa, with the maximum parallax

determined by the baseline distance between the two

cameras. Looking for matches in the stereo pair –

using a systematically decreasing disparity, starting

from the maximum value – can therefore be seen as

a simpliﬁed front-to-back depth scan.

Consistently surveyedby (Scharstein and Szeliski,

2002), stereo matching algorithms can be divided into

a cost computation, cost aggregation, disparity se-

lection and disparity reﬁnement processing module.

Local algorithms tend to put their focus on the ag-

gregation, and generally discard the reﬁnement. As

recently indicated by a performance study of (Gong

et al., 2007), one of the most promising local algo-

rithms is the adaptive weights approach. Therefore,

this approach is used as a case-study. For the inter-

ested reader, the complete algorithmic description can

be found in the paper of (Yoon and Kweon, 2006).

However, our paper only presents a high-level de-

scription of the adaptive weights algorithm, to obtain

a basic analysis of its current complexity.

The GPU implementation of the adaptive weights

is composed out of three different types of ﬂoating

point-based operations. These are the texture (i.e.

memory read), render (i.e. memory write), and com-

putational operations. The matching cost C

com-

putation inputs an RGB-colored left/right image pair

∗

(x,y) and I

∗

(x,y), disparity hypothesis d, and inte-

grates the absolute differences of each color channel

according to the dot product

(x,y,d) = g· |I

∗

(x,y) − I

∗

(x− d,y)| , (1)

with a grey-scaling vector g = h0.299, 0.587, 0.114i.

The matching cost is computed for each hypothesis d

from the search range S, regardless of the presence of

objects at the examined depth. Consequently, the cost

is truncated to a given maximum τ and normalized

(toward 255 when considering 24-bit color images),

following

TAD

(x,y,d) =

255

· min(C

(x,y,d) , τ) , (2)

to reduce the impact of noise inside the source im-

ages. The cost is then aggregated by a 33 × 1 hori-

zontal and 1 × 33 vertical separated 1D convolution

kernel w

(x,y,u), respectively w

(x,y,v). The com-

position of these kernels are based on the Gestalt prin-

ciples of similarity and proximity, and are therefore

computed in a preprocessing step according to

(x,y,u) = e

−

|∆c

+ e

−

|u|

, (3)

with |∆c

| being the Euclidean color distance be-

tween I

∗

(x,y) and I

∗

(x + u,y), and γ

= 17.6, γ

40.0 being a fall-off rate for the similarity, respec-

tively proximity term. The vertical kernel w

is com-

puted in a similar manner. Consequently, the aggre-

gated cost A

is deﬁned as

′

(x,y,d) =

(x,y,u) ∗

TAD

(x+ u, y,d)

, (4)

(x,y,d) =

(x,y,v) ∗

′

(x,y+ v,d)

, (5)

where µ

, µ

are normalization constants, and A

′

stores the intermediate result of the horizontal aggre-

gation. Finally, the best matching cost is selected

out of all disparity hypotheses conceived in S, on a

winner-takes-all (WTA) basis. The resulting depth

map D(x,y) is therefore composed according to

D(x,y) = arg min

d∈S

(x,y,d) . (6)

Table 1 reﬂects our analysis of the type and

amount of operations that are involved to compute

a depth map with the adaptive weights approach us-

ing a 33× 33 convolution kernel, where p deﬁnes the

amount of pixels in the image, and h the amount of

disparity hypotheses, i.e. the cardinality of S. Hence

the total amount of operations O to generate a depth

map can be described as

O = (1027+ 23· h)· p , (7)

resulting in a little over 17.6 Gops for the 1800 ×

1500 Teddy scene (Middlebury, 2003).

3 COMPLEXITY REDUCTION

The proposed method can reduce the total complex-

ity of the adaptive weights algorithm, or any other

local stereo matching algorithm for that matter. To

achieve this, the algorithm is executed twice, but with

different inputs. In the ﬁrst pass, the complete volume

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

548

Table 1: Amount of operations needed to generate a depth map with the adaptive weights algorithm using a 33× 33 convolu-

tion kernel, in function of the amount of image pixels p and disparity hypotheses h.

Preprocessing Matching Cost Cost Aggregation Disparity Selection

Texture Ops 198 · p 6· p· h 132· p · h 1· p· h

Computational Ops 764· p 13· p · h 256· p · h 1· p· h

Render Ops 64 · p 1· p· h 1· p· h 1· p

TOTAL OPS 1026· p 20· p· h 1· p · h (2· h+ 1)· p

Adaptive Scan Range

Full Scan Range

Histogram

Noise Threshold

ANALYSIS FEEDBACK

Figure 2: The principle of our proposed complexity reduction. The full 3D volume is scanned with a low-complexity approach,

and analyzed through a histogram. Consequently, an accurate high-complexity scan is performed on an adaptive range.

is scanned over the entire search range S, but with

a smaller input image resolution. Consequently, an

analysis on the depth map D is performed by comput-

ing the histogram H (d). As shown in Fig. 2, the his-

togram indicates the depths where objects are actually

located. In (Gallup et al., 2007), the histogram has

already been used to ﬁnd the best orientation of the

depth sweep, by performing the sweep for multiple

orientations and selecting the orientation that leads

to the minimum data-entropy of the histogram. Fur-

thermore, (Geys et al., 2004) also used histogram in-

formation, but similar to (Gallup et al., 2007), this

is mainly to lever the resulting quality of the algo-

rithm. In our proposed approach, the histogram infor-

mation acquired from a coarse-grained full range scan

leads to an accurate ﬁne-grained scan over an adaptive

range. As the complexity of the histogram computa-

tion is fairly low itself, the potential to accelerate a

local stereo matching algorithm becomes very high.

In general, the less pronounced foreground fatten-

ing (Scharstein and Szeliski, 2002) an algorithm ex-

hibits, the more the input image resolution can be re-

duced in the ﬁrst pass, while still maintaining a rep-

resentative histogram. In case of the adaptive weights

approach, the input image dimensions can be reduced

up to four times, resulting in a 16-fold reduction of

the original complexity.

To determine which subset S

sub

⊆ S to take, the

values of the histogram are compared with a given

threshold δ. Instead of ﬁxing the treshold δ, it can

be made dynamic by setting the level proportional

to the data-entropy of the histogram. Therefore, the

threshold will be set lowfor complex scenes with ﬁne-

grained geometry, and high when little geometry is

present. Hence, this efﬁciently ﬁlters out noise due to

mismatches inherent to the local algorithm.

4 OPTIMIZATIONS

The aforementioned algorithmic complexity is only

valid when considering the use of conventional Gen-

eral Purpose GPU (GPGPU) computing (Owens et al.,

2007). However, several optimization schemes can

be applied to further reduce the complexity of lo-

cal stereo algorithms. For the cost aggregation mod-

ule, the amount of texture operations can be reduced

by half, when bilinear sampling is used to aggre-

gate two values with a single memory lookup. Fur-

thermore, in the next-generation GPGPU paradigm

CUDA, 16 memory operations can be coalesced into

a single memory read operation. For the disparity se-

lection, the depth test of the GPU can be exploited to

avoid all explicit operations (Lu et al., 2007). How-

ever, the depth test can only be exploited through

the traditional GPGPU paradigm using Direct3D or

OpenGL. Nevertheless, thanks to the recent intro-

duction of CUDA v2.0 and the complete support of

interoperability between CUDA and Direct3D, both

paradigms can be combined to form a highly opti-

mized implementation.

COMPLEXITY REDUCTION OF REAL-TIME DEPTH SCANNING ON GRAPHICS HARDWARE

549

(b)(a)

(d) (e)

(c)

Figure 3: (a) The Teddy scene, (b) low-complexity depth

map, (c) joint histogram, (d) the resulting high-quality depth

map with reduced operations, and (e) the original result.

5 EXPERIMENTAL RESULTS

The proposed method was tested on the 1800× 1500

Teddy scene (see Fig. 3a) of the Middlebury dataset

(Middlebury, 2003), with disparity search range S =

{0,... , 240}. The image resolution was lowered

16 times to 450 × 375, which yields the dispar-

ity map depicted in Fig. 3b. After the histogram

(see Fig. 3c) analysis, the detected subset S

sub

{60, . . . ,96,108,...,180} is only 45% of the orginal

search range S. As the ﬁrst scan is only 1/16

(or

6.25%) of the original complexity, a total complex-

ity reduction of about 48% is harnessed. The dispar-

ity map in Fig. 3d is therefore generated in about 8.5

Gops, instead of the original result in Fig. 3e, which

takes 17.6 Gops. Our control-loop scheme therefore

yields a two-fold complexity reduction, and is able to

double the execution speed of the algorithm.

Considering that the complexity of the ﬁrst pass is

almost negligible, the proposed control loop add-on

will allow for a speedup, proportional to the amount

of void within the scanned volume. This is particu-

larly useful in eye-gaze corrected video chat (Dumont

et al., 2008), as only the chat participant needs to be

scanned in a rather large ofﬁce space.

6 CONCLUSIONS

We have proposed a control loop add-on to reduce the

complexity of local real-time stereo matching algo-

rithms. The histogram of a coarse-grained depth scan

is analyzed to adjust the input parameters of a con-

sequent ﬁne-grained accurate scan, causing the algo-

rithm to avoid scanning in spaces that lack the pres-

ence of objects. Hence, the orignal brute-force al-

gorithmic complexity can be reduced in proportion

with the amount of void within the volume, while

still remaining fully compliant with stream-centric

paradigms such as CUDA or Brook+.

ACKNOWLEDGEMENTS

Sammy Rogmans would like to thank the IWT for

the ﬁnancial support under grant number SB071150.

Tom Cuypers and Philippe Bekaert acknowledge the

ﬁnancial support by the European Commission (FP7

2020 3D media

). Furthermore, all authors recog-

nize the ﬁnancial support from the IBBT.

REFERENCES

Dumont, M., Maesen, S., Rogmans, S., and Bekaert, P.

(2008). A prototype for practical eye-gaze corrected

video chat on graphics hardware. In Proc.of SIGMAP.

Gallup, D., Frahmand, J., Mordohai, P., Yang, Q., and Polle-

feys, M. (2007). Real-time plane-sweeping stereo

with multiple sweeping directions. In Proc. of CVPR.

Geys, I., Koninckx, T. P., and Gool, L. V. (2004). Fast in-

terpolated cameras by combining a GPU based plane

sweep with a max-ﬂow regularisation algorithm. In

Proc. of 3DPVT.

Gong, M., Yang, R., Wang, L., and Gong, M. (2007). A

performance study on different cost aggregation ap-

proaches used in real-time stereo matching. IJCV,

75(2):283–296.

Lu, J., Rogmans, S., Lafruit, G., and Catthoor, F. (2007).

High-speed dense stereo via directional center-biased

support windows on programmable graphics hard-

ware. In Proc. of 3DTV-CON.

Middlebury (2003). http://vision.middlebury.edu/stereo.

Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger,

J., Lefohn, A., and Purcell, T. (2007). A survey of

general-purpose computation on graphics hardware.

CG Forum, 26(1):80–113.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. IJCV, 47(1-3):7–42.

Yang, R., Welch, G., and Bishop, G. (2002). Real-time

consensus-based scene reconstruction using commod-

ity graphics hardware. In Proc. PG.

Yoon, K.-J. and Kweon, I.-S. (2006). Adaptive support-

weight approach for correspondence search. IEEE

PAMI, 28:650–656.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

550