REAL-TIME DENSE DISPARITY ESTIMATION USING CUDA’S API

Mourad Boufarguine, Malek Baklouti, Vincent Guitteny, Serge Couvet

THALES Security Solutions & Services Division (D3S), Cergy-Pontoise, France

Keywords:

Stereovision, Real-time, Range map, Depth map, Dense matching, Beliefs propagation, General purpose cal-

culations on graphic processors.

Abstract:

In this paper, we present a real-time dense disparity map estimation based on beliefs propagation inference

algorithm. While being real-time, our implementation generates high quality disparity maps. Despite the

high complexity of the calculations beliefs propagation involves, our implementation on graphics processor

using CUDA API makes more than 100 times speedup compared to CPU implementation. We tested our

experimental results in the Middlebury benchmark and obtained good results among the real-time algorithms.

We use several programming techniques to reduce the number of iterations to convergence and memory usage

in order to maintain real-time performance.

1 INTRODUCTION

Stereo matching is a fundamental computer vision

problems with a wide range of applications, and hence

it has been extensively studied in the computer vi-

sion ﬁeld for decades. This paper describes a pro-

grammable graphics hardware implementation of hi-

erarchical beliefs propagation algorithm for stereo vi-

sion within Markov Random Field framework. MRF

formulations of stereo and other early vision prob-

lems have existed for some time, but the computa-

tional and storage complexity make exact solutions

infeasible; exact solutions to MAP estimator are NP-

complete, hence the need for approximation algo-

rithms such as graph cuts and beliefs propagation.

As processing power has increased, the approxima-

tion algorithms have become practical for real-time

stereo applications. Beliefs propagation is well suited

for parallel execution and hardware implementation.

Graphics processing units (GPUs) are highly paral-

lel single -instruction-multiple-data (SIMD) proces-

sors built into modern graphics cards along with up to

512 MB or more of high-speed memory. GPUs and

their programmable interface have become so versa-

tile that much research has been done on performing

general purpose computation in graphics hardware

(GPGPU). GPU computing with CUDA is a new ap-

proach to computing where hundreds of on-chip pro-

cessor cores simultaneously communicate and coop-

erate to solve complex computing problems up to 100

times faster than traditional approaches.

Section 2 describes state-of-the-art stereo match-

ing algorithms. Section 3 describes parallel comput-

ing on GPUs using CUDA API. In section 4, we de-

scribe our implementation of the hierarchical beliefs

propagation. Section 6 draws some conclusions based

on our results exposed in section 5.

2 STEREO MATCHING

ALGORITHMS

Stereo matching is a crucial step in the depth map es-

timation. Despite the simpliﬁcation brought by the

epipolar geometry, the problem of matching remains

difﬁcult to solve due to occlusion, luminosity changes

between viewpoints and non textured areas. To over-

come these difﬁculties, several methods have been

proposed. A state of the art of different existing meth-

ods is presented by (Scharstein and Szeliski, 2002).

Stereo matching algorithms are generally classiﬁed

into two classes, local algorithms and global algo-

rithms based on the cost computation method.

2.1 Local Methods

Local approaches are based on a correlation criterion

over a local window containing the element to match.

The choice of the appropriate size and shape is crucial

417

Boufarguine M., Baklouti M., Guitteny V. and Couvet S.

REAL-TIME DENSE DISPARITY ESTIMATION USING CUDAâ

ZS API.

DOI: 10.5220/0001773204170422

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications (VISIGRAPP 2009), page

ISBN: 978-989-8111-69-2

to ensure the efﬁciency of these methods. The choice

of a ﬁxed size is not very wise because it does not

ﬁt all areas of the image. The principle of more ef-

ﬁcient methods is to vary the size (Kanade and Oku-

tomi, 1994) and/or the shape (Veksler, 2002) of the

correlation windows. In (Yoon and Kweon, 2005),

instead of ﬁnding the optimal correlation window, the

authors adjust the weight of each pixel in the window.

2.2 Global Methods

Global approaches minimize an overall cost function

that involves all the pixels of the image. It is generally

expressed as an objective energy function

E(d) = E

data

(d) + E

prior

(d) (1)

prior

is a regularization term. The formulation

of this term depends heavily on the scene and cho-

sen constraints. Most algorithms use smoothness con-

straint. In this case, the function penalizes the differ-

ence of disparity of neighbouring pixels.

data

encodes knowledge about forward image

formation. It assigns a high cost for the functions of

disparity that are not consistent with the data.

In the global methods, calculating the disparity

ﬁeld is led to minimize the objective function of en-

ergy. Several optimization methods have been pro-

posed such as dynamic programming (Kim et al.,

2005), graph cuts (Roy and Cox, 1998) and beliefs

propagation.

3 BELIEFS PROPAGATION

We model the problem of stereo matching with a

Markov random Field (MRF). We refer the reader

to (Sun et al., 2003). Let P be the set of pixels in

an image and L be a ﬁnite set of labels. The labels

correspond to disparities that we want to estimate at

each pixel. A labeling d assigns a label d

∈ L to

each pixel p ∈ P. To measure the quality of a labeling

d, we consider the global energy function

E(d) =

∑

p∈P

) +

∑

(p,q)∈N

p,q

, d

) (2)

and U

p,q

are the data cost and smooth cost respec-

tively. The data cost encodes the log-likelihood func-

tion. The smooth cost encodes the prior distribution.

N is the set of neighboring pixels couples.

3.1 Data and Smooth Costs

To compute the data cost, we use a truncated absolute

difference as the matching cost. We aggregate this

cost over a square window with constant disparity.

) =

∑

(x,y)∈N(p)

min(|I

re f

(x, y) −I

match

(x −d, y)|, T )

(3)

The smooth cost is also computed using a trun-

cated absolute difference.

p,q

, d

) = min(|d

−d

|, λ) (4)

N(p) is a p-centered square window, I

re f

is the refer-

ence image and I

match

is the target image. T and λ are

thresholds.

3.2 BP Algorithm

In the min-sum form of the BP algorithm, we mini-

mize the global energy function in equation 2.

Figure 1: Messages Updates scheme.

BP estimates a MAP labelling for the MRF by

sending messages between nodes (Figure 1). A

node p sends to each of its four neighboors q

, i ∈

{1, 2, 3, 4} a message m

p→q

at every iteration k. Each

message is a vector of length |L|, with each compo-

nent being proportional to how likely node p ”be-

lieves” that node q

will have the corresponding dis-

parity. In the min-sum algorithm, messages are up-

dated in the following way

p→q

) = min

) +U

p,q

, d

) (5)

∑

j∈{1,2,3,4}, j6=i

k−1

→p

))

After K iterations, we compute the ”beliefs” vec-

tor for each node p

) = D

) +

∑

i∈{1,2,3,4}

→p

) (6)

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

418

Finally, for each node p we select the label d

∗

that

corresponds to the component of b

with the mini-

mum value

∗

= arg min

∈L

) (7)

3.3 Complexity Analysis

A standard implementation of the algorithm will be at

best a complexity of O(4 ×W ×H ×|L|

×K). Where

W and H are the image dimensions, L the set of dis-

parity values and K the number of iterations. The fac-

tor 4 is due to the fact that four messages are updated

per pixel and per iteration. we notice that the com-

plexity is a quadratic function of the disparity range

(|L|). In fact, to compute each of the |L| values of

the message, we do a minimization through the dis-

parity values of equation 6. Moreover, the number

of iterations to convergence (K) has a complexity of

√

W ×H) in order to propagate information over

all the pixels. Thus, the complexity of computation is

not linear according to the image dimensions.

In addition to its complexity, the beliefs propagation

algorithm has the disadvantage to need a fairly large

memory space. Indeed, we need to save in the mem-

ory, at least for two successive iterations and for each

pixel, the four vectors of |L| components representing

messages.

3.4 Proposed Solution and Algorithm

In this section, we describe how to reduce the com-

plexity of the algorithm. To reduce the complexity of

the algorithm, we propose three solutions :

• reduce the complexity related to the disparity

range from O(|L|

) to O(|L|), by exploiting the

properties of the chosen U

p,q

function (eq 4);

• reduce by half the memory space by exploiting the

fact that our graph is bipartite;

• reduce the number of iterations needed for the

convergence of the algorithm, using an hierarchi-

cal approach to update messages.

3.4.1 Reducing the Complexity Related to the

Disparity Range

We can rewrite the equation 6 in the following form:

p→q

) = min

∈[d

min

max

]

p,q

, d

) + h(d

))

(8)

where h(d

) = D

) +

∑

j=1, j6=i

k−1

→p

The usual approach to compute this message, explic-

itly minimizes this term throughout all the possible

values of d

. This needs O(|L|

) operations (|L| is the

number of disparity values). Using the properties of

the function U

p,q

(equation 4), we will demonstrate

how to reduce the complexity to O(|L|).

To simplify the demonstration, let’s consider a non-

truncated function. The update message function is

done then as in equation 9.

(u) = min

∈[d

min

max

]

(ρ.min(|d

−u|+ h(d

)) (9)

for each u ∈ [d

min

, d

max

]. The minimisation of equa-

tion 9 can be seen as the convex hull of L cones de-

ﬁned by the parametric equation 10. Figure 2 shows

and example of the convex hull.

: x 7→ ρ

−x

+ h(d

);d

∈ [d

min

, d

max

] (10)

Figure 2: Convex hull of an example of 4 cones.

To compute this hull, we use an algo-

rithm of distance transformation introduced

by (Borgefors, 1986). The algorithm to

compute the messages is detailed below.

Algorithm 1: Algorithm of computation of m’

in O(|L|).

for u = d

min

max

(u) ← h(u)

end for

for u = d

min

+ 1 d

max

(u) ← min(m

(u), m

(u −1) + ρ)

end for

for u = d

max

−1 d

min

(u) ← min(m

(u), m

(u + 1) + ρ)

end for

Hence, for the truncated function we obtain :

p→q

) = min



), min

d∈[d

min

max

]

h(d) + λ



(11)

3.4.2 Reducing the Memory Usage

The propagation of messages have the property of bi-

partite graph (ﬁgure 1). A bipartite graph is a graph

REAL-TIME DENSE DISPARITY ESTIMATION USING CUDA'S API

419

whose vertices can be divided into two disjoint sets

A and B such that every edge connects a vertex in A

to one in B. We notice that in the case of a bipartite

graph, to update the messages sent by the pixels of A,

we only need the messages sent by the pixels of B.

As a result, to compute messages sent by a pixel of A

at iteration k + 1, we need only half of the messages

of the iteration k unlike the classical algorithm. In ad-

dition to the gain in memory space, this technique can

reduce by half the number of operations, since only

half of the messages will be updated at each iteration.

3.4.3 Reduce the Number of Iterations a

Hierarchical Approach

In our implementation, we use a multi-scale grid as

described in (Felzenszwalb and Huttenlocher, 2006)

to reduce the number of iterations to convergence.

The basic idea is to perform BP in different scales

from a coarset scale to a ﬁnest one. This technique

allows long range interactions between pixels with a

minimum number of iterations. The messages of a

coarser scale are used to initialise the messages in a

ﬁner scale allowing to dramatically reduce the num-

ber of iterations in that scale.

4 PARALLEL COMPUTING ON

GPUS WITH CUDA API

Graphics processing units (GPUs) have evolved to

performances surpassing 500 GFLOPS. CUDA is a

technology for GPU computing from NVIDIA. It ex-

poses the hardware as a set of SIMD multiprocessors,

each of which consists of a number of multiproces-

sors. These multiprocessors have arbitrary read/write

access to a global memory region called the device

memory, but also share a memory region known as

shared memory. Access times to the device memory

are long, but it is large enough (up to 1.5 GB on some

cards) to store entire data sets. The shared memory,

on the other hand, is small (16 KB) but has much

shorter access times. It acts as an explicitly-managed

cache. Traditional textures can also be stored in the

device memory and accessed in a read-only fashion

through the multiprocessor’s texture cache. In addi-

tion, a limited amount of read-only memory is avail-

able for constants used by all the processors. This is

accessed through the constant cache.

A schematic overview of the hardware model is

shown in Figure 3. In a CUDA program, the de-

veloper sets up a large number of threads that are

grouped into thread blocks. A CUDA thread has a

set of registers and a program counter associated with

Figure 3: CUDA Execution Model.

it. Each thread block is executed on a single multi-

processor. It is possible to synchronize the threads

within a block, allowing the threads to share data

through the shared memory. Communication between

all threads, however, require the use of global mem-

ory. Given that a thread block can consist of more

threads than the number of processors in a multipro-

cessor, the hardware is responsible for scheduling the

threads. This allows it to hide the latency of fetches

from device memory by letting some threads perform

computations while others wait for data to arrive.

5 CUDA IMPLEMENTATION

STRATEGY

The implementation strategy has a great impact on

the overall performance of the implementation. It

deals with the allocation of threads to the problem and

the usage of the different types of onboard memory.

The approach chosen uses a thread to process a sin-

gle pixel. The reference image is then divided into

blocks having the same dimensions. Threads within

a same threads block can communicate through the

low latent shared memory. The next step is to deter-

mine what types of memory to use. Texturing from

a CUDA Array is used for both the reference and

target images. Using texturing provides several ad-

vantages mainly a cached access. Global memory is

used to save the data costs and the messages. While

using global memory is bandwidth limiting because

of its latency (400 to 600 clock cycles), the use of

shared memory is inadequate since the treatement of

one block of pixels is dependent on its neighbouring

blocks. Figure 4 shows the instruction ﬂow (single

arrows) and the data ﬂow (double arrows) of our im-

plementation.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

420

Reference

image

Target

image

texture memory

Data cost

computing

Message updating

Messages

Global memory

Number of

iterations

Number of

sacles

Disparity

computation

Data costs

Global memory

Figure 4: CUDA implementation details.

6 EXPERIMENTAL RESULTS

We tested our real-time BP algorithm on a 3 GHz

PC equipped with a Geforce 8800 GTS graphics card

with 512M video memory. In the proposed algorithm,

we have two parameters T and λ used in the data and

smooth costs computation modules (equations 3 and

4).

All kernels are implemented using CUDA API.

The same parameter settings were used throughout

the experiments with the following parameters T =

20, λ = 3.0. We implement hierarchical BP in GPU

with three scales, and the typical iterations for each

scale are (4,5,5), from coarse-to-ﬁne scale.

Figure 5 shows the results we obtained for sequences

of the benchmark Middlebury. The table 1 reproduces

the classiﬁcation by the benchmark Middlebury of our

results compared to some algorithms listed. It is note-

worthy that our algorithm is the 23

place in the over-

all standings, and 2

in the ranking algorithms real

time.

(a) Tsukuba (b) Venus (c) Teddy (d) Cones

Figure 5: Disparity map estimation on sequences from Mid-

dlebury database.

Table 1: Classiﬁcation by the benchmark Middlebury.

Algorithm Rank % false matching

TsukubaVenusTeddyCones

AdaptingBP (Klaus et al., 2006) 2.8 1.11 0.10 4.22 2.48

RealtimeBP (Yang et al., 2006) 21.9 1.49 0.77 8.72 4.61

Our implementation 23.2 1.59 1.13 12.6 6.27

RealTimeGPU (Wang et al., 2006) 26.8 2.05 1.92 7.23 6.41

BP+MLH (Scharstein and Szeliski, 2002) 32.5 4.17 1.96 10.2 4.93

The implementation of the non-hierarchical

algorithm takes approximately 14 ms per iteration

(70 iterations per second). The choice of the number

of iterations is empirical and depends on the desired

quality of the disparity map.

The hierarchical approach of the algorithm reduces

the number of iterations required. In our implemen-

tation, we used three scales. At each level, the image

is downscaled by two from a scale to another. In our

implementation, we chose to take one pixel on two to

move from a scale to another.

Table 2: Comparaison between CPU and GPU computing

times (Tsukuba sequence).

Computing Time (ms) FPS (fps)

CPU 4542 0.22

GPU 37 27

We obtained a frame rate of 27 FPS. To estimate

the acceleration of the algorithm done with the GPU,

we implemented the same algorithm on CPU and we

used the same settings (table 2). The CPU takes 4.54

seconds to calculate a disparity map, which corre-

sponds to 0.22fps. We have thus obtained an accel-

eration of 122 times compared to CPU.

(a) Refernece image (b) Disparity map

Figure 6: Disparity map estimation on videos from Mi-

crosoft Research.

Figure 6 shows the estimated disparity map of a

stereo sequence of 320x240 from Microsoft Research.

Our implementation computes 25 frames per second

to estimate a 40 levels disparity map.

Figure 7 shows the estimated disparity map of a

stereo images of 320x240 taken by two DALSA cam-

eras. Our implementation computes 26 frames per

second to estimate a 35 levels disparity map.

REAL-TIME DENSE DISPARITY ESTIMATION USING CUDA'S API

421

(a) Reference image (b) Disparity map

Figure 7: Disparity map estimation on our own sequence.

7 CONCLUSIONS

Algorithms achieving high frame rates must have

strong limitations in image size and disparity search

range, whereas high quality results often need sev-

eral minutes per image pair. We have presented in

this paper a real time dense disparity estimation us-

ing CUDA API. GPU computing with CUDA is a

new approach for parallel computing that supports the

parallel data cache and thread execution manager for

higher efﬁciency. We have presented in this paper

a new parallel implementation of the efﬁcient belief

propagation algorithm. This allows real time compu-

tation and robustness is clearly demonstrated on the

dedicated Middleburry benchmark.

We reduced the complexity of the algorithm using a

distance transformation to minimize the convex hull

for message computation. We reduced then the mem-

ory usage using bipartite graph properties. Finally, we

used a multi-scale grid to reduce the number of itera-

tions to convergence. In our implementation, we per-

formed parallelized message updates thanks to CUDA

progamming.

We tested our real-time BP algorithm on a 3 GHz

PC equipped with a Geforce 8800 GTS graphics card

with 512M video memory. We implemented hierar-

chical BP in GPU with three scales, and the typical it-

erations for each scale are (4,5,5), from coarse-to-ﬁne

scale. We used the Tsukuba data set with 16 disparity

levels and we obtained 27 fps.

REFERENCES

Borgefors, G. (1986). Distance transformations in digi-

tal images. Comput. Vision Graph. Image Process.,

34(3):344–371.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2006). Efﬁcient

belief propagation for early vision. Int. J. Comput.

Vision, 70(1):41–54.

Kanade, T. and Okutomi, M. (1994). A stereo matching

algorithm with an adaptive window. IEEE Trans. Pat-

tern Anal. Mach. Intell., 16(9):920–932.

Kim, J. C., Lee, K. M., Choi, B. T., and Lee, S. U. (2005).

A dense stereo matching using two-pass dynamic pro-

gramming with generalized ground control points. In

CVPR05, pages 1075–1082.

Klaus, A., Sormann, M., and Karner, K. (2006). Segment-

based stereo matching using belief propagation and a

self-adapting dissimilarity measure. In ICPR06, pages

15–18.

Roy, S. and Cox, I. J. (1998). A maximum-ﬂow formulation

of the n-camera stereo correspondence problem. In

ICCV98, page 492.

Scharstein, D. and Szeliski, R. (2002). A taxonomy and

evaluation of dense two-frame stereo correspondence

algorithms. Int. J. Comput. Vision, 47(1-3):7–42.

Sun, J., Shum, H., and Zheng, N. (2003). Stereo matching

using belief propagation. IEEE Trans. Pattern Anal.

Mach. Intell., 25(7):787–800.

Veksler, O. (2002). Stereo correspondence with compact

windows via minimum ratio cycle. IEEE Trans. Pat-

tern Anal. Mach. Intell., 24(12):1654–1660.

Wang, L., Liao, M., Gong, M., Yang, R., and Nister, D.

(2006). High-quality real-time stereo using adap-

tive cost aggregation and dynamic programming. In

3DPVT06, pages 798–805.

Yang, Q., Wang, L., and Yang, R. (2006). Real-time global

stereo matching using hierarchical belief propagation.

In BMVC06, page III:989.

Yoon, K.-J. and Kweon, I.-S. (2005). Locally adaptive

support-weight approach for visual correspondence

search. In CVPR05, pages 924–931.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

422