whose vertices can be divided into two disjoint sets
A and B such that every edge connects a vertex in A
to one in B. We notice that in the case of a bipartite
graph, to update the messages sent by the pixels of A,
we only need the messages sent by the pixels of B.
As a result, to compute messages sent by a pixel of A
at iteration k + 1, we need only half of the messages
of the iteration k unlike the classical algorithm. In ad-
dition to the gain in memory space, this technique can
reduce by half the number of operations, since only
half of the messages will be updated at each iteration.
3.4.3 Reduce the Number of Iterations a
Hierarchical Approach
In our implementation, we use a multi-scale grid as
described in (Felzenszwalb and Huttenlocher, 2006)
to reduce the number of iterations to convergence.
The basic idea is to perform BP in different scales
from a coarset scale to a finest one. This technique
allows long range interactions between pixels with a
minimum number of iterations. The messages of a
coarser scale are used to initialise the messages in a
finer scale allowing to dramatically reduce the num-
ber of iterations in that scale.
4 PARALLEL COMPUTING ON
GPUS WITH CUDA API
Graphics processing units (GPUs) have evolved to
performances surpassing 500 GFLOPS. CUDA is a
technology for GPU computing from NVIDIA. It ex-
poses the hardware as a set of SIMD multiprocessors,
each of which consists of a number of multiproces-
sors. These multiprocessors have arbitrary read/write
access to a global memory region called the device
memory, but also share a memory region known as
shared memory. Access times to the device memory
are long, but it is large enough (up to 1.5 GB on some
cards) to store entire data sets. The shared memory,
on the other hand, is small (16 KB) but has much
shorter access times. It acts as an explicitly-managed
cache. Traditional textures can also be stored in the
device memory and accessed in a read-only fashion
through the multiprocessor’s texture cache. In addi-
tion, a limited amount of read-only memory is avail-
able for constants used by all the processors. This is
accessed through the constant cache.
A schematic overview of the hardware model is
shown in Figure 3. In a CUDA program, the de-
veloper sets up a large number of threads that are
grouped into thread blocks. A CUDA thread has a
set of registers and a program counter associated with
Figure 3: CUDA Execution Model.
it. Each thread block is executed on a single multi-
processor. It is possible to synchronize the threads
within a block, allowing the threads to share data
through the shared memory. Communication between
all threads, however, require the use of global mem-
ory. Given that a thread block can consist of more
threads than the number of processors in a multipro-
cessor, the hardware is responsible for scheduling the
threads. This allows it to hide the latency of fetches
from device memory by letting some threads perform
computations while others wait for data to arrive.
5 CUDA IMPLEMENTATION
STRATEGY
The implementation strategy has a great impact on
the overall performance of the implementation. It
deals with the allocation of threads to the problem and
the usage of the different types of onboard memory.
The approach chosen uses a thread to process a sin-
gle pixel. The reference image is then divided into
blocks having the same dimensions. Threads within
a same threads block can communicate through the
low latent shared memory. The next step is to deter-
mine what types of memory to use. Texturing from
a CUDA Array is used for both the reference and
target images. Using texturing provides several ad-
vantages mainly a cached access. Global memory is
used to save the data costs and the messages. While
using global memory is bandwidth limiting because
of its latency (400 to 600 clock cycles), the use of
shared memory is inadequate since the treatement of
one block of pixels is dependent on its neighbouring
blocks. Figure 4 shows the instruction flow (single
arrows) and the data flow (double arrows) of our im-
plementation.
VISAPP 2009 - International Conference on Computer Vision Theory and Applications
420