available registers increase to 256. At most half of
these should be used per block to allow running mul-
tiple thread blocks on a multiprocessor at the same
time. This means that a complex kernel must be run
with a smaller block size than a simple one. A similar
restriction applies to the use of shared memory.
3.2 Accelerating Raycasting
While easy to implement, the basic raycasting al-
gorithm leaves room for optimization. Many tech-
niques have been proposed for DVR, from skipping
over known empty voxels (Levoy, 1990) to adaptively
changing the sampling rate (R
¨
ottger et al., 2003).
Most of these techniques are also applicable to a
CUDA implementation. In this paper, we rather fo-
cus on techniques that can use the additional capabil-
ities of CUDA to get a performance advantage over a
shader implementation.
Many volume visualization techniques take a
voxel’s neighborhood into account for calculating its
visual characteristics, starting with linear interpola-
tion, to gradient calculations of differing complexity,
to techniques for ambient occlusion. As the neigh-
borhoods of the voxels sampled by adjacent rays do
overlap, many voxels are fetched multiple times, thus
wasting memory bandwidth. Moving entire parts of
the volume into a fast cache memory could remove
much of the superfluous memory transfers.
As noted in Section 3.1, each multiprocessor has
available 16 kB of shared memory, but less than half
of this should be used by each thread block to get op-
timal performance. Using the memory for caching
of volume data would allow for a subvolume of 16
3
voxels with 16 bit intensity values. While access-
ing volume data cached in shared memory is faster
than transferring them from global memory, it has
some disadvantages compared to using the texturing
hardware. First, the texturing hardware directly sup-
ports trilinear filtering, which would have to be per-
formed manually with multiple shared memory ac-
cesses. Second, the texturing hardware automatically
handles out-of-range texture coordinates by clamping
or wrapping, and removes the need for costly address-
ing and range checking. Finally, the texture hardware
caching can give results similar to shared memory, as
long as the access pattern exhibits enough locality.
When a volume is divided into subvolumes that
are moved into cache memory, accessing neighbor-
ing voxels becomes a problem. Many per-voxel op-
erations like filtering or gradient calculation require
access to neighboring voxels. For voxels on the bor-
der of the subvolumes much of their neighborhood is
not directly accessible any more, since the surround-
ing voxels are not included in the cache. They can
either be accessed directly through global memory, or
included into the subvolume as border voxels, thus re-
ducing the usable size of the subvolume cache. Mov-
ing border voxels into the cache reduces the usable
subvolume size to 14
3
, with 33% of the cache occu-
pied with border data. Hence this would substantially
reduce the efficiency of the subvolume cache.
Bricking implementations for shader-based vol-
ume raycasting often split the proxy geometry into
many smaller bricks corresponding to the subvolumes
and render them in front-to-back order. This requires
special border handling inside the subvolumes and
can introduce overhead due to the multitude of shader
calls. A CUDA kernel would have to use a less flexi-
ble analytical approach for ray setup, instead of utiliz-
ing the rasterization hardware as proposed by (Kr
¨
uger
and Westermann, 2003), or implement its own ras-
terization method (Kainz et al., 2009). As described
above, due to the scarce amount of shared memory,
the total number of bricks would also be quite high,
increasing the overhead for management of bricks
and compositing of intermediate results. The bricking
technique described in (Law and Yagel, 1996) is spe-
cially designed for orthographic projection, for which
the depth-sorting of the bricks can be simplified sig-
nificantly, compared to the case of perspective pro-
jection. Their technique also relies on per-brick lists,
where rays are added after they first hit the brick and
removed after leaving it. This list handling can be effi-
ciently implemented on the CPU, but such data struc-
tures do not map efficiently to the GPU hardware.
(Kim, 2008) works around this problem by handling
the data structures on the CPU. As his aim is stream-
ing of data not fitting into GPU memory, the addi-
tional overhead is of no concern, in contrast to when
looking for a general approach for volume rendering.
To summarize, a direct bricking implementation
in CUDA is problematic because only a small amount
of shared memory is available and the ray setup
for individual bricks is difficult. Therefore we will
introduce an acceleration technique which is better
adapted to the features and limitations of the CUDA
architecture in Section 5.
4 BASIC RAYCASTING
As illustrated in Figure 1, the basic raycasting algo-
rithm can be divided into three parts: initialization
and ray setup, ray traversal, and writing the results. A
fragment shader implementation uses texture fetches
for retrieving ray parameters, applying transfer func-
tions, and for volume sampling, utilizing the texturing
GRAPP 2010 - International Conference on Computer Graphics Theory and Applications
192