times from the kernels (Sanders and Kandrot, 2010).
Also, as a new improvement incorporated on the
GF100 architecture, the static parameters of the
kernel are automatically stored in constant memory.
In our 3D thinning algorithm we store in constant
memory the offset values indicating the position of
the neighbours of each voxel. These values depend
on the dimension of the model and do not change
along the thread execution, so are ideal for constant
memory. These values are accessed while checking
if a voxel belongs to the 3D model boundary and
when operating over a border point to obtain its
neighbourhood. Thus, global memory bandwidth is
freed. Our tests indicate that the algorithm is 11% to
18% faster, depending on the model size and the
GPU used, when using constant memory (see
“Constant Memory Usage” speedup in Figure 5).
4.4 Shared Memory Usage
Avoiding memory transfers between devices and
hiding the access memory latency time are important
factors in improving CUDA algorithms, whatever
GPU architecture, as outlined in sections 4.1 and
4.2. But focusing only on the GT200 architecture,
the fundamental optimization strategy is, according
to our experience, to use the CUDA memory
hierarchy properly. This is mainly achieved by using
shared memory instead of global memory where
possible (NVIDIA A, 2011); (Kirk and Hwu, 2010);
(Feinbure et al., 2011). However, the use of shared
memory on the newest GF100 GPUs may not be so
important, as would be seen later.
Shared memory is a fast on-chip memory widely
used to share data between threads within the same
thread-block (Ryoo et al., 2008). It can also be used
as a manual cache for global memory data by storing
values that are repeatedly used by thread-blocks. We
will see that this last use is not so important when
working on GF100-based GPUs, since the GF100
architecture provides a real cache hierarchy.
Shared memory is a very limited resource (see
Table1). It could be dynamically allocated during
the kernel launch, but not during the kernel
execution. This fact avoids that each thread within a
thread-block could allocate the exact amount of
memory that it needs. Therefore, it is necessary to
allocate memory to all the threads within a thread-
block, although not all of these threads will use this
memory. Several memory positions are wasted in
this case.
Based on our experience we recommend the
following steps for an optimal use of shared
memory: A) identify the data that are reused (data-
sharing case) or accessed repeatedly (cache-system
case) by some threads, B) calculate how much
shared memory is required, globally (data-sharing
case) or by each individual thread (cache-system
case), and C) deter-mine the number of threads per
block that maximizes the MP occupancy (more in
section 4.5).
Focusing now on our case study, the
neighborhood of each voxel is accessed repeatedly
so it can be stored in shared memory to achieve a
fast access to it. This way, each thread needs to
allocate 1 byte per neighbour. This amount of
allocated shared memory and the selected number of
threads per thread-block determine the total amount
of shared memory that each thread-block allocates.
We have tested our 3D thinning algorithm by
first storing each 26-neighborhood in global memory
and then storing it in shared memory. Testing on the
GTX295 with our five test models, a relative
speedup of more than 2x when using shared memory
is achieved. On the contrary, for the GTX580, the
relative speedup is minimal, achieving only a poor
acceleration of around 10%, with shared memory
preference mode. This indicates that the Fermi’s
automatic cache system works fine in our case. In
other algorithms, e.g. those in which not many
repeated and consecutive memory accesses are
performed, a better speedup could be obtained by
implementing a hand-managed cache instead of
using a hardware-managed one. If the L1 cache
preference mode is selected, using shared memory
decreases the performance on a 25%, because less
shared memory space is available. Despite this fact,
the use of shared memory is still interesting because
it releases global memory space, since the
neighbourhood could be directly transformed in
shared memory without modifying the original
model, which permits us to apply new improvements
later.
The overall improvement of the algorithm, up to
13.94x on GTX295 and 36.17x on GTX580, is
shown in Figure 5 (see “Shared Memory Usage”
speedup).
4.5 MP Occupancy
Once the required amount of shared memory is
defined, the number of threads per block that
generates the better performance has to be set. The
term MP occupancy refers to the degree of
utilization of each device multiprocessor. It is
limited by several factors. Occupancy is usually
obtained as:
GPU OPTIMIZATION AND PERFORMANCE ANALYSIS OF A 3D CURVE-SKELETON GENERATION
ALGORITHM
81