Table 1: Results of joint histogram calculations.
Implementation Time
Original implementation 350 msec
Optimized implementation (atomic) 196 msec
Optimized implementation (tagged) 190 msec
int bin = ...;
unsigned int tagged;
do
{
\\ Remove previous tag
unsigned int val =
localHistogram[bin] & 0x07FFFFFF;
\\ Tag the value with the thread id
tagged = (threadid << 27) | (val + 1);
\\ Write to the histogram
localHistogram[bin] = tagged;
} while (localHistogram[bin] != tagged);
\vspace{-0.3cm}
This method was originally developed before
atomic operations in shared memory were provided.
Here, we will investigate the original method, but us-
ing these atomic operations and other optimization
techniques.
We have performed joint histogram calculations
of images of size 4096x4096 on a NVIDIA GTX 280
with 30 multiprocessors, a clock speed of 1.3 GHz
and 1 GiB of off-chip memory. As shown in Ta-
ble 1, the version of (Shams and Kennedy, 2007) is
more efficient when using optimization techniques.
More specifically, we have reduced the size of the data
to speed up memory reads, enabled colaesced loads
and decreased bank conflicts. These optimizations
can be accomplished by rearranging the threads, and
thus control the memory operations per thread. Ad-
ditionally, we have modified the calculation instruc-
tions to reduce registry usage and thus reduce register
swappng to global memory. By using these optimiza-
tions, we accomplished a speedup of 150 msec, which
demonstrates the importance of attention to optimiza-
tion.
Using atomic add operations on shared memory
to remove the tagging technique does not result in
faster execution. This demonstrates that care should
be taken when using atomic operations, even in small
kernels, and other techniques should be implemented
and benchmarked if possible.
3.2 Convolution
Convolution is an image processing method where
for every pixel a new pixel value is calculated in
function of the values surrounding the current pixel.
Convolution is often referred to as finite impulse re-
sponse filtering. Effects of block sizes and imple-
mentation strategies are extensively investigated in
(Goorts et al., 2009). Here, we will discuss the im-
portance of the position of threads to enable coalesc-
ing and reduce bank conflicts. We will only consider
normal convolution algorithms; techniques using fast
Fourier transformations and singular value decompo-
sitions are not in the scope of this paper.
One thread per pixel is assigned. Every thread
now needs the value of the surrounding pixels, re-
quiring to load multiple pixels from off-chip mem-
ory. Because threads in the same block have a shared
memory, it is possible to use the data read by other
threads and thus enabling data reuse to reduce mem-
ory reads. However, threads at the border of the block
do not have sufficient data available and therefore a
border of threads is created to read the missing data.
This border is called the apron. Because these threads
do not calculate new output, the pixels located in the
apron must be read by different blocks. These redun-
dant reads must be reduced to a minimum.
Because of the apron, it is not sufficient to use
blocks with the for coalescing required width of a
multiple of 16. The width of the apron must also
be a multiple of 16 (see figure 2) to enable coalesc-
ing in every block. By doing this, we will create a
lot of unnecessary threads, but the performance will
increase thanks to coalesced memory operations and
the elimination of bank conflicts. This result is visi-
ble in figure 3. The results are dependent on the filter
size; the larger the filter, the larger the apron. Be-
cause the non-coalesced algorithm must read every
word in the apron apart, the memory instructions will
increase if the apron size increases. In the coalesced
case, the read instructions stay constant and optimal.
There will be more thread and more blocks, but the
performance is still higher, which proves the impor-
tance of coalescing in this application. Because the
thread positions are chosen correctly, bank conflicts
are removed from the coalesced case.
4 CONCLUSIONS
We have applied optimization principles to increase
the performance of algorithms executed on SIMT ar-
chitectures. By coalescing off-chip memory loads,
reducing bandwidth, increasing occupancy, reducing
bank conflicts, eliminating local memory usage and
optimizing instructions, one can maximize the utiliza-
tion of the resources of the parallel device and reduce
execution time. Reducing bank conflicts is forgotten
SIGMAP 2010 - International Conference on Signal Processing and Multimedia Applications
48