This paradigm is consistently mapped to the GPU
hardware. The GPU is a collection of multiproces-
sors, where each multiprocessor contains one instruc-
tion decoder, but multiple streaming processors. Each
streaming processor in a multiprocessor thus executes
the same instruction, but using different data. All
multiprocessors contain a local on-chip fast mem-
ory, accessible by all streaming processors. Every
multiprocessor is furthermore connected to an off-
chip GPU-wide memory block using a relatively slow
memory bus.
Every block of the execution model maps on one
multiprocessor and every thread of a block can be ex-
ecuted by one streaming processor. Because it is pos-
sible to have more threads in a block than there are
streaming processors available, a thread scheduler is
available to suspend and resume threads. The group
of threads that is executed simultaneously is called a
warp, and is well-defined by the number of stream-
ing processors to allow performance optimizations. It
is furthermore possible to have multiple blocks reside
on a single multiprocessor. These blocks are com-
pletely separated from each other and can not com-
municate in any faster way as blocks on different mul-
tiprocessors can communicate.
Because this distributed shared memory architec-
ture is well-known, it is possible to optimize the algo-
rithm for maximum performance. Firstly, it is possi-
ble to coalesce memory accesses of the global mem-
ory of different threads in one memory call. This will
reduce the load on the memory bus and increase the
overall speed. Secondly, it is possible to optimize
the number of blocks and threads per block to as-
sure every processor is kept busy and no valuable pro-
cessor cycles are lost. The effective fraction of used
streaming processors is called the occupancy. Enough
blocks and threads should be defined to utilize every
streaming processor.
To increase occupancy, memory latency should be
hidden. Threads will become idle when waiting for
the results of a global memory fetch, which can take
hundreds of clock cycles. Therefore, the idle threads
can be swapped out and other threads can continue
their execution. Because the threads can also use data
from other threads in the same block, it will happen
that the majority of the threads are waiting on a few
threads, which in turn are waiting for the result of
their global memory fetch. To counter this idle pro-
cessor time, the multiprocessor can decide to fetch in
another block of threads. This is only possible when
the blocks are small enough to allow the presence of
multiple blocks on the multiprocessor.
The number of blocks per multiprocessor and the
number of threads per block is limited by the com-
Figure 3: Implementation strategies for FIR filtering on par-
allel SIMT architectures. (a) Naive strategy. Every thread
fetches all required data, resulting in multiple slow memory
accesses per thread. (b) Optimized strategy. Every thread
fetches only one data element and stores this in a shared
memory. Because other threads fetch the other data ele-
ments, data reuse is possible, resulting in less slow memory
accesses.
pute capability of the hardware. More recent compute
capabilities allow more flexibility and more simulta-
neous threads, thus the optimizations are dependent
hereof.
4 IMPLEMENTATION
We implemented the method of (Malvar et al., 2004)
using CUDA. Because this method is a specialization
of generic FIR filtering, we will discuss this first.
4.1 Generic FIR Filtering
We can use the parallel direct GPU computing ar-
chitecture to implement linear FIR filtering with the
aid of a user-managed cache i.e. the shared memory
(Goorts et al., 2009). When implementing, for exam-
ple, a 3 × 3 filter without optimizations, we can al-
locate one thread for every pixel of the image. This
thread will access 9 pixels of the image in global
memory to calculate the final result for the allocated
pixel (see Figure 3 (a)). However, it is possible to re-
duce the memory accesses by reusing the information
of nearby threads, i.e. the thread loads only his allo-
cated pixel in shared memory and can then use the
values of nearby pixels which are loaded in shared
memory by other threads. Therefore, the amount of
global accesses per thread is reduced to one, which is
shown in Figure 3 (b).
Since the size of the blocks is limited and some
threads at the borders of the blocks don’t have enough
data available to calculate the filtered value, we must
create extra threads at the borders that only read pixel
informationand hereby will not calculate a newvalue.
This way, these threads won’t need the value for
neighboring threads. The set of these specific threads
is called the apron (see Figure 4).
SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications
98