of further research which will use GPU based CUDA
programming to explore large scale SNNs, using dif-
ferent bio-inspired neuron models.
2 TESLA GPU’S AS A PLATFORM
FOR PARALLEL PROCESSING
Graphics Processing Units (or GPUs) are a hard-
ware unit traditionally used by PCs to render graph-
ics information to the user. The information pre-
sented by displays has evolved as the power of GPUs
has increased, from the early PC systems which pro-
vided an almost typewriter class monotone display to
immersive 3D displays found in advanced worksta-
tion PCs today. This evolution has lead to the de-
sign, by Nvidia, of modern graphics cards which are
in essence massively multi-threading processors with
high memory bandwidth (Kirk and Hwu, 2010). The
power of GPU has traditionally been used to improve
the appearance of the GUI in computer systems and
to improve the visual quality of games. However,
as interest with in research communities has evolved
to harness the power of parallel systems, GPUs have
been seen as a way to bring huge amounts of process-
ing power to an individual desktop. Nvida have re-
sponded by developing the Compute Unified Device
Architecture (CUDA) which allows programmers to
better leverage the parallel processing capability of
GPUs. In turn this has allowed GPUs to claim a place
within the high performancecomputing family of sys-
tems and also lower the cost of entry from thousands
of pounds to hundreds of pounds.
As a result of the evolution of high performance
graphics systems, there are available Nvidia GPUs
whose characteristics allow for the parallel execution
of multiple thousands of lightweight threads. Re-
searchers are currently working on ways to harness
the power of these lightweight threads, so that they
may be used to simulate many thousands of neurons
on a GPU (Nageswaran et al., 2009).
Nvidia GPUs which can be programmed using
the CUDA, framework typically contain arrays of
Streaming Multiprocessors (SMs), with upto 30 SMs
available on the largest Nvidia Telsa devices. This ca-
pability is paired with upto 4GB of memory, which re-
sults in super computer class performance on desktop
PCs. Each of the Streaming Multiprocessors avail-
able in a Telsa GPU typically contain, eight floating
point Scalar Processors (SPs), a Special function unit
(SFU), a multi-threaded instruction unit, 16KB shared
memory which can be managed by the user, and
16KB of cache memory. GPU also contain a hard-
ware scheduling unit which selects which group of
threads (in Nvidia terminology this is called a ’warp’
of threads) to be run on the SM. If a single thread
within the warp requires access to data which is held
in external memory then the hardware scheduling unit
can mark another warp of threads to run on the SM,
while the data for the thread in the first group is re-
trieved from external memory, thus helping to mask
the memory access time and improve overall perfor-
mance.
3 SNN ARCHITECTURE FOR
EDGE DETECTION
In order to evaluate the potential for simulation per-
formance improvements which may be obtainable
when using GPUs as an acceleration platform, an ap-
plication in emulating neural networks was required.
The application which was selected for comparison
was the ’SNN for edge detection application’ previ-
ously reported by (Wu et al., 2007). This applica-
tion has been subsequently implemented on a field
programmable gate array (FPGA), system which is a
powerful parallel hardware environment. When com-
bined with a powerful novel reconfigurable architec-
ture (Glackin et al., 2009a), this architecture showed
impressive speed increases. Over an order of mag-
nitude speed increase was recorded increase over a
similar CPU based implementation of the SNN edge
detection application when running on a Intel Xeon
class processor. The SNN architecture of this appli-
cation will now be briefly described in the rest of this
section.
The principle upon which the SNN application
was designed is to use receptive fields tuned to up,
down, left, right orientations to detect the edges con-
tained within the SNN input image.
As can be seen in figure 1 the edge detection ap-
plication contains a number of ’layers’. In ’Recep-
tor layer’ each node represents the current value ob-
tained when converting the pixel value at the corre-
sponding location in the input image. Each value in
this layer is then forwarded on to the intermediate
layer ’N’, via 5x5 receptive field (RF) weight distri-
butions, such that one excitatory and one inhibitory
field is formed for each orientation direction and are
labelled as ’∆’ and ’X’ respectively. Thus, eight RF
orientations were used RF
up exc
, RF
up inh
, RF
down exc
,
RF
down inh
, RF
left inh
, RF
left exc
, RF
right inh
, RF
right exc
.
Figure 2 shows the weight distribution matrices for
each of the orientation selective receptive fields.
As indicated in figure 3, there are 50 RF connec-
tions from each input pixel in the ’Intermediate’ (fig-
ure 1) layer. Each value in the input layer (x,y) is con-
PERFORMANCE COMPARISON OF A BIOLOGICALLY INSPIRED EDGE DETECTION ALGORITHM ON CPU,
GPU AND FPGA
421