MFF networks with the BP and MBP algorithms.
1
The CPU version was benchmarked on a Intel
Core 2 6600 CPU running at 2.4 GHz, whilst the GPU
version was benchmarked on two different NVIDIA
devices: a GeForce 8600 GT with 4 SM (32 cores)
and a GTX 280 with 30 SM (240 cores).
The neural network models used in this study con-
sisted of MFF networks comprising: (i) a main net-
work containing an input layer with N
i
neurons, a
hidden layer with N
h1
neurons with selective activa-
tion, an optional second hidden layer with N
h2
neu-
rons (without selective activation) and an output layer
with N
o
neurons; and (ii) a space network with N
i
in-
puts and N
h1
outputs.
Three benchmarks (“two-spirals”, “sonar” and
“Friedman”) were chosen for testing and comparing
the online (stochastic) and the batch parallel imple-
mentations of the MBP algorithm. The “two-spirals”
benchmark, which is considered extremely hard to
solve for algorithms of the BP family (Fahlman and
Lebiere, 1990), consists of discriminating between
the points of two distinct spirals which coil three
times around one another and the x-y plane origin.
The “sonar” benchmark
2
consists of discriminating
between the sonar signals bounced off a metal cylin-
der from those bounced off a roughly cylindrical rock.
The “Friedman” benchmark consists in approximat-
ing the function f (x) = 10sin(πx
1
x
2
)+20(x
3
−
1
2
)
2
+
10x
4
+ 5x
5
(Friedman, 1991).
3
5.2 Results and Discussion
For a fair comparison of the implementations (in both
GPU and CPU versions), the initial weights were
identical (independently of the implementation and
hardware). Table 2 shows the number of epochs
trained per minute, accordingly to the hardware used.
The number of epochs trained using the batch mode
is far superior on the GPU than on the CPU. However,
when using the online (stochastic) mode the GPU can
only achieve better results than the CPU, when the
trained NN contains a sufficient large number of con-
nections. This is better emphasized in Table 3 which
shows the speedups attained by the GPU over the
CPU for both the batch and online implementations of
1
The latest version of Multiple Back-Propagation soft-
ware can be freely obtained at http://dit.ipg.pt/MBP.
2
Available at the Carnegie Mellon Univer-
sity Collection of Neural Networks Benchmarks
(http://www.cs.cmu.edu/afs/cs/project/
ai-repository/ai/areas/neural/bench/cmu/).
3
The training data is available at the Bilkent Univer-
sity function approximation repository (http://funapp.
cs.bilkent.edu.tr/DataSets/).
the MBP algorithm. A speedup value, S, greater than
one, means the GPU implementation is S times faster
than the corresponding CPU implementation, whilst
a speedup value smaller than one means the GPU im-
plementation is slower than the corresponding CPU
implementation. The online training mode can only
take advantage of the parallelism inherent to the NN
layers. On the other hand, the batch training mode can
also benefit from the fact that each pattern can be pro-
cessed independently. Thus in the batch mode the pat-
terns can be processed in parallel, leading to greater
speedups, regardless the number of connections.
Execution pipelines on the CPUs support a lim-
ited number of concurrent threads. Modern quad-core
processors can only run 16 threads in parallel (32 if
the CPUs support Hyper-Threading). By contrast, the
smallest executable unit of parallelism on a device,
called a warp, comprises 32 threads. All NVIDIA
GPUs can support at least 768 active threads per mul-
tiprocessor. On devices that have 30 multiprocessors
(such as the GTX 280), more than 30,000 threads
can run simultaneously (NVIDIA, 2009a). Thus to
take full advantage of the GPU parallel processing ca-
pabilities a large number of threads is required and
that cannot be accomplished using the online training
mode (for the vast majority of the problems).
Table 4 shows how many times the NN weights
are corrected per minute on a GTX 280. The ratio of
the number of weights corrections in the batch mode
relatively to the online mode is far greater on the GPU
than on the CPU, where it represents a small fraction.
4
This is better emphasized by Table 5 which shows
the ratios between the batch and the online training
modes, for both (i) the number of epochs per minute
and (ii) the number of network weights corrections
per minute. Thus on the GPU, the online training
is not granted to converge faster than the batch train-
ing mode, even in situations where that holds true on
the CPU. In fact, in our experimental tests, we found
the batch training mode to converge faster than all the
other modes for the “two-spirals” and “sonar” bench-
marks. However, in the “Friedman” benchmark, the
mini-batch mode outperformed the batch mode, re-
gardless of the number of selected patterns, which in
turn outperformed the stochastic version. Figure 2
shows the RMS error versus the training mode. Al-
though, we run the mini-batch with 32, 64, 128, 256
and 512 patterns
5
the graphic only shows two results
(namely, 64 and 512 patterns) for clarity purposes.
4
The values for the CPU are not presented, but they can
easily be calculated from the data contained in Table 2.
5
The values selected for the number of patterns in the
mini-batch mode are multiples of the warp size (32) for per-
formance reasons.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
274