Figure 3: Combination of Amdahl’s law with single-core
performance; performance speed versus log(number of
cores).
problem size per thread, the implementation will run
faster for larger problems. On a GPU, the problem
size per thread is not fixed. It depends on the num-
ber of cores versus the overall problem size, which
are both fixed. But on a CPU with e.g. 4 cores, the
number of threads can vary in a continuous fashion to
any number, changing the problem-size-per-thread at
will.
Figure 4 shows the results of this experiment on
our test system. We ran the same problem from Fig-
ure 1 using 1 to 480 threads on a quadcore Intel i5.
Since every core runs at about 3GHz, a maximal per-
formance of 12GFlops can be expected – assuming
1 flop per clock cycle. Obviously, the performance
doubles when going from 1 thread to 2 and quadru-
ples for four threads. But all of those four imple-
mentations level off around the same point at approx-
imately 1GFlops. In other words, for problems larger
that 2 megabytes, the four cores combined perform as
poorly as only one. Now, if we run the same prob-
Figure 4: Running snorm2 with 1 to 480 threads, on 4 cores
for increasing problem size (logarithmic scale).
lem with 48 to 480 threads on the 4 CPU cores, then
the region of maximal performance clearly shifts to
the right. This ‘optimal’ region also tends to be much
broader, remember that the x-axis is labeled logarith-
mically. Very little performance is lost in the extra
overhead of launching the threads.
Of course, this level of optimisation could also be
achieved by classical optimization and by probing the
speed of the internal busses and the size of the caches.
But by massive multithreading, the number of threads
is an optimization parameter that is quite independent
of the hardware. This makes the porting and the reuse
of software easier.
5 CONCLUSIONS
With the advent of massively multicore GPUs came
the new programming paradigm of massively paral-
lel computing. In this paper, we have argued that the
usefulness of this new paradigm need not be limited
to the GPU: there are sound theoretical reasons to ex-
pect that the same programming style may also be
beneficial for programs run on CPUs. In particular,
it may helps to shift the performance curve towards
the end of the spectrum where performance is most
needed, i.e., the large instances. At the negligible cost
of solving the small instance slightly less quickly, the
number of instances that can actually be solved can
be drastically increased, as demonstrated by the pre-
liminary experiment discussed in this paper.
A second advantage is of course that, by using the
same programming paradigm on both CPU and GPU,
the choice between those different types of hardware
can be postponed. A suitably designed algorithm
may be deployed on either one with only minimal
changes. Moreover, if a platform-independent paral-
lel programming language such as OpenCL is used, it
may not even be necessary to change the implemen-
tation either.
It is not necessary to know at design time how
many cores the hardware will have, since the num-
ber of threads (not cores) is a possible optimization
parameter. While executing these threads, CPUs can
address a larger amount of memory than GPUs and
have still more efficient pipelines. On the other hand,
the CPU cores are definitely much lower in number.
Finally, debugging software on a CPU platform is
much better supported than on a GPU. So if designed
software can be transferred between CPU and GPU as
if they were highly similar, then the debugging prob-
lem of GPUs is somewhat relieved.
PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems
198