THE FUTURE OF PARALLEL COMPUTING: GPU VS CELL

General Purpose Planning against Fast Graphical Computation Architectures,

which is the Best Solution for General Purposes Computation?

Luca Bianchi, Riccardo Gatti and Luca Lombardi

Department of Informatics Engineering, University of Pavia, Via Ferrata 1, Pavia, Italy

Keywords: Real-time Visual Simulations, Parallel Computing, HPC, CELL, GPU, GPGPU, General Purposes, Fast

Computing.

Abstract: Complex models require high performance computing (HPC) which means Parallel Computing. That is a

fact. The question we try to address in this paper is "which is the best suitable solution for HPC contexts

such as rendering? Will it be possible to use it in General Purpose elaborations?" We start from these

questions and analyze two different approaches, IBM CELL and the well known GPGPU, showing how

changing our minds and breaking some assumptions can lead to unexpected results and open a whole set of

new possibilities. We talk about rendering, but quickly move slightly towards general purpose computation,

because many algorithms used in Visual Simulations are not only referred to rendering issues but to a wider

range of problems.

1 HIGH PERFORMANCE

COMPUTING

In a couple of years the demand for fast computation

solutions has grown up dramatically, thanks to the

wide diffusion of multimedia applications and the

availability of complex models for visual

simulations.

Many attempts have been done to push

computer's CPUs to overcome their limits using the

performance grow rate described by Moore's law.

Until recently the performance gains for processors

performances were obtained through clock

frequency increasing. However this led to the rise of

many unavoidable problems which slew down CPU

growth rate, due to heating problems originating

from too high an integration of circuits on a single

chip.

A couple of years ago some solutions started to

be implemented to overcome this strong limit which

would have bounded elaboration performances and,

indirectly, technology development. One interesting

solution comes from 3D graphics and involves the

adoption of parallel architectures for High

Performance Computing. This gained wide success

in the critical performance context of rendering for

3D Graphics. Certainly this solution was optimal

and researchers soon started to investigate out of the

rendering context in which it was initially

developed. In this way a new paradigm emerged and

was established: General Purpose on GPU (Luebke

& Harris, 2004).

In the same years Sony, IBM and Toshiba started

to develop a new processor architecture designed for

parallel computing which was called CELL.

1.1 The Heart of the Problem

Many publications have been written to show how

both solutions are better than traditional common

CPUs. No one though has yet taken a position on

how parallel computing will be in the next years.

We started by considering that Real-Time Visual

Simulations often require to address problems not

only concerned to rendering itself (e.g. physics,

collision detections, artificial intelligence), and

require fast computation even for general purpose

algorithms. This is the reason why hardware

producers, such as nVidia, try to propose their

solution not only for renderings or graphics-related

contexts.

419

Bianchi L., Gatti R. and Lombardi L. (2008).

THE FUTURE OF PARALLEL COMPUTING: GPU VS CELL - General Purpose Planning against Fast Graphical Computation Architectures, which is

the Best Solution for General Purposes Computation?.

In Proceedings of the Third International Conference on Computer Graphics Theory and Applications, pages 419-425

DOI: 10.5220/0001099904190425

 SciTePress

Figure 1: Standard Rendering Pipeline (The Computer

Language Co. Inc., 2004).

With this fact clear in mind we asked ourselves i)

which one between GPGPU and CELL is the best

approach for HPC, ii) how much flexibility is

important and iii) how much it is paid with

performance loss.

Our goal is to show that CELL processor provides

higher flexibility than the GPGPU approach as well

as performances that are comparable if not better in

many general purpose contexts. Our intention is not

to run after clock frequencies, but to discuss how

architectural choices make CELL a better solution

for HPC. For this reason, we just present a small

amount of data and then discuss them.

The next section introduces both the GPGPU and

CELL approaches to parallel computing, assuming

that both provide better performances compared to

traditional CPUs (nVidia, 2006) (IBM, 2007).

Section 3 counters some common assumptions

hinting that GPGPU could be the only solution for

parallel computing and explains how CELL can

easily overcome such assumptions.

Section 4 shows some of the benefits provided

by this processor not available for GPGPU. Section

5, finally, presents the work we propose to unleash

the power of CELL processor.

2 PARALLEL COMPUTING

2.1 General Purpose on GPU

This approach develops in the specific context of

real – time 3D graphics out of the strong need for

high accuracy with complex shading models and real

– time performances for renderings.

A lot has been done since the late 90s when the

first programmable unit for GPU was realized. Since

then, the graphical pipeline was thought of not as a

fixed sequence of steps built on GPU, but as a

solution with two programmable stages (figure 1).

This idea was supported by the adoption of a stream

processing SIMD architecture which could perform

a single instruction on multiple data in a single

execution step (Akenine-Moller & Haines, 2002).

It is well known today which advantages in

performance can be achieved through a

programmable rendering pipeline, and not only at

the graphic level. The idea behind the so called

GPGPU is to play a trick by using the graphical unit

for general elaborations. With this purpose, only the

fragment shader is generally used: a single quad is

sent down the pipeline, while data are stored in a

texture as pixels’ values. At rastering stage the quad

is mapped on to screen dimension and a grid is built.

Thanks to this object we can reference each data in

our texture and use it for computation. Results are

stored in a FrameBuffer from where they can easily

be transferred back to CPU or sent to the screen. The

program in the fragment shader could be any kind of

algorithm, thus realizing general purpose computing

(Luebke & Harris, 2004). What should be kept in

mind is that GPGPU execution is still constrained by

the concept of pipeline and computing is done using

graphic hardware as a black box. One of the greatest

limits to this approach is that each stream processor

should work alone without sharing data from another

processor.

2.1.1 Computer Unified Device Architecture

and Tesla

In late 2006 nVidia Corporation sold the first

GeForce 8800 card which used a new kind of

pipeline (nVidia Technical Brief, 2006). This

architecture, according to Shader Model 4 specs, can

GRAPP 2008 - International Conference on Computer Graphics Theory and Applications

420

provide more flexibility than the previous one. The

separation between vertex and fragment units was

left aside in moving to a new unified unit capable of

load balancing (figure 2).

Also data transfers between stream processors

were made more flexible by introducing a parallel

data cache which could be used for making

computed data available for other units. This is a

great step forward, but GPU architecture still

remains constrained by the pipeline paradigm, which

has proved to be a good solution for fast renderings.

Recently, nVidia started proposing a modified

version of GeForce 8800 named Tesla, with more

memory and no DVI connectors (nVidia, 2007).

This is a solution thought for GPGPU.

Unfortunately, no data could be retrieved either for

Tesla or for 8800 series. According to nVidia,

CUDA graphic cards are 30% faster than other

cards: as a result a comparison can be drawn even

without data.

Figure 2: CUDA Architecture.

2.2 CELL Processor

CELL was designed by a partnership of Sony,

Toshiba and IBM to be the heart of Sony’s

Playstation3 gaming console. However, results from

Berkeley laboratory show that CELL architecture

has a tremendous potential for scientific

computations in terms of both raw performance and

power efficiency (Williams, Shalf, Oliker, Kamil,

Husbands, & Yelick, 2006).

CELL combines the considerable floating point

resources for demanding numerical algorithms with

a power-efficient software-controlled memory

hierarchy. Instead of slowly evolving towards a

streaming SIMD multi-core architecture, the CELL

Processor was designed with these concepts in mind.

CELL architecture is made of nine processors

operating on a shared, coherent memory. The

processors can be distinguished between PPE and

SPEs. PPE (PowerPC Processor Element) is a high

performance 64-bit PowerPC core with 32KB L1

cache and 512KB L2 cache (figure 3).

Each SPE (Synergistic Processor Element) has a

Synergistic Processor Unit (SPU), a 256KB local

memory and a memory flow controller. SPEs are

independent processors that are optimized for

running compute-intensive applications.

PPE provides support for the operating system

and manages the work of all the SPEs. PPE uses 2-

way symmetric multithreading which is comparable

to Intel Hyperthreading .SPEs, on the other hand,

provide to CELL the application performance. Each

SPE includes four single precision (SP) datapaths

and one double precision (DP) datapath. SIMD

double-precision operation must be serialized. The

SPE cannot access the main shared memory and it

must transfer data via DMA to its own local store

using the Memory Flow Controller (MFC). The

MFC operates asynchronously with respect to the

SPU, so that is possible to overlap DMA transfers

with other concurrent operations.

All CELL elements are connected by 4 data

rings known as the EIB (Element Interconnection

Bus). This ring permits 8 byte/s to be read and

simultaneous transfers to be carried out.

Access to external memory is made by a 25.6 GB/s

XDR memory controller.

Figure 3: CELL Architecture Diagram (Gschwind, 2005).

3 COMMON BELIEFS

Thanks to game industry GPU parallel computing

model is more widespread than CELL. This led to a

series of assumptions that are often GPU biased. In

THE FUTURE OF PARALLEL COMPUTING: GPU VS CELL - General Purpose Planning against Fast Graphical

Computation Architectures, which is the Best Solution for General Purposes Computation?

421

this section we present the most common claims and

try to centre the balance.

3.1 GPUs do not Cost as Much as

CELL

This first claim rises from one of the few papers

available that draw a direct comparison between the

two architectures (Baker, Gokhale, & Tripp, 2007).

In this work it is shown how GPU has a lower price

and an higher Speedup/$K rate. This could may

appear obvious reading the paper, but some things

need to be pointed out.

First of all, by looking to raw data the difference

in performance and costs is not so huge because

although CELL price is three times nVidia 7900

GTX price, it is also three times faster. In fact the

Speedup/$K rate is almost the same: the difference

is 0.34. The important thing is that in this benchmark

a single graphic card vs a CELL blade system which

mounts two processors is used. In order to make

results comparable, only one Blade’s processor is

used. In this way we have the cost of a blade but half

the power it could provide. Using the single CELL

of a PS3 we discover that a single video card has the

same price of a Playstation, which does not only

includes the CELL processor. This small difference

in terms of price is more evident if we compare the

nVidia Deskside Tesla (sold at 7500$) and CELL

QS21 Blade (almost 8000$). What Baker, Gokhale

& Tripp (Baker, Gokhale, & Tripp, 2007) show is

the importance of the possibility to buy a single

CELL solution without all the PS3 environment.

The last thing to point out is that in the paper the

code used for benchmarks is not optimized. This

affects more the CELL performances than the

GPU’s, as we will discuss further on.

3.2 GPUs have a Faster Learning

Curve

As a matter of fact GPUs have a faster learning

curve if your aim is just to write a “Hello world”

program. If your goal is to use GPU for small

algorithms with no high performance needs you will

be able to do that after a while. On the contrary if

your goal is to develop an optimized solution for a

problem where performances really matter, then you

will have to learn graphic programming and

OpenGL (or DirectX). This will not make your

learning curve so fast. Some good news comes from

nVidia with the announcement that a C compiler

will be available for CUDA. In this way learning

graphics will be no longer necessary but you will

always need to know how your code is executed on

GPU. This is the very problem which makes the

CELL the learning curve so slow.

In considering learning curves, the only

difference worth pointing out is that GPU makes

parallel programming transparent to users (nVidia

CUDA, 2007). However it has not yet been

demonstrated that this would be and advantage in

specific contexts where optimizations matter.

3.3 GPUs are Specific for Graphic and

Provide Better Performances

This is a claim often proposed while presenting

benchmarks between GPUs and CPUs, and is

obviously true. If you have an algorithm, the closer

it is to graphic context, the more porting it to GPU

would provide faster performance. The common

example is image filtering, where we can obtain an

incredible speedup with respect to CPU

implementations. What is never said but often

thought is that GPU performances are the best tool

available in parallel computing, both at the graphic

and general purpose levels. This is not true.

One of the most significant results provided by

CELL over GPGPU architecture concerns the

solution of a matrix multiplication problems. This

has been used for a long time to demonstrate the

GPU’s abilities. nVidia Quadro 4600 performs

single precision matrix multiplications with a

throughput of 90 GFLOPS (GPU-Tech, 2007). The

same operation performed on CELL processor with

8 SPU runs at 140 GFLOPS (Barcelona

Supercomputing Center, 2007). This result is highly

significant, as matrix multiplication has always been

GPU computing’s greatest achievement. We do not

aim to claim that CELL should be used for graphics

rendering. Our purpose is just to demonstrate that, if

this processor is valuable even for a context where

GPU has always been the top solution, its flexibility

probably makes it a better choice for general

purpose parallel computing.

It might be argued that, on paper, nVidia G80

offers a higher GFLOPS rate than CELL (500

against 208). This claim is true if you only compare

the raw computation rates, positing a full utilization

of both technologies. This is just an ideal case. In

real applications, code optimization is extremely

difficult for GPU, and is even more so if we consider

the C compiler layer introduced by CUDA

architecture. In poor words in real applications, such

as real-time ray tracing, CELL benefits from code

optimization more than GPU and provides higher

performance even with the single six core CELL

GRAPP 2008 - International Conference on Computer Graphics Theory and Applications

422

processor of Playstation 3 as shown in (Minor,

2007). In this paper, both architectures are

compared, first on raw computation GFLOPS and

then with the graphics algorithm of Interactive

Raytrace applied to Stanford Bunny. The results are

amazing: one single CELL processor is four to five

times faster than G80. If we consider the “on paper”

computing power and use a QS20 blade (which has a

comparable GFLOPS amount), it is eight to eleven

time faster.

4 WHY A GOOD SOLUTION IS A

GOOD SOLUTION

The CELL architecture provides features that makes

it an excellent platform for developing any kind of

applications. We identify two main benefit in CELL

structure: the possibility of using and organizing the

work of the different cores in a totally separate and

independent way and the fast communication system

that link all the chip components. A developers

framework for CELL offers useful tools like

profilers, simulators and compilers for helping the

programmers to take advantage of all the CELL key

features.

4.1 Flexibility

The CELL Broadband Engine Architecture has been

designed to support a variety of different

applications.

Although the CELL processor was initially

conceived for application in game consoles or high-

definition televisions, its architecture was designed

to allow fundamental advances in processor

performance and programming flexibility.

The GPU Architecture, on the other hand, was

initially designed as a dedicated rendering device

and is highly efficient in making more effective all

those algorithms and all those computations bound

to graphics needs. Using a dedicated architecture to

make general purpose applications requires the

programmer to deal with a large number of problems

and limitations in their algorithms. In fact, the

GPGPU concept of programming is based on

deceiving the GPU by using the graphics pipeline

for making different types of computations unrelated

to graphics applications.

Programs that run on CELL typically split

computational cost among all the available processor

elements. In order to determine workload and data

distribution, the programmer should take the

following considerations into account:

• Processing-load distribution

• Program structure

• Data access patterns

• Code movement and data movement among

processors

• Cost of bus loading and bus attachments

In the CELL programming way there are

different application partitioning models can be

found. The two main models are the PPE-centric

model and the SPE-centric model.

In the PPE-centric model the main application

runs on the PPE while the SPEs are used to off-load

other individual tasks. The PPE duties are to wait

and coordinate the different results coming from the

SPEs. Applications that have serial data and parallel

computations fit this model well. The SPEs can be

used in three different ways:

• The multistage pipeline model

• The parallel stages model

• The services model

If an application requires multiple and sequential

stages, the programmer can use a multistage-pipeline

model approach. Every step of the application is

loaded onto a single SPE and the results are sent

through the shared bus from SPE to SPE. The data

stream is initially sent to the first SPE and the results

can be taken from the last SPE that contains the last

stage of the application. In Multistage pipelining

problems occur in determining load balancing and in

large data-movement between the SPEs.

In the parallel stages model each SPE runs the

same task and the data input of the application is

equally split among all the SPEs as well as

processed at the same time. This is a concept of

programming similar to the GPGPU where the input

data stream is processed at the same time in different

shaders running the same kernel.

The PPE-centric service model is used when

there is the need to run different tasks that are part of

a large application not in a pre-existing order. In

each SPE a different program is loaded and the

appropriate SPE is called by the PPE when a

particular service is needed.

In the SPE-centric model the application code is

split among all the SPEs (or part of them). Each SPE

fetches its next work from either the main storage or

its local memory. The PPE on the other hand acts as

a resource manager for the SPEs.

All this flexibility in using the different CELL’s

cores makes it a perfect platform for any kind of

application. The programmer just has to devise the

best way to organize the steps of his algorithms to

exploit all the possibilities and the power of the

CELL architecture.

THE FUTURE OF PARALLEL COMPUTING: GPU VS CELL - General Purpose Planning against Fast Graphical

Computation Architectures, which is the Best Solution for General Purposes Computation?

423

There are already many papers that show how

CELL architecture boosts the performance of many

kinds of applications ranging from rendering to

general purpose ones. A work from Utah University

shows how good the performances of ray tracing on

the CELL Processor are. The research shows how to

efficiently map the ray tracing algorithm to the

CELL Processor, with the result that a single SPE

attains the same performance as a fast x86 system.

(Benthin, Wald, Scherbaum, & Friedrich, 2007).

Another work shows how a parallelized form of

H.264 encoding algorithm (Park & Soonhoi, 2007)

achieves optimal performance. In this work the

authors also claim that a SPE-specific optimization

is needed to obtain a meaningful speed-up. By using

the Vector/SIMD instructions and reducing data

transfers between SPE and PPE, better performance

can be achieved in their particular application.

Some effort were also made in porting a digital

media indexing application (MARVEL) on CELL

processor. This kind of application needs image

analysis for feature extraction; overall performance

of this algorithms was excellent on CELL platform

(Lurng-Kuo, Qiang, Apostol, Kenneth, Smith, &

Varbanescu, 2007).

Again, all these examples show how the CELL

architecture is suitable for improving performance of

a different range of applications.

4.2 Shared Memory

The CELL processor can be programmed as a

shared-memory multiprocessor where SPE and PPE

units can interoperate in a cache-coherent shared-

memory programming model. Anyway PPE and

SPE have significant difference in the way they

access memory. PPE accesses main storage with

load and store instructions that go between a private

storage with direct memory access (DMA)

commands that are stored, along with data, in a

private local memory. This 3-level organization

(register file, local store and main storage) explicitly

parallelizes computation and the transfer of data and

instructions. The main reason for this organization is

that application performance is, in most cases,

limited by memory latency rather than by peak

compute capability or peak bandwidth. The DMA

model allows each SPE to have many concurrent

memory accesses. Another benefit is that very few

cycles are needed to set up a DMA transfer

compared to the long waiting time (in terms of

cycles) that occurs when a load instruction of a

program misses in the caches in conventional

architecture.

A valid approach in memory-access is to create a

list of DMA transfers in the SPE’s local store so that

the SPE’s DMA controller can process this list

asynchronously while the SPE operates in

previously transferred data.

The on-chip communication benchmark of the

CELL was matter of accurate benchmark and tests.

Overall results of the experiments demonstrate that

the CELL processor’s communication subsystem is

well matched to the processor’s computational

capacity. The communication network provides all

the speed and bandwidth that applications need in

order to exploit the processor’s computational power

(Kistler, Perrone, & Petrini, 2006).

4.3 Simulator, Compiler and Profiler

One of the main problems while programming

GPGPU kernels is the portability of the code. There

are many differences between architectures of

different manufacturers that prevent the code to be

freely used on any GPU (g.e texture format, texture

size, pixel format supported …). On the other side

CELL architecture provides to the programmer with

a unique and complete environment.

A Full-System Simulator is offered as an

alternative to conventional process and thread

programming. Here the programmer has access to

many features such as scheduler for threads,

debugging tools, performance visualization, tracing

and logging capabilities.

PPE implements an extended version of the

PowerPC instruction set. This extension consists of a

Vector/SIMD Multimedia extension plus some

changes in PowerPC instructions. The SPE

instruction set is similar to PPE but needs a different

compiler. All these extensions are supported by C-

language intrinsics. Intrinsics substitute assembly

instructions with C-language commands. Most

instructions process 128b operands, divided into four

32b words.

5 CONCLUSIONS

Both GPGPU and CELL approaches are excellent

solutions for HPC applications. Without any doubt

they will mark the state of the art for next years.

Many upcoming changes will be released, starting

from CELL v2.0 through next CUDA generation

especially designed for physics.

GRAPP 2008 - International Conference on Computer Graphics Theory and Applications

424

Hopefully we’ve proved that CELL has the best

opportunities to become the standard for general

purposes computing: its flexibility could provide

high performances without too many constraints.

An additional gain of CELL is represented by its

reduced size which makes it suitable for embedded

devices.

Our interest in the topic is focused on creating a

good knowledge base for CELL programming, using

it to reduce computational costs in general purpose

contexts as Medical Image Elaboration or Virtual

Reality. This lack of knowledge and realizations on

CELL are today one of the biggest obstacles its

adoption because strengthens the idea of a solution

which don’t pay the investment and with a too

abrupt learning curve.

ACKNOWLEDGEMENTS

Special thanks to Elisa Ghia for all the support given

to this work.

This work has been partially supported by PRIN

2006 - Ambient Intelligence: event analysis, sensor,

reconfiguration and multimodal interfaces.

REFERENCES

Akenine-Moller, T., & Haines, E. (2002). RealTime

Rendering. A. K. Peters.

Baker, Z. K., Gokhale, M. B., & Tripp, J. (2007). Matched

Filter Computation on FPGA, Cell and GPU. 15th

Annual IEEE Symposium on Field-Programmable

Custom Computing Machines., (pp. 207-218).

Barcelona Supercomputing Center. (2007). Matrix

Multiplication Example. Retrieved 11 25, 2007, from

Computer Sciences:

http://www.bsc.es/plantillaH.php?cat_id=420

Benthin, C., Wald, I., Scherbaum, M., & Friedrich, H.

(2007). Ray Tracing on the Cell Processor. IEEE

Symposium on Interactive Ray Tracing (pp. 15-23).

IEEE.

GPU-Tech. (2007). GPU-Tech GPU Computing.

Retrieved 11 26, 2007, from GPU-Tech:

http://www.gpucomputing.eu/index3.php?lang=en&pa

ge=_demo1.php&id=2

Gschwind, M. (2005, 08 17). The Cell project at IBM

Research.

IBM. (2007). Cell Broadband Engine - An Introduction.

Kistler, M., Perrone, M., & Petrini, F. (2006). Cell

Multiprocessor Communication Network: Built for

Speed. IEEE Micro , 26 (3), 10-23.

Luebke, D., & Harris, M. (2004). GPGPU: General

Pourpose Computation On Graphics Hardware.

SIGGRAPH Course Notes.

Lurng-Kuo, L., Qiang, L., Apostol, N., Kenneth, R. A.,

Smith, J. R., & Varbanescu, L. A. (2007). Digital

Media Indexing on the Cell Processor. IEEE

International Conference on Multimedia and Expo.

IEEE International.

Minor, B. (2007, 09 05). Cell vs G80. Retrieved 11 23,

2007, from Game Tomorrow: http://

gametomorrow.com/blog/index.php/2007/09/05/cell-

vs-g80/

nVidia CUDA. (2007). GeForce 8800 & NVIDIA CUDA

A New Architecture for Computing on the GPU.

nVidia. (2006). CUDA Programming GUIDE v0.8.

nVidia.

nVidia. (2007, 10 18). nVidia Tesla Tech Specifications.

Retrieved 11 18, 2007, from nVidia Web site:

http://www.nvidia.com/object/tesla_tech_specs.html

nVidia Technical Brief. (2006). NVIDIA GeForce 8800

GPU Architecture Overview. nVidia Corporaion.

Park, J., & Soonhoi, H. (2007). Performance Analysis of

Parallel Execution of H.264 Encoder on the Cell

Processor. IEEE/ACM/IFIP Workshop on Embedded

Systems for Real-Time Multimedia, 2007 (pp. 27-32).

ESTIMedia.

The Computer Language Co. Inc. (2004). Graphics

Pipeline. Retrieved from Answer.com:

http://www.answers.com/topic/graphics-

pipeline?cat=technology

Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P.,

& Yelick, K. (2006). The Potential of the Cell.

Berkeley : Computational Research Division

Lawrence Berkeley National Laboratory.

Wright, R. J. (2004). OpenGL SuperBible. Addison

Wesley.

THE FUTURE OF PARALLEL COMPUTING: GPU VS CELL - General Purpose Planning against Fast Graphical

Computation Architectures, which is the Best Solution for General Purposes Computation?

425