A Multicore-aware Von Neumann Programming Model

anos V

egh

, Zsolt Bagoly

am Kics

and P

eter Moln

Faculty of Informatics, University of Debrecen, Kassai Str 26, Debrecen, Hungary

PhD School of Informatics, University of Debrecen, Kassai Str 26, Debrecen, Hungary

Keywords:

Modeling, Computer Architecture, Computing Paradigm, Multicore, Von Neumann, Performance, Parallel.

Abstract:

An extension to the classic von Neumann paradigms is suggested, which –from the point of view of chip

designers– considers modern many-core processors, and –from the point of view of programmers– still remains

the classic von Neumann programming model. The work is based on the ideas that 1) the order in which the

instructions (and/or code blocks) are executed does not matter, if some constraints do not force a special

order of execution 2) a High Level Parallelism for code blocks (similar to Instruction Level Parallelism for

instructions) can be introduced, allowing high-level out of order execution 3) discovering the possibilities for

out of order execution can be done during compile time rather than runtime 4) the optimization possibilities

discovered by the compile toolchain can be communicated to the processor in form of meta-information 5)

the many computing resources (cores) can be assigned dynamically to machine instructions. It is shown that

the multicore architectures could be transformed to a strongly enhanced single core processor. The key blocks

of the proposal are a toolchain preparing the program code to run on many cores, a dispatch unit within the

processor making effective use of the parallelized code, and also a much smarter communication method

between the two key blocks is needed.

1 INTRODUCTION

In the forecasts given for even the farther future, sus-

tained improvements in computer performance are

always included, either implicitly or explicitly. A

decade ago, however, it became clear that the com-

puting performance cannot be increased any more

through increasing clock frequency (Agarwal et al.,

2000). Rather, the number of computing units started

to rise. Recently, even this technology solution run

into walls of ”Dark Silicon” (Esmaeilzadeh et al.,

2012).

The single-processor performance seems to be not

able to follow the expectations, extrapolated from its

historical trend a decade ago. The ratio of the miss-

ing computing performance will reach 100 around

2020 (Fuller and Millett, 2011), see Fig. 1. The

warning signs reached the level of government and

scientiﬁc advisory boards, both in the technology

leader USA (Fuller and Millett, 2011) and Europe

(S(o)OS project, 2010).

In case of attempting to use the cores in paral-

lel, one faces the problems of ineffectivity of paral-

lelization. At the beginning, at least recompiling is

needed (even if you have auto-parallelizing compiler),

and more typically manual parallelization must take

place. It is especially hard to parallelize codes includ-

ing foreign parts, like libraries. In addition, ”paral-

lel programs . . . are notoriously difﬁcult to write, test,

analyze, debug, and verify, much more so than the se-

quential versions” (Yang et al., 2014).

Although there exist for a while ”look ahead” and

”out-of-order” solutions in processors (with rather

poor performance with respect to using the cores ef-

fectively, for example (Intel, 1998)), the present ef-

forts concentrate mainly on exa-scale computing, be-

lieving that the ”brute force” method, although in a

very ineffective way, will solve the problem. The time

and efforts needed to let the many processors cooper-

ate, result in most cases in insigniﬁcant gain only. The

efforts resulted in decreasing electric power consump-

tion (also in absolute measure), improving computa-

tional efﬁciency and providing closer ways for coop-

erating between the cores, and even partly reconﬁg-

urable processors appeared allowing for open source

hardware design for the end-user (Xilinx, 2012; Al-

tera, 2013; Adapteva, 2014). Also, the possible high-

level support is intensively searched (for examples see

(HLPGPU workshop, 2012; World Scientiﬁc, 2014)),

apparently with no breakthrough ideas.

150

Végh J., Bagoly Z., Kicsák Á. and Molnár P..

A Multicore-aware Von Neumann Programming Model.

DOI: 10.5220/0005097001500155

In Proceedings of the 9th International Conference on Software Engineering and Applications (ICSOFT-EA-2014), pages 150-155

ISBN: 978-989-758-036-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Historical growth in single-processor perfor-

mance and a forecast of processor performance to 2020,

based on the ITRS roadmap. A dashed line represents ex-

pectations if single-processor performance had continued

its historical trend (Fuller and Millett, 2011).

The technology today allows to implement liter-

ally hundreds of cores in a single chip (for example,

(Adapteva, 2014) is delivered recently with 64 cores,

but its architecture is prepared for 4096 cores, In-

tel also developed chips with dozens of cores, etc.).

However, as summarized on the project homepage

(S(o)OS project, 2010): ”Processor and network ar-

chitectures are making rapid progress with more and

more cores being integrated into single processors

and more and more machines getting connected with

increasing bandwidth. Processors become heteroge-

neous and reconﬁgurable . . . No current program-

ming model is able to cope with this development,

though, as they essentially still follow the classical

van Neumann model.”

2 THE CLASSICAL VON

NEUMANN MODEL – IN LESS

CLASSICAL TERMS

The primary reason of the missing performance is,

that although the computing resources are available,

the programming methodology is not able to use them

properly. Really, the abstractions used presently (and

are known for the masses of programmers and engi-

neers, including chip designers, educated till recently)

are based on easy to understand, but outdated and

non-technical view of computing.

The overwhelming majority of the plethora of

the today’s code is written for sequential processors,

by programmers having sequential programming in

mind. Even the concurrent (sometimes erroneously

called ”parallel”) programming deals with pieces of

sequential code. So, it would be of great importance

to ﬁnd a way which allows some automatic paral-

lelization of single-thread programs, in this way con-

verting the existing single-thread programs to mas-

sively parallelized ones.

On the other hand, one absolutely needs some

level of abstraction and also the new model must be

compatible to some measure with the old one. In or-

der to preserve compatibility with the existing pro-

gramming paradigms, the best way would be to mod-

ify the interpretation of the existing abstractions in a

way which allows taking into account the present-day

achievements of technology. In the sections below,

ﬁrst some of the vN abstractions are re-formulated us-

ing a somewhat unusual terminology, which makes

the changes to be introduced easier to understand.

Following that, a new meaning will be introduced for

those old abstractions, having in mind some of the

technical developments. Finally, the principles of a

possible implementation will be presented.

2.1 The Old Picture

As (Godfrey and Hendry, 1993) mentions, although

”The von Neumann report contains a wealth of insight

and analysis still not available elsewhere”, the com-

puter described by von Neumann (vN) ”was never

built, and its architecture and design seem now to be

forgotten”. Well, the todays processors are far from

those ones in the ancient ages. But, from program-

ming point of view, we still can describe their opera-

tion using vN abstractions. The programming model

was constructed early and – because of compatibility

– kept some features of the early primitive implemen-

tations. Unfortunately, both hardware and software

developers use that old model.

2.2 Computing Resource

In the vN world, the sole computing resource was the

processor. Since only the processor had such process-

ing ability, every single bit had to pass through the

processor, and all parts of the processor were busy

with executing the current instruction. This feature

excluded any chance for executing instructions in par-

allel, and concluded the serial execution of instruc-

tions. Combined with the stored instructions princi-

ple, the fetch-decode-execute cycle mode processing

was needed. The computing ﬂaw was built up from

AMulticore-awareVonNeumannProgrammingModel

151

elementary machine instructions, which were stored

in some addressable stores (memory cells).

In order to reduce the length of the instruction

code, a simpliﬁed addressing was introduced. In nor-

mal cases, the control unit only advanced that special

In exceptional cases the address of the next instruc-

tion was concluded from the stored instruction itself

(jumps, calls, returns). The same control unit can

even direct the instruction pointer to an address, de-

ﬁned by some external hardware. No contradiction

with the vN principles: it only required ”proper se-

quencing of operations”.

This feature lead to the abstraction process, in

which the running program has the exclusive use of

the processor and the memory (and actually, all of its

resources). Notice, that nothing changes if we take

another physical processor or core (even if we use

different cores for the different machine instructions):

we are using an abstract one. So, the abstraction ”pro-

cess” can be trivially extended to multi-core case.

2.3 Instruction Set

At abstraction level, the instruction set tells what the

processor is able to do and the processor is its actual

implemention, with all of its technical details. The ab-

straction instruction speciﬁes only the input and out-

put, and leaves the implementation for the engineers.

(That was the intention of Neumann: ”we will base

our considerations on a hypothetical element, which

functions essentially like a vacuum tube . . . but which

can be discussed as an isolated entity. . . . After the

conclusions of the preliminary discussion the ele-

ments will have to be reconsidered.” (Aspray, 1990))

So, the instruction execution could be accelerated, us-

ing any trick, as long as it does not interfere with the

established mathematical model, constructed on the

basis of the von Neumann architecture.

If there exists only one executing engine (one

CPU), it is naturally implied that that processor will

execute all instructions. In this simpliﬁed picture it is

trivial, that the processor will be available for execut-

ing the next instruction only after ﬁnishing the current

instruction. So, the computing resource that can be

assigned for executing an instruction, is always the

same and is always busy.

Let us consider the same situation differently. The

control unit points to an address where it ﬁnds some

machine instruction. Before executing the instruction,

the control unit allocates a computing resource from

the pool and assigns it to the instruction. Then in-

structs the allocated computing resource to execute

the instruction, waits until the instruction gets termi-

nated and releases the resource back to the pool. From

this change nothing appears for the abstraction ”pro-

cess”, it is only an internal business of the control unit,

being not part of the abstraction. If only one such ex-

ecution unit exists, it will be continuously allocated

and then released, with no gain in performance. Un-

til the current instruction provides a result, the next

instruction is waiting for either the input data or the

resource, which is the result of the current instruction

or the processor itself. Notice too, that in this simple

picture the computing unit will be ready for execut-

ing another instruction exactly at the time when the

result is also provided, i.e. waiting for a result or for

availability of a resource is the same action.

2.4 Timeliness of Execution

Generally it is believed that in the vN model the in-

structions must be executed one after the other, in

the order as the programmer wrote them. How-

ever, the vN model requires only that the control unit

must assure a proper sequencing for executing the in-

structions (Aspray, 1990). The control unit executes

the instructions actually pointed out by its instruc-

tion pointer, one at a time, really provides the illu-

sion that the instructions are executed according to

some strict time sequence. However, the instructions

are executed by the processor in a simple order-of-

appearance. So, one may consider the possibility to

reorder the instructions in the object ﬁle without af-

fecting the ﬁnal result.

In modern processors, the pipelining simply re-

uses parts of the processor logic through cutting in-

struction executing into stages, and so introducing

overlapped execution of instructions, which, in strict

sense, does not preserve the one after another method

of execution. This method however preserves the

original execution sequence. So does also the out of

order execution in modern processors, because a lot

of efforts are done to reorder the states after executing

the instructions in out of their order (including condi-

tional and forecasted execution). Do we really need

to preserve the original ordering of the instructions,

if we are allowed to execute them in a different order?

3 THE MULTICORE VON

NEUMANN COMPUTING

MODEL

Modern processors might have many computing re-

sources, in close vicinity to each other, so it is time

to re-consider, whether the vN operating model really

ICSOFT-EA2014-9thInternationalConferenceonSoftwareEngineeringandApplications

152

requires separating the instruction execution in time

and keeping the semantic order (as they were writ-

ten in the program), or those requirements were just

concluded from the technical possibilities of the time

of the conclusion. The new model is the result of re-

thinking the complete process, comprising code gen-

eration, transmission and execution.

3.1 Instruction Level Parallelism (ILP)

Since the beginning of the computer era, the need

for computing performance rose much quicker than

the hardware and especially the software technology

could provide it. At the very beginning, it was a

technical necessity to use a sequential execution of

instructions (and so: to introduce an appearance-

ordered execution) : the CPU was very expensive, and

it was only available in one single instance, so it had

to do every single operation. Just a bit later, the ques-

tion rose: could we make some instructions in paral-

lel, at the cost of building some extra hardware?

The theory of the problem has been scrutinized

around 1990 (Wall, 1993). Since this classic work,

just minor improvements have been made, but this

serves as a base for pipelining, as well as for the

increasingly popular out-of-order evaluation in the

modern processors. The author analysed many of the

factors affecting the different dependencies, and con-

cluded, that ”parallelism within a basic block rarely

exceeds 3 or 4 on the average. This is unsurpris-

ing: basic blocks are typically around 10 instruc-

tions long, leaving little scope for a lot of parallelism.

At the other extreme is a study (Nicolau and Fisher,

1984) that ﬁnds average parallelism as high as 1000,

by considering highly parallel numeric programs and

simulating a machine with unlimited hardware paral-

lelism and an omniscient scheduler (Wall, 1993).”

Not to surprise, many of the results of the conclu-

sions are present in the modern processors. A kind

of omniscient scheduler is implemented in hyper-

threading processors, and the basic blocks are used

when attempting to parallelize execution of a single

thread using out-of-order execution within a multi-

core processor. It looks like the question about the

appearance-ordered execution is not yet decided: the

out of order execution is allowed, but after that, a con-

siderable amount of time is spent with reordering the

result, in order to preserve the illusion that they were

executed in order. But, do we need to do that? Can be

really a higher (at least around 100) parallelization

reached, if the proper number of computing resources

are available, and a wide enough instruction window

is used?

3.2 Changing Execution Order of

Instructions

The order in which the instructions (and/or some se-

ries of instructions) are executed in most (maybe: in

majority of) cases does not matter, if some constraints

do not force a special order of execution. Examples

include ILP (see above); the processes running un-

der multitasking operating systems, where the code

fragments are interleaved to each other; the out of or-

der and speculative evaluation in modern processors;

different hardware accelerators; synchronized multi-

thread execution, GPUs, distributed processing, etc.

Although those technical implementations work, and

are widely used, the so called ”semantic order of ex-

ecution”, dictated by the early primitive computer ar-

chitectures, is still required.

3.3 High Level Parallelism for Code

Blocks

As outlined above, the ILP provides an obvious

method for parallelizing the execution of a single

thread. One can, however, consider code blocks (like

subroutines, functions, even loops or programmer-

deﬁned blocks) as a kind of super-complex instruc-

tions, and so an analogous technology can be used by

the compiler to ﬁnd out the data-dependence of this

code ﬂow comprising ”instructions” from this ”Super

Complex Instruction Set”. As a result, the method

High Level Parallelism (HLP) can be developed, al-

lowing high-level out of order execution. Just note,

that the high level language possibilities introduce

several new points to consider, so a lot of new HLP (in

addition to ILP) algorithms must be developed. Since

one of the important bottlenecks is the low number of

registers in the processors, this step can increase by

an order of magnitude the number of cores that can

be used for parallelizing a single-thread process.

As (Nicolau and Fisher, 1984) pointed out, if the

hardware resources are not limited (and, this situa-

tion is approached if the parallelization analysis takes

place at compile time, rather than at runtime), par-

allelization of the order of several hundreds can be

reached. Although the analysis of the possibilities of

instruction-level parallelization takes place at compile

time, one has to consider that the larger the size of the

instruction window considered, the longer will be the

analysis time and the temporary storage needed. For-

tunately, the high level language compiler can provide

information on a reasonable segmentation.

AMulticore-awareVonNeumannProgrammingModel

153

3.4 Discover Parallelization Possibilities

at Compile Time

As shown in section 3.1, the simple ILP can enhance

the single processor performance by a factor 3-4, even

if an order of magnitude more cores are available.

This situation is experienced in the modern many-

core processors: only 3-5 cores (out of say 64) can be

used simultanously for executing a single thread. The

reason is that those modern processors discover the

ILP possibilities at execution time, so due to the need

for real-time operation size of the basic blocks cannot

exceed just about a dozen of instructions. However,

at compiler time we do have (nearly) unlimited time

to discover and (nearly) unlimited hardware resources

to ﬁnd out those dependencies, so the conditions as-

sumed by (Nicolau and Fisher, 1984) are (nearly) ful-

ﬁlled.

Since the possible data dependencies must be

scrutinized, and the time to discover those dependen-

cies grows in a factorial way with the number of in-

structions considered, this task cannot be solved with

a good performance at execution time. In the practice

the high level programming units provide hints for se-

lecting natural chunks for parallelizing. Also, the ex-

periences with proﬁlers can help a lot. On lower level,

the known methods of ILP can be used, both at source

code level and at machine code level.

In this way, this toolchain can generate object

code, which provides the possibility (and the meta-

information) for a smart processor to highly paral-

lelize the code. The processor will ”see” the instruc-

tions in the order as they appear in the memory (which

is mostly the same as in the object ﬁle), so the in-

structions which can be executed independently (par-

allel) should come ﬁrst. The primary point of view

should be to put the instructions in the order as they

can be executed maximally independently. A sec-

ondary point of view can be to put mini-threads (actu-

ally: fragments, a piece of strongly sequential code)

in consecutive locations, thus using out the pipelines

structure of the cores. The expected maximum num-

ber of cores can be a parameter for the compilation

process, and in such a case optimization for the actual

case can be carried out.

3.5 Smarter Communication Between

Compile Toolchain and Processor

So, a lot of parallelization information can be col-

lected at compile time. However, only the part of

the information, collected by the compile tools, con-

taining the ILP optimized machine instructions, can

be made known for the processor: namely, the object

code the processor can read from the memory and ex-

ecute it instruction by instruction. So, the processor

needs to re-discover the possibilities for paralleliza-

tion, with rather bad efﬁciency. When transferring the

full information in form of meta-data, the many-core

processors could do a much better job. The natural

way to do so would be to extend the object code with

those meta-data.

Some compile-time switch in the toolchain could

decide whether the traditional or this multicore format

object code should be generated, and also the proces-

sor could have a mode to decide whether to operate

in single-core or multi-core mode. In the object code

seen by the computer the instructions are ordered (as

before) as they are expected to be executed. This or-

der, however, can be mostly independent from the or-

der of appearance in the source code, since the com-

piler can rearrange the instructions, in order to reach

a high level of instruction-level parallelism. Just note,

that this kind of smarter communication would be

highly desirable in many other aspects, say for op-

erating cache memories with enhanced performance.

3.6 Assigning Computing Resource

Dynamically to the Machine

Instructions

In the classic model the only computing resource, the

lone processor, is assigned statically to the process,

and the assignment happens at the beginning of the

computation. The individual instructions simply in-

herit the computing resource, assigned to the process

they belong to. Since only one computing unit exists,

a default assignment does the job.

In the multicore model the individual cores are

considered as a computing resource, much similar to

as multiple copies of arithmetic units are present in

some modern processors. The control unit fetches

the instructions to execute one at a time, as usual in

the vN model. However, in order to execute an in-

struction, one has to assign a computing resource to

it, since there is no default assigned resource. This

assignment of one of the available allocated comput-

ing resources to the instruction occurs dynamically, at

the beginning of instruction execution. The allocation

of the resources from the pool happens at the begin-

ning of the process (although it might be dynamically

modiﬁed during the ﬂow of the process).

In this way every single instruction has a comput-

ing resource, exactly the same way as in the classic

model, although here the computing resources, unlike

in the classical model, can be different entities. Also,

provided that the control hardware forces consider-

ing the possible constraints, there will be no differ-

ICSOFT-EA2014-9thInternationalConferenceonSoftwareEngineeringandApplications

154

ence in the instruction execution. The instructions

are executed slightly overlapped, as also in the case

of pipelined execution. This means, that – depend-

ing on the conditions – the control unit can make any

number of quasi-parallel assignments: until no more

computing resource is available or the constraints do

not allow doing so.

To take the real advantage of the reordered code,

a hardware dispatcher placed in front of the cores is

necessary. Its operation is much similar to the dis-

patcher presently used in Intel’s P6 architectures, but

with important functional differences. First of all,

the present dispatcher uses a very narrow ILP instruc-

tion window, because that parallelization information

is assembled at runtime. In the proposed solution,

this parallelization information is already assembled

at compile time, and is ready to be used immediately

by the processor. The other difference, that there is

no need to reorder the instructions following an out-

of-order execution. In the suggested solution, the dis-

patcher – in addition to the object code – takes also the

compile-time metadata, which was assembled using

very wide instruction window, and comprising global

information on the parallelization.

For the classic processors, using reordered in-

struction sequences makes no confusion: since the

strongly sequential codes are located in consecutive

locations, no problem manifests because of the re-

ordered code. No loss of information, but no gain in

execution speed.

However, since some processors use a kind of

ILP (with strongly limited instruction window width),

feeding code optimized by the compiler over a wide

instruction window for a narrow runtime window

might result in performance gain even in case of

presently existing processors, without any hardware

change.

Using the pipeline available in the modern proces-

sors, means no theoretical advantage: a pipelined core

simply acts like a higher speed core, which is ready to

accept further instructions while still processing an-

other ones. When using many cores, the latency time

is clearly replaced by the much shorter item time.

4 CONCLUSIONS

The present model provides chances to consider-

ably increase the single-processor computing perfor-

mance. In the suggested model the many-core archi-

tectures are included in such a way, that the abstract

paradigms like process, processor and machine in-

struction, remain essentially unchanged from the pro-

grammers point of view.

The model allows the programmers to continue

writing single-thread programs and yet taking advan-

tage of the computing performance due to the many

cores. Using the new toolchain, the old source codes

can be compiled to those new processors with really

impressive enhanced raw computing power.

REFERENCES

Adapteva (2014). Parallella Board.

http://www.parallella.org/.

Agarwal, V., Hrishikesh, M., Keckler, S., and Burger, D.

(2000). Clock Rate versus IPC: The End of the Road

for Conventional Microarchitectures. In Proceedings

of the 27th Annual International Symposium on Com-

puter Architecture.

Altera (2013). DE1 SoC.

http://www.altera.com/education/univ/

materials/boards/de1-soc/unv-de1-soc-board.html/.

Aspray, W. (1990). John von Neumann and the Origins of

Modern Computing, pages 34–48. MIT Press, Cam-

bridge.

Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam,

K., and et al. (2012). Dark Silicon and the End of

Multicore Scaling. IEEE Micro, 32(3):122–134.

Fuller, S. H. and Millett, L. I. (2011). Computing Perfor-

mance: Game Over or Next Level? Computer, 44:31–

38. http://download.nap.edu/

cart/download.cgi?&record id=12980.

Godfrey, M. D. and Hendry, D. F. (1993). The Computer as

von Neumann Planned It. IEEE Annals of the History

of Computing, 15(1):11–21.

HLPGPU workshop (2012). High-level programming for

heterogeneous and hierarchical parallel systems. In

HiPEAC Conference.

Intel (1998). P6 Family of Proces-

sors Hardware Developers Manual.

ftp://download.intel.com/design/pentiumii/manuals/

24400101.pdf.

Nicolau, A. and Fisher, J. A. (1984). Measuring the par-

allelism available for very long instruction word ar-

chitectures. IEEE Transactions on Computers, C-

33(11):968–976.

S(o)OS project (2010). Resource-independent execu-

tion support on exa-scale systems. http://www.soos-

project.eu/index.php/related-initiatives.

Wall, D. W. (1993). Limits of Instruction-Level Paral-

lelism. http://www.hpl.hp.com/techreports/Compaq-

DEC/WRL-93-6.pdf.

World Scientiﬁc (2014). Parallel Processing Letters.

http://www.worldscientiﬁc.com/worldscinet/ppl.

Xilinx (2012). Zedboard. http://zedboard.org/.

Yang, J., Cui, H., Wu, J., Tang, Y., and Hu, G. (2014).

Making Parallel Programs Reliable with Stable Multi-

threading. Communications of the ACM, 57(3):58–69.

AMulticore-awareVonNeumannProgrammingModel

155