3.4 Discover Parallelization Possibilities
at Compile Time
As shown in section 3.1, the simple ILP can enhance
the single processor performance by a factor 3-4, even
if an order of magnitude more cores are available.
This situation is experienced in the modern many-
core processors: only 3-5 cores (out of say 64) can be
used simultanously for executing a single thread. The
reason is that those modern processors discover the
ILP possibilities at execution time, so due to the need
for real-time operation size of the basic blocks cannot
exceed just about a dozen of instructions. However,
at compiler time we do have (nearly) unlimited time
to discover and (nearly) unlimited hardware resources
to find out those dependencies, so the conditions as-
sumed by (Nicolau and Fisher, 1984) are (nearly) ful-
filled.
Since the possible data dependencies must be
scrutinized, and the time to discover those dependen-
cies grows in a factorial way with the number of in-
structions considered, this task cannot be solved with
a good performance at execution time. In the practice
the high level programming units provide hints for se-
lecting natural chunks for parallelizing. Also, the ex-
periences with profilers can help a lot. On lower level,
the known methods of ILP can be used, both at source
code level and at machine code level.
In this way, this toolchain can generate object
code, which provides the possibility (and the meta-
information) for a smart processor to highly paral-
lelize the code. The processor will ”see” the instruc-
tions in the order as they appear in the memory (which
is mostly the same as in the object file), so the in-
structions which can be executed independently (par-
allel) should come first. The primary point of view
should be to put the instructions in the order as they
can be executed maximally independently. A sec-
ondary point of view can be to put mini-threads (actu-
ally: fragments, a piece of strongly sequential code)
in consecutive locations, thus using out the pipelines
structure of the cores. The expected maximum num-
ber of cores can be a parameter for the compilation
process, and in such a case optimization for the actual
case can be carried out.
3.5 Smarter Communication Between
Compile Toolchain and Processor
So, a lot of parallelization information can be col-
lected at compile time. However, only the part of
the information, collected by the compile tools, con-
taining the ILP optimized machine instructions, can
be made known for the processor: namely, the object
code the processor can read from the memory and ex-
ecute it instruction by instruction. So, the processor
needs to re-discover the possibilities for paralleliza-
tion, with rather bad efficiency. When transferring the
full information in form of meta-data, the many-core
processors could do a much better job. The natural
way to do so would be to extend the object code with
those meta-data.
Some compile-time switch in the toolchain could
decide whether the traditional or this multicore format
object code should be generated, and also the proces-
sor could have a mode to decide whether to operate
in single-core or multi-core mode. In the object code
seen by the computer the instructions are ordered (as
before) as they are expected to be executed. This or-
der, however, can be mostly independent from the or-
der of appearance in the source code, since the com-
piler can rearrange the instructions, in order to reach
a high level of instruction-level parallelism. Just note,
that this kind of smarter communication would be
highly desirable in many other aspects, say for op-
erating cache memories with enhanced performance.
3.6 Assigning Computing Resource
Dynamically to the Machine
Instructions
In the classic model the only computing resource, the
lone processor, is assigned statically to the process,
and the assignment happens at the beginning of the
computation. The individual instructions simply in-
herit the computing resource, assigned to the process
they belong to. Since only one computing unit exists,
a default assignment does the job.
In the multicore model the individual cores are
considered as a computing resource, much similar to
as multiple copies of arithmetic units are present in
some modern processors. The control unit fetches
the instructions to execute one at a time, as usual in
the vN model. However, in order to execute an in-
struction, one has to assign a computing resource to
it, since there is no default assigned resource. This
assignment of one of the available allocated comput-
ing resources to the instruction occurs dynamically, at
the beginning of instruction execution. The allocation
of the resources from the pool happens at the begin-
ning of the process (although it might be dynamically
modified during the flow of the process).
In this way every single instruction has a comput-
ing resource, exactly the same way as in the classic
model, although here the computing resources, unlike
in the classical model, can be different entities. Also,
provided that the control hardware forces consider-
ing the possible constraints, there will be no differ-
ICSOFT-EA2014-9thInternationalConferenceonSoftwareEngineeringandApplications
154