3 NUMA-AWARENESS
To reduce the time for the solution of a forward prob-
lem, SHEMAT-Suite is parallelized using OpenMP,
the de facto standard for parallel programming of
shared-memory systems. In addition, there is an
OpenMP parallelization of an MC method. Through-
out this position paper, we do not focus on comput-
ing the forward model in parallel, but on the paral-
lelization of the MC method. The current implemen-
tation involves a nested OpenMP parallelization, i.e.,
two levels of parallelism on top of each other. The
outer parallelization consists of computing different
realizations in parallel, whereas the inner level is con-
cerned with parallelizing the forward problem. For
instance, the inner level includes the parallel solution
of a system of linear equations as well as the parallel
assembly of the corresponding coefficient matrix.
In the MC method, the program computes the
solution of the forward problem several times. Here,
these realizations are independent of each other.
However, the input and output needs to be handled
carefully. The realizations are computed in parallel
by a team of OpenMP threads. To illustrate the strat-
egy behind the OpenMP implementation, consider a
five-dimensional array declared as, say,
x(n x,n y,n z,n k,n j) (1)
in the serial code. Think of some physical quantity
x discretized on a three-dimensional spatial grid of
size n
x
× n
y
× n
z
with two additional dimensions
representing n
k
and n
j
discrete values for two further
characteristics. In the OpenMP implementation, all
major arrays are extended by another dimension that
represents the realizations. That is, in the OpenMP-
parallelized code, the array (1) is transformed into
the data structure
x(r,n x,n y,n z,n k,n j) (2)
whose first dimension allocates additional storage for
the computation of r realizations. This paralleliza-
tion strategy increases the storage requirement by a
factor of r as compared to the serial software. How-
ever, it is shown in (Wolf, 2011) that this strategy is
advantageous for bringing together parallelization of
an MC method for the solution of an inverse prob-
lem with automatic differentiation of the underlying
forward model. The latter is important in the context
of SHEMAT-Suite (Rath et al., 2006), but is not dis-
cussed further in this position paper.
Typically, the number of realizations, r, is much
larger than the number of available OpenMP threads
on the outer level, q. Therefore, each thread will work
on more than one realization. Recall that the realiza-
tions represent independent tasks. The assignment of
the threads to the tasks is carried out using a dynamic
scheduling to balance the computational load among
the threads. That is, each thread will start to work on
the next available task as soon as it finishes its current
task.
A simple example is illustrated in Figure 1 show-
ing q = 3 OpenMP threads on the outer level which
are handling r = 5 realizations. Initially, these three
threads start to compute the first three realizations.
The realizations will be terminated in any order. Sup-
pose that thread T
2
finishes the computation of the re-
alization R
2
. It then immediately starts to work on the
realization R
4
which is the first realization waiting for
execution. Afterwards, the thread T
3
terminates R
3
and then executes realization R
5
. In that example,
the thread T
1
is still computing the realization R
1
; but
the remaining threads, R
2
and R
3
, are dynamically as-
signed to the realizations that are waiting for execu-
tion.
Figure 1: Dynamic assignment of the realization R
i
to the
OpenMP threads T
i
.
Though this scheduling strategy is simple and ad-
equate, its memory access is irregular. This makes
it difficult to achieve high performance on today’s
shared-memory systems. From a conceptual point
of view, there are different classes of shared-memory
systems schematically depicted in Figure 2. In the
first class, processors access the shared memory via
a common bus in a uniform way. This way, all pro-
cesses spend the same time to access different parts of
memory; see the left part of this figure.
A more realistic class is illustrated in the right part
of this figure. Here, the concept of a shared mem-
ory is implemented by putting together multiple lo-
cal memories via an interconnection network. Pro-
cessors access their local memories faster than mem-
ories of other processors. Thus, these shared-memory
systems are referred to as non-uniform memory ac-
SIMULTECH2014-4thInternationalConferenceonSimulationandModelingMethodologies,Technologiesand
Applications
288