odic boundary conditions for the simulation space).
A node, before sending or receiving a message, calcu-
lates the rank of the real source of the next incoming
and outgoing message in order to place it in or retrieve
it from the correct spot in the message buffer.
Due to the use of MPI_Ssend, we need to assure
that no deadlocks will occur (e.g. two nodes posting a
receive call for each other). To avoid that, we use an
even-odd-pattern for the communication, where the
nodes with an even rank are depicted in blue and those
with an uneven rank are depicted in orange (for visu-
alization see Fig. 5). In the first step, each blue node
sends its message while the orange nodes receive the
messages; in the second step, the roles swap and the
orange nodes become senders while the blue ones re-
ceive the messages. Since we used 64 nodes, we did
not need to consider an extra phase for the case of two
blue nodes meeting at the edges of the row.
Further, we dictate a sequence for the communica-
tion: first we daisy-chain all right messages k times,
then we daisy-chain all left messages k times. An-
other way would be to interleave the right and left
messages.
After all of the 2k messages have been exchanged,
the measurement ends and the message buffers are
checked for correctness.
4.3 Implementation Measurements
We used 64 nodes and ran the algorithm 100 times
for each cut-off k ∈ [1, 10]. This resulted in 6400 data
elements per k-value. So, all measured values pre-
sented in this work are averaged over 6336 measure-
ments (the first measurement of each node is excluded
because it is 100 times slower than the rest due to the
path routing process).
4.4 Hockney Parameters
As (Lastovetsky et al., 2009) wrote in their paper, the
Hockney parameters α and β are typically estimated
by statistically evaluating one of two series of point-
to-point communications:
• Roundtrips between two nodes with an empty
message to determine α and roundtrips between
two nodes for each relevant message load to de-
termine β with respect to m.
• A series of roundtrips with a growing message
load, so that a linear regression can be fitted over
the resulting curve.
Some of our messages are too small to show a linear
relation (messages <50kB), so we used the first op-
tion for our measurements. To make sure that we do
Table 1: Message latency in relation to message load with
derived Hockney parameter β.
load [byte] latency [ns] σ [ns] β [ns/byte]
0 2,122 176 –
10 2,234 183 11.246
100 2,686 302 5.645
1,000 2,881 305 0.760
10,000 4,808 307 0.269
100,000 15,055 749 0.129
not have different latencies between node pairs due
to varying physical distance of nodes, we measured
the latency between nodes from different racks. After
making sure that we have sufficient location invari-
ance (physical distance has no measurable influence
on the latency), we gathered data over several days
with a morning and an afternoon measurement with a
set of four random nodes each time (to account for dif-
ferent network load patterns of other users that might
influence our model negatively).
We used a one-to-all ping pong from the node with
MPI rank zero to the three other nodes of the set. For
each pair we returned for every value m
1
the average
over 10,000 roundtrips. Overall, we performed ten
measurements, getting thirty sets of data per m
1
-value
per measurement, so 300 data elements for each mes-
sage load value m
1
.
We determined the Hockney parameters by using
the average over all 300 data elements per m
1
-value.
Table 1 shows the average latency by load m
1
and
the derived β (standard deviation σ over 300 data ele-
ments); the latency value for zero bytes is used for α.
5 SHIFT MODEL ANALYSIS
We explained in Section 4.2 that we implemented
only one dimension/phase of the Shift, so we use
Equation 1 multiplied by two to predict the timing be-
havior for the Shift in the row dimension.
Table 2 shows one representative example of
the measurements and predictions for the one-
dimensional Shift algorithm with the respective stan-
dard deviation σ for the measurements; tables for the
other message loads can be found in the Appendix.
The data in Tables 2 to 6 shows that the model
works very well for message loads m
1
of 10 to
1,000 bytes but becomes less accurate for larger m
1
.
However, our predictions are within the stan-
dard deviation of the measurements for all message
loads (10 to 100,000 bytes), hence our model can be
considered correct.
Analytical Model of Communication Algorithm for Simulations with Range-Limited Interactions
315