Exploring Instance Heterogeneity in Public Cloud Providers for HPC

Applications

Eduardo Roloff

, Matthias Diener

, Luciano P. Gaspary

and Philippe O. A. Navaux

Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazil

University of Illinois at Urbana-Champaign, U.S.A.

Keywords:

Cloud Computing, Cost Efﬁciency, Heterogeneity, Microsoft, Amazon, Google.

Abstract:

Public cloud providers offer a wide range of instance types with different speeds, conﬁgurations, and prices,

which allows users to choose the most appropriate conﬁgurations for their applications. When executing paral-

lel applications that require multiple instances to execute, such as large scientiﬁc applications, most users pick

an instance type that ﬁts their overall needs best, and then create a cluster of interconnected instances of the

same type. However, the tasks of a parallel application often have different demands in terms of performance

and memory usage. This difference in demands can be exploited by selecting multiple instance types that are

adapted to the demands of the application. This way, the combination of public cloud heterogeneity and ap-

plication heterogeneity can be exploited in order to reduce the execution cost without signiﬁcant performance

loss. In this paper we conduct an evaluation of three major public cloud providers: Microsoft, Amazon, and

Google, comparing their suitability for heterogeneous execution. Results show that Azure is the most suitable

of the three providers, with cost efﬁciency gains of up to 50% compared to homogeneous execution, while

maintaining the same performance.

1 INTRODUCTION

Executing large parallel applications (such as those

from the High Performance Computing (HPC) do-

main) in the cloud has become an important aspect of

cloud computing in recent years (Foster et al., 2008;

Buyya et al., 2009; Armbrust et al., 2009). By running

these applications in the cloud, users can beneﬁt from

lower up-front costs, higher ﬂexibility, and speedier

hardware upgrades compared to traditional clusters.

However, raw performance as well as cost efﬁciency

for long-term usage can be a disadvantage (Roloff

et al., 2012). In recent years, several approaches were

introduced to use the cloud efﬁciently when execut-

ing HPC applications (Abdennadher and Belgacem,

2015; d. R. Righi et al., 2016; Zhang et al., 2015).

Many public cloud providers offer different types

of instances to users, which contain different CPU

types and memory sizes, in contrast to traditional

clusters, which in most cases consist of homogeneous

machines. Furthermore, many HPC applications con-

sist of tasks that have different computational de-

mands, due to load imbalance or uneven work distri-

bution. By matching the demands of the application

to the characteristics of the instance types offered by

a cloud provider, creating a cluster of heterogeneous

instances, and mapping each task to its most appropri-

ate instance, we can create a more cost-efﬁcient exe-

cution of the application in the cloud (Roloff et al.,

2018).

In this paper, we use a parallel application,

ImbBench, to analyze three different cloud providers

(Microsoft Azure, Amazon AWS, and Google Cloud)

in terms of their suitability for such a heterogeneous

execution of large parallel applications. ImbBench

is an MPI-based application that can create differ-

ent imbalance patterns in CPU and memory usage

that mimics the behavior of real-world HPC ap-

plications (Roloff et al., 2018). We evaluate the

cost efﬁciency as well as performance when run-

ning ImbBench in homogeneous and heterogeneous

instances with different imbalance patterns and com-

putational demands.

Our results show that the suitability for heteroge-

neous execution is very different between the three

providers. Azure beneﬁted the most from hetero-

geneous execution, improving cost efﬁciency by up

to 50% for the memory-bound 8Level pattern, while

maintaining the same performance. On average,

Azure obtains gains of 30% in cost-efﬁciency by us-

210

Roloff, E., Diener, M., Gaspary, L. and Navaux, P.

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications.

DOI: 10.5220/0007799302100222

In Proceedings of the 9th International Conference on Cloud Computing and Services Science (CLOSER 2019), pages 210-222

ISBN: 978-989-758-365-0

ing heterogeneous conﬁgurations. AWS beneﬁts less

from heterogeneous execution, as in our tests the

cheapest instance was also the fastest. Google Cloud

only offers instances that differ in memory size, but

not in CPU or memory speed, and therefore does not

show potential for heterogeneous execution currently.

These results show that heterogeneous execution can

be beneﬁcial in situations where the cloud provider

offers a sufﬁcient variety of instance types. It also

motivates the future analysis of other aspects, such as

I/O and network performance.

2 BACKGROUND: CLOUD

INSTANCE TYPES

There are many providers in the public cloud, with a

large set of instance type offered to the users. The

goal of this paper is to evaluate whether it is possible

for a cloud user to beneﬁt from this level of hetero-

geneity to create adequate executions environments

for his applications.

In this work, we will evaluate the suitability for

heterogeneity of three major IaaS providers: Amazon,

Microsoft, and Google. These three providers repre-

sent the state of the art of the public cloud providers.

Each provider has its own classiﬁcation for the avail-

able services and instance conﬁgurations.

Amazon divides their instances into ﬁve groups:

General purpose, Compute Optimized, Memory Op-

timized, Accelerated Computing and Storage Opti-

mized. General purpose instances includes balanced

machines, in terms of memory and CPU amounts,

this group includes machines with ARM processors

as well. Compute, Memory and Storage Optimized

instances focus on one of the three components. The

Accelerated Computing group contains the instances

equipped with GPUs and FPGAs. In terms of number

of CPUs Amazon ranges from instances with a shared

core up to instances with 128. The available memory

ranges from 500MB up to almost 4TB.

Microsoft classiﬁes instances into six groups:

General purpose, Compute optimized, Memory op-

timized, Storage optimized, GPU and High perfor-

mance compute. The General purpose group aims to

provide a balanced memory to CPU ratio and was de-

signed for small workloads. Similar to Amazon, the

Compute, Memory and Storage optimized instances

focus in one of the three instance components. The

GPU group provides instances with single or multi-

ple NVIDIA GPUs. The High performance compute

group provides instances with substantial CPU power

and several memory conﬁgurations, low latency In-

ﬁniBand network is available as well. The CPU range

of Microsoft ranges from 1 to 72 CPUs and the mem-

ory starts from 1GB up to near 4TB.

Google has ﬁve groups of pre-conﬁgured in-

stances: Standard, High-memory, High-CPU, Shared-

core and Memory-optimized. The Standard group has

a ﬁxed ratio of 3.75 GB of memory per CPU and are

instances for balanced workloads. The High-memory

and High-CPU instances have, respectively, 6.50GB

and 0.90 GB of memory per CPU. The Shared-core

group provides instances that use less than a sin-

gle core, sharing the same core with other instances.

The Memory-optimized group presents instances with

large amounts of memory, suitable for workloads that

perform in-memory compute.

Google has the capability to attach GPUs to the

available instances, with some exceptions, however

it does not provide a group of GPU pre-conﬁgured

instances. The pre-deﬁned instances start from 0.2

CPUs up to 96 and from 600MB up to 4TB of mem-

ory. Moreover, Google has the unique capability to

customize the instances according to the user’s needs.

In this case, the user can conﬁgure instances starting

from 1 up to 96 cores and the memory starts from 1

GB up to 1.4TB. This feature seems very interesting,

because the user could conﬁgure instances with the

amount of memory that the application needs.

In terms of price, there is a signiﬁcant variation as

well. The price of the same instance type presents

variation inside the same provider. This variation

is due to the datacenter location where the instance

is allocated. For example, a c5.2xlarge Amazon in-

stance costs US$ 0.34 in the Oregon datacenter and

US$ 0.524 in the Sao Paulo datacenter, a difference

of 54%. Because of the variation within the same

provider, we will not present the instance price range

of the three providers.

Table 1 shows the instance prices as well as the

performance result of the High-Performance Linpack

(HPL) (Dongarra et al., 2003) benchmark, which is

widely used to measure computing performance. Ta-

ble 2 shows the results of the memory performance

measured using the STREAM (McCalpin et al., 1995)

benchmark. Due to this heterogeneity in the public

cloud, we explore the variety of instances in order to

beneﬁt the user to reduce the cost without losing per-

formance.

3 RESULTS

This section presents the results of our evaluation of

homogeneous and heterogeneous cloud cluster with

ImbBench.

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

211

Table 1: Characteristics of the Cloud Instances. Linpack

results are given in GFlops/sec.

Instance type Price/hour Memory (GB) Linpack

Azure

A8 US$ 0.975 56 149.81

A8v2 US$ 0.40 16 74.25

D4 US$ 0.559 28 261.53

D13 US$ 0.741 56 232.91

E8 US$ 0.593 64 127.62

F8 US$ 0.498 16 218.50

G3 US$ 2.44 112 291.76

H8 US$ 0.971 56 332.76

H8m US$ 1.301 112 326.25

L8 US$ 0.688 64 259.52

Google

n1-standard-8 US$ 0.38 30 136.96

n1-highmem-8 US$ 0.4736 52 137.41

n1-highcpu-8 US$ 0.2836 7.2 135.03

Amazon

c4.2xlarge US$ 0.398 15 182.31

c5.2xlarge US$ 0.34 16 194.25

m4.2xlarge US$ 0.40 32 154.16

m5.2xlarge US$ 0.384 32 159.52

r4.2xlarge US$ 0.532 61 154.17

r5.2xlarge US$ 0.504 64 161.80

t2.2xlarge US$ 0.3712 32 298.18

t3.2xlarge US$ 0.3328 32 159.50

Table 2: Results of the STREAM benchmark for the Cloud

Instances (in GB/sec).

Instance type Copy Scale Add Triad

Azure

A8 11.85 13.11 13.79 13.62

A8v2 5.86 6.39 6.68 6.69

D4 11.03 11.00 12.42 12.26

D13 10.09 9.85 11.23 11.15

E8 8.99 9.46 10.07 10.15

F8 10.50 10.49 12.00 11.70

G3 11.53 11.85 13.39 13.16

H8 13.06 12.78 14.70 14.39

H8m 13.08 12.88 14.71 14.48

L8 9.96 10.10 11.33 11.20

Google

n1-standard-8 12.18 9.37 9.98 9.88

n1-highmem-8 10.01 9.76 10.57 10.50

n1-highcpu-8 12.18 9.49 10.25 10.20

Amazon

c4.2xlarge 19.72 11.98 13.39 13.22

c5.2xlarge 10.56 12.50 13.25 13.26

m4.2xlarge 16.75 9.03 9.88 9.85

m5.2xlarge 10.25 11.99 12.71 12.68

r4.2xlarge 18.40 9.82 10.79 10.77

r5.2xlarge 10.45 12.12 13.10 13.05

t2.2xlarge 18.94 11.06 12.29 12.35

t3.2xlarge 9.95 10.83 11.59 11.51

3.1 Methodology

3.1.1 Hardware and Software Conﬁguration

For all experiments, we conﬁgure a cluster of 32

cores, consisting of four instances with eight cores

each. All instances run Ubuntu 18.04 LTS. To provide

a fair comparison between the providers and their in-

stances, we chose datacenter locations as close as pos-

sible. For Amazon we chose the ”Oregon” region, for

Microsoft the ”West US” region and for Google the

”Oregon” region.

For each experiment, we run 30 executions and

show average values as well as the standard deviation.

For each instance type presented in Section 2, we run

a homogeneous cluster, but discard the types that did

not perform well or cost too much. The following

instance types were used in this section:

• Azure: H8, F8, A8v2

• AWS: c5.2xlarge, t2.2xlarge

• Google: n1-highcpu-8, n1-highmem-8, n1-

standard-8

3.1.2 ImbBench

We used ImbBench as the workload to perform

the tests of the cloud conﬁgurations. ImbBench

is a proto-benchmark that implements several load

patterns for MPI processes (Roloff et al., 2017).

ImbBench can produce several imbalance patterns. In

this paper, we use the following four patterns: 8Level,

Amdahl, Linear, and Balanced. In the 8Level pattern,

there are 8 distinct load levels for the MPI ranks, and

there is an equal number of ranks for each level. In the

Amdahl pattern, the ﬁrst rank has more work than all

other ranks. In the Linear pattern, the load increases

linearly among all ranks. The Balanced pattern has

the same load across all ranks. It is used to verify the

behavior of a homogeneous application.

It is possible to perform CPU performance tests,

where the workload is a random number generator,

that was proﬁled with the PERF tool (de Melo, 2010)

as almost 100% CPU-bound computation. The mem-

ory performance could be measured as well, in this

case the workload of ImbBench is a BST manipula-

tion, in this case the PERF proﬁling results in an ap-

plication that is 50% memory-bound.

3.1.3 Cost Efﬁciency Metric

To measure the cost of the conﬁguration, we use the

Cost Efﬁciency methodology (Roloff et al., 2012;

Roloff et al., 2017). In a summarized way, this

methodology helps the user to determine which

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

212

provider and conﬁguration offers the higher perfor-

mance for the price paid. We deﬁne cost efﬁciency

as the number of executions that could be made in an

hour with a given application, divided by the price per

hour of the cloud. Equation 1 formalizes this deﬁni-

tion, where execution time [hours] represents the ex-

ecution time of an application in hours, and price per

hour represents the price per hour of the cloud where

the application was executed on.

CostEfﬁciency =

1 hour

execution time [hours]

price per hour

(1)

The result values are just to compare two environ-

ments for a given application, where higher values of

indicate higher efﬁciency. The value itself does not

have another meaning.

3.2 Microsoft Azure

Microsoft Azure has the largest number of instances

with 8 cores among the three providers of this study.

We selected 10 instance types for our evaluation, cov-

ering the whole spectrum of available conﬁgurations.

In terms of processing power, we observed a range

of the HPL results starting from 74 GigaFlops up to

332 GigaFlops, as shown in Table 1. Otherwise, in

terms of memory performance the results presented

less variation than the processing power, but it is still

possible to observe different performance groups, as

seen in Table 2. Those results lead us to state that

Microsoft Azure presents an interesting heterogeneity

among its available instances. We used the homoge-

neous cluster that presented better performance, the

H8 cluster, as the baseline for all the tests and create

heterogeneous clusters to improve the cost efﬁciency

of the execution.

3.2.1 CPU

To evaluate the CPU performance of Microsoft Azure,

we performed tests with different instances types.

In order to identify if the combination of different

instance types could be used to reduce the execu-

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU 8Level, homogeneous

Figure 1: Results for the 8Level pattern and CPU performance for the homogeneous H8 cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU 8Level, heterogeneous

H8 F8 A8v2

Figure 2: Results for the 8Level pattern and CPU performance for the Azure heterogeneous cluster.

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

213

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU Amdahl, homogeneous

Figure 3: Results for the Amdahl pattern and CPU performance for the homogeneous H8 cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU Amdahl, heterogeneous

H8 A8v2

Figure 4: Results for the Amdahl pattern and CPU performance for the Azure heterogeneous cluster.

tion cost, we build several combinations of instances.

We show and discuss three different patterns here:

8Level, Amdahl and Linear.

The 8Level pattern simulates an application

whose processes present eight different load levels.

It is quite common that applications present differ-

ent levels of imbalance, and this pattern of ImbBench

simulates that kind of applications. Figure 1 shows

the execution of the 8Level pattern using the homo-

geneous H8 cluster. We can note the load distribution

levels in the eight groups. MPI ranks 7, 15, 23 and 31

were the ones that took most time to execute, around

5 seconds. This means that all the other processes

have certain level of idle time during the application

execution.

In order to reduce the execution cost without

losses in the execution time, we performed experi-

ments with combinations of different instance types.

In the Figure 2 we show the combination that pre-

sented the best balance between price and perfor-

mance. This conﬁguration is a combination of one

H8 instance, two F8 instances and one A8v2 instance.

The H8 instance was used to execute the processes

with more load, the F8 instances were used to exe-

cute the processes with medium load and the A8v2

instance the processes with low load level. As seen

in the ﬁgure, we were able to maintain the same to-

tal execution time. However, due to the combination

with cheaper instances, the cost of the execution was

reduced.

In terms of cost efﬁciency, the homogeneous clus-

ter has a factor of 175 and the heterogeneous cluster

has a results of 288. This means that the heteroge-

neous conﬁguration is more efﬁcient. In terms of cost

per hour, the heterogeneous conﬁguration present a

reduction of US$ 1.51, or 40%, without performance

loss.

The Amdahl pattern represents an application that

has one process that performs much more computa-

tion than all the others. Normally this situation oc-

curs in applications that were implemented to central-

ize the work-ﬂow in a master process and the other

processes receive and send data to it. In Figure 3 we

can observe the Amdahl execution time, using a ho-

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

214

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU Linear, homogeneous

Figure 5: Results for the Linear pattern and CPU performance for the homogeneous H8 cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure CPU Linear, heterogeneous

H8 F8 A8v2

Figure 6: Results for the Linear pattern and CPU performance for the Azure heterogeneous cluster.

mogeneous cluster built of H8 Azure instances. We

can observe that process 0 presents an execution time

of around 5 seconds and the other 31 processes exe-

cute in less than 2 second. This means that we have a

large amount of idle time during the application exe-

cution. Moreover, in a Cloud Computing environment

the user is charged as well for this idle time.

Building a cluster with different instances types

could help the user to reduce his execution cost, with-

out signiﬁcant losses in execution time. Figure 4

shows the execution of the same application by using

an heterogeneous cluster. The heterogeneous cluster

was conﬁgured with one H8 instance and three A8v2

instances. As we can note in the ﬁgure, the process

0 still was the critical-path of the application, taking

around 5 seconds. Processes 8 to 31 took around 4

seconds, because they where executed in the instances

with less performance. However, the total execution

time remain the same as on the homogeneous cluster.

In terms of cost efﬁciency, the homogeneous clus-

ter for the Amdahl pattern results in a factor of 175

and the result for the heterogeneous cluster was 315.

This means, again, that the heterogeneous conﬁgu-

ration is more efﬁcient. In terms of cost per hour,

the heterogeneous conﬁguration present a reduction

of US$ 1.71, or 45%, without performance loss.

The Linear pattern presents a situation where each

process of the application executes a different load.

The process 0 has the lowest load and the process n,

in our case process 31, has the highest load of the

application. This patter was designed to explore the

scalability capacity of a given conﬁguration. Figure 5

shows the results of the Linear pattern in the H8 clus-

ter. We can note that the load increases according with

the processes number increases as well. The process

with the highest rank, in our case 31, executes the

biggest load, meaning that it is the one determining

the total execution time. All the other processes exe-

cute in less time, so we have a situation of idle time

again.

In order to reduce the overall idle time, we pro-

pose the execution of the application using a combi-

nation of different instances. We could achieve this

by using instances with less performance and price.

Figure 6 shows the execution of the application in a

heterogeneous cluster. In this case, we conﬁgured a

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

215

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure Memory Linear, homogeneous

Figure 7: Results for the Linear pattern and Memory performance for the homogeneous H8 cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Azure Memory Linear, heterogeneous

H8 F8 A8v2

Figure 8: Results for the Linear pattern and Memory performance for the Azure heterogeneous cluster.

cluster with one A8v2 instance, two F8 instances and

on H8 instance. The execution time remains the same

as the homogeneous cluster. However the idle time

was reduced, by using the cheaper instances.

The cost efﬁciency repeats the results of the

8Level pattern, where the homogeneous cluster was

of 175 and for the heterogeneous cluster was 288. The

cost per hour reduction was adherent with the 8Level

pattern as well, with a reduction of US$ 1.51, or 40%.

3.2.2 Memory

In terms of memory, there is less heterogeneity among

the Microsoft Azure instances. Only the A8v2 in-

stance presented a result that is far different from the

other instances. The rest of the instances are close,

but we still can group them into two groups. We

could separate the instances that performed 13 and 14

GBytes/sec from the instances that performed 10,11

and 12 GBytes/sec. The ﬁrst group is composed by

the A8, G3, H8 and H8m instances; and the second

contains the D4, D13, E8, F8 and L8 instances. These

groups have close results but, after some experimen-

tation, we found some heterogeneity that could be ex-

ploited.

Figure 7 shows the results of ImbBench Linear

pattern for testing the memory in a homogeneous H8

cluster. As observed, the growth of execution time

is linear, according with the increase of the process

number. Likewise the CPU testing, the process with

the biggest rank, in our experiments the process 31, is

the critical one for the execution time.

After some experiments with different instance

combinations, we found a conﬁguration that does not

increase the execution time while reduces the cost.

Figure 8 shows the results of a cluster composed of

two A8v2 instances, one F8 instance and one H8 in-

stance. As shown, the total execution time was not

affected, because the processes with higher load, 24-

31) were stile executed in the most powerful instance.

And the processes with less load, 0-23, were executed

by using less powerful instances. This conﬁguration

presented the best cost-efﬁciency without increasing

the execution time.

The cost efﬁciency for the memory Linear pattern

was 73 for the homogeneous cluster and for the het-

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

216

erogeneous was 126. The cost per hour was reduced

as well, presenting a reduction of US$ 1.61, or 42%.

3.3 Google Cloud Platform

The Google instances presented similar results in

terms of processing power, as observed in Table 1,

where the three instances performed around 135 Gi-

gaﬂops in the HPL benchmark. In terms of mem-

ory performance, the instances presented almost the

same results, with low variation. For a practical com-

parison, we used the TRIAD test which represents

a mix of all the STREAM tests. The n1-standard-8,

n1-highmem-8 and n1-highcpu-8 memory results are

shown in Table 2.

We executed the whole suite of ImbBench in the

three Google instances. We observed that the in-

stances presented the about same performance for

both CPU and memory.

3.3.1 CPU

In terms of processing power measured with the

ImbBench, the Google instances presented a homoge-

neous behavior, which was expected due to the HPL

results. We present the ImbBench results only for the

Balanced and Linear patterns, because these two pat-

terns contain two situations that can help to identify

the homogeneity of instances.

First, the Balanced pattern means a full load of all

the available cores of the instance for the entire execu-

tion time. The Figure 9 presents the Balanced pattern

results. We could observe that there is a low varia-

tion with all the 32 processes executing their loads in

around 10 seconds for the three Google instances. We

do not observe variation between the four instances of

the cluster as well. The Standard Deviation presented,

also, a low rate meaning that the execution times were

constant among the 30 executions of our experiments.

Secondly, the Linear pattern presents a variable

load inside the instances processes and between in-

stances as well. Each one of the 32 processes exe-

cuted a different load. The Figure 10 presents the Lin-

ear pattern, and the homogeneity could be observed as

well. We could observe that there is a homogeneous

behavior in all of the 32 levels of the ImbBench. This

means, as in the Balanced pattern, that the three in-

stances types of Google have a homogeneous behav-

ior. Like in the Balanced pattern, the Standard Devia-

tion presented a low rate showing the constance in the

execution time.

3.3.2 Memory

In terms of memory performance that we measured

with ImbBench, the Google instances presented a

homogeneous-like behavior. This situation was ex-

pected because of the STREAM results, where the

three instances presented convergent results. As in the

CPU evaluation, we just present the ImbBench results

for the Balanced and Linear patterns.

The Balanced pattern means that each process ex-

ecutes a full load of memory operations, during the

application execution. The Figure 11 shows the Bal-

anced pattern results for the Memory evaluation. As

we can observe, the behavior is slightly different than

the CPU results. We have different performance re-

sults in all the four instances os the cluster. It is pos-

sible to observe that in the ﬁrst group of processes, 0

to 7, the standard instance performed worst than the

highmem and highcpu instances. The second group

of processes presented slightly better results for the

standard instance, but still worst than the other ones,

and the highcpu instance performed worst as well.

The third group, 16 to 23, presented similar results

for all the three instance types. Finally in the fourth

group, 24 to 31, the standard machine presented bet-

ter performance than the highmem and highcpu in-

stances. Otherwise, the Standard Deviation rates were

low, meaning that the executions presented a constant

behavior. This could be explained with the concur-

rency of the instances over the same hardware, which

means that the instances are not 100% isolated.

The Linear pattern results are shown in Figure 12.

Different from the Balanced pattern, we observed a

homogeneous behavior in the three ﬁrst groups of pro-

cesses, from process 0 up to 23. These results means

that when the memory pressure is lower, all the three

instances respond with good performance. However,

in the fourth group of processes, 24 to 31, there were

different results among the three instances. In this

group, the highmem and highcpu instances performed

worse than the standard instance type. This situation

could also be explained due to the concurrency level

in the underline hardware. However, this is just re-

garding to our experience in virtualization, because

we do not have access to this level of information

from the provider.

The discussion of cost efﬁciency is not neces-

sary for the Google instances. The performance of

n1-standard-8, n1-highmem-8, and n1-highcpu-8 are

practically the same, both in terms of memory and

processing power. The price difference among them

is relative to the amount of resources, where the ma-

chines with less resources are cheaper than the ones

with more resources. The major difference is in terms

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

217

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Google CPU Balanced, homogeneous

n1-highcpu-8 n1-highmem-8 n1-standard-8

Figure 9: Results for the Balanced pattern and CPU performance for the homogeneous Google instances cluster.

0 1 2 3 4 5

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Google CPU Linear, homogeneous

n1-highcpu-8 n1-highmem-8 n1-standard-8

Figure 10: Results for the Linear pattern and CPU performance for the homogeneous Google instances cluster.

of memory amount, and not memory performance.

The ImbBench memory test is related to the mem-

ory performance and does not use the total size of the

memory.

3.4 Amazon Web Services

We selected eight different instances from the Ama-

zon Web Services provider. The AWS instances do

not present a high level of heterogeneity in terms of

processing power. In Table 1 we can observe that

only the t2.2xlarge instance presented an HPL result

that was far from the other instances. The other seven

instances presented similar performance results. In

terms of memory, all the instances presented simi-

lar results, with exception of the m4.2xlarge instance,

that performed lower than the others.

3.4.1 CPU

In terms of CPU evaluation, we performed sev-

eral tests with the AWS instances. However, only

the c5.2xlarge and t2.2xlarge instances were inter-

esting to conduct an evaluation. Due to the price-

performance relation these two instances outper-

formed all the others, both in terms of performance

and cost-efﬁciency.

Figure 14 shows the result of the Balanced pat-

tern of ImbBench for the two instances. We could

note that the performance of the t2.2xlarge was bet-

ter than the c5.2xlarge, with the ﬁrst presenting an

execution time of around 6.5 seconds and the sec-

ond one around 8 seconds. However, these results

are not as good as expected, because in the HPL pro-

ﬁle the t2.2xlarge presented results 50% better than

the c5.2xlarge. These results could be explained due

to the underline hardware and the conﬁguration of

them. During the HPL tests, we note that the eight

cores of the c5.2xlarge instance were 4 cores with hy-

perthreading and the t2.2xlarge delivered eight cores.

Moreover, the c5.2xlarge uses an Intel Xeon Platinum

8000 series as processor and was designed for com-

puting performance and the t2.2xlarge does not spec-

ify the model of the processor, just mentions that it is

an Intel Xeon. These were our ﬁndings to explain the

results of the Balanced pattern.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

218

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Google Memory Balanced, homogeneous

n1-highcpu-8 n1-highmem-8 n1-standard-8

Figure 11: Results for the Balanced pattern and Memory performance for the homogeneous Google instances cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Google Memory Linear, homogeneous

n1-highcpu-8 n1-highmem-8 n1-standard-8

Figure 12: Results for the Linear pattern and Memory performance for the homogeneous Google instances cluster.

When we used another pattern of ImbBench, the

8Level pattern, another situation occurs. Figure 13

shows the results of this execution. The performance

of all the eight levels were similar in both instances.

But, with a closer look, we could observe that the

ﬁrst four processes, the processes with less load, of

the application presented worst performance in the

c5.2xlarge instance. And the last four processes, the

ones with highest load, performed better in c5.2xlarge

than in the t2.2xlarge.

3.4.2 Memory

Related to memory performance, the AWS instances

were similar to the CPU tests. Again the c5.2xlarge

and t2.2xlarge instances were chosen to perform the

evaluation. Figure 15 shows the results of the mem-

ory performance test using the Linear pattern from

ImbBench. As we can observe, the memory perfor-

mance of both instance types is identical, meaning

that there is no heterogeneity to be exploited.

The AWS instances are well balanced, as the

Google instances, and there is not opportunities to ex-

ploit the heterogeneity. The instance c5.2xlarge is one

of the cheapest of the provider and it presents the best

performance as well. According to our tests, this in-

stance performed better in all patterns, with exception

of the Balanced pattern, where the t2.2xlarge instance

performed better. Due to this situation, the cost efﬁ-

ciency of the AWS will be better by using homoge-

neous clusters made of c5.2xlarge.

3.5 Discussion

Table 3 shows the overall performance and cost

efﬁciency results for all experiments. In terms of

the performance across all the providers, Microsoft

Azure presented the instance with best performance,

the H8 instance, that performed better both in terms of

processing power and in memory performance. The

AWS c5.2xlarge instance presented slightly less per-

formance than the H8 one. Google got the third place

with its three instances. However in terms of price,

Google’s n1-highcpu-8 instance has the best price,

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

219

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Amazon CPU 8Level, homogeneous

c5.2xlarge t2.2xlarge

Figure 13: Results for the 8Level pattern and CPU performance for the homogeneous Amazon instances cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Amazon CPU Balanced, homogeneous

c5.2xlarge t2.2xlarge

Figure 14: Results for the Balanced pattern and CPU performance for the homogeneous Amazon instances cluster.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

MPI rank

Time in seconds

Amazon Memory Linear, homogeneous

c5.2xlarge t2.2xlarge

Figure 15: Results for the Linear pattern and Memory performance for the homogeneous Amazon instances cluster.

followed by AWS and Azure instance has the

higher price.

Several interesting conclusions can be drawn from

the results presented in this section. First, the Azure

cloud presents a beneﬁcial case for heterogeneous

clouds, since cost efﬁciency improves dramatically

over a homogeneous allocation with an imbalanced

application. On the other hand, Google’s cloud in-

stances consists of machines of the same CPU and

memory speed, varying only in memory size, and can

therefore beneﬁt less from application heterogeneity.

Amazon, in contrast to the other two providers, shows

the interesting situation where the cheapest machine

is also the fastest, which limits its suitability for het-

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

220

Table 3: Cost efﬁciency characteristics of the cloud in-

stances. Performance is shown in seconds. Higher values

for cost efﬁciency are better.

Conﬁguration CPU MEM

Cost eff. Perf. Cost eff. Perf.

8Level

Azure H8 175 5.27 75 12.22

Azure Het. 288 5.26 129 12.22

Google n1-standard-8 290 8.15 121 19.53

Google n1-highmem-8 230 8.25 114 16.62

Google n1-highcpu-8 386 8.21 193 16.43

Amazon c5.2xlarge 432 6.12 194 13.58

Amazon t2.2xlarge 365 6.63 158 15.32

Amdahl

Azure H8 175 5.28 75 12.62

Azure Het. 315 5.25 138 12.14

Google n1-standard-8 301 7.85 139 16.97

Google n1-highmem-8 245 7.73 128 13.68

Google n1-highcpu-8 408 7.76 228 13.84

Amazon c5.2xlarge 455 5.81 236 11.17

Amazon t2.2xlarge 369 6.55 159 15.22

Linear

Azure H8 176 5.26 73 12.14

Azure Het. 288 5.26 126 12.51

Google n1-standard-8 238 9.93 128 18.47

Google n1-highmem-8 194 9.79 93 20.29

Google n1-highcpu-8 317 9.98 163 19.40

Amazon c5.2xlarge 343 7.71 170 15.50

Amazon t2.2xlarge 368 6.58 160 15.14

erogeneous execution. However, both Google and

Amazon might beneﬁt more from heterogeneity in

the future when the instance types they offer change.

Furthermore, the suitability of heterogeneous execu-

tion is similar between the CPU and memory-bound

versions of ImbBench. The results shown in the pa-

per can therefore also be seen as a motivation for

providers to offer instance types with more diverse

characteristics, as these can be used to improve cost

efﬁciency of executing applications with uneven com-

putational demands in the cloud.

4 RELATED WORK

Prabhakaran and Lakshmi (Prabhakaran and Lak-

shmi, 2018) evaluate the cost-beneﬁt of using Ama-

zon, Google and Microsoft cloud instances instead of

a supercomputing. They found that the cost-beneﬁt of

the supercomputer was better than the cloud. How-

ever they consider a 7x24 hours usage, and do not

consider the idle time, which affects the cost calcula-

tions. Kotas et al. (Kotas et al., 2018) compared AWS

and Azure as a platform for HPC. The authors exe-

cuted the HPCC and HPCG benchmarks suite. How-

ever they evaluate one instance type of each provider

and build cluster conﬁgurations from 1 to 32 nodes.

Aljamal et al. (Aljamal et al., 2018) compare

Azure, AWS, Google and Oracle Cloud as a platform

for HPC. They made a comparative analysis of the

providers conﬁgurations and offers, but did not per-

form any performance or cost efﬁciency experiments.

(Li et al., 2018a) performed an analysis of how to

beneﬁt from the cloud heterogeneity to perform video

trans-coding, in order to maximize the efﬁciency for

video streaming services. (Li et al., 2018b) created

a cost and energy aware scheduler to reduce the cost

and energy consumption of executing scientiﬁc work

ﬂows in the cloud. They evaluated their proposal by

using the CloudSim simulator.

Costa el al. (G. S. Costa et al., 2018) performed

a performance and cost analysis comparing the on-

demand versus preemptive instances of the AWS and

Google cloud providers. However they used only pro-

ﬁling of the instances and do not evaluate benchmark-

ing or applications. In our previous works (Roloff

et al., 2017; Roloff et al., 2018), we exploited the het-

erogeneity of public cloud and application demand.

This work is an improvement of the previous work, by

performing a deep analysis of the instances perform-

ing memory and CPU characterizations and a com-

prehensive analysis of different public clouds.

5 CONCLUSIONS AND FUTURE

WORK

Executing HPC applications in cloud environments

has become an important research topic in recent

years. In contrast to traditional clusters, which consist

of mostly homogeneous machines, executing HPC

applications in the cloud allows them to use hetero-

geneous resources offered by the cloud provider. This

way, the cloud instances can be adapted to an imbal-

anced behavior of the HPC application, reducing the

price of executing it in the cloud.

In this paper, we performed a comprehensive

set of experiments evaluating the potential of us-

ing heterogeneous instance types on three public

cloud providers, Microsoft Azure, Amazon AWS, and

Google Cloud. Using ImbBench, a benchmark that

can create different imbalance patterns of CPU and

memory usage, we compared performance and cost

efﬁciency on homogeneous and heterogeneous cloud

instances. Our results show that heterogeneous execu-

tion is most beneﬁcial on Azure, improving cost efﬁ-

ciency by up to 50% compared to execution on homo-

geneous instances, while maintaining the same per-

formance. The other two providers are less suitable,

Exploring Instance Heterogeneity in Public Cloud Providers for HPC Applications

221

since the cheapest instance type is also the fastest

(AWS), or the provider offers only instances that vary

in memory size, but not in performance (Google).

As future work, we intend to extend our

ImbBench benchmark with support for I/O operations

and network communication to evaluate these aspects

in terms of heterogeneity. We also intend to add sup-

port for combining different types of operations to be

more representative of real-world applications. Fur-

thermore, we will provide an automatic way to sug-

gest a mix of cloud instances for a particular applica-

tion behavior.

ACKNOWLEDGEMENTS

This work has been partially supported by the

project “GREEN-CLOUD: Computacao em Cloud

com Computacao Sustentavel” (#16/2551-0000 488-

9), from FAPERGS and CNPq Brazil. This research

received partial funding from CYTED for the RICAP

Project.

REFERENCES

Abdennadher, N. and Belgacem, M. B. (2015). A high level

framework to develop and run e-science applications

on cloud infrastructures. In High Performance Com-

puting and Communications (HPCC).

Aljamal, R., El-Mousa, A., and Jubair, F. (2018). A com-

parative review of high-performance computing major

cloud service providers. 2018 9th International Con-

ference on Information and Communication Systems,

ICICS 2018, 2018-Janua:181–186.

Armbrust, M., Fox, A., Grifﬁth, R., Joseph, A. D., Katz,

R. H., Konwinski, A., Lee, G., Patterson, D. A.,

Rabkin, A., and Zaharia, M. (2009). Above the

clouds: A berkeley view of cloud computing. Tech-

nical report.

Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., and

Brandic, I. (2009). Cloud computing and emerging

it platforms: Vision, hype, and reality for delivering

computing as the 5th utility. Future Generation Com-

puter Systems, 25(6):599 – 616.

d. R. Righi, R., Rodrigues, V. F., da Costa, C. A., Galante,

G., de Bona, L. C. E., and Ferreto, T. (2016). Autoe-

lastic: Automatic resource elasticity for high perfor-

mance applications in the cloud. IEEE Transactions

on Cloud Computing, 4(1):6–19.

de Melo, A. C. (2010). The New Linux ’perf’ Tools. In

Linux Kongress.

Dongarra, J. J., Luszczek, P., and Petitet, A. (2003). The

linpack benchmark: past, present and future. Con-

currency and Computation: practice and experience,

15(9):803–820.

Foster, I., Zhao, Y., Raicu, I., and Lu, S. (2008). Cloud com-

puting and grid computing 360-degree compared. In

2008 Grid Computing Environments Workshop, pages

1–10.

G. S. Costa, B., Reis, M. A. S., P. F. Ara

ujo, A., and

Solis, P. (2018). Performance and cost analysis be-

tween on-demand and preemptive virtual machines.

In Proceedings of the 8th International Conference on

Cloud Computing and Services Science - Volume 1:

CLOSER,, pages 169–178. INSTICC, SciTePress.

Kotas, C., Naughton, T., and Imam, N. (2018). A compar-

ison of Amazon Web Services and Microsoft Azure

cloud platforms for high performance computing.

2018 IEEE International Conference on Consumer

Electronics, ICCE 2018, 2018-Janua:1–4.

Li, X., Amini Salehi, M., Joshi, Y., Darwich, M., Bayoumi,

M., and Landreneau, B. (2018a). Performance analy-

sis and modeling of video transcoding using heteroge-

neous cloud services. IEEE Transactions on Parallel

and Distributed Systems, pages 1–1.

Li, Z., Ge, J., Hu, H., Song, W., Hu, H., and Luo, B.

(2018b). Cost and energy aware scheduling algorithm

for scientiﬁc workﬂows with deadline constraint in

clouds. IEEE Transactions on Services Computing,

11(4):713–726.

McCalpin, J. D. et al. (1995). Memory bandwidth and ma-

chine balance in current high performance computers.

IEEE computer society technical committee on com-

puter architecture (TCCA) newsletter, 1995:19–25.

Prabhakaran, A. and Lakshmi, J. (2018). Cost-beneﬁt Anal-

ysis of Public Clouds for ofﬂoading in-house HPC

Jobs. 2018 IEEE 11th International Conference on

Cloud Computing (CLOUD), pages 57–64.

Roloff, E., Diener, M., Carissimi, A., and Navaux, P. O. A.

(2012). High performance computing in the cloud:

Deployment, performance and cost efﬁciency. In 4th

IEEE International Conference on Cloud Computing

Technology and Science Proceedings, pages 371–378.

Roloff, E., Diener, M., Carre

no, E. D., Gaspary, L. P., and

Navaux, P. O. A. (2017). Leveraging cloud hetero-

geneity for cost-efﬁcient execution of parallel appli-

cations. In Euro-Par 2017: Parallel Processing - 23rd

International Conference on Parallel and Distributed

Computing, Santiago de Compostela, Spain, August

28 - September 1, 2017, Proceedings, pages 399–411.

Roloff, E., Diener, M., Gaspary, L. P., and Navaux, P. O. A.

(2018). Exploiting load imbalance patterns for hetero-

geneous cloud computing platforms. In Proceedings

of the 8th International Conference on Cloud Comput-

ing and Services Science - Volume 1: CLOSER,, pages

248–259. INSTICC, SciTePress.

Zhang, J., Lu, X., Arnold, M., and Panda, D. K. (2015).

Mvapich2 over openstack with sr-iov: An efﬁcient ap-

proach to build hpc clouds. In Cluster, Cloud and Grid

Computing (CCGrid), 2015 15th IEEE/ACM Interna-

tional Symposium on, pages 71–80.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

222