Towards Optimizing Cost and Performance for Parallel Workloads in

Cloud Computing

William Maas

, F

abio Diniz Rossi

, Marcelo C. Luizelli

, Philippe O. A. Navaux

and

Arthur F. Lorenzon

Institute of Informatics, Federal University of Rio Grande do Sul, Brazil

Campus Alegrete, Federal Institute Farroupilha, Brazil

Campus Alegrete, Federal University of Pampa, Brazil

fabio.rossi@iffarroupilha.edu.br, marceloluizelli@unipampa.edu.br, {wbmaas, navaux, aﬂorenzon}@inf.ufrgs.br

Keywords:

Parallel Computing, Cloud Computing, Cost, Performance.

Abstract:

The growing popularity of data-intensive applications in cloud computing necessitates a cost-effective ap-

proach to harnessing distributed processing capabilities. However, the wide variety of instance types and con-

ﬁgurations available can lead to substantial costs if not selected based on the parallel workload requirements,

such as CPU and memory usage and thread scalability. This situation underscores the need for scalable and

economical infrastructure that effectively balances parallel workloads’ performance and expenses. To tackle

this issue, this paper comprehensively analyzes performance, costs, and trade-offs across 18 parallel workloads

utilizing 52 high-performance computing (HPC) optimized instances from three leading cloud providers. Our

ﬁndings reveal that no single instance type can simultaneously offer the best performance and the lowest costs

across all workloads. Instances that excel in performance do not always provide the best cost efﬁciency, while

the most affordable options often struggle to deliver adequate performance. Moreover, we demonstrate that

by customizing instance selection to meet the speciﬁc needs of each workload, users can achieve up to 81.2%

higher performance and reduce costs by 95.5% compared to using a single instance type for every workload.

1 INTRODUCTION

Applications that handle massive data volumes, such

as machine learning, data mining, and health simula-

tions, are becoming increasingly prominent in cloud

computing (Navaux et al., 2023). These workloads

leverage the distributed processing capabilities of

cloud servers to execute high-performance computing

(HPC) tasks to reduce the execution time of the entire

workload. However, the extensive variety of available

instance types and conﬁgurations poses a challenge,

often leading to substantial operational costs. This is-

sue is exacerbated by the slowing efﬁciency gains in

newer hardware generations resulting from the end of

Dennard scaling (Baccarani et al., 1984). As a result,

cloud servers require a scalable and cost-effective in-

frastructure to efﬁciently run applications, providing

resources on demand via the Internet. Consequently,

choosing the most suitable instance type for a given

workload is critical for optimizing performance and

cost-efﬁciency.

However, optimizing the cost-performance bal-

ance for parallel workloads in the cloud is challenging

due to the wide variety of instance types, each with

distinct computational capabilities and pricing mod-

els. AWS, Google Cloud, and Azure offer general-

purpose and compute-optimized instances tailored for

speciﬁc workloads (AWS, 2024). However, selecting

the best instance becomes more complex when work-

loads with different characteristics must be matched

to the most suitable option across multiple cloud plat-

forms. Moreover, parallel execution improves perfor-

mance by utilizing available computational resources,

but gains are not always proportional to core count.

Off-chip bus saturation, shared memory contention,

and synchronization overhead (Suleman et al., 2008;

de Lima et al., 2024) can limit scalability, making

higher core counts ineffective for some workloads.

For instance, increasing from 32 to 64 vCPUs may not

accelerate machine learning tasks. Achieving an opti-

mal cost-performance trade-off requires selecting in-

stances based on workload-speciﬁc thread-level paral-

lelism (TLP). However, varying workload behaviors

and memory access patterns mean an instance efﬁ-

cient for one task may be suboptimal for another.

Given the scenario discussed above, we present

Maas, W., Rossi, F. D., Luizelli, M. C., Navaux, P. O. A. and Lorenzon, A. F.

Towards Optimizing Cost and Performance for Parallel Workloads in Cloud Computing.

DOI: 10.5220/0013421600003950

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 15th International Conference on Cloud Computing and Services Science (CLOSER 2025), pages 231-238

ISBN: 978-989-758-747-4; ISSN: 2184-5042

231

a comprehensive cost-performance evaluation when

running eighteen parallel workloads with varying

memory and CPU usage and TLP degree require-

ments across 52 HPC- and Compute-optimized in-

stances of different types from three major cloud

providers: Google Cloud Platform (GCP), Amazon

Web Services (AWS), and Microsoft Azure. Through

this extensive set of experiments, we show that: (i)

Given the characteristics of each parallel workload re-

garding CPU and memory usage, no single instance

can provide the best performance, the lowest cost,

and the best trade-off between both across all eigh-

teen parallel workloads simultaneously. (ii) Selecting

the most suitable instance for each workload based

on its speciﬁc characteristics regarding thread scala-

bility, we show that it is possible to achieve a 35.7%

improvement in performance and a 6.5% reduction in

costs compared to using the same instance type that

delivers the best results for the entire set of workloads.

(iii) The provided Microsoft Copilot tool in Azure to

recommend the best instance for running each parallel

workload cannot ﬁnd the most suitable instance as it

does not consider the intrinsic characteristics of par-

allel applications, such as thread scalability.

2 BACKGROUND

2.1 Cloud Computing

Cloud computing has become the standard for run-

ning applications, offering on-demand resources (Liu

et al., 2012). Initially, compute-intensive applica-

tions like Big Data faced challenges due to hyper-

visor overhead, despite efforts to ensure elasticity

and availability (Barham et al., 2003). To address

this, lightweight container technologies like Docker

emerged, offering near-native performance. Docker

has become the leading framework for developing

and running containerized applications, encapsulating

necessary components for efﬁcient execution. Docker

is widely used for deploying parallel workloads, pro-

viding isolated environments with all dependencies.

Its lightweight nature minimizes virtualization over-

head, making it ideal for HPC applications. This

enables efﬁcient scaling across multiple nodes, op-

timizing cloud-based HPC infrastructures. As a re-

sult, companies like NVIDIA, AMD, and Intel use

Docker to optimize workloads for cloud execution.

Cloud providers offer instance types tailored to spe-

ciﬁc workloads. Compute-optimized instances, such

as AWS’s C-series and Azure’s F-series, prioritize

CPU performance for data processing but come at a

higher cost. HPC-optimized instances, like AWS’s

H-series and Azure’s HBv3-series, feature advanced

architectures and high-speed interconnects, offering

maximum performance for scientiﬁc simulations and

large-scale data analysis.

2.2 Thread Scalability

When executing parallel workloads in cloud environ-

ments, developers often maximize resource use, in-

cluding vCPUs and cache memory. However, studies

indicate that this approach may not yield optimal per-

formance (Suleman et al., 2008; Subramanian et al.,

2013). Performance limitations arise from software

and hardware factors such as off-chip bus saturation,

concurrent shared memory accesses, and data syn-

chronization. Off-Chip Bus Saturation: Applica-

tions requiring frequent memory access may experi-

ence scalability issues when memory bandwidth bot-

tlenecks. As more threads are added, the connec-

tion’s bandwidth remains limited by I/O pins (Ham

et al., 2013), leading to increased power consump-

tion without performance gains. Concurrent Shared

Memory Accesses: Performance can degrade when

multiple threads frequently access shared memory lo-

cations. Since these accesses occur in distant mem-

ory areas like last-level cache, they introduce higher

latency than private caches, potentially creating bot-

tlenecks (Subramanian et al., 2013). Data Syn-

chronization: Threads accessing shared variables re-

quire synchronization to maintain correctness. How-

ever, only one thread can access a critical section at

a time, forcing others to wait. As the number of

threads increases, serialization effects become more

pronounced, limiting application scalability (Suleman

et al., 2008).

2.3 Related Work

Recent studies have explored parallel computing in

cloud environments and parallel workload optimiza-

tion. Ekanayake et al. (Ekanayake and Fox, 2010)

examined cloud-based parallel processing for scien-

tiﬁc applications, highlighting distributed comput-

ing’s ability to handle intensive workloads. Rak et al.

(Rak et al., 2013) developed a cost-prediction method

combining benchmarking and simulation, aiding in

cloud cost management. Rathnayake et al. (Rath-

nayake et al., 2017) introduced CELIA, an analytical

model optimizing cloud resource allocation by bal-

ancing execution time and costs. Wan et al. (Wan

et al., 2020) proposed a cost-performance model us-

ing queuing systems to enhance workload manage-

ment. Several works focus on parallel workload op-

timization. Zhang et al. (Zhang et al., 2020) de-

CLOSER 2025 - 15th International Conference on Cloud Computing and Services Science

232

0.0

0.2

0.4

0.6

0.8

1.0

0.00 0.50 1.00

Average IPC

LLC Miss Ratio

ST BFS JA

FFT SP CG

HPCG MG MG

UA LL SC

FT LU HS

BT LBM PO

Figure 1: Characteristics of each parallel workload regard-

ing the average IPC and LLC Miss Ratio on the AWS c6a

32xlarge instance type.

veloped the EPRD algorithm to minimize schedul-

ing durations for precedence-constrained tasks. Li

et al. (Li et al., 2020) introduced AdaptFR, accel-

erating distributed machine learning training through

scalable resource allocation. Ohue et al. (Ohue et al.,

2020) demonstrated cloud computing’s adaptability

for protein-protein docking simulations. Bhardwaj et

al. (Bhardwaj et al., 2020) tackled task scheduling

in parallel cloud jobs, improving efﬁciency. Hauss-

mann et al. (Haussmann et al., 2019b) evaluated cost-

performance trade-offs using reserved and volatile

processors. Their later work (Haussmann et al.,

2019a) optimized processor counts for irregular par-

allel applications. They also introduced an elasticity

description language to enhance resource allocation

in cloud deployments (Haussmann et al., 2020).

3 METHODOLOGY

3.1 Parallel Workloads

Eighteen parallel workloads from assorted bench-

mark suites already written in C/C++ and paral-

lelized with OpenMP (OpenMulti Processing) API

were selected for evaluation. Seven benchmarks are

from the NAS Parallel Benchmark (Bailey et al.,

1995), used to evaluate the performance of paral-

lel hardware and software: BT (Block Tridiago-

nal), CG (Conjugate Gradient), FT (Fourier Trans-

form), LU (LU Decomposition), MG (Multigrid), SP

(Scalar Pentadiagonal), and UA (Unstructured Adap-

tive). Three applications from the Rodinia bench-

mark suite (Che et al., 2009), used to identify per-

formance bottlenecks and compare different comput-

ing platforms: HS (HotSpot), LUD (LU Decomposi-

tion with Data Redistribution), and SC (Streamclus-

ter). Two workloads from the Parboil suite (Strat-

ton et al., 2012), a set of applications used to study

the performance of throughput computing architec-

tures: BFS (Breadth-First Search) and LBM (Lat-

tice Boltzmann Method). Six from distinct domains:

FFT (Fast Fourier Transform) (Brigham and Mor-

row, 1967), HPCG (High-Performance Conjugate

Gradient)(Heroux et al., 2013), JA (Jacobi iteration),

LULESH (Livermore Unstructured Lagrangian Ex-

plicit Shock Hydrodynamics) (Karlin et al., 2013),

PO (Poisson Equation), and ST (Stream).

The applications were executed with the standard

input set deﬁned on each benchmark suite. We com-

piled each application with gcc/g++ 12.0, using the

-O3 optimization ﬂag. The chosen benchmarks cover

a wide range of parallel workloads when considering

the average instructions per cycle (IPC) and last-level

cache (LLC) memory behavior on the target architec-

tures, as shown in Figure 1 for the AWS c6a 32xlarge

instance type. We consider the IPC metric as it re-

ﬂects how CPU- or memory-intensive each applica-

tion is. On the other hand, the amount of LLC misses

represents the number of DRAM memory accesses.

Hence, the higher the LLC miss ratio, the more time

the threads spend accessing data from the main mem-

ory, impacting its IPC.

3.2 Target Instances

We consider a set of ﬁfty-two cloud instances from

the compute- and HPC-optimized domains (Table 1).

The selected instances consider different processors,

including AMD, Intel, and ARM (AWS Graviton). In

addition to choosing various instance types, we se-

lected different conﬁgurations (i.e., variations in Ta-

ble 1) within the same instance type, each with vary-

ing numbers of available vCPUs. This setup allows us

to examine how performance and cost efﬁciency are

impacted by scaling computational resources within

the same instance family. For example, within the

AWS c7g series, we considered instances with 8, 16,

32, 48, and 64 vCPUs to assess the ability of paral-

lel applications to leverage additional cores and mem-

ory bandwidth efﬁciently. Additionally, all instances

were conﬁgured to use the same operating system

(Ubuntu 22.04) and kernel version. Our analysis used

the on-demand pricing model to maintain consistency

and comparability among various instance types and

cloud providers.

3.3 Execution Environment

The process for running each workload on the cloud

instance was as follows: ﬁrst, the workload binary

was placed within a Docker container to ensure iso-

lated execution, regardless of the architecture. Then,

the container was deployed for execution on the cor-

responding instance. During execution, the number of

cores assigned to the container (and thus to the work-

Towards Optimizing Cost and Performance for Parallel Workloads in Cloud Computing

233

Table 1: Characteristics of the Compute- and HPC-Optimized instances.

Instance

Type

Processor Variations vCPU’s

Cost per Hour

(USD)

gcp-h3

Intel Xeon

Platinum 8481C

std-88 88 4.93

gcp-c2

Intel Xeon

Gold 6253CL

std-8; std-16;

std-30; std-60

8; 16;

30; 60

0.34; 0.67;

1.25; 2.51

gcp-c2d

AMD EPYC

7B13

std-8; std-16;

std-32; std-56; std-112

8; 16;

32; 56; 112

0.36; 0.73;

1,45; 2.54; 5.09

aws-c6a

AMD EPYC

7R13

2xlarge; 4xlarge; 8xlarge;

16xlarge; 32xlarge; 48xlarge

8; 16; 32;

64; 128; 192

0.31; 0.61; 1.22;

2.45; 4.90; 7.94

aws-c7i

Intel Xeon

8488C

2xlarge; 4xlarge; 8xlarge; 12xlarge;

16xlarge; 32xlarge; 48xlarge

8; 16; 32; 48;

64; 96; 192

0.36; 0.71; 1.43; 2.14;

2.86; 4.28; 8.57

aws-c7g AWS Graviton3

2xlarge; 4xlarge; 8xlarge;

12xlarge; 16xlarge

8; 16; 32;

48; 64

0.29; 0.58; 1.16;

1.73; 2.31

aws-hpc7g AWS Graviton3E

4xlarge; 8xlarge;

16xlarge

16; 32;

1.69

aws-hpc6a

AMD EPYC

7R13

48xlarge 96 2.88

aws-hpc6id

3rd Gen Intel

Xeon Scalable

32xlarge 64 8.64

aws-hpc7a

AMD EPYC

9R14

12xlarge; 24xlarge;

48xlarge; 96xlarge

24; 48;

96; 192

7.20

azure-fsv2

Intel Xeon

Platinum 8370C

f8s v2; f16s v2; f32s v2;

f64s v2; f72s v2

8; 16; 32;

64; 72

0.34; 0.68; 1.35;

2.71; 3.05

azure-hx176

AMD EPYC

9V33X

24rs; 48rs; 96rs;

144rs; 176

24; 48; 96;

144; 176

8.64

azure-hb176

AMD EPYC

9V33X

24rs v4; 48rs v4; 96rs v4;

144rs v4; 176rs v4

24; 48; 96;

144; 176

7.20

127.8

132.1

134.3

135.2

137.0

137.5

141.5

142.2

142.8

143.8

152.7

153.2

164.1

164.6

167.2

172.6

173.9

175.6

175.8

175.9

176.0

183.0

185.9

197.2

198.4

202.7

202.9

204.9

205.4

219.7

234.5

244.9

248.9

255.5

260.9

266.1

269.7

275.1

280.4

282.1

288.7

294.8

328.2

395.5

412.3

416.8

466.4

514.1

516.8

654.8

678.8

200

400

600

800

c7g.12xlarge

HB176-96

hpc7g.16xlarge

HX176-48rs

HX176-96rs

HB176-48

c7g.16xlarge

hpc7g.8xlarge

c7g.8xlarge

c7i.16xlarge

c7i.24xlarge

c7i.12xlarge

HB176-176

HX176-144rs

HB176-144

HX176-176

HB176-24

c7i.8xlarge

h3-std-88

c2d-std-56

HX176-24rs

hpc6id-32xlarge

hpc6a-48xlarge

hpc7a.48xlarge

hpc7a.24xlarge

hpc7g-4xlarge

c7g.4xlarge

c2d-std-32

c2d-std-112

c6a.16xlarge

hpc7a-12xlarge

c6a.32xlarge

c2-std-60

c7i.48xlarge

hpc7a.96xlarge

c6a.48xlarge

f72s_v2

f64s_v2

c7i.4xlarge

c2-std-30

c6a.8xlarge

f32s_v2

c2d-std-16

c7g.2xlarge

c2-std-16

f16s_v2

c6a.4xlarge

c7i.2xlarge

c2d-std-8

c6a-2xlarge

c2-std-8

f8s_v2

Time (s)

0.062

0.264

0.063

0.324

0.329

0.275

0.091

0.066

0.046

0.114

0.182

0.091

0.328

0.395

0.329

0.401

0.345

0.069

0.240

0.124

0.422

0.146

0.372

0.394

0.093

0.033

0.082

0.290

0.140

0.439

0.319

0.171

0.592

0.511

0.532

0.225

0.203

0.055

0.097

0.096

0.108

0.060

0.026

0.074

0.078

0.071

0.046

0.051

0.044

0.062

0.064

0.00

0.20

0.40

0.60

0.80

c7g.12xlarge

HB176-96

hpc7g.16xlarge

HX176-48rs

HX176-96rs

HB176-48

c7g.16xlarge

hpc7g.8xlarge

c7g.8xlarge

c7i.16xlarge

c7i.24xlarge

c7i.12xlarge

HB176-176

HX176-144rs

HB176-144

HX176-176

HB176-24

c7i.8xlarge

h3-std-88

c2d-std-56

HX176-24rs

hpc6id-32xlarge

hpc6a-48xlarge

hpc7a.48xlarge

hpc7a.24xlarge

hpc7g-4xlarge

c7g.4xlarge

c2d-std-32

c2d-std-112

c6a.16xlarge

hpc7a-12xlarge

c6a.32xlarge

c2-std-60

c7i.48xlarge

hpc7a.96xlarge

c6a.48xlarge

f72s_v2

f64s_v2

c7i.4xlarge

c2-std-30

c6a.8xlarge

f32s_v2

c2d-std-16

c7g.2xlarge

c2-std-16

f16s_v2

c6a.4xlarge

c7i.2xlarge

c2d-std-8

c6a-2xlarge

c2-std-8

f8s_v2

Cost (USD)

2.5

10.1

2.6

12.4

12.5

10.5

3.6

2.8

2.1

4.5

7.0

3.7

12.5

15.0

12.6

15.3

13.2

3.0

9.2

4.9

16.1

5.7

14.2

15.0

3.8

2.0

3.5

11.1

5.5

16.8

12.2

6.8

22.6

19.5

20.3

8.8

8.0

3.0

4.3

4.7

3.2

2.8

4.2

4.4

4.2

4.1

4.5

4.4

5.6

5.8

c7g.12xlarge

HB176-96

hpc7g.16xlarge

HX176-48rs

HX176-96rs

HB176-48

c7g.16xlarge

hpc7g.8xlarge

c7g.8xlarge

c7i.16xlarge

c7i.24xlarge

c7i.12xlarge

HB176-176

HX176-144rs

HB176-144

HX176-176

HB176-24

c7i.8xlarge

h3-std-88

c2d-std-56

HX176-24rs

hpc6id-32xlarge

hpc6a-48xlarge

hpc7a.48xlarge

hpc7a.24xlarge

hpc7g-4xlarge

c7g.4xlarge

c2d-std-32

c2d-std-112

c6a.16xlarge

hpc7a-12xlarge

c6a.32xlarge

c2-std-60

c7i.48xlarge

hpc7a.96xlarge

c6a.48xlarge

f72s_v2

f64s_v2

c7i.4xlarge

c2-std-30

c6a.8xlarge

f32s_v2

c2d-std-16

c7g.2xlarge

c2-std-16

f16s_v2

c6a.4xlarge

c7i.2xlarge

c2d-std-8

c6a-2xlarge

c2-std-8

f8s_v2

Time-Cost Tradeoff

Figure 2: Total time (top), cost (middle), and the trade-off between both (bottom) to execute the batch of workloads on each

cloud instance. Each set of bar colors represents a different instance type. The lower the value, the better.

load) was set to the number of system cores (vCPUs

in the instance), which is the standard practice for ex-

ecuting parallel applications. To measure the execu-

tion time of each run, we used the omp get wtime

function from OpenMP. The cost to execute each con-

tainer (and therefore the workload) was calculated by

CLOSER 2025 - 15th International Conference on Cloud Computing and Services Science

234

dividing the cost per hour (see Table 1) by the amount

of time taken to execute it. Moreover, to calculate

the trade-off between performance and cost, we con-

sidered the Euclidean distance from the pair of execu-

tion time and cost for each instance to the origin point,

representing the ideal scenario of zero cost and zero

execution time. This metric quantitatively measures

how close each instance is to the optimal balance be-

tween cost and performance, enabling a fair compar-

ison across all tested instances. The smaller the dis-

tance, the better the trade-off the instance achieves for

the given workload. The results discussed next are

the average of 30 executions of each container and

instance type, with a standard deviation of less than

0.5%.

4 EXPERIMENTAL EVALUATION

4.1 Performance and Cost Evaluation

In this Section, we discuss the performance and cost

of running the workload set on each target instance.

Figure 2 shows the total execution time (top), cost

(middle), and the balance between them (bottom) for

running the eighteen workloads on every instance, re-

spectively. All the results are sorted by the lowest

execution time. Moreover, for all plots, lower values

indicate a better outcome.

When processing all workloads in batch, the AWS

c7g.12xlarge instance achieves the best performance,

surpassing the Azure hb176-96 by 3.3%. In con-

trast, the Azure f8s v2 is the worst-performing in-

stance, increasing execution time by 81.1% due to its

lower vCPU count (8 vs. 48 in c7g.12xlarge). How-

ever, c7g.12xlarge was not the most cost-effective op-

tion, costing 2.33× more than c7g.2xlarge, while the

c7g.2xlarge was 95.6% cheaper than the c7i.48xlarge.

Although c7g.12xlarge delivered the best perfor-

mance and c7g.2xlarge the lowest cost, neither was

optimal for every workload. Table 2 shows that no

single instance consistently outperformed others. For

instance, BT ran best on hb176-144 with 144 vC-

PUs, FFT on hpc7g.4xlarge with 16 vCPUs, and BFS

on c6a-2xlarge with 8 vCPUs due to thread scalabil-

ity limitations. The BFS workload suffers from syn-

chronization overhead, where adding more threads

increases waiting time instead of improving perfor-

mance (Suleman et al., 2008). Similarly, workloads

like FFT and LL that require frequent memory access

face off-chip bus saturation beyond a certain thread

count. For FFT, optimal performance was achieved

with 16 vCPUs on m7g.4xlarge and hpc7g.4xlarge.

Increasing the number of vCPUs beyond this point led

100

150

200

250

300

350

0.00 0.10 0.20 0.30 0.40

Execution Time (s)

Cost (USD)

c7g.12xlarge

c7g.2xlarge

c7g.4xlarge

Opt-Perf

Opt-Cost

Opt-Balance

GCP-Copilot

GCP-Opt

Better

Figure 3: Performance and cost comparison between the

best results found by the best instances for each metric and

the strategies that optimize for each metric.

to a 5.6% execution time increase on m7g.16xlarge

and 5.01% on hpc7g.16xlarge due to bandwidth limi-

tations.

Furthermore, because threads communicate by ac-

cessing data in shared memory regions, the over-

head imposed by the shared memory accesses may

outweigh the gains achieved by the parallelism ex-

ploitation, as discussed in Section 2. This behavior

was observed for workloads with high communica-

tion demands, including SC and HPCG. In the case

of HPCG, the hx176-48rs with 48 vCPUs performed

better than the hx176 with 176 vCPUs. While we

identiﬁed thread communication as the primary rea-

son for this behavior, other factors such as thread

scheduling, data afﬁnity, or the inherent characteris-

tics of the workload may also inﬂuence L3 cache per-

formance. For instance, a workload with a high rate

of private access to L1 and L2 caches may also result

in a rise in L3 accesses. There are also scenarios of

workloads that were able to scale by increasing the

number of hardware resources in the instances, such

as JA, LUD, and LBM (refer to Table 2).

4.2 Optimizing the Trade-off Between

Performance and Cost

In the previous section, we evaluated the performance

and cost of each instance when the entire workload

set is executed on the same instance. However, as

shown in Table 1, because no single instance can

deliver the lowest cost and execution time for all

workloads simultaneously, determining the ideal

instance to execute each workload based on its

characteristics is essential to improve performance

and reduce costs. Hence, in this section, we compare

the results of the best instances for the overall per-

formance (c7g.12xlarge), cost (c7g.2xlarge), and the

trade-off between cost and performance (c7g.4xlarge)

to the following strategies: Opt-Perf: considers the

total execution time to execute the batch of work-

loads by running each workload on the instance that

delivers the best performance, as shown in Table 2.

Towards Optimizing Cost and Performance for Parallel Workloads in Cloud Computing

235

Table 2: Ideal compute- and HPC-optimized instances found by an exhaustive search to execute each parallel workload. Each

instance has associated with it the time (in seconds) and cost (in USD) to execute the workload according to the target metric.

Opt-Performance Opt-Cost (USD) Opt-Balance

FFT hpc7g.4xlarge (19.9s) c7g.2xlarge (0.00162) c7g.8xlarge (20.0s - 0.00642)

HPCG hx176-48rs (1.9s) c7g.2xlarge (0.00051) hb176-48 (2.0s - 0.00398)

JA hb176 (0.4s) c7g.2xlarge (0.00051) hb176 (0.4s - 0.00078)

LL c7i.8xlarge (19.0s) c7g.2xlarge (0.00259) c7i.8xlarge (19.0s - 0.00755)

PO c7g.12xlarge (0.18s) c7g.2xlarge (0.00004) c7g.12xlarge (0.18s - 0.00009)

ST hpc7g.16xlarge (0.06s) c7g.8xlarge (0.00003) hpc7g.16xlarge (0.06s - 0.00003)

BT hb176-144 (2.0s) c7g.2xlarge (0.00185) hpc7g.16xlarge (3.9s - 0.00186)

CG hpc7a.96xlarge (0.4s) c7g.2xlarge (0.00029) hpc6a (0.6s - 0.00046)

FT hb176-144 (0.2s) c7g.2xlarge (0.00028) hpc7g.16xlarge (0.7s - 0.00033)

LU hb176-144 (2.1s) c7g.2xlarge (0.00094) hpc7g.16xlarge (2.5s - 0.00117)

MG hb176-144 (0.5s) c7g.2xlarge (0.00016) c7g.8xlarge (1.2s - 0.00037)

SP hx176-144rs (1.5s) c7g.2xlarge (0.00080) hpc6a (2.7s - 0.00215)

UA hx176-96rs (2.1s) c7g.2xlarge (0.00132) hpc7g.16xlarge (3.1s - 0.00145)

HS c7g.16xlarge (3.1s) c7g.2xlarge (0.00101) hpc7g.16xlarge (3.1s - 0.00146)

LUD hb176 (2.2s) c7g.2xlarge (0.00164) c7g.8xlarge (5.3s - 0.00170)

SC hb176 (19.3s) c7g.2xlarge (0.00614) c7g.12xlarge (24.5s - 0.01179)

BFS c6a.2xlarge (2.08s) c6a.2xlarge (0.00018) c6a.2xlarge (2.08s - 0.00018)

LBM hx176-96rs (5.5s) hpc6a (0.00561) hpc6a (7.0s - 0.00561)

Opt-Cost: the same as the previous, but considering

the Cost as the optimization metric. Opt-Balance:

the same as the Opt-Perf, but considering the balance

between cost and performance. GCP-Copilot:

employs the Microsoft Copilot in Azure to ﬁnd the

best instance type for each workload. The provided

prompt to Copilot was: Which VM size will best

suit for the workload with IPC = XX, LLC

Misses Ratio = XX, and TLP degree = XX?,

where XX was replaced by the values shown in

Figure 1. Table 3 depicts the instances recommended

to execute each workload and compares it to the

ones that deliver the best performance found by an

exhaustive search (GCP-Opt).

It is important to note that we could not com-

pare our results with the recommendation tools pro-

vided by AWS and GCP due to the following reasons:

(i) The AWS Compute Optimizer generates recom-

mendations by analyzing resource speciﬁcations and

utilization metrics collected through Amazon Cloud-

Watch over a 14-day period, making it unsuitable

for scenarios lacking historical usage data ((AWS),

2024). (ii) Similarly, the machine-type recommen-

dations from GCP rely on system metrics gathered

by the Cloud Monitoring service during the previous

8 days, which also requires prior resource utilization

data (Cloud, 2024). These constraints made integrat-

ing these tools into our evaluation impractical, as they

depend on prior workload execution history, which

was not the case in our experimental setup.

Figure 3 depicts the results for executing all work-

loads with the discussed instances and strategies.

Each mark in the plot represents the execution time

(y-axis) and cost (x-axis) for a given instance or strat-

egy. The closer the point is to the origin (0, 0), the bet-

ter the balance between performance and cost. When

prioritizing performance, using the best instance to

execute each workload (Opt-Perf ) yields 35.5% of

performance improvements compared to the instance

that delivers the best overall performance for all work-

loads (c7g.12xlarge). When the cost is considered

as the target optimization metric, selecting the ideal

instance to execute each workload (Opt-Cost) leads

the user to save 4% compared to the best single in-

stance (c7g.2xlarge) while also reducing the total ex-

ecution time by 32.5%. When both performance and

cost are equally important for evaluation, choosing

the ideal instance to execute each workload brings

performance improvements and a lower cost to ex-

ecute the entire set of applications: Opt-Balance is

2.06 times faster while it is 1.45% more expensive

than if the application was executed on the single in-

stance that provides the best tradeoff between cost and

performance.

Regarding Microsoft Copilot in Azure Cloud, it

failed to select optimal instances for parallel work-

loads because it does not integrate critical metrics

like TLP Degree, average IPC, and LLC Miss Ra-

tio, which are essential for characterizing inter-thread

communication and memory access patterns. Copilot

relies mainly on historical data and general resource

usage, neglecting detailed microarchitectural features

CLOSER 2025 - 15th International Conference on Cloud Computing and Services Science

236

Table 3: Instances recommended by the Microsoft Azure

Copilot compared to the instance that delivered the best out-

come, found by an exhaustive search.

Microsoft Azure Copilot GCP-Opt

FFT HB176-144 hb176-24

HPCG HB176-144 hx176-48rs

JA f32s v2 HB176-24

LL hx176-144rs hb176-24

PO hb176-24 NB176-48

ST hx176-48rs HB176-96

BT hx176rs-176 HB176-144

CG hx176-96rs hx176rs-176

FT f32s v2 HB176-144

LU HB176-144 HB176-144

MG hx176-96rs HB176-144

SP HB176-24 hx176-144rs

UA f64s v2 hx176-96rs

HS HB176-96 HB176-96

LUD HB176-144 HB176-176

SC hx176-144rs HB176-176

BFS NB176-48 f8s v2

LBM HB176-96 hx176-96rs

and workload-speciﬁc requirements. As a result, it

failed to identify instances that optimize performance,

cost, or both. Comparing GCP-Copilot with selecting

the best GCP instance for each workload (GCP-Opt),

Copilot was 93% more expensive and 96% slower,

emphasizing the need to incorporate thread scalabil-

ity metrics in its recommendation model.

5 CONCLUSION

Optimizing performance and cost for parallel work-

loads in cloud computing is crucial for handling data-

intensive applications efﬁciently. Cloud providers

like AWS, GCP, and Azure offer various HPC-

optimized instances, each with unique cost and per-

formance characteristics. However, selecting the op-

timal instance type based on CPU utilization, mem-

ory needs, and thread scalability remains challeng-

ing. This study analyzed 18 parallel workloads on

52 HPC instances to evaluate performance-cost trade-

offs. Results show that no single instance type is

universally optimal. High-performance instances ex-

cel in compute-heavy tasks but are not always cost-

efﬁcient, while lower-cost options often fail to meet

performance needs. By matching instances to work-

load requirements, users achieved up to 81.2% better

performance and 95.5% cost savings over a generic

approach. Future work will explore GPU-based

instances and other pricing models to reﬁne cost-

performance optimization strategies in cloud environ-

ments.

ACKNOWLEDGEMENTS

This study was ﬁnanced in part by the CAPES - Fi-

nance Code 001, FAPERGS - PqG 24/2551-0001388-

1, and CNPq.

REFERENCES

AWS (2024). Amazon ec2. https://aws.amazon.com/ec2/.

[Online; accessed 19-February-2024].

(AWS), A. W. S. (2024). What is aws compute optimizer?

Accessed: 2024-12-24.

Baccarani, G., Wordeman, M. R., and Dennard, R. H.

(1984). Generalized scaling theory and its application

to a 1/4 micrometer mosfet design. IEEE Transac-

tions on Electron Devices, 31(4):452–462.

Bailey, D., Harris, T., Saphir, W., Van Der Wijngaart, R.,

Woo, A., and Yarrow, M. (1995). The nas parallel

benchmarks 2.0. Technical report, Technical Report

NAS-95-020, NASA Ames Research Center.

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T.,

Ho, A., Neugebauer, R., Pratt, I., and Warﬁeld, A.

(2003). Xen and the art of virtualization. SIGOPS

Oper. Syst. Rev., 37(5):164–177.

Bhardwaj, A. K., Gajpal, Y., Surti, C., and Gill, S. S.

(2020). Heart: Unrelated parallel machines prob-

lem with precedence constraints for task schedul-

ing in cloud computing using heuristic and meta-

heuristic algorithms. Software: Practice and Expe-

rience, 50(12):2231–2251.

Brigham, E. O. and Morrow, R. E. (1967). The fast fourier

transform. IEEE Spectrum, 4(12):63–70.

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W.,

Lee, S.-H., and Skadron, K. (2009). Rodinia: A

benchmark suite for heterogeneous computing. In

2009 IEEE International Symposium on Workload

Characterization (IISWC), pages 44–54.

Cloud, G. (2024). Apply machine type recommendations

for instances. Accessed: 2024-12-24.

de Lima, E. C., Rossi, F. D., Luizelli, M. C., Calheiros,

R. N., and Lorenzon, A. F. (2024). A neural network

framework for optimizing parallel computing in cloud

servers. Journal of Systems Architecture, 150:103131.

Ekanayake, J. and Fox, G. (2010). High performance par-

allel computing with clouds and cloud technologies.

In Avresky, D. R., Diaz, M., Bode, A., Ciciani, B.,

and Dekel, E., editors, Cloud Computing, pages 20–

38, Berlin, Heidelberg. Springer Berlin Heidelberg.

Ham, T. J., Chelepalli, B. K., Xue, N., and Lee, B. C.

(2013). Disintegrated control for energy-efﬁcient and

heterogeneous memory systems. In IEEE HPCA,

pages 424–435.

Towards Optimizing Cost and Performance for Parallel Workloads in Cloud Computing

237

Haussmann, J., Blochinger, W., and Kuechlin, W. (2019a).

Cost-efﬁcient parallel processing of irregularly struc-

tured problems in cloud computing environments.

Cluster Computing, 22(3):887–909.

Haussmann, J., Blochinger, W., and Kuechlin, W. (2019b).

Cost-optimized parallel computations using volatile

cloud resources. In Djemame, K., Altmann, J.,

nares, J.

A., Agmon Ben-Yehuda, O., and Naldi,

M., editors, Economics of Grids, Clouds, Systems, and

Services, pages 45–53, Cham. Springer International

Publishing.

Haussmann, J., Blochinger, W., and Kuechlin, W. (2020).

An elasticity description language for task-parallel

cloud applications. In CLOSER, pages 473–481.

Heroux, M. A., Dongarra, J., and Luszczek, P. (2013). Hpcg

benchmark technical speciﬁcation.

Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B. L.,

Cohen, J., DeVito, Z., Haque, R., Laney, D., Luke,

E., Wang, F., Richards, D., Schulz, M., and Still,

C. (2013). Exploring traditional and emerging par-

allel programming models using a proxy application.

In 27th IEEE International Parallel & Distributed

Processing Symposium (IEEE IPDPS 2013), Boston,

USA.

Li, M., Zhang, J., Wan, J., Ren, Y., Zhou, L., Wu, B., Yang,

R., and Wang, J. (2020). Distributed machine learning

load balancing strategy in cloud computing services.

Wireless Networks, 26:5517–5533.

Liu, F., Tong, J., Mao, J., Bohn, R., Messina, J., Badger, L.,

and Leaf, D. (2012). NIST Cloud Computing Refer-

ence Architecture: Recommendations of the National

Institute of Standards and Technology. CreateSpace

Independent Publishing Platform, USA.

Navaux, P. O. A., Lorenzon, A. F., and da Silva Serpa, M.

(2023). Challenges in high-performance computing.

Journal of the Brazilian Computer Society, 29(1):51–

62.

Ohue, M., Aoyama, K., and Akiyama, Y. (2020). High-

performance cloud computing for exhaustive protein-

protein docking.

Rak, M., Cuomo, A., and Villano, U. (2013).

Cost/performance evaluation for cloud applications

using simulation. In 2013 Workshops on Enabling

Technologies: Infrastructure for Collaborative

Enterprises, pages 152–157.

Rathnayake, S., Loghin, D., and Teo, Y. M. (2017).

Celia: Cost-time performance of elastic applications

on cloud. In 2017 46th International Conference on

Parallel Processing (ICPP), pages 342–351.

Stratton, J. A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang,

L.-W., Anssari, N., Liu, G. D., and Hwu, W.-m. W.

(2012). Parboil: A revised benchmark suite for sci-

entiﬁc and commercial throughput computing. Cen-

ter for Reliable and High-Performance Computing,

127:27.

Subramanian, L., Seshadri, V., Kim, Y., Jaiyen, B., and

Mutlu, O. (2013). MISE: Providing performance

predictability and improving fairness in shared main

memory systems. In IEEE HPCA, pages 639–650.

Suleman, M. A., Qureshi, M. K., and Patt, Y. N.

(2008). Feedback-driven Threading: Power-efﬁcient

and High-performance Execution of Multi-threaded

Workloads on CMPs. SIGARCH Computer Architec-

ture News, 36(1):277–286.

Wan, B., Dang, J., Li, Z., Gong, H., Zhang, F., and Oh, S.

(2020). Modeling analysis and cost-performance ratio

optimization of virtual machine scheduling in cloud

computing. IEEE Transactions on Parallel and Dis-

tributed Systems, 31(7):1518–1532.

Zhang, L., Zhou, L., and Salah, A. (2020). Efﬁcient sci-

entiﬁc workﬂow scheduling for deadline-constrained

parallel tasks in cloud computing environments. In-

formation Sciences, 531:31–46.

CLOSER 2025 - 15th International Conference on Cloud Computing and Services Science

238