Network Failures in Cloud Management Platforms: A Study on

OpenStack

Hassan Mahmood Khan

, Frederico Cerveira

, Tiago Cruz

and Henrique Madeira

University of Coimbra, CISUC, DEI, Portugal

Keywords:

Cloud Computing, Network Failure, OpenStack, Cloud Management Platform, Fault Injection, Dependability.

Abstract:

Cloud Management Platforms (CMPs) have a critical role in supporting private and public cloud computing as

a tool to manage, provision and track resources and their usage. These platforms, like cloud computing, tend

to be complex distributed systems spread across multiple nodes, thus network faults are a threat that can lead to

failures to provide the expected service. This paper studies how network faults occurring in the links between

the nodes of the CMP can propagate and affect the applications that are hosted on the virtual machines (VMs).

We used fault injection to emulate various types of network faults in two links of the OpenStack CMP while a

common cloud computing workload was being executed. The results show that not all network links have the

same importance and that network faults can propagate and cause the performance of applications to degrade

up to 50% and a small percentage of their operations to fail. Furthermore, in many campaigns some of the

responses returned by the applications did not match the expected values.

1 INTRODUCTION

With the development of computer technology, the

range of applications and their complexity has ex-

panded, along with a fast increase in the usage of

computing resources. To sustain this trend, there is

a need for a large and diversiﬁed pool of computing

resources that can satisfy the clients’ requirements.

Cloud computing offers shared, scalable, ubiquitous,

reconﬁgurable, and simple on-demand access to com-

puting resources (such as networks, storage, process-

ing units, applications, and services) via conﬁgurable

Internet services that can be quickly deployed and

released with reduced management overhead. Due

to some characteristics of cloud computing, such as

scalability, agility and cost-effectiveness, many busi-

nesses tend to place their application services on the

cloud.

With the ongoing increase of application require-

ments and considerable progress in cloud computing

system research, many researchers are focusing on

cloud management platforms and their dependability

and fault tolerance. Cloud management platforms,

https://orcid.org/0000-0002-7974-2108

https://orcid.org/0000-0002-0180-4815

https://orcid.org/0000-0001-9278-6503

https://orcid.org/0000-0001-8146-4664

such as OpenStack (OpenStack, 2022), are a collec-

tion of software modules and tools that provides a

framework to create and manage both public cloud

and private cloud.

Dependability is “the ability to deliver service that

can justiﬁably be trusted” (Avi

zienis et al., 2004).

Cloud computing system dependability is crucial for

cloud service providers, brokers, carriers, and users

worldwide. Fault injection is an important method

that can accelerate the occurrence of faults in a con-

trolled manner in a system. It aids in understand-

ing how the system behaves when stressed in unusual

ways, hence helping make it more fault tolerant.

To establish cloud computing as a trustable plat-

form for cloud stakeholders, it must achieve levels

of dependability that are comparable to the depend-

ability offered by classical dedicated infrastructures.

In other words, clients will shy away from migrat-

ing their applications to the cloud, particularly the

business-critical ones, if by moving they are signiﬁ-

cantly worsening its dependability.

This research paper addresses the impact that

moving an existing workload from a dedicated in-

frastructure to a CMP-based cloud infrastructure can

have on the dependability of the applications. We fo-

cus on network faults that can occur in the network

links between the nodes of the CMP and how a fault

in these links can propagate up to the hosted applica-

228

Khan, H., Cerveira, F., Cruz, T. and Madeira, H.

Network Failures in Cloud Management Platforms: A Study on OpenStack.

DOI: 10.5220/0011851400003488

In Proceedings of the 13th International Conference on Cloud Computing and Services Science (CLOSER 2023), pages 228-235

ISBN: 978-989-758-650-7; ISSN: 2184-5042

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

tion and affect the timeliness and correctness of the

service provided. To emulate network faults, fault in-

jection is used with the help of Sidekick, a network

trafﬁc shaper that supports various fault models, such

as packet loss, bandwidth congestion or latency.

The results show that it is possible that hosted ap-

plications, running on the VMs that have been provi-

sioned using the CMP, are affected by network faults

in the links that connect the nodes of the CMP. The ex-

perienced failure modes include performance degra-

dation (i.e., the service takes longer to fulﬁl the re-

quests), failed operations (e.g., due to a timeout) and

even operations that return invalid results. However

not every network link has the same impact. From the

two considered scenarios (i.e., network links where

faults were injected) only one affected the service pro-

vided by the hosted applications.

The remaining parts of the paper are structured in

the following manner: The background of cloud man-

agement platforms, their dependability, and network

failure are covered in Section 2. The experimental

conﬁguration is discussed in Section 3. The ﬁndings

collected from the experiment are discussed in Sec-

tion 4. Threats to validity of this study are presented

in Section 5 and conclusions drawn and future work

are described in Section 6.

2 BACKGROUND

Cloud computing can be regarded as an ensemble

of networked, virtualized computers that are con-

stantly provided and presented as uniﬁed comput-

ing resources in accordance with service-level agree-

ments between the service provider and the consumer

make up a cloud, which is a distributed system (Ku-

mari and Kaur, 2021). Cloud computing makes it pos-

sible to access, conﬁgure, and manipulate a shared

pool of reconﬁgurable computing resources, includ-

ing as networks, servers, storage, applications, and

services, in a manner that is pervasive, convenient,

and on-demand.

The systematic monitoring, control, administra-

tion, and maintenance of cloud computing infras-

tructure, services, and resources is referred to as

cloud management. A cloud management platform,

or CMP, is a collection of software used to man-

age cloud environments (Cocozza et al., 2015). The

primary objective of the CMP is to make it possi-

ble to improve resource management and monitor-

ing of cloud resources (Lu et al., 2020). Some ex-

amples of CMPs currently available include Abiquo,

CloudStack, Eucalyptus, Nimbus, openQRM, Open-

Stack (OpenStack, 2022), Open Nebula, Apache Vir-

tual Computing Lab (VCL), and HP’s CloudSystem

Matrix. OpenStack (OpenStack, 2022) is one of the

most prominent CMP that uses pooled virtual re-

sources to build and manage clouds.

Several methodologies have been utilized to as-

sess the dependability of cloud computing systems,

including analytic, state space, statistical, PetriNet,

simulation methods (Dantas et al., 2012) and fault in-

jection. Fault injection emulates faults in a target sys-

tem to produce failures similar to real-world failures

but at a faster pace (Natella et al., 2016).

Recent research has focused on dependability

of cloud computing systems. Ju et al. (Ju et al.,

2013) examined OpenStack’s resilience by introduc-

ing failures (by terminating VMs or service pro-

cesses), network partitions (by prohibiting connection

between two subnets), and network trafﬁc delay and

packet losses (by disrupting REST service requests).

(Cerveira et al., 2015) introduce CPU and memory bit

ﬂips to evaluate hypervisor and VM isolation when

affected by transient hardware faults. (Pham et al.,

2016) employed fault injection against OpenStack

to inject and analyze the caused failures. A study

by (Cotroneo et al., 2019) employed fault injection

and failure analysis to explore the consequences of

failures in the widely used OpenStack CMP. Their

results show that software bugs can propagate inside

the components of the CMP, thus suggesting the us-

age of more advanced testing and fault tolerance tech-

niques (Cotroneo et al., 2022).

There are various benchmarks for evaluating

a cloud platform from numerous perspectives.

Among these, the Yahoo Cloud Serving Benchmark

(YCSB) (Cooper et al., 2010) is a well-known key-

value data storage benchmark, which supports the ma-

jority of key-value store databases. YCSB purpose is

to provide a framework and a collection of common

workloads for measuring the performance of various

key-value and cloud serving stores

Apache Cassandra (Cassandra, 2022) is an open-

source, distributed NoSQL database that uses wide-

column partitioning. Facebook created Apache Cas-

sandra by combining Google’s Bigtable data and

storage engine concept with Amazon’s Dynamo dis-

tributed storage and replication techniques

3 EXPERIMENTAL SETUP

To evaluate the impact of network failures on the

OpenStack CMP and the applications hosted on it,

fault injection of network faults was performed. This

section describes the experimental setup, including

how Openstack was conﬁgured, the used workload,

Network Failures in Cloud Management Platforms: A Study on OpenStack

229

the type of network faults emulated, the fault injec-

tion process and the ﬂow of the experiments.

3.1 Environment Conﬁguration

We have opted to conduct our experimental evalu-

ation using the most commonly used CMP, Open-

Stack (more speciﬁcally, we used OpenStack Xena).

Figure 1 shows the experimental design comprising

OpenStack, workload and benchmark applications on

the hosted VMs and network trafﬁc shaper. Open-

Stack was conﬁgured over three nodes, which are:

Controller node: It runs identity service, image

service, web dashboard, networking agents and man-

agement portions associated to compute and network-

ing. It also includes supporting services i.e. SQL

database, message queue, Network Time Protocol.

Compute node: It runs the hypervisor that sup-

ports the instances (i.e. virtual machines). In our

setup, the used hypervisor was KVM 4.2.1. It also

runs a networking service agent that connects in-

stances to virtual networks and provides ﬁrewalling

services to instances via security groups.

Storage node: This node contains the disks that

are provisioned for the instances.

Figure 1: Experimental setup, CMP, Workload VMs, and

network trafﬁc shaper.

Our OpenStack network conﬁguration consists of

two networks. The ﬁrst is the management network,

which provides external access (to the Internet or to

a private network) to all nodes for administrative pur-

poses (e.g., package installation or security updates).

The other network is the data network, which directly

connects the Controller, Compute and Storage nodes.

Due to the different roles, it is a good practice to keep

these two networks separate of each other. In our ex-

periments, we focus on the data network, because it is

the network that supports the CMP, whereas the man-

agement network has a support role.

For the experimental evaluation, we derived two

scenarios as to evaluate how faults in different net-

work links may affect the system differently:

• C1: fault injection on the data network connection

between compute node and storage node;

• C2: fault injection on the data network connection

between the controller node and storage node.

3.2 Network Trafﬁc Shaper

There was need to use a tool capable of emulating var-

ious kinds of network faults in the least intrusive and

most reproducible manner possible to perform net-

work faults injection. For that purpose, the network

trafﬁc shaper tool named Sidekick was used to inject

communications-related disturbances in a controlled

way. Although in this paper Sidekick is used to eval-

uate the effect of network faults in a system, it can

also be used to evaluate the performance and robust-

ness of communications protocols, APIs or services.

The fundamental operation model for this tool is

presented in Figure 2.

Figure 2: Sidekick operational model.

A Network Emulator bridge based on a Linux vir-

tual appliance provides the means to transparently

constrain/disturb the trafﬁc between communicating

peers on different network segments. This appliance

is conﬁgured with three network interfaces: two for

the transparent bridge and one for out-of-band exper-

iment control. Network trafﬁc shaping capabilities

are provided by the native Linux TC and Netem sub-

systems to constrain bandwidth or inject disturbances

such as packet losses, jitter, or latency.

To integrate these capabilities within a fault injec-

tion framework, the Sidekick agent provides remote

control of the trafﬁc shaper for integrated experiment

management. This agent provides the capabilities of

remote scheduling for time-triggered activation, con-

ﬁguration through an easy-to-use JSON proﬁle, oper-

ation in foreground or background mode, and trigger

precision around 0.02s, on average (for overall system

load <20%).

Sidekick allows for experiments to be scheduled,

by supplying an ID for a group of affected IP ad-

dresses (obtained from a conﬁguration ﬁle), the start-

ing moment, encoded in UNIX timestamp/epoch for-

mat (UTC) and the test duration (in seconds). For

instance, to activate a scheduled test, an MQTT plain-

text message has to be sent to the platform MQTT

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

230

server, to the topic “<session name>/scheduler”,

with the format “groupID:start:duration”.

Experiments can be conﬁgured in two ways: Dy-

namic Conﬁguration, by employing new conﬁgura-

tion modes pushed via MQTT messages. For this

purpose, the PUSH command allows to push new

test proﬁles for a speciﬁc group over-the-wire. Static

Conﬁguration, by utilizing a JSON ﬁle that estab-

lishes the nominal and faulty conditions for groups

of IP addresses. The Sidekick command set provides

further resources to activate test modes, force clock

synchronization and control logging and reporting,

among other operations.

3.3 Workload

The YCSB (Cooper et al., 2010) key-value data stor-

age benchmark is widely used and represents a com-

mon use case found in cloud computing (key-value

stores). For our experiments we paired YCSB with

the open source, distributed NoSQL database, Apache

Cassandra (Cassandra, 2022). Cassandra was in-

stalled in one VM and three other VMs were used to

run the YCSB clients that send operation requests to

Cassandra. The four VMs used for the workload aim

to replicate AWS t2.small (VCPUS 1, Memory 2GB,

Storage 10 GB) (AWS, 2022). We used the work-

loada of the benchmark, that consists on 50% propor-

tion of read-operations and 50% proportion of update-

operations. We conﬁgured record count as 1000 and

operation count as 5000, based on the performance

and hardware resources of our setup. Finally, we acti-

vated the conﬁguration of YCSB that checks the data

integrity of the read operations, thus verifying the in-

tegrity of the results.

The usage of a workload at the level of the VM in

opposition to a workload that exercises the manage-

ment operations of OpenStack is justiﬁed because we

desire to focus on studying the impact on the applica-

tions that are hosted in the VMs, and because even a

workload running on the VMs will cause OpenStack

to be exercised (e.g., there will be network trafﬁc be-

tween the Compute and Storage node due to the disk

reads and writes triggered by the workload).

3.4 Fault Model

In our experiments, three different types of network

faults are emulated: packet loss, latency and network

congestion. These three types of faults were chosen

because previous work has shown them to be repre-

sentative of network faults that occur in the wild (Qi

et al., 2021) (Cotroneo et al., 2022).

For the fault injection experiment campaign, we

have grouped the network faults according to their

intensity into three levels (Low, Medium and High).

Furthermore, we also vary the duration that the net-

work fault remains active by three levels. In other

words, we assume that a network fault will be ﬁxed

and normal operation will resume after some time.

Table 1 shows the values used for fault intensity and

fault duration in our fault injection campaigns.

Table 1: Network fault types and conﬁguration.

Fault

Injection

Intensity

Fault type

Network

Congestion

Packet

Loss

Latency

FI Exec

Duration

Low 250 Mbps 25% 0.5s 30s

Medium 100 Mbps 50% 1.0s 45s

High 0.5 Mbps 75% 3.0s 60s

The fault injection campaigns were designed to

evaluate the impact of each network fault type across

different durations. In each experiment, a certain fault

type is picked and then its intensity and duration are

varied. In total, we executed 27 experiment conﬁgu-

rations (100 runs per each conﬁguration) where fault

type and duration varied. Over the 27 conﬁgurations,

we amassed a total of 5400 experiment runs (27 con-

ﬁgurations x 2 scenarios x 100 runs).

3.5 Experimentation Flow

Figure 3 presents the ﬂow of an experiment execution

(or experiment run). Before the initialization of the

experiments, the hosted VMs (Cassandra and YCSB

instances) are already provisioned and running. The

ﬂow of an experiment run comprises the initialization

of the workload on the hosted VMs. For each exper-

iment run, we load a fresh copy of the database state

to ensure a clean experiment environment. The work-

load is always executed for at least 10 sec (warm-up

time) before any fault is injected. After this warm-up

period, Sidekick is executed with one speciﬁc conﬁg-

uration (i.e., fault type, intensity, and duration). Dur-

ing the fault injection experiment, the workload may

become unstable or malfunction. After the fault in-

jection duration, the “Keep Time” interval enables the

workload to return back to a normal state.

Thus, the experiment ﬂow can be summarized into

the following steps:

1. Load Cassandra workload on the YCSB bench-

mark in OpenStack Experiment Environment.

2. Run Cassandra workload on the YCSB bench-

mark in OpenStack Experiment Environment: At

Time T, launch the workload and wait to receive

normal behaviour (warm-up 10 sec).

Network Failures in Cloud Management Platforms: A Study on OpenStack

231

Figure 3: Flow of an experiment run.

3. Fault Injection: Fault injection begins (at T= T+10

(1-20 sec randomization)), inject speciﬁc faults

type for 30 sec to 60 sec duration for each iter-

ation.

4. Keep Time (Back to Normal after a failure, if

any). As the fault injection stops, the system

tended to Normal behaviour (at Ti= T+ 80-30 sec).

Figure 3 shows that there could be a constant

warm-up time of 10 seconds and variations in the ran-

domization time and fault injection time that will vary

the keep time. i.e. In a Case, a maximum ”keep

time” of 80 seconds, assuming a 1-second minimum

fault injection initiation randomization time and a 30-

second fault injection duration.

4 RESULTS AND ANALYSIS

This section presents and analyses the results obtained

from the fault injection campaigns in both the C1

(network fault in the link between compute and stor-

age node) and C2 (network fault in the link between

controller and storage node) scenarios. The follow-

ing results depict the impact of fault injection from

three perspectives. Firstly, it was desired to under-

stand how a network fault in the CMP would affect the

performance of the applications running in the hosted

VMs. Secondly, it was studied whether network faults

in the network links of OpenStack can cause the ap-

plications in the hosted VMs to suffer service fail-

ures. Finally, we evaluated whether network faults in

the CMP could cause unsuccessful operations, unan-

swered requests or data corruption, thus affecting the

availability and correctness of the provided service.

4.1 Performance Impact

In scenario C1, the performance impact is presented

and evaluated for the experiments that target the net-

work between compute node and storage node within

the CMP while the hosted VMs are executing the

workload. We can observe in Figure 4 that as the

duration of the network fault increases, and, most im-

portantly, as the intensity of the fault increases (in this

case, network congestion), the throughput is reduced.

In the low intensity and duration conﬁguration (Low

NC, Low ET), the throughput is virtually similar to

that when no network fault is injected. Whereas in the

highest intensity and duration (High NC, High ET),

the throughput is reduced to about half.

Figure 4: Throughput C1: Network Congestion (NC).

In Figure 5, which refers to the packet loss fault

type, we see a similar trend of decreasing throughput

as the intensity and the duration increase, whereas for

network congestion faults the predominant factor was

the fault intensity, here the estimated time appears to

have a stronger inﬂuence (e.g., there is a noticeable

break in throughput when moving from a medium du-

ration to a high duration, while keeping the same in-

tensity). The results for low packet loss show an un-

expected pattern where throughput increases as fault

duration increases. However this behaviour is repro-

ducible and was experienced again in a second round

of conﬁrmatory experiments.

Figure 5: Throughput C1: Packet Loss (PL).

Figure 6 depicts the results for the network fault

type that injects latency into the network. It shows

that as the fault intensity increases, the deviation of

the throughput greatly increases. At the same time,

the mean throughput also decreases, but at a slower

rate. The fault duration appears to have a smaller ef-

fect on the throughput than the intensity. This sug-

gests that the network link studied in the scenario C2

has very little to no importance regarding the impact

on the performance of the hosted VMs. Thus we can

conclude that this network link does not require spe-

cial redundancy or fault tolerance mechanisms. On

the other hand, the network link studied in the sce-

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

232

nario C1 has a noticeable impact and may need to re-

ceive speciﬁc fault tolerance mechanisms.

Figure 6: Throughput C1: Latency (Lat).

For our second scenario, C2, we have injected

faults between the controller node and storage node.

Figure 7 shows no impact of fault injection on this

network link.

Figure 7: Throughput C2: Network Congestion (NC),

Packet Loss (PL), and Latency (Lat).

4.2 Workload Operations Failures

After studying the impact on the performance of the

applications, we focused on whether the provided ser-

vice is being correctly performed. It is possible for

the workload to show no performance effect, while

producing invalid responses or failing to perform op-

erations. In this subsection the focus is on whether

the read and update operations of the YCSB workload

were successfully completed or not.

For scenario C1, as shown in Figure 8, which

refers to the failed read-operations, network conges-

tion was the fault type with the least impact. Of

the nine combinations where network congestion was

used, the mean number of failed operations was 1.

Although in absolute terms a single failed read op-

eration may seem inconsequential, the important ob-

servation to retain is that there were failed operations.

Ideally we would expect to see that a network fault in

the CMP does not lead to any failed operation of the

applications hosted over it. Since this is not what the

results show, it means that migrating an application

(e.g., a key-value store) from dedicated infrastructure

to cloud computing will bring a decrease in its de-

pendability.

The ﬁgure also depicts that packet loss was the

the fault type with the highest impact, followed by

latency.

Figure 8: Read-operation Failed C1: All.

Figure 9 shows the number of update-operations

that failed to successfully execute, for scenario C1.

Figure 9: Update-operation Failed C1: All.

The results show that as the intensity and dura-

tion of the fault increases, the number of failed opera-

tions also increases noticeably. Some combinations

showed no impact in the number of failed update-

operations. For example, a low network conges-

tion and low duration never caused any failed oper-

ation. However, the other two types of network faults

(Latency and Packet Loss) caused failed operations

even when the fault intensity was low. The abso-

lute amount of failed operations varied. For example,

when we have Low PL and Low ET, there is a mean of

10 failed operations. The mean number of failed oper-

ations reached as high as 25 (e.g., High PL and High

ET). When talking about percentage of failed opera-

tions, the resulting percentages are very small. How-

ever, once again the important observation to take is

that these failures occur at all.

For scenario C2, we have not found any operation

failure for the read-operations and update-operations.

4.3 Operations Correctness Check

Other than verifying if the operations were completed

successfully, we also veriﬁed if the operations re-

Network Failures in Cloud Management Platforms: A Study on OpenStack

233

turned the correct value. The veriﬁcation of the cor-

rectness of operations is evaluated through the in-

tegrity check embedded in YCSB. This check veriﬁes

whether a read operation returns the expected result.

For scenario C1, the results in Figure 10 show that

fault injection impacted the correct completion of the

operations. In a correct execution, on average about

2500 read operations would be performed. Thus a

value of Integrity Verify Operations lower than 2500

means that there were incorrect responses being re-

turned.

In some cases, 80% to 90% of read operations

failed to pass the correctness test. As an example,

for low NC, Low ET, the mean incorrect operations

are slightly less than 400 (84%), while in some runs it

exceeded over 600 (∼76%).

Figure 10: Integrity Verify Operations C1: Network Con-

gestion.

Figure 11 depicts the number of successfully

passed integrity veriﬁcation operations when packet

loss was injected. Here we can observe of the high-

est amount of failed checks, which occurred with high

packet loss and high duration (High PL, High ET) and

lead to only about 350 correct operations (86%).

Figure 11: Integrity Verify Operations C1: Packet Loss.

Figure 12 depicts the same results but wrt. la-

tency network faults. The most surprising result refers

to low latency with both low and medium duration,

which saw unusually high numbers of failed checks,

whereas the remaining combinations caused none to

only a few failed checks. This pattern was veriﬁed

by repeating the same experiments twice and a root

cause analysis will be performed as future work.

Figure 12: Integrity Verify Operations C1: Latency.

Once more, network faults injected in scenario C2

showed no impact. It demonstrates that the network

link studied in C2 has no effect in the behaviour of the

hosted applications, and showing that different net-

work links of the CMP have different importance.

In summary. network faults in the CMP, specially

those that cause congestion or packet loss in the net-

work, can cause incorrect results to be returned, usu-

ally at a very high percentage. The explanation for

this high percentage is likely found behind the fact

that once one operation in the workload fails, then the

next operations may be affected.

5 THREATS TO VALIDITY

The main threat to the validity of our results de-

rives from the choices taken when building the ex-

perimental setup and the representativeness of the

used fault models and its parameters. To mitigate

the threats with respect to the experimental setup,

we followed the OpenStack implementation guide-

lines (OpenStack, 2022) and fulﬁlled the minimum

hardware and software requirements. The experiment

setup is relatively simple. Real-world deployments

are likely to be more complex and require more com-

puting power.

We opted to use the YCSB benchmark paired with

Cassandra as our workload. This workload represents

a common cloud use case (key-value stores), however

workloads of different types and areas need also to be

evaluated, as the results may differ.

There is a need to balance experiment duration

and accuracy. For that reason the durations chosen

for the faults can be considered relatively short net-

work failures in the real-world can often exceed many

hours, however it would have been impractical to em-

ulate such long failures. Nevertheless, we consider

this not to be a signiﬁcant problem because the im-

pact of longer failures is going to be more pronounced

than that of shorter failures.

CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science

234

6 CONCLUSIONS

Network faults are an unavoidable reality in any large-

scale complex distributed systems, as is often the case

for the infrastructure that supports cloud computing.

In this paper, an experimental evaluation of the im-

pact that network faults can have in a cloud comput-

ing system was performed. We focused on the CMP,

more speciﬁcally in OpenStack, due to its popularity

and due to being a complex distributed system where

the network plays an important role. Fault injection

of 3 different types of common network fault was per-

formed with the help of Sidekick.

The results show that different network links have

different importance in the impact experienced by the

applications hosted on the infrastructure. The results

show that network faults affecting the link between

the compute node and the storage node can cause

applications running on the infrastructure to fail to

provide correct service, even if the network faults

only lead to increased latency or reduced bandwidth.

These results serve as the basis for future work on the

development of fault tolerance mechanisms for CMPs

that increase its tolerance of network faults while car-

rying minimal cost and overhead. Furthermore, as fu-

ture work, we will carry out more experiments featur-

ing more complex setups and setups where autoscal-

ing is present, as to evaluate how network faults affect

these setups.

ACKNOWLEDGEMENTS

This work is funded by the FCT - Foundation for

Science and Technology, I.P./MCTES through na-

tional funds (PIDDAC), within the scope of CISUC

R&D Unit - UIDB/00326/2020 or project code

UIDP/00326/2020. This work is also supported by

Project Reference ECSEL/0017/2019 and 876852-

ECSEL-RIA-VALU3S, ﬁnanced by Fundac¸

ao para a

encia e a Tecnologia, I.P./MCTES through national

funds (PIDDAC) and funding from the ECSEL Joint

Undertaking (JU) under grant agreement No 876852.

The JU receives support from the European Union’s

Horizon 2020 research and innovation programme

and Sweden, Italy, Spain, Portugal, Czech Republic,

Germany, Austria, Ireland, France and Turkey.

REFERENCES

Avi

zienis, A., Laprie, J.-C., and Randell, B. (2004). De-

pendability and its threats: a taxonomy. In Building

the Information Society, pages 91–120. Springer.

AWS (2022). Amazon ec2 instance types, https://aws.ama

zon.com/ec2/instance-types/, access date:2022-09-30

Cassandra (2022). Apache cassandra, https://cassandra.

apache.org, access date: 2022-09-30.

Cerveira, F., Barbosa, R., Madeira, H., and Araujo, F.

(2015). Recovery for virtualized environments. In

2015 11th European Dependable Computing Confer-

ence (EDCC), pages 25–36. IEEE.

Cocozza, F., L

opez, G., Marın, G., Villal

on, R., and Arroyo,

F. (2015). Cloud management platform selection: A

case study in a university setting. Cloud Computing,

2015:92.

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R.,

and Sears, R. (2010). Benchmarking cloud serving

systems with ycsb. In Proceedings of the 1st ACM

symposium on Cloud computing, pages 143–154.

Cotroneo, D., De Simone, L., Liguori, P., Natella, R., and

Bidokhti, N. (2019). Enhancing failure propagation

analysis in cloud computing systems. In 2019 IEEE

30th International Symposium on Software Reliability

Engineering (ISSRE), pages 139–150. IEEE.

Cotroneo, D., De Simone, L., and Natella, R. (2022).

Thorﬁ: a novel approach for network fault injection

as a service. Journal of Network and Computer Appli-

cations, 201:103334.

Dantas, J., Matos, R., Araujo, J., and Maciel, P. (2012).

Models for dependability analysis of cloud comput-

ing architectures for eucalyptus platform. Interna-

tional Transactions on Systems Science and Applica-

tions, 8(5):13–25.

Ju, X., Soares, L., Shin, K. G., Ryu, K. D., and Da Silva, D.

(2013). On fault resilience of openstack. In Proceed-

ings of the 4th annual Symposium on Cloud Comput-

ing, pages 1–16.

Kumari, P. and Kaur, P. (2021). A survey of fault tolerance

in cloud computing. Journal of King Saud University-

Computer and Information Sciences, 33(10):1159–

1176.

Lu, Y., Cheng, H., Ma, Y., and Wu, S. (2020). Research

on the technology of power uniﬁed cloud management

platform. In 2020 IEEE 9th Joint International Infor-

mation Technology and Artiﬁcial Intelligence Confer-

ence (ITAIC), volume 9, pages 770–773.

Natella, R., Cotroneo, D., and Madeira, H. S. (2016). As-

sessing dependability with software fault injection: A

survey. ACM Computing Surveys(CSUR),48(3):1–55

OpenStack (2022). Openstack- open source cloud comput-

ing platform software.

Pham, C., Wang, L., Tak, B. C., Baset, S., Tang, C., Kalbar-

czyk, Z., and Iyer, R. K. (2016). Failure diagnosis

for distributed systems using targeted fault injection.

IEEE Transactions on Parallel and Distributed Sys-

tems, 28(2):503–516.

Qi, Y., Fang, C., Liu, H., Kang, D., Lyu, B., Cheng, P.,

and Chen, J. (2021). A survey of cloud network fault

diagnostic systems and tools. Frontiers of Information

Technology and Electronic Engineering, 22(8):1031–

1045.

Network Failures in Cloud Management Platforms: A Study on OpenStack

235