Quality Metrics for Reinforcement Learning for Edge Cloud and

Internet-of-Things Systems

Claus Pahl and Hamid R. Barzegar

Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy

Keywords:

Reinforcement Learning, Machine Learning, Quality Management, Metrics, Controller, Edge Computing,

Internet-of-Things.

Abstract:

Computation at the edge or within the Internet-of-Things (IoT) requires the use of controllers to make the

management of resources in this setting self-adaptive. Controllers are software that observe a system, analyse

its quality and recommend and enact decisions to maintain or improve quality. Today, often reinforcement

learning (RL) that operates on a notion of reward is used to construct these controllers. Here, we investigate

quality metrics and quality management processes for RL-constructed controllers for edge and IoT settings.

We introduce RL and control principles and deﬁne a quality-oriented controller reference architecture. This

forms the based for the central contribution, a quality analysis metrics framework, embedded into a quality

management process.

1 INTRODUCTION

Software development can employ Machine Learning

if a sufﬁcient amount of data is available to generate

the ML models that form the core functions of these

software systems (Wan et al., 2021; Pahl, 2023). This

is particularly true for a self-adaptive system that is

built around a controller. A controller generates ac-

tions to maintain such a system within an expected

quality ranges based on monitored system data as the

input. Self-adaptive systems are widely used in envi-

ronments where a manual adjustment is neither feasi-

ble nor reliable. For instance, edge and IoT settings

are suited to be governed by a control-theoretic solu-

tion to continuously and automatically adjust the sys-

tem.

The quality concerns of reinforcement learning

has been investigated widely for speciﬁc concerns

(Al-Nima et al., 2021; Bu¸soniu et al., 2018; Xu et al.,

2021). Performance or robustness the most frequent

concerns that can be found. Our objective here is

to conduct a wider review of quality metrics beyond

these two, also including fairness, sustainability and

explainability, which are common concerns for ma-

chine learning (ML) in general, but need a speciﬁc

investigation for RL. We present a catalog of clas-

siﬁed metrics as the main contribution. In order to

frame this metrics catalog, we introduce a reference

architecture (Pahl et al., 2022; Pahl et al., 2019) for

edge and IoT controllers with a quality management

framework (Pahl and Azimi, 2021). We also embed

this into a continuous change process in a DevOps-

style that allows quality monitoring continuously to

mediate quality deﬁciencies.

2 ML-Based CONTROLLERS

ML in general is used to generate a range of appli-

cations (Mendonca et al., 2021), such as: predictors

where ML is used to predict or forecast events based

on historic data, classiﬁers where ML serves to cate-

gorise or classify input data based on some pattern, or

adaptors where ML is used to create controllers for

self-adaptive systems. Our concern here is the latter

category of adaptors. However, due to the utilisation

of ML but recently also other AI technologies such

as large language models like GPT to construct soft-

ware, there is no direct full control by expert software

engineering and thus quality needs to be controlled in

a different way. This requires for instance explain-

ability of the ML models to understand quality impli-

cations.

Machine learning models are normally evaluated

in their effectiveness, generally in terms of met-

rics such as accuracy, precision and recall. Two

requirements emerge for RL-constructed controllers

and their quality:

Pahl, C. and Barzegar, H.

Quality Metrics for Reinforcement Learning for Edge Cloud and Internet-of-Things Systems.

DOI: 10.5220/0012194800003584

In Proceedings of the 19th International Conference on Web Information Systems and Technologies (WEBIST 2023), pages 355-364

ISBN: 978-989-758-672-9; ISSN: 2184-3252

355

• In order to better judge the quality of controllers,

other concerns such as explainability, but also

fairness or sustainability are important.

• the usual performance measures accuracy, preci-

sion and recall do not naturally apply for RL,

requiring a different notion of performance and

also the need to take uncertainty and distur-

bances in the environment into account.

What is needed is an engineering approach for ML-

constructed software, which is often called AI Engi-

neering (Lwakatare et al., 2019). Here, often explain-

ability is the focus that aims at better understanding

and transparency of how ML models work. Important

for understanding is a link between input and output

data and the quality of the controller.

Our concrete setting are controllers for edge cloud

and/or IoT resource management (Femminella and

Reali, 2019; Hong and Varghese, 2019; Javed et al.,

2020; Tokunaga et al., 2016; Zhao et al., 2019). Sys-

tem adaption is required for the resource conﬁgura-

tion, including monitoring of resource utilization and

respective application performance as well as appli-

cation of ML-generated rules for resource manage-

ment to meet quality requirements. The rules adjust

the resource conﬁguration (e.g., size) to improve per-

formance and other qualities. The chosen ML tech-

nique for this is RL, which employs a reward prin-

ciple applicable in the self-adaptation loop to reward

improvements and penalise deterioration. In a con-

crete use case, the problem is a resource controller for

edge adaptation that follows a formal/semantic model

(Fang et al., 2016), working with the following rule:

if Workload 80% then Double(resource-size). The

problem is whether this rule is optimal and whether

the recommendation space considered by the con-

troller is complete. The solution could be an RL con-

troller that provides a recommendation for scaling.

The model could reward a good utilization rate (e.g.,

60 − 80%) and penalise costly resource consumption

(e.g., high costs for cloud resources).

ML-driven controller generation for automated

adaptation of a system requires proper quality mon-

itoring of a deﬁned set of comprehensive quality cri-

teria. Furthermore, detected quality deﬁciencies need

to be analysed as part of a root cause analysis (Azimi

and Pahl, 2020). From this, suitable remedies need to

be recommended and enacted.

3 RELATED WORK

We discuss here the RL quality perspective covering

individual metrics but also general frameworks. Edge

and IoT quality concerns have already been covered

in the previous section.

A range of individual quality metrics have been in-

vestigated for reinforcement learning. Reinforcement

learning is a suitable approach to derive solutions for

control systems. The work in (Bu¸soniu et al., 2018)

covers the link between RL performance and the no-

tion of stability that stems from the control area. (Xu

et al., 2021) is a good example of an RL application

for a control problem that requires high degrees of

performance, speciﬁcally accuracy. Robustness and

performance are covered in (Al-Nima et al., 2021)

in order to cover recent deep reinforcement learning

trends. Robustness is also investigated in (Glossop

et al., 2022). The ability to deal with disturbances

is often seen as an important property of control sys-

tems that act in environments with a lot of uncertainty.

However, beyond classical performance metrics, re-

cently in the wider ML and AI context other concerns

such as explainability or sustainability. Attention has

been given to these from the perspective of the envi-

ronment and the users and/or subjects of a solution.

Another concrete direction is the fairness of the solu-

tion. (Jabbari et al., 2016) looks at this in the context

of Markov processes, which deﬁne the central proba-

bilistic behaviour of control systems. While explain-

ability has now been widely recognised for prediction

and classiﬁcation approaches, RL has received less at-

tention. One example is (Krajna et al., 2022) that de-

ﬁnes explainability for RL. A survey of this aspect is

provided by (Milani et al., 2022). As a wider soci-

etal concern that also has a cost impact for users, sus-

tainability through for example energy and resource

consumption is also investigated for RL (Mou et al.,

2022).

If a set of metrics need to be implemented, i.e.,

need to be monitored, analysed and converted into

recommendations or remedial actions if quality con-

cerns are detected, then a systematic engineering ap-

proach is needed that explains the architecture of the

system in question and devises a process for quality

management. (He et al., 2021) provides an overview

of the AutoML domain, which a notion to covered

automated approaches to manage the ML model cre-

ation and quality management. Another term used

in this context is AI engineering. For instance,

(Lwakatare et al., 2019) approach this from a soft-

ware engineering perspective, aiming to deﬁne prin-

ciples that deﬁne a systematic engineering approach.

Similarly, (Wan et al., 2021) investigates common en-

gineering practices and how they change in the pres-

ence of ML.

This review demonstrates two insights. Firstly,

relevant quality metrics are performance, robustness,

WEBIST 2023 - 19th International Conference on Web Information Systems and Technologies

356

fairness, explainability and sustainability. Secondly,

a systematic engineering framework with architecture

and process model are needed to embed these metrics

into a coherent framework.

4 REFERENCE ARCHITECTURE

FOR CONTROLLER QUALITY

The quality management of ML models is a challenge

that remains. Ground truth, i.e., the accuracy against

the real world, could mean if a predictor predicts ac-

curately or if an adaptor manages to improve the sys-

tem performance. However, as the above discussion

of the state-of-the-art show, more than the traditional

performance and cost improvement needs to be ad-

dressed. While discrimination is for edge and IoT not

a direct issue, a notion of technical fairness and as-

pects of accountability and explainability need to be

dealt with.

Self-adaptive systems and decision models at the

core of respective controllers are suitable for RL-

based creation due to the availability of data for train-

ing. The objective of the RL model is to enact an

action, e.g., to adapt resources in the edge, divert traf-

ﬁc in IoT settings or to instruct the machines that are

controlled. This implements a dynamic control loop,

governed by the quality goals.

4.1 Reference Architecture

Self-adaptive systems that are governed by a con-

troller implement a feedback loop. Our objective is

to provide a meta-learning layer that monitors a num-

ber of quality metrics, but also validates the metrics

and their respective success criteria in a second loop

– see Figure 1: The lower layer is a controller for

self-adaptive edge and IoT systems based on RL. The

upper layer is an intelligent quality monitoring and

analysis framework aligned with the requirements of

an RL-generated controller for self-adaptive systems.

Figure 1 builds on the so-called MAPE-K loop (Mon-

itoring, Analysing, Planning, Executing – based on

Knowledge) in two layers. The upper loop is the fo-

cus in this paper, but needs to take on board the lower

layer behaviour. The selected quality metrics are in-

dicated. Performance and fairness directly affect the

system quality. Robustness is a guard against external

uncertainties and inﬂuences. Sustainability has an ef-

fect on the environment. Explainability allows under-

standability, e.g., to explain the differences between

two alternative model variants.

Figure 1: System Architecture – a lower edge/IoT controller

loop and an upper RL quality management loop.

4.2 Controller and Quality

Management

An important question concerns the full automation

of the upper loop. While the full automation is not

the primary objective here, this ultimate objective cre-

ates some challenges regarding metrics and their mea-

surability and should thus guide the solution. These

challenges are as follows with respect to the indicated

MAPE-loop: automate testing (M in MAPE), auto-

mate the test evaluation (A in MAPE), recommend an

RL learning conﬁguration adjustment (P in MAPE).

For this work, our objective is to develop a conceptual

quality framework that would allow a full automation

sometime in the future. We aim to demonstrate later

on that an automation is beneﬁcial and feasible.

In order to develop a solution, we follow the

Engineering Research guidelines proposed by the

ACM SIGSOFT Empirical Standards committee as

the methodological framework, published in Version

0.1.0. For the controller design, we use the already

mentioned MAPE-K architecture pattern for system

adaption. The evaluation of the controller can be

based on testing by checking possible variations and

their effect on the user experience regarding the se-

lected metrics. The upper meta-learning loop is de-

signed to follow the MAPE pattern:

• Monitor: we need score functions for ML model

quality, i.e., adaptor quality based on metrics that

are linked the application system and its aspects

– from data quality in the input (requiring robust-

ness) (De Hoog et al., 2019; Ehrlinger et al., 2019)

to sustainability (requiring to lower cost and envi-

ronment damage) as examples.

• Analyse: a root cause analysis for ML model

quality problems is needed and feeding into ex-

plainability aspects through a dependency deter-

mination in order to identify which system factors

improve the targeted quality the most.

• Plan/Execute: a recommendation and if possible

Quality Metrics for Reinforcement Learning for Edge Cloud and Internet-of-Things Systems

357

also enacting these is needed. For example, a

rule update for the cloud adaptor could be rec-

ommends, with RL model recreation being done.

This could in very concrete terms be a readjust-

ment of training data size or ratio.

This upper loop would implement a meta-learning

process that at the upper layer is a learning process to

adapt the controller through a continuous testing and

experimentation process. We call this the knowledge

learning layer. As said before, this is the ultimate aim

to automate the model adjustment by "MAPE-ing" the

RL model construction, i.e., to carry our a meta-level

optimization through relabelling or test size/ratio ad-

justment as sample techniques.

5 RL AND CONTROL

In this section, we introduce the relevant reinforce-

ment learning (RL) background for our quality met-

rics framework for the speciﬁc adaptive-systems con-

text for edge and IoT.

Our focus is to apply RL to self-adaptive sys-

tems. Sample RL techniques that are typically used

here are SARSA or Q-learning. RL has the notion

of a value function at the core to assess a given state

and proposed action and enact a policy. This assess-

ment is expressed in terms of a reward. Q-learning

and SARSA are the currently most widely used ap-

proaches that combine policy and value function cal-

culation into a single quality function.

Reinforcement learning is often applied at the in-

tersection of AI and control theory, with the latter also

being relevant in our setting. We brieﬂy point out dif-

ferences. The AI perspective focuses on performance

and related qualities of the generated models, control

focuses on the stability of the system, which largely

means that bounded inputs should result in bounded

outputs as the key property of the system that results

in a stabilising system from an initially unstable one.

Performance is measured in terms of rewards. Re-

wards are assumed to be bounded, but unstable sys-

tems could be governed by arbitrarily negative (or un-

bounded) rewards or penalties.

The RL objective is to maximise the reward that

is calculated for each state of the system. An impor-

tant assessment factor in this process is the value of

being in a state s. For this, the expected future re-

ward of a policy is evaluated using a value function.

Positive and negative assessments can be used: Re-

ward, e.g., for achieving performance objective, and

Penalty, e.g., for high costs or consumption. The

quality of the approach is then measured typically by

the optimality of the model and time of convergence.

The policy is adjusted to improve performance. Pol-

icy optimisation is based on a mix of exploitation and

exploration, i.e., mixing the exploitation of previous

knowledge and also random exploration. In contexts

such as edge computing, in addition to classical per-

formance, also robustness against disturbances in the

environment is of importance. We have argued that

also fairness is important and can actually be seen as

contributing to the overall performance. We will pro-

vide a respective deﬁnition below that clariﬁes this.

As indicated, sustainability and explainability impact

more the context of the system in question, but can

of course also be rewarded or penalised if automated

observation and assessment is possible (as for energy

consumption as a sustainability criterion).

The different metrics distinguishes our setting

from the typical control-theoretic focus on stability.

Two widely used RL algorithms are Q-Learning

and SARSA. Both learn an action-value function by

estimating a quality function Q(s, a) for a state s and

an action a. The Q-value or quality function is up-

dated after every time step. SARSA and Q-Learning

both use the epsilon-greedy policy, i.e., choosing be-

tween exploration and exploitation randomly.

Q-Learning is a so-called off-policy TD algorithm

to ﬁnd an optimal policy by updating the state-action

value function (Q)

Q : X × A → R

for every step using the Bellman Optimality equation

until the function converges to the optimal Q.

Algorithm 1: Deﬁnitions for Q-learning and SARSA.

States S {1, . . . , n

}

Actions A {1, . . . , n

}, A : S ⇒ A

Reward function R : S × A → R

Probabilistic transition function T : S × A → S

Learning rate α ∈ [0, 1], typically α 0.1

Discount factor γ ∈ [0, 1]

For Q-learning, the quality function Q : S × A → R is

deﬁned by the following algorithm:

SARSA is the second RL algorithm we consider.

SARSA stands for State, Action, Reward, (Next)State,

(Next)Action. SARSA is an on-policy TD algorithm

that aims to ﬁnd the optimal policy by updating the

state-action value function Q at every state using the

Bellman equation SARSA learns by experiencing the

environment and updating the state-action value at ev-

ery time step:

Q(s

′

, a) ← (1 − α) · Q(s, a) α · (r γ · Q(s

′

, a

′

))

Thus, there is only one difference to Q-learning in the

calculation of Q where the maximisation is not ap-

plied to Q(s

′

, a

′

WEBIST 2023 - 19th International Conference on Web Information Systems and Technologies

358

Algorithm 2: Quality function Q for Q-learning.

procedure QLEARNING(S, A, R, T , α, γ)

Initialize Q : S × A → R arbitrarily

while Q is not converged do

Start in state s ∈ S

while s is not terminal do

Calculate π according to Q and exploration

strategy

a ← π(s)

r ← R(s, a) ▷ Receive the reward

′

← T (s, a) ▷ Receive the new state

Q(s

′

, a) ← (1 − α) · Q(s, a)

α · (r γ · max

′

Q(s

′

, a

′

))

s ← s

′

end while

return Q

end procedure

6 ML QUALITY METRICS

We introduce our metrics framework, starting with a

conceptual frame, before deﬁning each of the ﬁve se-

lected metrics in more detail.

6.1 Metrics – Conceptual Framework

We can classify the metrics based on whether they re-

lated to the control task at hand or affect the resources,

which this task consumes. For the task, we also indi-

cate whether the control function is directly affected

(core), it could be inﬂuence by its direct context (en-

vironment), it could skew the resuls by favouring cer-

tain outcomes (bias), it could related to the respon-

sibility for the task (governance) or could impact on

resources (energy here as one selected concern).

• task - core: performance

• task - environment: robustness

• task - bias: fairness

• task - governance: explainability

• resources - energy: sustainability

Both positive and negative measurements can be val-

ued, e.g., by rewarding or penalising them.

We measure concerns that directly inﬂuence that

task at hand, i.e., how well the solution can perform

its job. In a second category of quality targets, the en-

vironment is addressed. This includes resources and

their consumption, e.g., in terms of energy consump-

tion, but also the human or organisation in charge of

the system in a governance concern, e.g., in terms

of explainability. We also add an impact direction,

i.e., whether the concern in internal, inﬂuenced by ex-

ternal forces (inwards through disturbances) or inﬂu-

ences external aspects (outwards on parts of the envi-

ronment).

6.2 Performance

Performance is here the overall accuracy of the model

towards an optimal reward. This is built into ap-

proaches like SARSA or Q-learning to optimise the

reward. The performance of an RL algorithm can

be determined by deﬁning the cumulative reward as

a function of the number of learning steps. Better re-

wards are better performance.

Different performances emerge depending on the

chosen α for the Q-function. Three parameters are

important for the performance evaluation:

• Convergence: the asymptotic slope of a graph in-

dicates the quality of the policy after stabilisation

of the RL algorithm at the end.

• Initial Loss: The lowest point of a graph indicates

how much reward is often sacriﬁced before the

performance is beginning to improve.

• Return on Investment: The zero crossing after the

initial loss gives an indication of recovery time,

i.e., of how long it takes to recover from initial,

often unavoidable learning costs.

The second and third cases only apply if there are pos-

itive and negative rewards. Also note that the cumula-

tive reward is a measure of the total rewards, but algo-

rithms such as Q-learning or SARSA use discounted

rewards modelled using the discount factor γ. A ﬂat-

tened graph would indicate that the learning process

has ﬁnished with a deﬁned policy. Instead of accumu-

lated rewards, also the average reward could be mea-

sured. This would be a measure of the quality of the

learned policy.

6.3 Robustness

Robustness is the ability to accept, i.e., deal with a

wide range of input cases. This includes for instance

uncertainties, noise, non-deterministic behaviour and

other disturbances. These are typical for physical sys-

tems like the IoT or the edge, where sensors, con-

nection or computation can fail in different locations.

Robustness arises in non-deterministic behaviour sit-

uations and needs repeated experiments in the eval-

uation. We use the term disturbances to capture the

multitude of external factors. Disturbances can be

classiﬁed into three possible contexts: observations,

actions, and dynamics of the environment that the RL

agent interacts with.

Quality Metrics for Reinforcement Learning for Edge Cloud and Internet-of-Things Systems

359

• Observation OR state disturbances happen when

the observers (e.g., sensors) used cannot detect the

exact state of the system.

• Action disturbances happen when the actuation

ultimately is not the same as the one speciﬁed by

the control output, thus causing a difference be-

tween actual and expected action.

• External dynamics disturbances are applied di-

rectly to the system. These are environmental fac-

tors or external forces.

Figure 2: Disturbances: white noise, step, impulse, saw-

tooth, triangle waves [adopted from (Glossop et al., 2022)].

Disturbances can be classiﬁed into different be-

havioural patterns for the observed system quality

over time - see Figure 2 The patterns are important

as the evaluation of a controller’s quality is often

done using simulations based on disturbances being

injected into the system following these patterns. Fig-

ure 2 shows the ﬁve patterns in two categories – three

non-periodic and two periodic ones.

Non-periodic patterns are the following:

• White Noise Disturbances: mimic natural

stochastic noise that agents encounter in the real

world. Noise is applied, ranging from zero with

increasing values of standard deviation.

• Step Disturbances: allow us to frame a system’s

response to one sudden and sustained change. The

magnitude of the step can be varied.

• Impulse Disturbances: allow us to see a sys-

tem’s response to a sudden, very short temporary

change. The impulse magnitude can be varied as

above.

Periodic patterns are the following:

• Saw Wave Disturbances: these are cyclic waves

that increase linearly to a given magnitude and in-

stantaneously drop back to a starting point in a

repeated way. Thus, this combines characteristics

of the step and impulse disturbances, but is here

in contrast applied periodically.

• Triangle Wave Disturbances: these are also cyclic

waves that as above repeatedly increase linearly to

a given magnitude and decrease at the same rate to

a starting point (and not suddenly as above). So,

this is very similar to the saw wave, but exhibits a

more sinusoidal behaviour.

Robustness is then evaluated in general as follows:

We can compare the performance metrics (as above

in the ’Performance’ section) between an ideal and a

disturbed setting:

Per f

Disturbed

Per f

Ideal

. This can be done for all

disturbance patterns.

6.4 Sustainability

General economic and ecological sustainability goals

are important societal concerns that also should ﬁnd

their application in computing, here speciﬁcally in

terms of cost and energy-efﬁciency of the RL model

creation and model-based decision processes.

Sustainability if often used synonymously with

environmentally sustainable, e.g., in terms of lower

carbon emissions (Mou et al., 2022). While differ-

ent measures can be proposed here, we choose energy

consumption here as one example because it is often

easy to determine in computing environments. En-

ergy efﬁciency can be measured through

• energy consumption in KJoule per task (KJ/task),

which can be determined using monitoring tools

• CPU/GPU usage in percent %, which can also be

determined using nvidia-smi or similar tools.

These metrics are often put into comparison with

performance metrics. Similar to the robustness case,

a ratio could be indicate a possible trade-off between

performance and sustainability. We can relate the

performance to the cost or resource consumption it

causes:

Per f ormance

Resource Consumption

This can be done for vari-

ous resource or cost types.

Sustainability focusing on the consumption of re-

sources is often considered and measured through

penalties in the value or quality calculation.

6.5 Fairness

Speciﬁcally where people are involved is the fairness

of decisions made crucial and any bias towards or

against speciﬁc groups needs to be identiﬁed. This

concern can also be transferred to the technical do-

main, creating a notion of technical fairness that

avoids preferences that could be given to speciﬁc set-

tings without a reason.

Fairness can be deﬁned in a precise way – we fol-

low the deﬁnition given by (Jabbari et al., 2016):

• A policy is fair, if in a given state s an RL al-

gorithm does not choose a possible action a with

probability higher than another action a

′

unless its

quality is better, i.e., Q(s, a) Q(s, a

′

such as the NVIDIA System Management Interface

(nvidia-smi) is a CLI utility for the management and moni-

toring of NVIDIA GPU devices.

WEBIST 2023 - 19th International Conference on Web Information Systems and Technologies

360

• This ensures that the long-term reward of a chosen

action a is greater than that of a

′

and there is no

bias that would lead to a selection not guided by

optimal performance.

The algorithms must result in a distribution of ac-

tions with a somewhat heavier weight put on better

performing actions judged in terms of (possibly dis-

counted) long-term reward. Actions cannot be sug-

gested without having a positive effect on the objec-

tive performance of the system as deﬁned above.

The above deﬁnition is often referred to as exact

fairness, i.e., quality measured as the potential long-

term discounted reward. Possible alternatives shall

also be brieﬂy discussed:

• Approximate-choice fairness requires to never

choose a worse action with a probability substan-

tially higher than that of a better action.

• Approximate-action fairness requires to never

favour an action of substantially lower quality

than that of a better action.

A number of quality remedies that are known in

the ML domain include the improvement of data la-

belling through so-called protected attributes. An ex-

ample is the automation of critical situation assess-

ment. Here for instance a high risk of failure based

on past experience might be considered, which could

have a probability of discrimination based on certain

events that have occurred and could be biased against

or towards these, be that through pre-processing (be-

fore ML training) and in-processing (while training).

The challenges are to ﬁnd bias and remove this

bias through a control loop, e.g., using favourable data

labels as protected attributes to manage fairness. Ex-

amples in the edge controller setting are if smaller or

bigger device clusters could be favoured wrongly or

speciﬁc types of recommended topologies or recom-

mended conﬁguration sizes (messages, storage etc.)

exist.

6.6 Explainability

Explainability is important in general for AI in order

to improve the trustworthiness of the solutions. For

technical settings such as Edge and IoT, explainabil-

ity could aid a root cause analysis for quality deﬁcien-

cies. The explainability of the controller actions is a

critical factor to map observed ML model deﬁcien-

cies back to system-level properties via the monitored

input data to the ML model creation.

Explainability is a meta-quality aiding to improve

controller quality assessment. Since how and why

ML algorithms create their models is generally not

always obvious, a notion of explainability can help

to understand deﬁciencies in all of above four criteria

and remedy them.

Explainability for RL is less mature than for other

ML approaches. A number of taxonomies have been

proposed in recent years. We focus here on (Milani

et al., 2022) to illustrate one example of a classiﬁca-

tion of explainability into three types.

• Feature importance (FI) explanations: identify

features that have an affect on an action a pro-

posed by a controller for a given input state s.

FI explanations provide thus an action-level per-

spective of the controller. For each action, the im-

mediate situation that was critical for causing that

action selection is considered.

• Learning process and MDP (LPM) explanations:

show past experiences or the components of the

Markov Decision Process (MDP) that have led to

the current controller behaviour.

LPM explanations provide information about the

effects of the training process or the MDP, e.g.,

how the controller handles the rewards.

• Policy-level (PL) explanations: show long-term

controller behavior as caused by its policy.

This happens either through abstraction or repre-

sentative examples. They are used to evaluate the

overall competency of the controller to achieve its

objectives.

Others taxonomies also exist. For instance, (Kra-

jna et al., 2022) distinguishes two types in terms of

their temporal scope:

• Reactive explanations: these focus on the imme-

diate moment, i.e., only consider a short time hori-

zon and momentary information.

• Proactive explanations: these focus on longer-

term consequences, thus considering information

about an anticipated future.

A reactive explanation provides an answer to the

question “what happened”. A proactive explanation

answers ”why has something happened”.

These can then be further classiﬁed in terms of

how an explanation was creation, e.g.,

• Reactive explanations: policy simpliﬁcation, re-

ward decomposition or feature contribution and

visual methods for the reactive group.

• Proactive explanations: structural causal model,

explanation in terms of consequences, hierarchi-

cal policy and relational reinforcement learning

for the proactive group.

While explainability is a broad concern, we have

introduced here deﬁnitions and taxonomies that are

relevant for a technical setting and allow to deﬁne

Quality Metrics for Reinforcement Learning for Edge Cloud and Internet-of-Things Systems

361

metrics based on observations can be be obtained in

the controller construction and deployment.

6.7 Metrics Summary

We provided a review of ﬁve types of quality metrics

for RL-constructed controllers. These have been in-

troduced at conceptual level with the aim to motivate

this speciﬁc catalog of metrics here as being relevant

for the chose architectural settings. We have noted

that for those ﬁve metric types a number of more spe-

ciﬁc individual metrics exist.

We summarise the ﬁndings in Table 1 with a brief

deﬁnition and notes on metrics determination and re-

mediation, relating to the M and A parts and P and E

parts of the MAPE pattern.

7 DevOps PROCESS

Quality might already be a problem at the beginning.

However, often the quality of a system deteriorates

over time. This observation led us to device a qual-

ity management process, aligned with the DevOps ap-

proach to software quality management.

7.1 Change – Drift and Anomalies

In general, we need to consider changes in the envi-

ronment as possible root causes of observed quality

problems. These changes could be caused by sensor

faults or communication faults in an edge network,

but might also reﬂect naturally occurring changes.

With drift we describe this particular phe-

nomenon that the quality of systems often deteriorates

over time, particularly if the environment changes

naturally (Hu et al., 2020; Lu et al., 2018). As an

example, the data creation process is not always sta-

ble because it is subject to changing external events

that affect data coming from input sources, such as the

seasonality of data or errors resulting from the sensing

and monitoring.

On the other hand, quality deteriorates when faults

occur in the systems, causing anomalies to be ob-

served (Samir and Pahl, 2019; Samir and Pahl, 2021).

ML models that have been trained over these data

inputs could become obsolete and might have difﬁ-

culties adapting to changing conditions. As a con-

sequence, a challenge is to relate observed quality

problems with the controller to change and potentially

quality problems with environmental factors, e.g., at

the input data level.

This process starts with drift and anomaly detec-

tion in the quality monitoring and should result in

a root cause determination at the data side, if pos-

sible, and also the enactment of a suitable remedy,

i.e., implementing a feedback loop for instance in the

MAPE-style. The research on drift as well as anomaly

detection is still a challenge, even without an embed-

ding in a closed feedback loop.

The discussion of metrics as indicated that qual-

ity problems result in the environment of the system.

Thus, there is a need to ﬁnd root causes of the anoma-

lies that have been observed.

Figure 3: ML-centric DevOps – DevOps adjusted to ML-

based Software Construction and Operation.

7.2 RL DevOps Process

We apply our proposed architecture to resource man-

agement and orchestration In edge clouds and IoT

architectures (Hong and Varghese, 2019), controllers

manage systems autonomously. Compute, storage or

network resources are conﬁgured dynamically (Toku-

naga et al., 2016; Femminella and Reali, 2019). An-

other strategy is the dynamic allocation and man-

agement of tasks in distributed environments (Zhao

et al., 2019). ML has been used in some architectures

(Wang et al., 2020).

In order to continuously manage quality, we pro-

pose here a process accompanies the architecture in-

troduced earlier on. It aims to align the different in-

dividual quality concerns into an integrated DevOps

quality model, providing a closed RL feedback loop.

DevOps is an integrated feedback loop used for soft-

ware development and operation. We adapt this, tak-

ing into account the speciﬁc problems of the ML con-

troller construction – see Fig. 3.

7.3 Management of Anomalies

Input anomalies can be distinguished into two types:

incompleteness: sensors do not provide data or the

connections between devices is down; incorrectness:

sensors provide incorrect data (because of faultiness

of the sensors themselves or transmission faults). The

anomalies can be characterised along the following

dimensions. Firstly, the extend or degree of incom-

pleteness or incorrectness: different degrees of in-

completeness and used incorrect data ranging from

WEBIST 2023 - 19th International Conference on Web Information Systems and Technologies

362

Table 1: Summary of metrics with deﬁnition, Determination (M/A in MAPE), Remediation (P/E in MAPE).

Quality Metric Deﬁnition Determination Remediation

Performance reward optimisa-

tion

level 1 quality (e.g., execution time time, work-

load) and level 2 quality (e.g., convergence, loss)

built into rewards and

policy optimisation

Robustness tolerance against

disturbances

impact on performance metrics during distur-

bances

pattern recognition

(learned or not)

Sustainability resource consump-

tion

environmental metrics – consumption and cost reward or penalise con-

sumption

Explainability evaluation of con-

troller actions

recording of actions and a-posteriori analysis of ac-

tion impact

(manual) RL reconﬁgu-

ration

Fairness no bias bias detection via performance comparison mechanisms such as

favourable labels

slightly out of normal ranges up to extreme and im-

possible values. Secondly, the variability of anoma-

lies: the two types could appear in a random way or

clustered.

The disturbance patterns introduced earlier are in-

correctness anomalies in the above sense. However,

incompleteness is also an issue. In a past evalua-

tion, we have demonstrated that incorrectness is more

signiﬁcant than incompleteness. A possible reason

here is that in incompleteness the ML tool may ig-

nore missing data and not include these in the con-

struction. However, for incorrectness, a tool needs to

use all values, irrespective of their correctness. Thus,

it cannot control or minimize the negative impact on

performance.

These identiﬁed anomalies can be associated to

root causes, e.g., (i) signiﬁcant performance changes

point to incorrectness cause most likely by sensor

faults, (ii) clustered incompleteness can be associated

with local network faults, or (iii) time-clustered in-

correctness can be associated with sensor faults, but

faulty individual sensors have less impact than com-

munication faults. Using this kind of a rule system,

useful recommendations for remedial actions (also

beyond the automated adaptation) such as checking

or replacing faulty sensors, could be given.

8 CONCLUSIONS

This look into self-adaptive edge and IoT systems

shows that ML has been recognised as a highly suit-

able construction mechanism for controllers. Contin-

uous quality management is, however, still an open

research problem, where the term Auto ML for auto-

mated machine learning is often used to refer to the

need for continuous and automated management of

neural networks (NN), reinforcement learning (RF)

or other mechanisms (He et al., 2021). The problem

space we investigated here is at the intersection of

different research ﬁelds: software engineering, ML,

automation and self-adaptive systems and also data

analysis – here with an application focus on IoT and

the cloud edge continuum.

The core contribution is a metrics catalogue to

manage the quality of RL-constructed controllers.

This metrics catalogue was frame in an architecture

and process setting.

Several directions remain for future work. We al-

ready indicated that the metrics catalogue remains at a

conceptual level, which detailed deﬁnitions for result-

ing individual metrics across the ﬁve categories were

beyond the scope here. Furthermore, handling strate-

gies for the different metric types would also have to

be investigated further if the actual implementation of

a multi-objective controller for all metrics is envis-

aged.

REFERENCES

Al-Nima, R. R. O., Han, T., Al-Sumaidaee, S. A. M., Chen,

T., and Woo, W. L. (2021). Robustness and perfor-

mance of deep reinforcement learning. Applied Soft

Computing, 105:107295.

Azimi, S. and Pahl, C. (2020). Root cause analysis and

remediation for quality and value improvement in ma-

chine learning driven information models. In 22nd

International Conference on Enterprise Information

Systems.

Bu¸soniu, L., de Bruin, T., Toli

c, D., Kober, J., and Palunko,

I. (2018). Reinforcement learning for control: Per-

formance, stability, and deep approximators. Annual

Reviews in Control, 46:8–28.

De Hoog, J., Mercelis, S., and Hellinckx, P. (2019). Improv-

ing machine learning-based decision-making through

inclusion of data quality. CEUR Workshop Proceed-

ings, 2491.

Ehrlinger, L., Haunschmid, V., Palazzini, D., and Lettner,

C. (2019). A daql to monitor data quality in machine

learning applications. In Database and Expert Sys-

tems Applications.

Fang, D., Liu, X., Romdhani, I., Jamshidi, P., and Pahl, C.

(2016). An agility-oriented and fuzziness-embedded

semantic model for collaborative cloud service search,

retrieval and recommendation. Future Gener. Comput.

Syst., 56:11–26.

Quality Metrics for Reinforcement Learning for Edge Cloud and Internet-of-Things Systems

363

Femminella, M. and Reali, G. (2019). Gossip-based

monitoring of virtualized resources in 5g networks.

In IEEE Conference on Computer Communications

Workshops, INFOCOM Workshops 2019, pages 378–

384. IEEE.

Glossop, C., Panerati, J., Krishnan, A., Yuan, Z., and

Schoellig, A. P. (2022). Characterising the robustness

of reinforcement learning for continuous control using

disturbance injection. In Progress and Challenges in

Building Trustworthy Embodied AI.

He, X., Zhao, K., and Chu, X. (2021). Automl: A sur-

vey of the state-of-the-art. Knowledge-Based Systems,

212:106622.

Hong, C. and Varghese, B. (2019). Resource management

in fog/edge computing: A survey on architectures,

infrastructure, and algorithms. ACM Comput. Surv.,

52(5):97:1–97:37.

Hu, H., Kantardzic, M., and Sethi, T. S. (2020). No free

lunch theorem for concept drift detection in stream-

ing data classiﬁcation: A review. Wiley Interdisci-

plinary Reviews: Data Mining and Knowledge Dis-

covery, 10(2):e1327.

Jabbari, S., Joseph, M., Kearns, M. J., Morgenstern, J., and

Roth, A. (2016). Fair learning in markovian environ-

ments. CoRR, abs/1611.03071.

Javed, A., Malhi, A., and Främling, K. (2020). Edge

computing-based fault-tolerant framework: A case

study on vehicular networks. In Intl Wireless

Communications and Mobile Computing Conference,

IWCMC 2020. IEEE.

Krajna, A., Brcic, M., Lipic, T., and Doncevic, J. (2022).

Explainability in reinforcement learning: perspective

and position.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang,

G. (2018). Learning under concept drift: A review.

IEEE Transactions on Knowledge and Data Engineer-

ing, 31(12):2346–2363.

Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., and

Crnkovic, I. (2019). A taxonomy of software engi-

neering challenges for machine learning systems: An

empirical investigation. In Agile Processes in Soft-

ware Engineering and Extreme Programming - 20th

International Conference.

Mendonca, N. C., Jamshidi, P., Garlan, D., and Pahl, C.

(2021). Developing self-adaptive microservice sys-

tems: Challenges and directions. IEEE Software,

38(2):70–79.

Milani, S., Topin, N., Veloso, M., and Fang, F. (2022). A

survey of explainable reinforcement learning.

Mou, Z., Huo, Y., Bai, R., Xie, M., Yu, C., Xu, J., and

Zheng, B. (2022). Sustainable online reinforcement

learning for auto-bidding.

Pahl, C. (2023). Research challenges for machine learning-

constructed software. Serv. Oriented Comput. Appl.,

17(1):1–4.

Pahl, C. and Azimi, S. (2021). Constructing dependable

data-driven software with machine learning. IEEE

Softw., 38(6):88–97.

Pahl, C., Azimi, S., Barzegar, H. R., and El Ioini, N. (2022).

A quality-driven machine learning governance archi-

tecture for self-adaptive edge clouds. In International

Conference on Cloud Computing and Services Science

CLOSER.

Pahl, C., Fronza, I., Ioini, N. E., and Barzegar, H. R. (2019).

A review of architectural principles and patterns for

distributed mobile information systems. In Interna-

tional Conference on Web Information Systems and

Technologies.

Samir, A. and Pahl, C. (2019). DLA: detecting and local-

izing anomalies in containerized microservice archi-

tectures using markov models. In 7th International

Conference on Future Internet of Things and Cloud,

FiCloud 2019, pages 205–213. IEEE.

Samir, A. and Pahl, C. (2021). Autoscaling recovery actions

for container-based clusters. Concurr. Comput. Pract.

Exp., 33(23).

Tokunaga, K., Kawamura, K., and Takaya, N. (2016). High-

speed uploading architecture using distributed edge

servers on multi-rat heterogeneous networks. In IEEE

International Symposium on Local and Metropolitan

Area Networks, LANMAN 2016, pages 1–2. IEEE.

Wan, Z., Xia, X., Lo, D., and Murphy, G. C. (2021). How

does machine learning change software development

practices? IEEE Trans. Software Eng., 47(9):1857–

1871.

Wang, F., Zhang, M., Wang, X., Ma, X., and Liu, J. (2020).

Deep learning for edge computing applications: A

state-of-the-art survey. IEEE Access, 8:58322–58336.

Xu, X., Chen, Y., and Bai, C. (2021). Deep reinforcement

learning-based accurate control of planetary soft land-

ing. Sensors, 21(23).

Zhao, H., Yi, D., Zhang, M., Wang, Q., Xinyue, S., and Zhu,

H. (2019). Multipath transmission workload balanc-

ing optimization scheme based on mobile edge com-

puting in vehicular heterogeneous network. IEEE Ac-

cess, 7:116047–116055.

WEBIST 2023 - 19th International Conference on Web Information Systems and Technologies

364