Reinforcement Learning for Multi-Objective Task Placement on

Heterogeneous Architectures with Real-Time Constraints

Bakhta Haouari

1,2,3 a

, Rania Mzid

1,4 b

and Olfa Mosbahi

2 c

ISI, University Tunis-El Manar, 2 Rue Abourraihan Al Bayrouni, Ariana, Tunisia

LISI Lab INSAT, University of Carthage, Centre Urbain Nord B.P. 676, Tunis, Tunisia

Tunisia Polytechnic School, University of Carthage, B.P. 743, La Marsa, Tunisia

CES Lab ENIS, University of Sfax, B.P:w.3, Sfax, Tunisia

Keywords:

Real-Time, Task Placement, Multi-Objective, Refactoring, Pareto Q-Learning.

Abstract:

This paper introduces a novel approach for multi-objective task placement on heterogeneous architectures

in real-time embedded systems. The primary objective of task placement is to identify optimal deployment

models that assign each task to a processor while considering multiple optimization criteria. Given the NP-

hard nature of the task placement problem, various techniques, including Mixed Integer Linear Programming

and genetic algorithms, have been traditionally employed for efﬁcient resolution. In this paper, we explore

the use of reinforcement learning to solve the task placement problem. We initially modeled this problem

as a Markov Decision Process. Then, we leverage the Pareto Q-learning algorithm to approximate Pareto

front solutions, balancing system extensibility and energy efﬁciency. The application of the proposed method

to real-world case studies showcases its effectiveness in task placement problem resolution, enabling rapid

adaptation to designer adjustments compared to related works.

1 INTRODUCTION

Real-time embedded systems (RTES) are computing

system designed to perform speciﬁc functions with

precise timing constraints. Nowadays, these systems

are commonly found in various aspects of everyday

life, ranging from customer electronics to healthcare

applications (Akesson et al., 2020). They typically

consist of both hardware and software components

tightly integrated to execute tasks efﬁciently. Indeed,

RTES are characterized by their ability to promptly

respond to external stimuli within strict timing con-

straints, offering reliability and performance crucial

for a wide array of applications. Developing such sys-

tems is inherently challenging, given that any failure

could have critical implications for human safety.

The software engineering community is facing

various challenges today during the development of

RTES, including the task placement problem (Las-

soued and Mzid, 2022). This problem involves as-

signing tasks to processors in a manner that satisﬁes

https://orcid.org/0000-0002-5336-6300

https://orcid.org/0000-0002-3086-370X

https://orcid.org/0000-0002-0971-2368

real-time constraints while optimizing system per-

formance metrics. As real-time embedded systems

become more complex, solving the task placement

problem becomes much harder. In fact, task place-

ment turns into an NP-hard problem, which essen-

tially means that ﬁnding the best solutions becomes

extremely difﬁcult within a reasonable amount of

time. In response, researchers have explored vari-

ous optimization techniques, including exact methods

such as Mixed Integer Linear Programming (MILP)

(Mehiaoui et al., 2019; Lakhdhar et al., 2020), Re-

inforcement Learning (RL) (Haouari et al., 2022;

Haouari et al., 2023b), and Genetic Algorithms (GA)

(Lassoued and Mzid, 2022), to tackle task placement

efﬁciently.

In this paper, we explore the use of reinforcement

learning to solve the multi-objective task placement

problem in RTES. We introduce a new method called

Pareto Q-learning Placement (PQP). This method re-

lies on the Pareto Q-learning algorithm (Van Moffaert

and Now

e, 2014), an emerging artiﬁcial intelligence

technique known for its effectiveness in such scenar-

ios. The PQP method is designed to generate a set of

Pareto-optimal solutions for the task placement prob-

Haouari, B., Mzid, R. and Mosbahi, O.

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints.

DOI: 10.5220/0012721500003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 179-189

ISBN: 978-989-758-696-5; ISSN: 2184-4895

179

lem in RTES, adhering to all speciﬁed constraints,

particularly those related to real-time requirements,

while simultaneously improving system extensibility

and reducing energy consumption. In addition to its

ability to produce Pareto-optimal solutions, the pro-

posed PQP method addresses the challenge of refac-

toring. Refactoring refers to the process of modifying

the system properties such as updating tasks proper-

ties (Haouari et al., 2022). The main contributions

of this paper may be summarized as follows: (i) We

model the task placement problem as a Markov Deci-

sion Process (MDP), (ii) We use the Pareto Q-learning

algorithm to propose a multi-objective decision maker

called PQP. The latter offers to the designer a set of

deployment models referring to the possible task-to-

processor assignments, balancing the system extensi-

bility and energy efﬁciency. (iii) We apply and simu-

late the proposed PQP method to real-life case stud-

ies. The results obtained demonstrate the effective-

ness of the proposed methods in efﬁciently identify-

ing Pareto solutions compared to existing approaches.

Additionally, they showcase the capability of the pro-

posed methods to address the refactoring issue.

The rest of the paper is organized as follows: Sec-

tion 2 discusses the relevant literature. Section 3 in-

troduces key concepts related to multi-objective opti-

mization. Section 4 presents the RTES formalization.

The description of the task placement problem as a

MDP is introduced in Section 5. Section 6 introduces

the PQP method and provides detailed explanations of

the RL algorithms. Experimental results are detailed

in Section 7, and the paper concludes with Section 8,

which discusses future directions of our research.

2 RELATED WORK

Several approaches have been proposed in the litera-

ture to deal with the task allocation problem in RTES.

In (Zhu et al., 2013), the authors utilize MILP tech-

nique to optimize task and message allocation within

a distributed system, aiming to meet end-to-end dead-

lines and minimize latencies. In (Vidyarthi and Tri-

pathi, 2001), a GA-based method is proposed for

maximizing reliability in the task allocation problem.

In (Kashani et al., 2017), the authors deal with the

same problem and propose a method based on GA

to minimize the communication cost. In (Haouari

et al., 2023a), RL techniques are applied to intro-

duce a novel approach to real-time task placement and

scheduling. Unlike previous approaches, this paper

systematically explores all feasible placements be-

fore assigning a scheduling to each solution, deﬁning

the optimal scheduling as the one that minimizes re-

sponse time. Additionally, in (Haouari et al., 2023b),

the authors delve into self-adaptive systems, propos-

ing a new method capable of effectively functioning

under both predictable and unpredictable online sys-

tem updates. Their proposed approach, represented

by a set of RL algorithms, aims to optimize system

extensibility while ensuring real-time feasibility. An-

other notable work by (Haouari et al., 2022) leverages

the Q-learning algorithm for task placement, with

the goal of reducing the number of utilized proces-

sors. This study also addresses the refactoring is-

sue, demonstrating the effectiveness of RL methods

in coping with updates to system parameters. De-

spite the signiﬁcance of these works in addressing

the placement problem in RTES, they are limited to

single-objective optimization, which may not fully

capture the complexities of real-world scenarios.

In addressing multi-objective optimization in the

task allocation problem in RTES, the authors of

(Mehiaoui et al., 2019) propose an approach aimed

at addressing placement and scheduling challenges

within embedded distributed architectures, where

multiple objectives need to be optimized collectively.

They employ both MILP and GA (Mirjalili, 2019) to

handle the scalability issues associated with MILP.

To address the set of objectives comprehensively, the

authors suggest consolidating them into a single ob-

jective function, assigning weights to each objective.

This process yields an optimal solution that strikes

a balance among the objectives for the entire prob-

lem. However, as discussed in (Coello, 2007), the au-

thors demonstrate that despite executing this process

multiple times with varying weights, certain solutions

that do not align with any combination of weights re-

main undetected. In (Lassoued and Mzid, 2022) a

multi-objective algorithm based on evolutionary algo-

rithms, speciﬁcally SPEA2 (Huseyinov and Bayrak-

dar, 2022), was proposed to optimize both the slack

capacity and the consumed energy of the system for

distributed processors. Nevertheless, the scalability of

the GA model is evident. The authors of (Mehiaoui

et al., 2019) conclude that evaluating and ensuring

the quality of solutions generated by the GA is chal-

lenging, given its dependency on various factors such

as encoding, crossover, and mutation operators. De-

spite their importance, these works do not address

the refactoring issue, as the MILP and GA formula-

tions must be re-executed from scratch after applica-

tion properties changes.

Some studies have achieved success utilizing RL

techniques to address problems akin to task placement

in domains beyond RTES, including the work refer-

enced as (Caviglione et al., 2021). In this paper, the

authors introduce a multi-objective approach utiliz-

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

180

ing a deep reinforcement learning (RL) technique for

addressing the online placement of virtual machines

(VMs) across a group of servers within cloud data

centers. Nevertheless, their methodology employs the

weighted sum function, imposing constraints on its

applicability to linear problems. The study presented

in (Yang et al., 2020) introduces a multi-policy con-

vex hull reinforcement learning algorithm designed

for managing the operations of We-Energy in the con-

text of the Energy Internet. The algorithm lever-

ages Q-learning twice: initially, to construct a set of

Pareto multi-objective solutions within a continuous

state space, and subsequently, to select the optimal

solution that minimizes We-Energy production costs

while maximizing the security of each We-Energy

unit in the system. This proposed algorithm incor-

porates multi-objective reinforcement learning tech-

niques, and it incorporates human involvement in the

adjustment of the application to ensure system control

and enhance the conﬁdence of intelligent systems. In-

spired by the aforementioned approaches, this work

aims to cope with the limitations of existing meth-

ods for task placement in RTES. In this paper, we

aim to extend the work proposed in (Haouari et al.,

2022) to deal with multi-objective task placement to

be mapped in heterogeneous architectures.

3 BACKGROUND

The resolution of optimization problems where a sin-

gle or non conﬂicting objective is considered leads

to a single solution, which may be the optimal or

near optimal one. In multi-objective optimization

where objectives may be conﬂicting, the resolution

step is not as simple. In fact, we are faced to a col-

lection of solutions representing a trade-off between

the objectives. The decision maker can navigate in

this collection and choose the one that best ﬁts their

needs. In this section, some necessary deﬁnitions

commonly used in multi-objective optimization prob-

lems are given :

• Pareto dominance: In a multi-objective optimiza-

tion approach, optimization problems are deﬁned

by conﬂicting objectives. This implies that en-

hancing the solution with respect to one criterion

may potentially worsen another. In these scenar-

ios, there is no singular optimal solution ; instead,

there exists a set of non-dominated solutions, each

representing a trade-off between the objectives.

For instance, if we consider two solutions, π

and

, for the same problem, to dominate π

, it is

sufﬁcient and necessary for π

to be at least better

than π

on all the considered objectives. In Figure

1, solutions depicted as red diamonds are domi-

nated by solutions depicted as large blue circles.

• Pareto optimality: π

is deemed a Pareto optimal

solution when there is no other solution π that

dominates π

. The set of Pareto-optimal solutions

constitutes the Pareto set. In Figure 1, this collec-

tion is illustrated by the set of solutions shown as

large blue circles.

• The Pareto front: Represents a speciﬁc subset of

the Pareto set, encompassing only solutions along

the frontier. These solutions are characterized by

the trade-off phenomenon, wherein the enhance-

ment of one objective results in the degradation of

another. The Pareto front is the line that delineates

the portion of the plane containing the dominated

points. The solutions illustrated as large blue cir-

cles in Figure 1 comprise the Pareto front.

Figure 1: Pareto front curve.

• Hypervolume: In a multi-objective scenario, we

encounter a collection of solutions that approxi-

mate the Pareto front. It has been demonstrated

in (Zitzler and Thiele, 1998; Fonseca et al., 2006)

that hypervolume serves as the most effective met-

ric for evaluating the quality of a given set of so-

lutions. Figure 2 illustrates the region (colored

in green) to which the hypervolume measurement

pertains. This region delineates the space occu-

pied by solutions dominated by the approximated

Pareto front, bounded by this front and a reference

point (i.e., a designated point dominated by all so-

lutions). Therefore, a higher hypervolume value

corresponds to superior solution quality.

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints

181

Figure 2: Hypervolume.

4 SYSTEM FORMALIZATION

The task placement problem, involves three differ-

ent types of models: the task model, the platform

model, and the deployment model. It is assumed

in this work that the task model, that we denote by

τ, is composed of n synchronous, periodic, and in-

dependent tasks (i.e.,τ = {T

, T

. .. T

}). Each task

is characterized by static parameters T

= (C

, Pr

)

where C

= (c

, . . . , c

) such as c

i j

represents an es-

timation of the worst case execution time of the task

on the processor P

; i ∈ {1 . . . n} and j ∈ {1. . . m},

and Pr

is the activation period of the task T

. The

platform model, that we denote by P , represents the

execution platform of the system. We assume that

this model is composed of m heterogeneous proces-

sors (i.e., P = {P

, P

, . . . , P

}). Each processor has

its own memory and runs a Real-Time Operating Sys-

tem (RTOS). The task placement step produces a de-

ployment model that we denote by D in this work.

The deployment model consists of a set of tuples D =

{(P

, ξ

), (P

, ξ

), . . . (P

, ξ

)} where k represents the

number of active (or used) processors such as k ≤ m

and ξ

represents the subset of tasks allocated to the

processor P

after the placement step. The primary

objective of the approach developed in this paper is

to minimize the energy consumption (denoted by E)

of the processors responsible for executing the des-

ignated tasks. To quantify E for a speciﬁc deploy-

ment model, we rely on the energy model outlined in

(Lakhdhar et al., 2018). In this model, each proces-

sor P

is associated with a constant ζ

representing its

capacitance (assumed to be equal to 1 in this work),

a frequency f

denoting the processing speed, and a

voltage v

corresponding to the electrical supply. The

energy required for the proper operation of processor

, denoted as E

, is calculated using Equation 1

∑

i∈{K

}

i j

∑

i∈{K

}

∗c

i j

, j ∈ {1. . . m} (1)

where K

is the set of tasks allocated to P

, E

i j

is the

energy required for processor P

to perform task T

and the global energy E is computed as Equation 2

E =

∑

j=1

(2)

For real-time embedded systems, it is imperative that

the deployment model is deemed feasible. Feasibility,

in this context, entails that placing real-time tasks on

various processors ensures the timing speciﬁcations

of the system are met. In this regard, Liu and Lay-

land (Liu and Layland, 1973) deﬁned a schedulabil-

ity test that is both necessary and sufﬁcient when the

task model satisﬁes the Rate Monotonic (RM) opti-

mality conditions. The feasibility test serves to as-

certain whether a given set of tasks will consistently

meet all deadlines under all release conditions. This

test relies on the calculation of the processor utiliza-

tion factor U

and is deﬁned as follows:

∑

i=1

i j

≤ 0.69 (3)

In tackling refactoring, our approach involves

maximizing the system’s adaptability, referred to as

extensibility. Extensibility denotes the system’s abil-

ity to accommodate changes while preserving fea-

sibility. We measure this extensibility through the

concept of ”slack capacity,” which signiﬁes the min-

imum remaining processor capacity post task place-

ment over all the considered processors. The sec-

ond objective pursued by PQP is the maximization of

slack capacity, deﬁned as:

SlackCapacity(P

) = min

1..m

(0.69 −U

) (4)

Essentially, the greater the available capacity on the

processors, the more adaptable the system is to po-

tential changes.

5 TASK PLACEMENT PROBLEM

AS AN MDP

In this study, the task placement challenge is designed

as a sequential decision problem. Speciﬁcally, at

each time step t, the system is characterized by the

states of the processors (i.e., the list of tasks placed

on each processor which is required to compute its

available utility) and the list of tasks that has not yet

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

182

been placed. Consequently, the new task placement

is contingent solely upon the current state of the sys-

tem. This adherence to the Markov property (Bell-

man, 1957), deﬁned as an essential condition for ap-

plying RL (Barto, 2021), justiﬁes the utilization of

this technique in our context. The application of RL

techniques necessitates the establishment of key con-

cepts, such as:

• State (S

) : denoted by S, it reﬂects the state of the

system at time step t. It undergoes changes based

on the actions performed by the agent at each step

until reaching the ﬁnal stage. In the context of

the task placement problem, S

consists of sets,

each composed of a processor and its currently as-

signed tasks, along with the set of unplaced tasks.

The ﬁnal stage aligns with the completion of the

task placement process, involving the assignment

of all tasks.

• Epoch of Decision : It aligns with the completion

of the placement of the entire set of tasks on the

designated processors. It concludes with the gen-

eration of a deployment model.

• Agent : It serves as the decision-maker for the sys-

tem, responsible for selecting an action to tran-

sit from state S

to S

t+1

with the goal of maxi-

mizing its reward in accordance with a policy π.

The decision-maker undergoes a learning process

from one epoch to another to fulﬁll its mission,

and its performance is contingent upon both the

action selection process and the received reward.

In the task placement problem, the objective is to

discover an optimal deployment model within a

multi-objective context.

• Action Space : Denotes the set of feasible ac-

tions available to the agent when it is in state

following a policy π. For the task placement

problem, the action involves determining the op-

timal processor for a given task from the list of

unplaced tasks, adhering to an ε-greedy policy

(Barto, 2021), where ε is a small number that have

to be initialized. Initially, a high value of ε is as-

signed to promote environmental exploration in

the early stages. The ε-value is systematically

reduced after each epoch by an ε-decrease fac-

tor. This decline in ε is justiﬁed by the agent’s

learning progression from one epoch to the next.

As the learning process advances, new states are

explored, leading to a preference for exploitation

over exploration.

• Reward R : The reward holds paramount signiﬁ-

cance in system modeling, serving as the agent’s

motivator that inﬂuences its choices during each

action selection. In the context of the task place-

ment problem, the reward functions as positive re-

inforcement, resembling a gift, when the agent

makes an appropriate selection, or as a penalty

in alternative scenarios. Additionally, given the

multi-objective nature of our approach, it be-

comes essential to designate speciﬁc rewards for

each objective. In the ensuing discussion, we de-

ﬁne two rewards: one for addressing slack capac-

ity and another for consumed energy.

Slack-Capacity Reward.

Rsc = SP

−U

i j

(5)

Where

– U

is the available utilization of processor P

time step t

– U

i j

is the required utilization for a task T

to turn

on P

(i.e., U

)

– S P

reﬂects the processor state such as











−(m + δ), When there is not enough

space on P

to support T

(with δ > 0)

-m, When T

is placed on P

and

it is not the processor with

the most free capacity

0, Otherwise

Where m denotes the number of processors de-

ﬁned in the hardware model.

Energy Consumption Reward.

Re =



i j

, when E

i j

is the minimum over all P

−m Otherwise

(6)

Where

– E

i j

represents the energy needed for processor

to execute task T

– m is the number of processors deﬁned in the

hardware model.

6 PQP DESCRIPTION

The proposed Pareto Q-learning Placement (PQP)

method aims to address placement problems charac-

terized by a set of tasks that need to be assigned to

a set of heterogeneous processors (as discussed in

Section 4). This method comprises a series of algo-

rithms designed to generate the Pareto-optimal set.

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints

183

Initially, the RL Initialization algorithm (Algorithm

1) is called to start with essential initialization steps.

The objective of this algorithm is to conﬁgure vari-

ous data structures accommodating inputs such as the

task model and the platform model for the given prob-

lem and some initialization variables such as γ that

denotes the discounting factor, which reﬂects the im-

portance given to expected rewards, ε, and ε-decrease

(as deﬁned in Section 5).

After the initialization step, Algorithm 3 uses an ε-

greedy technique for the action selection step, which

consists of randomly generating a number nb in [0,

1]. This algorithm compares the nb number with ε;

if nb is equal to or less than ε then the action ((task,

processor) pair) is randomly chosen; otherwise, it in-

vokes the Task Selection algorithm, referred to as Al-

gorithm 2, to determine the optimal task placement.

Following the execution of this procedure, the agent

receives a reward pair (r.sl, r.e), where r.sl and r.e cor-

respond to the slack-capacity and energy objectives,

respectively. This pair of immediate rewards is em-

ployed to update the average reward R, which is in

turn combined with the non-dominated set ND

(S, a)

to compute the Qset. Here, ND

(S, a) signiﬁes the re-

ward pairs of non-dominated solutions at time step t

for the state S and the action a. It undergoes updates

based on the non-dominated vectors of S

′

with all pos-

sible actions a

′

. Similarly, Qset(S, a) represents a col-

lection of vectors for the state-action pair (S, a). It

stores the non-dominated set of Q-values obtained by

adding the average immediate reward R to the non-

dominated set ND

(S, a) expected in state S

′

and dis-

counted by γ. Subsequently, the agent transits to state

′

, removes the placed task from the list of unplaced

tasks, updates the deployment model D

with the new

placement, and the agent is replenished for another

task placement until all tasks in the unplaced list are

exhausted. At this moment, the ﬁnal D

is added to D,

and a new epoch searching for another non-dominated

optimal model is stated. The algorithm concludes by

generating a set of non-dominated deployment mod-

els D. Underlining its importance, it must be em-

phasized that the epoch count is a critical parameter

that needs to be carefully chosen to ensure the con-

vergence of PQP. Convergence is realized when no

new deployment model is generated that differs from

those already present in.

The strategy employed for action selection plays a

pivotal role as it signiﬁcantly impacts both the speed

and quality of the agent’s learning process. In the

context of multi-objective optimization, this step be-

comes intricate, deviating from the straightforward

process in classical Q-learning, where the action eval-

uation considers a single optimal solution. Here, the

Algorithm 1: RL Initialization.

Data: τ: The task model;

P : The platform model;

ε, ε − decease, γ : Initialization parameters;

Result: Q : The Q-table;

S: the initial state;

Q ← Create Q − table(S, A);

Initialize Q[Qset] to empty sets;

Initialize S to initial state;

return Q, S

Algorithm 2: Task Selection.

Data: S: Current state;

Q: Q-table;

Result: a: action (task,processor);

foreach a in S

Qsg ← empty set;

foreach vector v in Qset(S,a) do

Append (a,v) to Qsg ;

end

NDQS ← ND(QSG);

a ← choose randomly one element from

NDQS;

return a

evaluation must account for non-dominated solutions

concerning two conﬂicting objectives, namely, the

slack capacity and the consumed energy. Various

strategies for the action selection process have been

proposed in (Van Moffaert and Now

e, 2014). We

speciﬁcally adopt the Pareto action selection method,

identiﬁed as the most efﬁcient in (Van Moffaert and

Now

e, 2014). The implementation of Pareto action

selection is detailed in Algorithm 2. Initially, the Q-

values associated with each pair (action and its cor-

responding Q-set) for all available actions at state S

are collected into a list called Qsg. Subsequently, the

ND operator is applied to Qsg, eliminating dominated

elements and retaining only the non-dominated ones

in the NDQ list. From this list, the agent randomly

selects an action. This random choice allows, for the

agent, more exploration of the NDQ set.

7 CASE STUDY

This section illustrates the applicability of the pro-

posed PQP method based on the case study outlined

in (Haouari et al., 2022). The case study involves an

RTES implemented in a car to assist the driver in ve-

hicle control. The task model of the RTES consists of

ﬁve independent, periodic, and synchronous tasks:

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

184

Algorithm 3: Pareto Q-learning Placement (PQP).

Data: S: The initial state;

τ: The task model;

P : The platform model;

ε, ε − decease, γ : Initialization parameters;

Result: D : the deployment models;

Notations:

: deployment model at time step t;

a: the action (Task,Processor);

for t from 1 to number o f epochs do

(S, Q) ← RL Initialization (τ, P ) ;

while τ <>

0 do

nb ← random number ∈ [0, 1];

if nb ≤ ε then

Select a random possible action a

;

else

a ← Task Selection(S, Q);

end

Perform placement a, Observe state S′

;

r.sc ← Rsc;

r.e ← Re;

Q.nb(S, a) ← Q.nb(S, a) + 1 ;

R(S, a) ← R(S, a) +

r−R(S,a)

nb(S,a)

;

← ND(∪

a′

(Qset(S′, a′)) ;

Qset(S, a) ← R(S, a) ⊕ γ ND

(S, a);

S ← S

′

;

update τ /* /* remove the

placed task from τ */ */

;

update D

with the new placement a;

end

D ← D ∪ D

;

end

return D

three dedicated to sensor measurements such as the

car’s speed, the car’s temperature, and the car’s GPS

location, and two for displaying the measured values

such as the measurement summary and the map of the

current car location. In this paper, we expand upon the

platform model described in (Haouari et al., 2022) by

incorporating three heterogeneous processors instead

of homogeneous ones. Table 1 offers a detailed de-

scription of the case study where T

denotes the task

names (i ∈ [1..5]), Pr

corresponds to the task periods,

and C

, C

, and C

match the worst-case execution

time of T

on P

In order to determine the appropriate deployment

model for the speciﬁed case study, the designer em-

ploys the PQP method, implemented in Python 3 with

the NumPy library. This analysis is conducted us-

ing the task model outlined in Table 1 on the plat-

Table 1: Task model description.

10 1 1.5 2

10 2 3 4

40 3.2 4.8 6.4

12 2 3 4

6 1 1.5 2

Table 2: Hardware model description.

Capacitance Voltage Frequency

(V ) (GHz)

1 2 2

1 4 3

1 6 4

form model described in Table 2. Certain parameters

need to be initialized to ensure the proper function-

ing of the PQP algorithms, including γ, which serves

as the discount factor quantifying the signiﬁcance as-

signed to future rewards. In our approach, we deem

future task placements important for generating the

ﬁnal deployment model, and therefore, we assign a

sufﬁciently high value to γ (= 0.9). For the action

selection, ε is assigned a value of 0.9, and we con-

sider an ε-decrease value of 0.001 (c.f. Section 5).

Table 3 illustrates the non-dominated solutions (de-

ployment models) derived from the PQP algorithm’s

execution. There are four ways for the designer to

allocate tasks across various processors, offering a

trade-off between system extensibility (slack capac-

ity) and energy consumption. These four solutions

provide the designer with valuable insights on how to

implement the car system under optimal conditions.

Importantly, these insights are presented without any

a posteriori imposition of designer preferences. In-

deed, the designer has the opportunity to analyze the

various solutions and select from Table 3 the one that

aligns best with his preferences and constraints re-

lated to the two objectives. For instance, the ﬁrst

deployment model enables maximum system exten-

sibility, albeit at the expense of consuming the high-

est amount of energy. Nevertheless, the third model

exhibits the least slack capacity in exchange for min-

imizing energy consumption. It prohibits any updates

to the task model due to the overload on processor P

Moreover, extending the properties of the tasks would

render the current deployment model infeasible. The

second solution is also noteworthy as it maintains an

acceptable extensibility level while having a very rea-

sonable energy consumption value.

In multi-policy RL techniques like Pareto Q-

learning, when multiple solutions are generated for

a given problem, algorithm convergence signiﬁes the

conclusion of the agent’s learning process, indicat-

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints

185

Table 3: Deployment models description (The Pareto front).

Slack-capacity Energy

DeploymentModel

, T

} {T

} 0.34 349.6

DeploymentModel

, T

} {T

, T

} {} 0.24 165.6

DeploymentModel

, T

} {T

} {} 0.14 119.6

DeploymentModel

, T

} {T

, T

} {} 0.26 312.8

ing that the agent is no longer able to discover new

or superior solutions compared to the current ones,

regardless of the number of epochs used for train-

ing (Van Moffaert and Now

e, 2014). To investigate

the convergence of the PQP algorithm, we conduct

multiple PQP executions while varying the number of

epochs (incrementing by 50 each time) and observe

the generated solutions as illustrated in Table 4. The

NB NG in this table represents the total number of so-

lutions generated by the PQP, whereas the Nb NDS

refers to the number of non-dominated ones. It is

noteworthy that stability in solutions and convergence

are observed from 250 epochs onward. The spike

recorded at 150 epochs in the number of generated

solutions (7 solutions) can be attributed to the agent’s

immaturity at this stage, where it is still exploring its

environment through random task selection. Figure 3

depicts the learning progress of the agent. This learn-

ing progress is represented by the ratio of the num-

ber of non-dominated solutions (NB NDS) to the to-

tal number of generated solutions (NB GS) at each

algorithm iteration. We can see from this ﬁgure that,

after 250 epochs, PQP achieves convergence, yielding

a set of deployment models that align with the Pareto

front described in Table 3. From the data in Table 4

and the curve trend depicted in Figure 3, we observe

that as agents converge towards Pareto-optimal solu-

tions, the total number of generated placement mod-

els decreases. The decrease in overall count can be

attributed to the dominance of non-dominated solu-

tions during generation, leading to a reduction in the

total number of solutions. This trend is particularly

noticeable at epoch 150, where merely 7 solutions

exist, and only one among them is Pareto-optimal.

Conversely, at epoch 100, despite the agent’s less ad-

vanced learning stage, there are still 5 solutions in to-

tal. This occurrence can be explained by the fact that

the agent stumbles upon two non-dominated solutions

randomly, resulting in the effective elimination of the

majority of dominated ones.

Despite its signiﬁcance, convergence alone is in-

sufﬁcient to determine the efﬁciency of the algorithm.

Even if the algorithm converges and produces a set

of feasible task placements, these placements may be

dominated and far from the Pareto front solutions.

To address this issue and ensure a faithful evaluation

of PQP, we computed the Pareto front for this case

Table 4: PQP results with respect to the number of epochs.

Number o f epochs Nb GS Nb NDS

50 8 1

75 6 1

100 5 2

125 5 2

150 7 1

175 5 2

200 4 3

225 4 3

250 4 4

275 4 4

300 4 4

Figure 3: Process of agent learning.

study. Initially, we generated the complete set of po-

tential task placements (ﬁnding 230 possible deploy-

ment models), then ﬁltered out only the feasible ones

(132 deployment models are feasible). Subsequently,

we applied the non-dominated function to retain only

the task placements that constitute the Pareto front

(4 deployment models are non-dominated). Figure 4

compares the Pareto solutions produced by the PQP

algorithm, which correspond to the deployment mod-

els described in Table 3, with the true Pareto solu-

tions already computed for the considered case study.

As we can see from the ﬁgure, the obtained solutions

from the PQP execution (indicated by red diamonds)

perfectly match the actual deployment models (repre-

sented by black marks), proving the efﬁciency of our

method in producing accurate solutions. It’s worth

noting that determining the Pareto front is an NP-hard

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

186

Figure 4: Non dominated solutions for PQP approach Vs

true Pareto optimal solutions.

problem, and our success in computing it for this case

study is attributed to the small number of tasks and

processors considered.

8 EXPERIMENTAL RESULTS

We experiment PQP algorithm on a more extended

general case (for simplicity reason we call it GC) that

has been deﬁned in (Lassoued and Mzid, 2022). In

the study by (Lassoued and Mzid, 2022), a method-

ology (that we call GA based algorithm) employing

genetic algorithms was presented to generate place-

ment models that strike a balance between system ex-

tensibility and energy consumption. The eight black

Figure 5: Non dominated solutions for PQP approach Vs

GA based approach (Lassoued and Mzid, 2022).

markers shown in Figure 5 illustrate the results ob-

tained from running their algorithm on GC. Similarly,

in the same Figure 5, the eleven red diamonds corre-

spond to the solutions derived from executing the PQP

algorithm under the same conditions. As a ﬁrst in-

terpretation, we observe that the PQP algorithm pro-

duces a larger number of solutions compared to GA

based algorithm. Then the two algorithms share four

solutions (denoted by red diamond and a black mark):

(0.05, 112), (0.18, 176), (0.25, 304), and (0.34, 884).

However, the remaining four solutions from the other

algorithm are dominated by additional solutions from

PQP. Speciﬁcally, the solutions (0.26, 460), (0.28,

520), (0.31, 604), and (0.35, 1048) from the other al-

gorithm are respectively dominated by the PQP solu-

tions (0.29, 452), (0.29, 452), (0.31, 540), and (0.35,

1012). Additionally, PQP provides three solutions not

present in the GA based algorithm’s results: (0.19,

240), (0.23, 276), and (0.27, 324).

Delving deeper into the analysis, it would be ad-

vantageous to compare the results against the true

Pareto front. However, due to the relatively com-

plex system involved, this task is not as straightfor-

ward as outlined in section 7. In such circumstances,

where the true Pareto front is unavailable, many stud-

ies (Zitzler and Thiele, 1998; Cao et al., 2015) have

advocated for the use of the hypervolume as an ef-

fective measure to evaluate multi-objective solutions.

The hypervolume, a scalar value, quantiﬁes the area

bounded by a reference point (i.e., a solution that is

dominated by all generated solutions) and the non-

dominated solutions discovered by an algorithm (see

Section 3). What sets the hypervolume apart from

other multi-objective metrics is its ability to simul-

taneously quantify both the extent and diversity of

a set of solutions provided by an algorithm. Hence,

The higher the hypervolume value, the larger space

of dominated solutions covered by the non-dominated

solutions. Figures 6 and 7 exhibit the hypervolumes

Figure 6: (Lassoued and Mzid, 2022) Hypervolume.

of the solution spaces for the GA-based algorithm and

PQP, respectively. Figure 8 presents the juxtaposition

of the two hypervolume ﬁgures under consideration to

facilitate comparison. It is evident that PQP covers a

larger extent compared to (Lassoued and Mzid, 2022).

However, since visual interpretation alone may not

sufﬁce, we compute the hypervolume values of the

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints

187

Figure 7: PQP algorithm Hypervolume.

Figure 8: Hypervolume comparison.

two algorithms following the methodology outlined

in (Van Moffaert and Now

e, 2014; Cao et al., 2015).

The computed values are 0.714 for GA based algo-

rithm and 0.734 for PQP, conﬁrming the superior per-

formance of the latter.

To more show the capabilities of PQP algorithm

and the utility of maximising the extensibility (slack

capacity) of the system, we maintain a set of new ex-

periments where we tackle the refactoring issue in

a multi-objective setting. For that we refer to the

two random systems data used for experiments in

(Haouari et al., 2022) and we run the refactoring pro-

cess many times to compare PQP and GA based algo-

rithm performances. To measure the time of refac-

toring we compute the T

generation

factor (deﬁned in

(Haouari et al., 2022)) which refers to the necessary

time spent to take into consideration the system up-

dates and to generate the new task placement models.

Equation 7 precises the compute of T

generation

= T

initial

+ r f ∗ T

re f actoring

(7)

Where T

initial

represents the duration required to sup-

ply the deployment models for the initial system ver-

sion, T

re f actoring

denotes the time necessary to gener-

ate new solutions for considering system updates, and

r f is the number of times the designer updates the

system properties. Figure 9 illustrates the T

generation

Figure 9: Evaluation of the generation time for randomly

generated system.

evolution during the refactoring process for both PQP

and AG based methods on the two random systems

(e.i system1 and system2). The curves depicted in

Figure 9 illustrate that AG-based algorithm requires

more time to accommodate adjustments in both sys-

tems. Furthermore, this necessary time becomes in-

creasingly substantial with the rise in refactoring fre-

quency, as well as with the increase in tasks and pro-

cessors within the system, as seen in system 2. These

results can be explained by the fact that in AG based

approach, a new system is established for every refac-

toring request, requiring the algorithm to be executed

anew each time. In the PQP approach, system up-

dates are handled differently. Speciﬁcally, to deﬁne

the external structure (i.e., tasks and processor num-

bers) of the system, the PQP algorithm utilizes the Q-

table. As in the refactoring process the aspect of the

system is preserved, the construction of the Q-table is

also maintained, thereby saving the time initially in-

vested in this phase of the algorithm. Consequently,

to address updates related to tasks properties, only the

computation of Q-values needs to be recalculated, re-

quiring a time of T

re f actoring

9 CONCLUSION

In this paper, we have presented a novel approach

leveraging reinforcement learning to address the

multi-objective optimization challenges inherent in

task placement problems. The optimization objec-

tives entail maximizing system extensibility while

minimizing energy consumption. To achieve this,

we introduce the PQP method, rooted in Pareto Q-

learning and tailored for the task placement prob-

lem in real-time embedded systems. The proposed

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

188

method approximates the Pareto front for mapping

tasks to heterogeneous processors. The Pareto front

represents the optimal deployment models that strike

a balance between the optimized objectives within

real-time constraints. Through empirical evaluation,

PQP demonstrates its efﬁcacy compared to genetic al-

gorithms while also providing a solution to the refac-

toring problem, enabling designers to efﬁciently ex-

plore system conﬁgurations and adjustments.

As future work, we aim to extend PQP’s applica-

bility to more diverse case studies, incorporating addi-

tional objectives and refactoring scenarios. Addition-

ally, we plan to address the task scheduling process

from a multi-objective perspective, aiming to mini-

mize both worst-case response time and energy re-

quirements simultaneously.

REFERENCES

Akesson, B., Nasri, M., Nelissen, G., Altmeyer, S., and

Davis, R. I. (2020). An empirical survey-based study

into industry practice in real-time systems. In 2020

IEEE Real-Time Systems Symposium (RTSS), pages

3–11. IEEE.

Barto, A. G. (2021). Reinforcement learning: An introduc-

tion by Richards’ Sutton. SIAM Rev, 6(2):423.

Bellman, R. (1957). A markovian decision process. Journal

of mathematics and mechanics, pages 679–684.

Cao, Y., Smucker, B. J., and Robinson, T. J. (2015). On

using the hypervolume indicator to compare pareto

fronts: Applications to multi-criteria optimal exper-

imental design. Journal of Statistical Planning and

Inference, 160:60–74.

Caviglione, L., Gaggero, M., Paolucci, M., and Ronco,

R. (2021). Deep reinforcement learning for multi-

objective placement of virtual machines in cloud dat-

acenters. Soft Computing, 25(19):12569–12588.

Coello, C. A. C. (2007). Evolutionary algorithms for solv-

ing multi-objective problems. Springer.

Fonseca, C. M., Paquete, L., and L

opez-Ib

anez, M. (2006).

An improved dimension-sweep algorithm for the hy-

pervolume indicator. In 2006 IEEE international con-

ference on evolutionary computation, pages 1157–

1163. IEEE.

Haouari, B., Mzid, R., and Mosbahi, O. (2022). On the use

of reinforcement learning for real-time system design

and refactoring. In International Conference on Intel-

ligent Systems Design and Applications, pages 503–

512. Springer.

Haouari, B., Mzid, R., and Mosbahi, O. (2023a). Psrl: A

new method for real-time task placement and schedul-

ing using reinforcement learning. In Software Engi-

neering and Knowledge Engineering, pages 555–560.

ksi research.

Haouari, B., Mzid, R., and Mosbahi, O. (2023b). A re-

inforcement learning-based approach for online opti-

mal control of self-adaptive real-time systems. Neural

Computing and Applications, 35(27):20375–20401.

Huseyinov, I. and Bayrakdar, A. (2022). Novel nsga-ii and

spea2 algorithms for bi-objective inventory optimiza-

tion. Studies in Informatics and Control, 31(3):31–42.

Kashani, M. H., Zarrabi, H., and Javadzadeh, G. (2017). A

new metaheuristic approach to task assignment prob-

lem in distributed systems. In 2017 IEEE 4th Interna-

tional Conference on Knowledge-Based Engineering

and Innovation (KBEI), pages 0673–0677. IEEE.

Lakhdhar, W., Mzid, R., Khalgui, M., and Frey, G. (2018).

A new approach for optimal implementation of multi-

core reconﬁgurable real-time systems. In ENASE,

pages 89–98.

Lakhdhar, W., Mzid, R., Khalgui, M., Frey, G., Li, Z., and

Zhou, M. (2020). A guidance framework for synthesis

of multi-core reconﬁgurable real-time systems. Infor-

mation Sciences, 539:327–346.

Lassoued, R. and Mzid, R. (2022). A multi-objective evo-

lution strategy for real-time task placement on het-

erogeneous processors. In International Conference

on Intelligent Systems Design and Applications, pages

448–457. Springer.

Liu, C. L. and Layland, J. W. (1973). Scheduling algo-

rithms for multiprogramming in a hard-real-time en-

vironment. Journal of the ACM (JACM), 20(1):46–61.

Mehiaoui, A., Wozniak, E., Babau, J.-P., Tucci-

Piergiovanni, S., and Mraidha, C. (2019). Optimiz-

ing the deployment of tree-shaped functional graphs

of real-time system on distributed architectures. Auto-

mated Software Engineering, 26:1–57.

Mirjalili, S. (2019). Genetic algorithm. In Evolutionary

algorithms and neural networks, pages 43–55.

Van Moffaert, K. and Now

e, A. (2014). Multi-objective re-

inforcement learning using sets of pareto dominating

policies. The Journal of Machine Learning Research,

15(1):3483–3512.

Vidyarthi, D. P. and Tripathi, A. K. (2001). Maximizing

reliability of distributed computing system with task

allocation using simple genetic algorithm. Journal of

Systems Architecture, 47(6):549–554.

Yang, L., Sun, Q., Zhang, N., and Liu, Z. (2020). Opti-

mal energy operation strategy for we-energy of energy

internet based on hybrid reinforcement learning with

human-in-the-loop. IEEE Transactions on Systems,

Man, and Cybernetics: Systems, 52(1):32–42.

Zhu, Q., Zeng, H., Zheng, W., Natale, M. D., and

Sangiovanni-Vincentelli, A. (2013). Optimization of

task allocation and priority assignment in hard real-

time distributed systems. ACM Transactions on Em-

bedded Computing Systems (TECS), 11(4):1–30.

Zitzler, E. and Thiele, L. (1998). Multiobjective opti-

mization using evolutionary algorithms—a compara-

tive case study. In International conference on par-

allel problem solving from nature, pages 292–301.

Springer.

Reinforcement Learning for Multi-Objective Task Placement on Heterogeneous Architectures with Real-Time Constraints

189