reward (i.e., the action’s immediate effect on the
environment) and long-term cumulative reward (i.e.,
the contribution to the learning agent’s overall ob-
jectives). The basic mathematical model of RL is
Markov Decision Processes (MDP) (Bellman, 1957).
Fundamentally, an MDP aims to solve a sequential
decision-making (control) problem in stochastic
environments where the control actions can influence
the evolution of the system’s state. An MDP is de-
fined as a five-tuple (S, A, R, P, γ) as follows: S is the
state space, and A is the action space.
P: S × A × S → [0, 1] gives the state transition
probability. P( s′|s, a), specifies the probability of
transition to s′ by taking action a in state s. R: S ×
A→
ℝ
is the reward function dictating the reward an
agent receives by taking action a ∈ A in state s ∈ S,
and γ ∈ [0, 1] is the discount factor (Sutton and Bar-
to, 2018).
It is essential to notice that the environment of
our RL agent is the reconfigurable system (the con-
trolled system), in contrast to classical RL frame-
works, where the environment is represented by the
uncontrolled system.
Two main learning strategies are available, ex-
ploration and exploitation. Making sure that the
agents explore the environment sufficiently is a
common challenge for RL algorithms known as the
exploration-exploitation dilemma. The ε-greedy
policy is a well-known method to address the explo-
ration-exploitation trade-off while training the RL
agent. This method, can balance exploration and
exploitation and make sure we are never ruling out
one or the other. Our exploration strategy uses
constraints defined in UML models to give structure
to the reconfiguration design space and thereby
leverage additional information to guide exploration.
Each configuration is considered as a valid
constraints’ combination defined on reconfigurable
active parts (elements) of the control system.
3 RELATED WORK
Solutions and research efforts already exist tackling
RCS design using different approaches. In particu-
lar, for classical manufacturing control systems,
several works (Thramboulidis and Frey, 2011) (Ben
Hadj Ali et al., 2012) (Fay et al. 2015) (Ouselati et
al., 2016) adapt UML and its extensions (such as
SysML and MARTE) for designing and modeling
the control logic (Vyatkin, 2013). These works often
aim to reduce control software complexity by raising
the abstraction level while ensuring automatic gen-
eration of PLC (Programmable Logic Controller)
standard-compliant code (IEC 61131 and IEC
61499) (Vyatkin, 2013). In addition, more recent
research works, such as (Thramboulidis and Chris-
toulakis, 2016) (Schneider et al., 2019) (Müller et
al., 2023) (Bazydło, 2023) (Parant, 2023), introduce
UML-based solutions to model and design the con-
trol part of manufacturing systems compliant with
I4.0 and that are considered CPSs in which multiple
concurrent software behaviors govern industrial
components running on embedded controllers.
As a semi-formal language, UML provides high
relevance to handling the semantic gap between
system design and the actual features of the control
application. However, UML-based design approach-
es suffer from a lack of precise semantics. For this
reason, several researchers propose to combine
UML diagrams with formal languages for the mod-
el-based design of RCS. The formalization of con-
trol model elements is performed using formal lan-
guages (such as Petri nets, Timed Automata, etc.) to
describe specific reconfiguration requirements and
thus guarantee the consistency and the correctness of
the specification and code generation by using veri-
fication techniques (such as model checking) (Vyat-
kin 2013) (Mohamed et al., 2021). These approaches
allow for verifying that the system behaves correctly
for all possible input scenarios by giving a precise
description of the possible system behavior. Howev-
er, most of them are based on an automated trans-
formation from a system description with informally
defined semantics and lack learning capabilities. In
addition, the reviewed works have in common the
exploitation of UML-based metamodels and models
to deal with reconfiguration and reconfigurable
systems modeling (Mohamed et al., 2021) and there-
fore allow the automation of several design steps
such as validation/verification and code generation.
However, they are often static since reconfiguration
knowledge that is not anticipated during design time
is handled statically by revising (modifying) existing
models offline (Ben Hadj Ali and Ben Ahmed,
2023).
Furthermore, several works have proposed many
RL agents to model efficient reconfiguration con-
trollers that can learn optimized reconfiguration
policies (plans). The optimization goal is therefore
formulated using the reward (objective) function of
the RL agent (Wuest et al. 2016) (Kuhnle et al.,
2020) (Shengluo and Zhigang 2022) (Saputri, and
Lee, 2020). Despite learning capabilities, the dy-
namicity of the reconfiguration space is only partial-
ly implemented within these approaches because
they mainly focus on exploitation with random ex-
ploration. Therefore, an effective conceptual frame-