
lows: Section 2 provides an overview of the back-
ground relevant to the study, followed by a discussion
of related work in Section 3. In Section 4, we present
our proposed approach for optimizing the link con-
figuration. The experimental setup is detailed in Sec-
tion 5, and the results of the experiments are discussed
in Section 6. Finally, Section 7 concludes the paper
and outlines potential directions for future research.
2 BACKGROUND
2.1 Reinforcement Learning
Machine learning is a subfield of artificial intelli-
gence, which can be further broken down into three
categories: supervised learning, unsupervised learn-
ing, and reinforcement learning (RL). RL is particu-
larly well-suited for addressing sequential decision-
making problems, such as those encountered in chess
or in the game AlphaGo (Deliu, 2023), where signif-
icant successes were achieved in recent years (Silver
et al., 2018). In RL, an agent learns a policy π by in-
teracting with an environment through trial and error,
with the goal of making optimal decisions. The two
main components in RL are the environment and the
agent. The environment is often modeled as a sim-
ulation of the real world, as an agent interacting di-
rectly with the real environment may be infeasible,
too risky or too expensive (Prudencio, Maximo, and
Colombini, 2023).
The foundation of RL lies in the Markov Deci-
sion Process (MDP). In an MDP, the various states of
the environment are represented by states in the pro-
cess. At any given time t, the agent occupies a state
S
t
. From this state, the agent can select from various
actions to transition to other states. Thus, the system
consists of state-action pairs, or tuples (A
t
, S
t
). When
the agent selects an action A
t
and executes it in the
environment, the agent receives feedback. This feed-
back includes an evaluation of the performed action
in the form of a reward, and the new state S
t+1
of the
environment. The reward may be positive or negative,
representing either a benefit or a penalty. The agent’s
objective is to maximize the accumulated reward, as
shown in Equation 1.
π
∗
= arg max
π
E
"
H
∑
t=0
γ
t
· R(s
t
, a
t
)
#
(1)
A major challenge in finding the optimal policy
π
∗
is the balance between exploration and exploita-
tion. Exploration refers to the degree to which the
agent explores the environment for unknown states,
while exploitation refers to the degree to which the
agent applies its learned knowledge to achieve the
highest possible reward. At the start of training, the
agent should focus more on exploring the environ-
ment, even if this means not always selecting the ac-
tion that yields the highest immediate reward (i.e.,
avoiding a greedy strategy). This approach allows the
agent to gain a more comprehensive understanding
of the environment and, ultimately, develop a better
policy. As the learning process progresses, the agent
should shift towards exploiting its knowledge more
and exploring less.
In RL, there is a distinction between model-based
and model-free approaches. In model-free RL, the
agent directly interacts with the environment and re-
ceives feedback. The environment may be a real-
world environment or a simulation. In contrast, in
model-based RL, the agent also interacts with the en-
vironment, but it can update its policy using a model
of the environment. In addition to standard RL, there
exists an advanced method called deep reinforcement
learning (deep RL). In deep RL, neural networks
are used to approximate the value function ˆv(s;θ) or
ˆq(s, a;θ) of the policy π(a | s; θ) or the model. The
parameter θ represents the weights of the neural net-
work. The use of neural networks enables the learn-
ing of complex tasks. During training, the connec-
tions between neurons, also called nodes, are dynam-
ically adjusted (weighted) to improve the quality of
the function approximation.
3 RELATED WORK
Metaheuristics are commonly applied to optimization
problems and frequently yield efficient near-optimal
solutions in complex environments. A promising
alternative to these traditional approaches are RL-
algorithms, whose potential in the context of opti-
mization problems has already been scientifically ex-
plored (T. Zhang, Banitalebi-Dehkordi, and Y. Zhang,
2022; Ardon, 2022; Li et al., 2021). Studies such
as (Klar, Glatt, and Aurich, 2023) and (Mazyavk-
ina et al., 2021) demonstrate that RL-algorithms can
achieve results similar to or even superior to those ob-
tained by standard metaheuristics. This applies to a
variety of optimization problems, including the Trav-
eling Salesman Problem (TSP) (Bello et al., 2016),
the Maximum Cut (Max-Cut) problem, the Minimum
Vertex Cover (MVC) problem, and the Bin Packing
Problem (BPP) (Mazyavkina et al., 2021).
While the work by (Klar, Glatt, and Aurich, 2023)
does not focus on the arrangement of links on a
transponder, it focuses on the planning of factory
Optimization of Link Configuration for Satellite Communication Using Reinforcement Learning
705