A Study on Multi-Objective Optimization of
Epistatic Binary Problems Using Q-learning
Yudai Tagawa
a
, Hern´an Aguirre
b
and Kiyoshi Tanaka
Department of Electrical and Computer Engineering, Shinshu University, Wakasato, Nagano, Japan
Keywords:
Multi-Objective Optimization, Reinforcement Learning, Q- learning, MNK-landscapes.
Abstract:
In this paper, we study distributed and centralized approaches of Q-learning for multi-objective optimization
of binary problems and investigate their characteristics and performance on complex epistatic problems using
MNK-landscapes. In the distributed approach an agent receives its reward optimizing one of the objective
functions and collaborates with others to generate Pareto non-dominated solutions. In the centralized approach
the agent receives its reward based on Pareto dominance optimizing simultaneously all objective functions.
We encode a solution as part of a state and investigate two types of actions as one-bit mutation operators, two
methods to generate an episode’s initial state and the number of steps an agent is allowed to explore without
improving. We also compare with some evolutionary multi-objective optimizers showing that Q-learning
based approaches scale up better as we increase the number of objectives on problems with large epistasis.
1 INTRODUCTION
Multi-Objective Evolutionary Algorithms (MOEAs)
(Deb, 2001) have been widely applied to solve real
world multi-objective op timization problems, and
various types of algorithms have been proposed.
MOEAs r e quire further improvements in order to per-
form an efficient optimization at limited computa-
tional cost and cope with problems of increased diffi-
culty, such as large-scale search spaces, many objec-
tive functions, and various shapes of the Pareto opti-
mal fr ont set.
In this work we focus on epistatic problems,
where the performance of multi-objective optimizers
using conventional mutation and r ecombination op-
erators drops consid erably as we increase the num-
ber of inter acting variables. There is the expecta-
tion th at in these problems operators guided by learn-
ing c ould lead to improvements. From this stand-
point, we study multi-objective optimization using Q-
learning (Drugan, 2019) (Watkins and Dayan, 1992),
a type of reinforcement learning (RL) (Sutton and
Brato, 1998). We want to understand whether Q-
learning based search methods perform an effective
exploration of large sp aces in the presence of epis-
tasis, aiming to develop robust and scalable multi-
a
https://orcid.org/0009-0005-4370-7633
b
https://orcid.org/0000-0003-4480-1339
objective optimization algorithm s.
Related works fall broadly in two categor ies.
Namely, multi-objective reinfor c ement learning
(MORL) and multi-objective optimization combined
with reinforcemen t learning (MOO-RL). The em-
phasis o f MORL is the multi-objective sequential
decision making of the agents to learn to perfo rm
a task when the reward space is multi-dim e nsional.
Several MORL algorithms have been proposed . Most
of them use linear scalarization functions to map
the reward vector into a scalar (Lizotte et al., 2010)
(G´abor et al., 1998) (Barrett and Narayanan, 2008)
(Hayes et al., 2022) (M offaert et al., 2013b) (Moffaert
et a l., 20 13a).
On the o ther h a nd, MOO-RL emphasises multi-
objective solution search supported by RL, i.e. blend-
ing multi- objective optimizers with RL. MOO-RL
can be subdivided in two majo r categories. One where
the solution search is carried out by the optimizer ap-
plying its operator s of variation and selection whereas
RL is applied to select strategies or configur a tions for
the optimizer. There are a few works in this direction,
for example, Q-learning has been used in dynamic
multi-objective optimization to select global and local
search strategies to be app lied by a meme tic algorith m
(Shen et al., 2018) and to select strategies to initialize
the population of the multi-objective optimizer (Zou
et al., 2021) every time a critical dynamic event oc-
curs.
Tagawa, Y., Aguirre, H. and Tanaka, K.
A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning.
DOI: 10.5220/0012156300003595
In Proceedings of the 15th International Joint Conference on Computational Intelligence (IJCCI 2023), pages 163-171
ISBN: 978-989-758-674-3; ISSN: 2184-3236
Copyright © 2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
163
The other major category for MOO- RL is where
RL is used as a multi-objective optimizer. That is,
a state includes th e codification of a solution to the
optimization problem a nd actions act as operators of
variation to search in the solution space . There are
very few previous works on RL applied as a multi-
objective optimizer. For example, in (Mariano and
Morales, 2000) a distributed approached was used to
optimize 2 and 3 objective fun c tions with two con-
tinuous variables. In (Jalalimanesh et al., 2 017),
a distributed Q-learning algorithm similar to (Mari-
ano and Morale s, 2000) is applied fo r multi-objective
optimization of radiotherapy aiming to find Pareto-
optimal solutions represen ting radiotherapy treatment
plans.
We focus on the latter category of MOO-RL and
study distributed and ce ntralized approac hes of Q-
learning for multi-objective optimization of binary
problems. In the distributed approach an agent re-
ceives its reward optimizing one of the objective fun c -
tions and collaborates with o thers to generate Pareto
non-dominated solutions. In the centralized approach
the agent receives its reward based on Pareto domi-
nance optimizing simultaneously a ll objective func-
tions.
In order to understand the cha racteristics of the RL
approa c hes, we conduct experiments so lving MNK-
landscapes (Aguirre an d Tanaka , 2007) varying the
number of binar y variables N, the nu mber of ob-
jectives M and the nu mber of interacting variables
K (epistatic interactions). We compar e results with
other MOEAs using 100 bits landscapes. We chose
for the comparison the multi- objective random bit
climber moRBC (Aguirre and Tanaka, 2005), the
NSGA-II (D e b et al., 2002) an d the decompo sition
based MOE A/D (Zhang and Li, 2008) algorithms,
which perfo rmance is known on MNK-landscapes
and thus allow u s to better understand the effective-
ness of the actions and reward approache s of the RL
optimizers on terms of well known selection meth-
ods and operators of variation as we scale up ob-
jective space and epistatic interactions between vari-
ables. We sh ow that Q- le a rning based approaches can
perform significantly better than the other algorithms
on incresingly non-linear problems for a broad ra nge
of K. We also sh ow that the comparison with th e
other algorithms provides valuable insights on how to
further improve Q-learn ing approaches for epistatic
problems.
Figure 1: Q-learning.
2 METHOD
2.1 Q-learning
Reinforcement learning (RL) is a method in which an
agent learns what to do in given situations so as to
maximize a numerical reward signal. In RL an agent
is not told which actions to take, but instead must dis-
cover which actions yield the most reward by trying
them. Q-learning is a type of RL that uses an off-
policy tempo ral difference control algorithm to learn
an action-value function Q, which approximates the
optimal action-value function independently of the
policy being followed (Watkins and Dayan , 1992).
Fig. 1 illustrates the main components of Q-learning.
When an agent takes action a in state s, a reward r and
the next state s
are passed from the environment. The
value o f Q is updated b y the fo llowing equation,
Q(s,a) Q(s,a ) + α[r + γ max
a
Q(s
,a
) Q(s,a)]
(1)
where α is the learning rate and γ is the discount rate,
a constant between 0 and 1. The above updating equa-
tion means that when an action causes a transition
from the current state s to the next state s
, its Q-value
is brough t closer to the value o f the action a
with the
highest Q-value in the n ext state s
. This means that
if a state has a high reward, that reward will propa-
gate to the states that can reach that state with each
update. This results in optimal learning of state tran-
sitions. The interaction between the agen t and the en-
vironm ent is repeated until a terminal state has been
reached. Each time an interaction takes place is called
a step and an episode deno te s the multiple steps of in-
teraction taken from the initial state to the terminal
state. Distributed Q-learning (Mariano and Mora le s,
2000) is a method where multiple agents interact with
the environment while u sually referring to the same
Q-table.
ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications
164
Figure 2: State.
2.2 Q-learning for Multi-Objective
Optimization
2.2.1 Main Components
To apply Q-learning, the environment, states, actions
and rewards have to be properly defined. In this work
we focus o n the optimiz a tion of multi-objective bi-
nary problems and th e Q-learning components are de-
fined to reflect that. Let us denote the vector x of bi-
nary variables as the environm ent where a n agent can
move and allow an agent at a given time to be posi-
tioned in one of the variables of the vector. Hence,
a state s is represented by joinin g the binary solution
instantiated in x = (x
0
,··· ,x
n1
) and the position p
of the agent, where x
i
{0,1}, n denotes the number
of variables and p {0,··· ,n 1}. The total num-
ber of states is n × 2
n
with this representation. Fig.
2 illustrates the representation of a state s for n = 4
variables. An action causes the agen t to move to an-
other variable and flip its value. In other words, a n
action serves as a variation operator that mutates one
bit of a solution to create a new one. We investigate
two kinds of actio ns to transition from one state to an-
other, which are detailed later in this section together
with the way the reward is assigned.
We study a distributed and a centralized a pproach
for solvin g the task of mu lti- objective optimization.
In the distributed approach an agent focuses on a par-
ticular objective func tion of the problem and its ac-
tions are rewarded for its re lative quality in th at ob-
jective function. Thus, multiple agents are required
to cooperate, at least one agent per objective function,
to solve the task in the distributed ap proach. In the
centralized approach, an agent focuses on all obje c-
tive function s and its actions are rewarded for its r el-
ative quality in the multi-objective space. In the fol-
lowing, for sh ort, we refer to the agents used in the
distributed approach as single-objective age nts and to
the agents used in the centralized approach as multi-
objective agents. In both cases the o bjective is to find
a set of Pareto solutions.
2.2.2 Multi-Agent Algorithm Framework
We implement a tunable multi-agent alg orithm frame-
work to investigate multi-obje ctive optimization using
Algorithm 1: Multi-objective optimization framework us-
ing Q- learning.
Data: init
type,agent type,act type,E,M,τ
Result: P, the set of no n-dominated solutions
found by the algorithm
1 Q InitializeQ()
2 P {}
3 for 1 to E do / episodes /
4 S {}
5 for i 1 to M do / agents /
6 s InitializeState(init type,P)
7 x GetSolution(s)
8 P
i
{x}
9 c 0
10 while c τ do
/ i th agent steps /
11 a SelectAction(a ct type,s,Q)
12 s
PerformActio n(s,a)
13 x GetSolution(s
)
14 r
ObserveReward(agent
type,x,P
i
)
15 S S + (s,a, s
,r )
16 P
i
P
i
{x}
17 s s
18 c
UpdateCounter(agent
type,x,P
i
)
19 end
20 end
21 foreach (s,a, s
,r ) in S do
22 Q(s,a) Q(s,a) + α[r +
γ max
a
(Q(s
,a
)) Q(s,a)]
23 end
24 P NonDominatedSolutions(P
S
i=1,···,M
P
i
)
25 end
26 return P
Q-learning either in a distributed or centralized ap-
proach . Algorithm 1 illustrates the framework. In the
following we explain relevant details of the algorithm.
First, the quality table Q is in itialize d to zero
for ea ch combin a tion o f state-action, i.e. s
a Q(s,a) = 0.0, and the bounded population P of
non-dominated solutions is set to empty (lines 1-2).
Next, the algorithm iterates for E episodes for each of
the M specified agents (lines 3-24) and returns the set
of non-domin ated solution s found (line 26). In this
work, if P exceeds its specified size it is truncated us-
ing crowding distan ce (Deb, 2001) (line 24).
The information associated to a step taken by an
agent is given by the tuple (a,s,s
,r ), where a is the
action, s is the c urrent state, s
is the next state and r is
A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning
165
the reward. When an episode starts, the list S that will
contain the information of all the steps taken by all
agents during an episode is initialized to empty (line
4). Before the first step of an episode for the i-th agent
is taken, an initial state is defined, the population P
i
of solutions visited by the agent during an episode is
initialized with the solution x contained in the initial
state s, and the counter c used to verify the terminatio n
of an episode is set to 0 (lines 6-9).
In each step of an episode, an action is selected
and executed so that the i-th agent transitions from
the cu rrent state s to a new state s
. Th e solution x
contained in s
is compared with the population of so-
lutions P
i
collected so far by the agent to compute the
reward of the action, the tuple (a,s,s
,r ) is added to
S, the solution x is added to P
i
, and the new state s
becomes the current state s (lines 11-18 ).
Once all M agents have completed an episode, the
quality table Q is updated with the information col-
lected in S of all the steps taken by all agents during
the episode (line 21-23), and the set of non-dominate d
solutions is updated with the solutions visited by the
agents contained in their respective populations P
i
(lines 24).
2.2.3 Initial State
Two methods for generating an initial state (line 6 )
at the start of an age nt’s episode are studied. One of
the methods generates randomly the solution x asso-
ciated to the initial state a nd the other one chooses a
solution x from the set of non-dominate d solutions P
i
collected so far. In both methods, the position p of
the ag e nt is r a ndomly deter mined. We select between
these methods setting init
type either to randomly or
continuously, respectively.
2.2.4 Types of Actions and Solution Generation
Two types of actions called rigth-left (rl) and any-
where are studied. The rl action moves the agents
from its current position p to either the right p + 1 or
left p 1 neighboring positions. On the other hand,
the anywhere action moves the agent from its current
position p to any of the n positions in the vector x,
including p again. In both kinds of actions, the bit in
the position wh ere the agent m oves is flipped.
Fig. 3a shows an example of moving to the rig ht
from the current position p = 1, when the type of ac-
tion is rl. The position after the move is p
= 2, and
the next state s
is formed by joining x
and p
. We
consider the vector x as a circular array. That is, the
position to the right of p = n 1 is p
= 0. Simi-
larly, the position to the left of p = 0 is p
= n 1.
When the type of action is rl, the number of th e ac-
(a) rl.
(b) anywhere.
Figure 3: Types of actions.
tions an agent can choose from is 2, either right or
left, independently of the dimension n of the vector x.
Fig. 3b shows an example of moving from the cur-
rent position p = 1 to p
= 3 when the type of action
is anywhere. Since this example is a 4-bit problem,
the nu mber of actions an agent can choose from is
4. In general, when the type of action is anywhere,
the number of actions an agent can choose from is n,
the dimension of the vector x. In the framework, we
select between these two types of actions by setting
act
type to either rl or anywhere. In this work, the
agents select probabilistically the action in the curren t
state using an ε-greedy strategy. That is, with proba-
bility 1 ε th e action with the hig hest Q-value in the
current state s is chosen, and with probability ε the
action is chosen randomly.
2.2.5 Reward Assignment
Rewards are given in different ways depe nding on the
type of agent. In the case of distributed agents, if the
generated solution x imp roves the fitness value of the
best solution in P
i
, in the fitness function the agent is
in charge of, the agent recieves a positive reward equal
to the size of P
i
. Othe rwise, the reward is negative and
equal to the number of solutions in P
i
that are better
ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications
166
(a) distributed agents.
(b) centralized agents.
Figure 4: Distributed and centralized agents behavior.
than x. In the case of centralized ag ents, the generated
solution x is co mpared using the Pareto domin a nce
relationship with the non-dominated solutions in P
i
.
If x is dominant, the agent receives a positive reward
equal to the number of solutions that x dominates. If
x is do minated, the reward is negative and equal to the
number o f solutions that dominate x. Othe rwise, if x
is nondominated by P
i
, the reward is 1.
2.2.6 Agent’s Episode Termination Condition
To determine whether an agent has reached a termi-
nal state (line 1 0) we keep a counter c of the num-
ber of consecutive tim e s an agent i fails to improve
the best solutions in its corresponding popula tion P
i
.
Once this counter goes above a threshold τ, c > τ, the
episode for that agent ends. In the case of distributed
single-objective ag ents, the counter c increases if the
fitness value of the solution x extracted from the new
state (line 13), in the corresp onding fitness function
the i-th agent is assigned to, does not imp rove the fit-
ness value of the best solution in P
i
. In the case of
centralized multi-objective agents, the counter c in-
creases if solution x extracted from the n ew state (lin e
13) is Pareto do minated by at least one solution in
P
i
. Fig. 4a and 4b illustrate the single-objective and
multi-objective agents search and how the counter c is
Table 1: Parameters: MNK-landscapes.
parameter MNK-landscapes
Objectives M 2,3, 4
Variable s N 100
Interacting Variables K
1,2, 3,5,7, 10,15, 20
Variable s Interaction random
Table 2: Parameters: Q-learning.
parameter Q-learning
Episodes 2 × 1 0
6
evaluatio ns
Agent Type single, multi
Action Type rl, anywhere
Initial State
continuously
τ 0
ε,α,γ
0.1,0.1,0.6
Population size 100
updated when they optimize a two objective p roblem.
3 EXPERIMENTS
We compare the performance of Q-learning based
multi-objective optimization with NSGA-II (Deb
et al., 2002), the multi-objective random bit climber
moRBC (Aguirre and Tanaka, 2005) an d MOEA/D
(Zhang and Li, 2008) using large MNK-landscapes
with M = 2, 3 and 4 objectives, N = 100 bits, varying
the number of epistatic bits K from 1 to 20. In these
experiments all algorithms run until 200,000 fitness
evaluatio ns have been completed. Parameters of the
MNK-landscapes used in our study are summarized in
Table 1. Parameters used for Q-learning are summa-
rized in Table 2 and parameters for the other MOEAs
in Table 3.
In all experiments, results are reported for 10 trials
of th e algorithms in the same MNK-landscape with
different random seeds. We use Hypervolume (HV)
(Zitzler, 1999) as the evaluation metric setting the ref-
erence point to (0,··· ,0).
4 RESULTS AND DISCUSSION
In this sectio n we observe the performance of the ce n-
tralized and distributed approaches using the two dif-
ferent types of action, varying the number of o bjec-
tives M from 2 to 4 and the number of interacting bits
K from 1 to 2 0. This allows us to understand bet-
ter the Q-lea rning based app roaches when we scale
up the dimension of the objective sp a ce and the com-
plexity of the landscape. In the following experiments
A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning
167
Table 3: Parameters: Other MOEAs.
parameter NSGA-II moRBC MOEA/D
Generation s 2000 2000 2000
Population size 100 100 100
Crossover
two-point - two-point
Mutation bit flip bit flip bit flip
Neighborhood size
- - 20
Scalarized function - - Tchebycheff
we fix τ to 0, the threshold for the counter of the num-
ber of consecutive times an agent fails to improve the
best solutions in its correspo nding population. This
threshold has shown best r esults in our experiments.
Also we use a solution selected from the population
P of non-dominated solutions to generate the initial
state of an episode, i.e. continuously strategy.
Fig. 5 plots HV over K for all f our possible combi-
nations agent type and action type. Results show the
HV of the final population P of n ondom inated solu-
tions after 200,000 fitness evaluations. Note that for
2 ob jectives, the multi-objective a gent perform bet-
ter when K is low, and the single- objective agent with
anywhere action perform e d better when K is high. For
3 and 4 objectives, the single-ob jective agents with
anywhere action achieves the highest HV. Note that
the centralized approach with a multi-objective agent
and anywhere action can perform better than the dis-
tributed approaches only for M = 2 objectives and
2 <= K <= 5. In all other cases, M = 2 for K >= 7
and M = 3,4 for all values of K, the distributed ap-
proach with single-objective a gents and anywhere ac-
tion overall perform ed better. As the dim e nsion of
the objective space increases it becomes clear that the
centralized app roach using a reward given by Pareto
dominance does not scale up well, as seen in Fig. 5c
for M = 4. A c entralized approach offers the pos-
sibility to reduce the number of agents required for
the multi-ob je ctive sear ch. However, results in this
work clearly suggest that a reward based on Pareto
dominance could only be effective in a very limited
subset of problems. It could be worth exploring in
the future other forms to assign rewar ds for a central-
ized agent. The rl action overall does app ear supe-
rior to anywhere in terms of perform ance. However,
the combined states-action space by rl is significantly
smaller than by anywhere. Actions rl and anywhere
can be seen as extreme cases in terms of the size of
the neighborhood of the position codified in the state
where a bit can be mutated. It could be useful to ex-
plore actions where the size of the current position’s
neighborhood is between 2 (rl) and n (anywhere) .
Next, we compare the Q-learning distributed ap-
proach using action anywhere for multi-objective op-
(a) M2N100.
(b) M3N100.
(c) M4N100.
Figure 5: Agent Types and Action Types (100-bits).
timization with NSGA-II, moRBC and MOEA/D run-
ning for the same number of fitness evaluations as the
Q-learning based approaches (200,000) setting their
population to 100. Fig. 6 shows HV over K, similar
to Fig. 5.
ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications
168
(a) M2N100.
(b) M3N100.
(c) M4N100.
Figure 6: Comparison to NSGA-II, moRBC and MOEA/D.
Before we discuss in detail this figure it is worth
remembering some properties of MNK-landscapes.
By enumeration it has been shown that increasing
K > 0 the landscape becom es rugged and the peaks’
height increase until medium values of K. There-
after the peaks remain of similar height for medium
to large values of K. The hyper volume of the true
Pareto front fo llows a trend similar to th e height of
the pe a ks (Aguirre and Tanaka, 2007).
Now, looking at the Fig. 6, it should be noted that
the increase in hy pervolume varying K from 1 to 5
for all algorithms is in accordance with the properties
of the landscapes. However, for K >= 7 the hyper-
volume decreases monotonica lly with K for all algo-
rithms, which means that the performance of all al-
gorithms drops substantially for K >= 7. Also, note
that there is not a dominant algorithm for all K and
M. However some important trends can be observed.
The Q-learning based approach is the b est perform -
ing algorithm in 3 and 4 objectives for K >= 5 and
K >= 10, respectively, and the second best for 2 ob-
jective s and K >= 7. It is also notoriously weak in all
objectives for K <= 3. On the other hand, MOEA/D
is a very strong algorithm in 2, 3 and 4 objectives
for K <= 5, but its performance drops faster th an
moRBC and the Q-learning approach for K >= 7.
The moRBC is overall the strongest algorithm in 2 o b-
jective s for K >= 3 and similar or better than NSGA-
II for all K and M. NSGA-I I is co mpetitive o nly in 2
objectives for K <= 2 and scales up ba dly for 3 and 4
objectives for all K.
The difference in performance among algorithms
is due to the combined effectiveness of the ope ra-
tors of variation a nd selection included in the al-
gorithms. The Q-learning based approach , moRBC
and NSGA-II use Pareto dom inance based ranking
in their selec tion mechanism. It is well known that
increasing the dimension of the ob je ctive space al-
gorithms with this kind of ranking scale up poorly,
compare d to a decomposition based approach like
MOEA/D. In addition, in smooth landscapes the re-
gions of non-dominance are broad and solutions in
the Pareto front are evenly distributed. Thus, it is
not surprising that MOEA/D with its unifo rm distri-
bution of weights outperfo rms the other alg orithms
for small K. However, as K increases and the land-
scapes become rugged the regions o f non-dominance
become fragm ented and smaller, ind ucing not uni-
form Pareto fronts where solutions are more separ ated
in objective and decision space (Aguirre and Tanaka,
2007). Here the effectiveness of the operators o f vari-
ation beco mes more relevant, in addition to selection.
MOEA/D for large K keeps the relative advantage of
its selection mechanism for 3 and 4 objectives, but
the combination of crossover and mutation loses ef-
fectiveness. The better performance by moRBC com-
pared with NSGA-II is explained by the thorough ex-
ploration of local optima by one-bit mutations rather
than by more disruptive operators like crossover. The
actions in the Q-learning approa ch are also one-bit
mutations. The Q-table offers a path to improving
moves once an episode is restarted, guiding the ex-
ploitation of promising regions and climbing to bet-
ter local optima, which becomes more difficult with-
out learning as evidenced by the results for large K.
However, different to moRBC, the actions in the Q-
learning approach allow transitions to states with non-
improving solutions and are far less c ompreh ensive to
explore local optima.
A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning
169
The results by all algorith ms co mpared in this
section provide valuable insights to improve the Q-
learning approach. They sug gest that incorporating
some of the functionality of the moRBC climber into
the transitions allowed for the Q-learning approach
could improve its effectiveness. In addition, ways
to include selection princip les that are more robust in
objective spaces of larger d imensions should be con-
sidered. This implies different ways to compute the
rewa rds and the selection of the solution to restart a n
episode.
5 CONCLUSION
In th is work, we studied distributed and centralized
approa c hes of Q-learning for multi-objective opti-
mization of binary epistatic problem s using MNK-
landscapes. We showed th at the Q-learning based
approa c hes scale up better than moRBC, NSGA-II
and MOEA/D as we increase the number of objec-
tives on problems with large epistasis. Also, we iden-
tified their wea knesses particularly in low epistatic
landscapes. In addition, we analyzed r e sults of other
MOEAs taking into account their selection method
and operators of variation together with properties of
MNK-landscapes to better understand the Q-learning
based approaches and suggested forms to improve
them. Our conclusions regarding the parameters of
the Q-learning based approach e s are as follows. The
action that flips any bit is overall slightly sup e rior to
the action that flips the left or right neighboring bits.
The centralized approach, using a reward based on
Pareto dominance, does not scale up well with the di-
mension of the ob je ctive space.
In the f uture, we would like to study other ways to
assign rewards for the centralized approach, enhance
the selection of solutions for the initial state of an
episode, an d constrain t transitions to non-improving
states. We would also like to study the Q-learning ap-
proach e s for many-objective optimization and analize
the optimization history obtained by Q-learning.
REFERENCES
Aguirre, H. and Tanaka, K. (2005). Random Bit
Climbers on Multiobjective MNK-L andscapes: Ef-
fects of Memory and Population Climbing. IEICE
Transactions, 88-A:334–345.
Aguirre, H. and Tanaka, K. ( 2007). Working Principles,
Behavior, and P erf ormance of MOEAs on MNK-
landscapes. European Journal of Operational Re-
search, 181:1670–1690.
Barrett, L. and Narayanan, S. (2008). Learning All Optimal
Policies with Multiple Criteria. International Confer-
ence on International Conference on Machine Learn-
ing, pages 41–47.
Deb, K. (2001). Multi-Objective Optimization using Evolu-
tionary Algorithms. John Wiley & Sons.
Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002).
A Fast and Elitist Multiobjective Genetic Algorithm:
NSGA-II. IE EE Transactions on Evolutionary Com-
putation, 6(2):182–197.
Drugan, M. (2019). Reinforcement Learning Versus E vo-
lutionary Computation: A Survey on Hybrid Algo-
rithms. Swarm Evol. Comput., 44:228–246.
G´abor, Z., Kalm´ar, Z., and Szepesv´ari, C. (1998). Multi-
Criteria Reinforcement Learning. I nternational Con-
ference on International Conference on Machine
Learning, 98:197–205.
Hayes, C., R˘adulescu, R., Bargiacchi, E., K¨allstr¨om, J.,
Macfarlane, M., Reymond, M., Verstraeten, T., and
et al (2022). A Practical Guide to Multi-Objective
Reinforcement Learning and Planning. Autonomous
Agents and Multi-Agent Systems, 32(1):26.
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian,
H., and Soltani, M. (2017). Multi-Objective Op-
timization of Radiotherapy: Distributed Q-Learning
and Agent-Based Simulation. Journal of Experimen-
tal & Theoretical Artificial Intelligence, 29(5):1071–
86.
Lizotte, D., Bow ling, M., and Murphy, S. (2010). Effi-
cient Reinforcement Learning with Multiple Reward
Functions for Randomized Controlled Trial Analysis.
International Conference on International Conference
on Machine Learning (ICML), 10:695–702.
Mariano, C. and Morales, E. (2000). Distr ibuted Reinforce-
ment Learning for Multiple Objective Optimizati on
Problems. In Proc. of Congress on Evolutionary Com-
putation (CEC-2000), pages 188–195.
Moffaert, K. V., Drugan, M., and Now´e, A. (2013a).
Hypervolume-Based Multi-Objective Reinforcement
Learning. Evolutionary Multi-Criterion Optimization,
pages 352–66.
Moffaert, K. V., Drugan, M., and Now´e, A. (2013b). Scalar-
ized Multi-Objective Reinforcement Learning: Novel
Design Techniques. IEEE Symposium on Adaptive
Dynamic Programming and Reinforcement Learning
(ADPRL), pages 191–99.
Shen, X., Minku, L., Marturi, N., Guo, Y., and Han, Y.
(2018). A Q-Learning-Based Memetic Algorithm for
Multi-Objective Dynamic Software Project Schedul-
ing. Information Sciences, 428:1–29.
Sutton, R. and Brato, A. (1998). Reinforcement Learning.
The MIT Press.
Watkins, C. and Dayan, P. (1992). Q- learning. Machine
Learning, 8:279–292.
Zhang, Q . and Li, H. (2008). MOEA/D: A Multiobjec-
tive Evolutionary Algorithm Based on Decomposi-
tion. IEE E Trans. Neural Netw., 11(6):712–731.
Zitzler, E. (1999). Evolutionary Algorithms for Multiobjec-
tive Optimization: Methods and Applications. PhD
thesis, ETH Zurich, Switzerland.
ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications
170
Zou, F., Yen, G., Tang, L. , and Wang, C. (2021).
A Reinforcement Learning Approach for Dynamic
Multi-Objective Optimization. Information Sciences,
546:815–34.
A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning
171