A Study on Multi-Objective Optimization of

Epistatic Binary Problems Using Q-learning

Yudai Tagawa

, Hern´an Aguirre

and Kiyoshi Tanaka

Department of Electrical and Computer Engineering, Shinshu University, Wakasato, Nagano, Japan

Keywords:

Multi-Objective Optimization, Reinforcement Learning, Q- learning, MNK-landscapes.

Abstract:

In this paper, we study distributed and centralized approaches of Q-learning for multi-objective optimization

of binary problems and investigate their characteristics and performance on complex epistatic problems using

MNK-landscapes. In the distributed approach an agent receives its reward optimizing one of the objective

functions and collaborates with others to generate Pareto non-dominated solutions. In the centralized approach

the agent receives its reward based on Pareto dominance optimizing simultaneously all objective functions.

We encode a solution as part of a state and investigate two types of actions as one-bit mutation operators, two

methods to generate an episode’s initial state and the number of steps an agent is allowed to explore without

improving. We also compare with some evolutionary multi-objective optimizers showing that Q-learning

based approaches scale up better as we increase the number of objectives on problems with large epistasis.

1 INTRODUCTION

Multi-Objective Evolutionary Algorithms (MOEAs)

(Deb, 2001) have been widely applied to solve real

world multi-objective op timization problems, and

various types of algorithms have been proposed.

MOEAs r e quire further improvements in order to per-

form an efﬁcient optimization at limited computa-

tional cost and cope with problems of increased difﬁ-

culty, such as large-scale search spaces, many objec-

tive functions, and various shapes of the Pareto opti-

mal fr ont set.

In this work we focus on epistatic problems,

where the performance of multi-objective optimizers

using conventional mutation and r ecombination op-

erators drops consid erably as we increase the num-

ber of inter acting variables. There is the expecta-

tion th at in these problems operators guided by learn-

ing c ould lead to improvements. From this stand-

point, we study multi-objective optimization using Q-

learning (Drugan, 2019) (Watkins and Dayan, 1992),

a type of reinforcement learning (RL) (Sutton and

Brato, 1998). We want to understand whether Q-

learning based search methods perform an effective

exploration of large sp aces in the presence of epis-

tasis, aiming to develop robust and scalable multi-

https://orcid.org/0009-0005-4370-7633

https://orcid.org/0000-0003-4480-1339

objective optimization algorithm s.

Related works fall broadly in two categor ies.

Namely, multi-objective reinfor c ement learning

(MORL) and multi-objective optimization combined

with reinforcemen t learning (MOO-RL). The em-

phasis o f MORL is the multi-objective sequential

decision making of the agents to learn to perfo rm

a task when the reward space is multi-dim e nsional.

Several MORL algorithms have been proposed . Most

of them use linear scalarization functions to map

the reward vector into a scalar (Lizotte et al., 2010)

(G´abor et al., 1998) (Barrett and Narayanan, 2008)

(Hayes et al., 2022) (M offaert et al., 2013b) (Moffaert

et a l., 20 13a).

On the o ther h a nd, MOO-RL emphasises multi-

objective solution search supported by RL, i.e. blend-

ing multi- objective optimizers with RL. MOO-RL

can be subdivided in two majo r categories. One where

the solution search is carried out by the optimizer ap-

plying its operator s of variation and selection whereas

RL is applied to select strategies or conﬁgur a tions for

the optimizer. There are a few works in this direction,

for example, Q-learning has been used in dynamic

multi-objective optimization to select global and local

search strategies to be app lied by a meme tic algorith m

(Shen et al., 2018) and to select strategies to initialize

the population of the multi-objective optimizer (Zou

et al., 2021) every time a critical dynamic event oc-

curs.

Tagawa, Y., Aguirre, H. and Tanaka, K.

A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning.

DOI: 10.5220/0012156300003595

In Proceedings of the 15th International Joint Conference on Computational Intelligence (IJCCI 2023), pages 163-171

ISBN: 978-989-758-674-3; ISSN: 2184-3236

163

The other major category for MOO- RL is where

RL is used as a multi-objective optimizer. That is,

a state includes th e codiﬁcation of a solution to the

optimization problem a nd actions act as operators of

variation to search in the solution space . There are

very few previous works on RL applied as a multi-

objective optimizer. For example, in (Mariano and

Morales, 2000) a distributed approached was used to

optimize 2 and 3 objective fun c tions with two con-

tinuous variables. In (Jalalimanesh et al., 2 017),

a distributed Q-learning algorithm similar to (Mari-

ano and Morale s, 2000) is applied fo r multi-objective

optimization of radiotherapy aiming to ﬁnd Pareto-

optimal solutions represen ting radiotherapy treatment

plans.

We focus on the latter category of MOO-RL and

study distributed and ce ntralized approac hes of Q-

learning for multi-objective optimization of binary

problems. In the distributed approach an agent re-

ceives its reward optimizing one of the objective fun c -

tions and collaborates with o thers to generate Pareto

non-dominated solutions. In the centralized approach

the agent receives its reward based on Pareto domi-

nance optimizing simultaneously a ll objective func-

tions.

In order to understand the cha racteristics of the RL

approa c hes, we conduct experiments so lving MNK-

landscapes (Aguirre an d Tanaka , 2007) varying the

number of binar y variables N, the nu mber of ob-

jectives M and the nu mber of interacting variables

K (epistatic interactions). We compar e results with

other MOEAs using 100 bits landscapes. We chose

for the comparison the multi- objective random bit

climber moRBC (Aguirre and Tanaka, 2005), the

NSGA-II (D e b et al., 2002) an d the decompo sition

based MOE A/D (Zhang and Li, 2008) algorithms,

which perfo rmance is known on MNK-landscapes

and thus allow u s to better understand the effective-

ness of the actions and reward approache s of the RL

optimizers on terms of well known selection meth-

ods and operators of variation as we scale up ob-

jective space and epistatic interactions between vari-

ables. We sh ow that Q- le a rning based approaches can

perform signiﬁcantly better than the other algorithms

on incresingly non-linear problems for a broad ra nge

of K. We also sh ow that the comparison with th e

other algorithms provides valuable insights on how to

further improve Q-learn ing approaches for epistatic

problems.

Figure 1: Q-learning.

2 METHOD

2.1 Q-learning

Reinforcement learning (RL) is a method in which an

agent learns what to do in given situations so as to

maximize a numerical reward signal. In RL an agent

is not told which actions to take, but instead must dis-

cover which actions yield the most reward by trying

them. Q-learning is a type of RL that uses an off-

policy tempo ral difference control algorithm to learn

an action-value function Q, which approximates the

optimal action-value function independently of the

policy being followed (Watkins and Dayan , 1992).

Fig. 1 illustrates the main components of Q-learning.

When an agent takes action a in state s, a reward r and

the next state s

′

are passed from the environment. The

value o f Q is updated b y the fo llowing equation,

Q(s,a) ← Q(s,a ) + α[r + γ max

′

Q(s

′

) − Q(s,a)]

(1)

where α is the learning rate and γ is the discount rate,

a constant between 0 and 1. The above updating equa-

tion means that when an action causes a transition

from the current state s to the next state s

′

, its Q-value

is brough t closer to the value o f the action a

′

with the

highest Q-value in the n ext state s

′

. This means that

if a state has a high reward, that reward will propa-

gate to the states that can reach that state with each

update. This results in optimal learning of state tran-

sitions. The interaction between the agen t and the en-

vironm ent is repeated until a terminal state has been

reached. Each time an interaction takes place is called

a step and an episode deno te s the multiple steps of in-

teraction taken from the initial state to the terminal

state. Distributed Q-learning (Mariano and Mora le s,

2000) is a method where multiple agents interact with

the environment while u sually referring to the same

Q-table.

ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications

164

Figure 2: State.

2.2 Q-learning for Multi-Objective

Optimization

2.2.1 Main Components

To apply Q-learning, the environment, states, actions

and rewards have to be properly deﬁned. In this work

we focus o n the optimiz a tion of multi-objective bi-

nary problems and th e Q-learning components are de-

ﬁned to reﬂect that. Let us denote the vector x of bi-

nary variables as the environm ent where a n agent can

move and allow an agent at a given time to be posi-

tioned in one of the variables of the vector. Hence,

a state s is represented by joinin g the binary solution

instantiated in x = (x

,··· ,x

n−1

) and the position p

of the agent, where x

∈ {0,1}, n denotes the number

of variables and p ∈ {0,··· ,n − 1}. The total num-

ber of states is n × 2

with this representation. Fig.

2 illustrates the representation of a state s for n = 4

variables. An action causes the agen t to move to an-

other variable and ﬂip its value. In other words, a n

action serves as a variation operator that mutates one

bit of a solution to create a new one. We investigate

two kinds of actio ns to transition from one state to an-

other, which are detailed later in this section together

with the way the reward is assigned.

We study a distributed and a centralized a pproach

for solvin g the task of mu lti- objective optimization.

In the distributed approach an agent focuses on a par-

ticular objective func tion of the problem and its ac-

tions are rewarded for its re lative quality in th at ob-

jective function. Thus, multiple agents are required

to cooperate, at least one agent per objective function,

to solve the task in the distributed ap proach. In the

centralized approach, an agent focuses on all obje c-

tive function s and its actions are rewarded for its r el-

ative quality in the multi-objective space. In the fol-

lowing, for sh ort, we refer to the agents used in the

distributed approach as single-objective age nts and to

the agents used in the centralized approach as multi-

objective agents. In both cases the o bjective is to ﬁnd

a set of Pareto solutions.

2.2.2 Multi-Agent Algorithm Framework

We implement a tunable multi-agent alg orithm frame-

work to investigate multi-obje ctive optimization using

Algorithm 1: Multi-objective optimization framework us-

ing Q- learning.

Data: init

type,agent type,act type,E,M,τ

Result: P, the set of no n-dominated solutions

found by the algorithm

1 Q ← InitializeQ()

2 P ← {}

3 for 1 to E do / ∗ episodes ∗ /

4 S ← {}

5 for i ← 1 to M do / ∗ agents ∗ /

6 s ← InitializeState(init type,P)

7 x ← GetSolution(s)

8 P

← {x}

9 c ← 0

10 while c ≤ τ do

/ ∗ i − th agent steps ∗ /

11 a ← SelectAction(a ct type,s,Q)

12 s

′

← PerformActio n(s,a)

13 x ← GetSolution(s

′

)

14 r ←

ObserveReward(agent

type,x,P

)

15 S ← S + (s,a, s

′

,r )

16 P

← P

∪ {x}

17 s ← s

′

18 c ←

UpdateCounter(agent

type,x,P

)

19 end

20 end

21 foreach (s,a, s

′

,r ) in S do

22 Q(s,a) ← Q(s,a) + α[r +

γ max

′

(Q(s

′

)) − Q(s,a)]

23 end

24 P ←NonDominatedSolutions(P

i=1,···,M

)

25 end

26 return P

Q-learning either in a distributed or centralized ap-

proach . Algorithm 1 illustrates the framework. In the

following we explain relevant details of the algorithm.

First, the quality table Q is in itialize d to zero

for ea ch combin a tion o f state-action, i.e. ∀s ∧

∀a Q(s,a) = 0.0, and the bounded population P of

non-dominated solutions is set to empty (lines 1-2).

Next, the algorithm iterates for E episodes for each of

the M speciﬁed agents (lines 3-24) and returns the set

of non-domin ated solution s found (line 26). In this

work, if P exceeds its speciﬁed size it is truncated us-

ing crowding distan ce (Deb, 2001) (line 24).

The information associated to a step taken by an

agent is given by the tuple (a,s,s

′

,r ), where a is the

action, s is the c urrent state, s

′

is the next state and r is

A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning

165

the reward. When an episode starts, the list S that will

contain the information of all the steps taken by all

agents during an episode is initialized to empty (line

4). Before the ﬁrst step of an episode for the i-th agent

is taken, an initial state is deﬁned, the population P

of solutions visited by the agent during an episode is

initialized with the solution x contained in the initial

state s, and the counter c used to verify the terminatio n

of an episode is set to 0 (lines 6-9).

In each step of an episode, an action is selected

and executed so that the i-th agent transitions from

the cu rrent state s to a new state s

′

. Th e solution x

contained in s

′

is compared with the population of so-

lutions P

collected so far by the agent to compute the

reward of the action, the tuple (a,s,s

′

,r ) is added to

S, the solution x is added to P

, and the new state s

′

becomes the current state s (lines 11-18 ).

Once all M agents have completed an episode, the

quality table Q is updated with the information col-

lected in S of all the steps taken by all agents during

the episode (line 21-23), and the set of non-dominate d

solutions is updated with the solutions visited by the

agents contained in their respective populations P

(lines 24).

2.2.3 Initial State

Two methods for generating an initial state (line 6 )

at the start of an age nt’s episode are studied. One of

the methods generates randomly the solution x asso-

ciated to the initial state a nd the other one chooses a

solution x from the set of non-dominate d solutions P

collected so far. In both methods, the position p of

the ag e nt is r a ndomly deter mined. We select between

these methods setting init

type either to randomly or

continuously, respectively.

2.2.4 Types of Actions and Solution Generation

Two types of actions called rigth-left (rl) and any-

where are studied. The rl action moves the agents

from its current position p to either the right p + 1 or

left p − 1 neighboring positions. On the other hand,

the anywhere action moves the agent from its current

position p to any of the n positions in the vector x,

including p again. In both kinds of actions, the bit in

the position wh ere the agent m oves is ﬂipped.

Fig. 3a shows an example of moving to the rig ht

from the current position p = 1, when the type of ac-

tion is rl. The position after the move is p

′

= 2, and

the next state s

′

is formed by joining x

′

and p

′

. We

consider the vector x as a circular array. That is, the

position to the right of p = n − 1 is p

′

= 0. Simi-

larly, the position to the left of p = 0 is p

′

= n − 1.

When the type of action is rl, the number of th e ac-

(a) rl.

(b) anywhere.

Figure 3: Types of actions.

tions an agent can choose from is 2, either right or

left, independently of the dimension n of the vector x.

Fig. 3b shows an example of moving from the cur-

rent position p = 1 to p

′

= 3 when the type of action

is anywhere. Since this example is a 4-bit problem,

the nu mber of actions an agent can choose from is

4. In general, when the type of action is anywhere,

the number of actions an agent can choose from is n,

the dimension of the vector x. In the framework, we

select between these two types of actions by setting

act

type to either rl or anywhere. In this work, the

agents select probabilistically the action in the curren t

state using an ε-greedy strategy. That is, with proba-

bility 1 − ε th e action with the hig hest Q-value in the

current state s is chosen, and with probability ε the

action is chosen randomly.

2.2.5 Reward Assignment

Rewards are given in different ways depe nding on the

type of agent. In the case of distributed agents, if the

generated solution x imp roves the ﬁtness value of the

best solution in P

, in the ﬁtness function the agent is

in charge of, the agent recieves a positive reward equal

to the size of P

. Othe rwise, the reward is negative and

equal to the number of solutions in P

that are better

ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications

166

(a) distributed agents.

(b) centralized agents.

Figure 4: Distributed and centralized agents behavior.

than x. In the case of centralized ag ents, the generated

solution x is co mpared using the Pareto domin a nce

relationship with the non-dominated solutions in P

If x is dominant, the agent receives a positive reward

equal to the number of solutions that x dominates. If

x is do minated, the reward is negative and equal to the

number o f solutions that dominate x. Othe rwise, if x

is nondominated by P

, the reward is −1.

2.2.6 Agent’s Episode Termination Condition

To determine whether an agent has reached a termi-

nal state (line 1 0) we keep a counter c of the num-

ber of consecutive tim e s an agent i fails to improve

the best solutions in its corresponding popula tion P

Once this counter goes above a threshold τ, c > τ, the

episode for that agent ends. In the case of distributed

single-objective ag ents, the counter c increases if the

ﬁtness value of the solution x extracted from the new

state (line 13), in the corresp onding ﬁtness function

the i-th agent is assigned to, does not imp rove the ﬁt-

ness value of the best solution in P

. In the case of

centralized multi-objective agents, the counter c in-

creases if solution x extracted from the n ew state (lin e

13) is Pareto do minated by at least one solution in

. Fig. 4a and 4b illustrate the single-objective and

multi-objective agents search and how the counter c is

Table 1: Parameters: MNK-landscapes.

parameter MNK-landscapes

Objectives M 2,3, 4

Variable s N 100

Interacting Variables K

1,2, 3,5,7, 10,15, 20

Variable s Interaction random

Table 2: Parameters: Q-learning.

parameter Q-learning

Episodes ≤ 2 × 1 0

evaluatio ns

Agent Type single, multi

Action Type rl, anywhere

Initial State

continuously

τ 0

ε,α,γ

0.1,0.1,0.6

Population size 100

updated when they optimize a two objective p roblem.

3 EXPERIMENTS

We compare the performance of Q-learning based

multi-objective optimization with NSGA-II (Deb

et al., 2002), the multi-objective random bit climber

moRBC (Aguirre and Tanaka, 2005) an d MOEA/D

(Zhang and Li, 2008) using large MNK-landscapes

with M = 2, 3 and 4 objectives, N = 100 bits, varying

the number of epistatic bits K from 1 to 20. In these

experiments all algorithms run until 200,000 ﬁtness

evaluatio ns have been completed. Parameters of the

MNK-landscapes used in our study are summarized in

Table 1. Parameters used for Q-learning are summa-

rized in Table 2 and parameters for the other MOEAs

in Table 3.

In all experiments, results are reported for 10 trials

of th e algorithms in the same MNK-landscape with

different random seeds. We use Hypervolume (HV)

(Zitzler, 1999) as the evaluation metric setting the ref-

erence point to (0,··· ,0).

4 RESULTS AND DISCUSSION

In this sectio n we observe the performance of the ce n-

tralized and distributed approaches using the two dif-

ferent types of action, varying the number of o bjec-

tives M from 2 to 4 and the number of interacting bits

K from 1 to 2 0. This allows us to understand bet-

ter the Q-lea rning based app roaches when we scale

up the dimension of the objective sp a ce and the com-

plexity of the landscape. In the following experiments

A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning

167

Table 3: Parameters: Other MOEAs.

parameter NSGA-II moRBC MOEA/D

Generation s 2000 2000 2000

Population size 100 100 100

Crossover

two-point - two-point

Mutation bit ﬂip bit ﬂip bit ﬂip

Neighborhood size

- - 20

Scalarized function - - Tchebycheff

we ﬁx τ to 0, the threshold for the counter of the num-

ber of consecutive times an agent fails to improve the

best solutions in its correspo nding population. This

threshold has shown best r esults in our experiments.

Also we use a solution selected from the population

P of non-dominated solutions to generate the initial

state of an episode, i.e. continuously strategy.

Fig. 5 plots HV over K for all f our possible combi-

nations agent type and action type. Results show the

HV of the ﬁnal population P of n ondom inated solu-

tions after 200,000 ﬁtness evaluations. Note that for

2 ob jectives, the multi-objective a gent perform bet-

ter when K is low, and the single- objective agent with

anywhere action perform e d better when K is high. For

3 and 4 objectives, the single-ob jective agents with

anywhere action achieves the highest HV. Note that

the centralized approach with a multi-objective agent

and anywhere action can perform better than the dis-

tributed approaches only for M = 2 objectives and

2 <= K <= 5. In all other cases, M = 2 for K >= 7

and M = 3,4 for all values of K, the distributed ap-

proach with single-objective a gents and anywhere ac-

tion overall perform ed better. As the dim e nsion of

the objective space increases it becomes clear that the

centralized app roach using a reward given by Pareto

dominance does not scale up well, as seen in Fig. 5c

for M = 4. A c entralized approach offers the pos-

sibility to reduce the number of agents required for

the multi-ob je ctive sear ch. However, results in this

work clearly suggest that a reward based on Pareto

dominance could only be effective in a very limited

subset of problems. It could be worth exploring in

the future other forms to assign rewar ds for a central-

ized agent. The rl action overall does app ear supe-

rior to anywhere in terms of perform ance. However,

the combined states-action space by rl is signiﬁcantly

smaller than by anywhere. Actions rl and anywhere

can be seen as extreme cases in terms of the size of

the neighborhood of the position codiﬁed in the state

where a bit can be mutated. It could be useful to ex-

plore actions where the size of the current position’s

neighborhood is between 2 (rl) and n (anywhere) .

Next, we compare the Q-learning distributed ap-

proach using action anywhere for multi-objective op-

(a) M2N100.

(b) M3N100.

Figure 5: Agent Types and Action Types (100-bits).

timization with NSGA-II, moRBC and MOEA/D run-

ning for the same number of ﬁtness evaluations as the

Q-learning based approaches (200,000) setting their

population to 100. Fig. 6 shows HV over K, similar

to Fig. 5.

ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications

168

(a) M2N100.

(b) M3N100.

Figure 6: Comparison to NSGA-II, moRBC and MOEA/D.

Before we discuss in detail this ﬁgure it is worth

remembering some properties of MNK-landscapes.

By enumeration it has been shown that increasing

K > 0 the landscape becom es rugged and the peaks’

height increase until medium values of K. There-

after the peaks remain of similar height for medium

to large values of K. The hyper volume of the true

Pareto front fo llows a trend similar to th e height of

the pe a ks (Aguirre and Tanaka, 2007).

Now, looking at the Fig. 6, it should be noted that

the increase in hy pervolume varying K from 1 to 5

for all algorithms is in accordance with the properties

of the landscapes. However, for K >= 7 the hyper-

volume decreases monotonica lly with K for all algo-

rithms, which means that the performance of all al-

gorithms drops substantially for K >= 7. Also, note

that there is not a dominant algorithm for all K and

M. However some important trends can be observed.

The Q-learning based approach is the b est perform -

ing algorithm in 3 and 4 objectives for K >= 5 and

K >= 10, respectively, and the second best for 2 ob-

jective s and K >= 7. It is also notoriously weak in all

objectives for K <= 3. On the other hand, MOEA/D

is a very strong algorithm in 2, 3 and 4 objectives

for K <= 5, but its performance drops faster th an

moRBC and the Q-learning approach for K >= 7.

The moRBC is overall the strongest algorithm in 2 o b-

jective s for K >= 3 and similar or better than NSGA-

II for all K and M. NSGA-I I is co mpetitive o nly in 2

objectives for K <= 2 and scales up ba dly for 3 and 4

objectives for all K.

The difference in performance among algorithms

is due to the combined effectiveness of the ope ra-

tors of variation a nd selection included in the al-

gorithms. The Q-learning based approach , moRBC

and NSGA-II use Pareto dom inance based ranking

in their selec tion mechanism. It is well known that

increasing the dimension of the ob je ctive space al-

gorithms with this kind of ranking scale up poorly,

compare d to a decomposition based approach like

MOEA/D. In addition, in smooth landscapes the re-

gions of non-dominance are broad and solutions in

the Pareto front are evenly distributed. Thus, it is

not surprising that MOEA/D with its unifo rm distri-

bution of weights outperfo rms the other alg orithms

for small K. However, as K increases and the land-

scapes become rugged the regions o f non-dominance

become fragm ented and smaller, ind ucing not uni-

form Pareto fronts where solutions are more separ ated

in objective and decision space (Aguirre and Tanaka,

2007). Here the effectiveness of the operators o f vari-

ation beco mes more relevant, in addition to selection.

MOEA/D for large K keeps the relative advantage of

its selection mechanism for 3 and 4 objectives, but

the combination of crossover and mutation loses ef-

fectiveness. The better performance by moRBC com-

pared with NSGA-II is explained by the thorough ex-

ploration of local optima by one-bit mutations rather

than by more disruptive operators like crossover. The

actions in the Q-learning approa ch are also one-bit

mutations. The Q-table offers a path to improving

moves once an episode is restarted, guiding the ex-

ploitation of promising regions and climbing to bet-

ter local optima, which becomes more difﬁcult with-

out learning as evidenced by the results for large K.

However, different to moRBC, the actions in the Q-

learning approach allow transitions to states with non-

improving solutions and are far less c ompreh ensive to

explore local optima.

A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning

169

The results by all algorith ms co mpared in this

section provide valuable insights to improve the Q-

learning approach. They sug gest that incorporating

some of the functionality of the moRBC climber into

the transitions allowed for the Q-learning approach

could improve its effectiveness. In addition, ways

to include selection princip les that are more robust in

objective spaces of larger d imensions should be con-

sidered. This implies different ways to compute the

rewa rds and the selection of the solution to restart a n

episode.

5 CONCLUSION

In th is work, we studied distributed and centralized

approa c hes of Q-learning for multi-objective opti-

mization of binary epistatic problem s using MNK-

landscapes. We showed th at the Q-learning based

approa c hes scale up better than moRBC, NSGA-II

and MOEA/D as we increase the number of objec-

tives on problems with large epistasis. Also, we iden-

tiﬁed their wea knesses particularly in low epistatic

landscapes. In addition, we analyzed r e sults of other

MOEAs taking into account their selection method

and operators of variation together with properties of

MNK-landscapes to better understand the Q-learning

based approaches and suggested forms to improve

them. Our conclusions regarding the parameters of

the Q-learning based approach e s are as follows. The

action that ﬂips any bit is overall slightly sup e rior to

the action that ﬂips the left or right neighboring bits.

The centralized approach, using a reward based on

Pareto dominance, does not scale up well with the di-

mension of the ob je ctive space.

In the f uture, we would like to study other ways to

assign rewards for the centralized approach, enhance

the selection of solutions for the initial state of an

episode, an d constrain t transitions to non-improving

states. We would also like to study the Q-learning ap-

proach e s for many-objective optimization and analize

the optimization history obtained by Q-learning.

REFERENCES

Aguirre, H. and Tanaka, K. (2005). Random Bit

Climbers on Multiobjective MNK-L andscapes: Ef-

fects of Memory and Population Climbing. IEICE

Transactions, 88-A:334–345.

Aguirre, H. and Tanaka, K. ( 2007). Working Principles,

Behavior, and P erf ormance of MOEAs on MNK-

landscapes. European Journal of Operational Re-

search, 181:1670–1690.

Barrett, L. and Narayanan, S. (2008). Learning All Optimal

Policies with Multiple Criteria. International Confer-

ence on International Conference on Machine Learn-

ing, pages 41–47.

Deb, K. (2001). Multi-Objective Optimization using Evolu-

tionary Algorithms. John Wiley & Sons.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002).

A Fast and Elitist Multiobjective Genetic Algorithm:

NSGA-II. IE EE Transactions on Evolutionary Com-

putation, 6(2):182–197.

Drugan, M. (2019). Reinforcement Learning Versus E vo-

lutionary Computation: A Survey on Hybrid Algo-

rithms. Swarm Evol. Comput., 44:228–246.

G´abor, Z., Kalm´ar, Z., and Szepesv´ari, C. (1998). Multi-

Criteria Reinforcement Learning. I nternational Con-

ference on International Conference on Machine

Learning, 98:197–205.

Hayes, C., R˘adulescu, R., Bargiacchi, E., K¨allstr¨om, J.,

Macfarlane, M., Reymond, M., Verstraeten, T., and

et al (2022). A Practical Guide to Multi-Objective

Reinforcement Learning and Planning. Autonomous

Agents and Multi-Agent Systems, 32(1):26.

Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian,

H., and Soltani, M. (2017). Multi-Objective Op-

timization of Radiotherapy: Distributed Q-Learning

and Agent-Based Simulation. Journal of Experimen-

tal & Theoretical Artiﬁcial Intelligence, 29(5):1071–

86.

Lizotte, D., Bow ling, M., and Murphy, S. (2010). Efﬁ-

cient Reinforcement Learning with Multiple Reward

Functions for Randomized Controlled Trial Analysis.

International Conference on International Conference

on Machine Learning (ICML), 10:695–702.

Mariano, C. and Morales, E. (2000). Distr ibuted Reinforce-

ment Learning for Multiple Objective Optimizati on

Problems. In Proc. of Congress on Evolutionary Com-

putation (CEC-2000), pages 188–195.

Moffaert, K. V., Drugan, M., and Now´e, A. (2013a).

Hypervolume-Based Multi-Objective Reinforcement

Learning. Evolutionary Multi-Criterion Optimization,

pages 352–66.

Moffaert, K. V., Drugan, M., and Now´e, A. (2013b). Scalar-

ized Multi-Objective Reinforcement Learning: Novel

Design Techniques. IEEE Symposium on Adaptive

Dynamic Programming and Reinforcement Learning

(ADPRL), pages 191–99.

Shen, X., Minku, L., Marturi, N., Guo, Y., and Han, Y.

(2018). A Q-Learning-Based Memetic Algorithm for

Multi-Objective Dynamic Software Project Schedul-

ing. Information Sciences, 428:1–29.

Sutton, R. and Brato, A. (1998). Reinforcement Learning.

The MIT Press.

Watkins, C. and Dayan, P. (1992). Q- learning. Machine

Learning, 8:279–292.

Zhang, Q . and Li, H. (2008). MOEA/D: A Multiobjec-

tive Evolutionary Algorithm Based on Decomposi-

tion. IEE E Trans. Neural Netw., 11(6):712–731.

Zitzler, E. (1999). Evolutionary Algorithms for Multiobjec-

tive Optimization: Methods and Applications. PhD

thesis, ETH Zurich, Switzerland.

ECTA 2023 - 15th International Conference on Evolutionary Computation Theory and Applications

170

Zou, F., Yen, G., Tang, L. , and Wang, C. (2021).

A Reinforcement Learning Approach for Dynamic

Multi-Objective Optimization. Information Sciences,

546:815–34.

A Study on Multi-Objective Optimization of Epistatic Binary Problems Using Q-learning

171