Authors:
Saba Yahyaa
;
Madalina Drugan
and
Bernard Manderick
Affiliation:
Vrije Universiteit Brussel, Belgium
Keyword(s):
Multi-armed Bandit Problems, Multi-objective Optimization, Linear Scalarized Function, Scalarized Function Set, Thompson Sampling Policy.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Computational Intelligence
;
Evolutionary Computing
;
Knowledge Discovery and Information Retrieval
;
Knowledge Representation and Reasoning
;
Knowledge-Based Systems
;
Machine Learning
;
Soft Computing
;
Symbolic Systems
Abstract:
In the stochastic multi-objective multi-armed bandit (MOMAB), arms generate a vector of stochastic normal
rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) using Pareto dominance relation. The goal of an agent is to
find the Pareto front. To find the optimal arms, the agent can use linear scalarization function that transforms
a multi-objective problem into a single problem by summing the weighted objectives. Selecting the weights is
crucial, since different weights will result in selecting a different optimum arm from the Pareto front. Usually,
a predefined weights set is used and this can be computational inefficient when different weights will optimize
the same Pareto optimal arm and arms in the Pareto front are not identified. In this paper, we propose a
number of techniques that adapt the weights on the fly in order to ameliorate the performance of the scalarized
MOMAB.
We use genetic and adaptive scalarization functions from multi-objective optimization to generate
new weights. We propose to use Thompson sampling policy to select frequently the weights that identify new
arms on the Pareto front. We experimentally show that Thompson sampling improves the performance of the
genetic and adaptive scalarization functions. All the proposed techniques improves the performance of the
standard scalarized MOMAB with a fixed set of weights.
(More)