Fuzzy Rewards on the Multi-Armed Bandits Model
Ciria R. Briones-Garc
´
ıa
1
, Ra
´
ul Montes-de-Oca
2 a
, V
´
ıctor H. V
´
azquez-Guevara
1 b
and Hugo Cruz-Su
´
arez
1 c
1
Facultad de Ciencias F
´
ısico Matem
´
aticas, Benem
´
erita Universidad Aut
´
onoma de Puebla,
San Claudio y 18 sur San Manuel 72570, Puebla, Mexico
2
Departamento de Matem
´
aticas, Universidad Aut
´
onoma Metropolitana-Iztapalapa, Av. Ferrocarril San Rafael Atlixco 186,
Col. Leyes de Reforma 1 A Secci
´
on, Alcald
´
ıa Iztapalapa 09310, Ciudad de M
´
exico, Mexico
ciria.briones@alumnos.buap.mx, momr@xanum.uam.mx, {vvazquez, hcs}@fcfm.buap.mx
Keywords:
Armed Bandits Model, Gittins Index, Fuzzy Reward, Trapezoidal Fuzzy Number.
Abstract:
In this paper an extension of the Armed Bandits problem is considered under the possibility that reward
functions take trapezoidal fuzzy values as the results of a fuzzy affine transformation (which is susceptible
of being interpreted as receiving “approximately” a reward located in some interval instead of such reward
itself). The main objective is to find an optimal selection strategy that maximizes the fuzzy total expected
discounted reward with respect to the partial order based on α-cuts and the one provided by the average
ranking. For this, it is obtained that Gittins strategy (that is optimal in the crisp setting) is still optimal at
the fuzzy paradigm. In addition, it is found that optimal stopping time associated with crisp Gittins index
is the same for its fuzzy counterpart by finding a link between the fuzzy and crisp versions of Gittins index
which leads us to demonstrate that fuzzy value function is connected to its crisp analog via some fuzzy affine
transformation, with this in mind, it is possible to ensure that value function is approximately in certain interval
related to the fuzzy transformation.
1 INTRODUCTION
In this paper it will be taken into account a Simple
Family of Armed Bandits with expected discounted
total reward criteria in which a trapezoidal fuzzy
transformation on reward will be discussed. This
topic will lead us to a transition between the theory
of Multi-Armed Bandits and the theory of fuzzy num-
bers.
On the one hand, the first Multi-Armed Ban-
dits problem was discussed in (Thompson, 1933), in
which two treatments of unknown efficiency are con-
sidered with the goal of adaptively assigning as many
patients as possible to the treatment with the great-
est success rate. Notably, in (Gittins and Jones, 1974)
and (Gittins, 2018) it was developed the so-called Dy-
namic Allocation index, later renamed as the Gittins
index, which is a scalar quantity associated with each
process in the Multi-Armed Bandits setting and con-
duces us to the Gittins optimal strategy: at each stage,
the controller must select the process with the highest
a
https://orcid.org/0000-0002-7632-9190
b
https://orcid.org/0000-0001-6602-8733
c
https://orcid.org/0000-0002-0732-4943
Gittins index.
On the other hand, the Fuzzy theory was first in-
troduced by Zadeh in his pioneer work “Fuzzy sets”
in 1965 (Zadeh, 1965) where the seminal definitions
and operations are stated.
In the best of our knowledge there are not previous
works on the Multi-Armed Bandits in a fuzzy envi-
ronment. However, the Multi-Armed Bandits treated
here are related to the fuzzy Markov decision pro-
cesses (MDPs), specially, with the fuzzy discounted
MDPs. Now, concerning the antecedents of fuzzy
MDPs, firstly, observe that in (Carrero-Vera et al.,
2022) and (Cruz-Su
´
arez et al., 2023b) fuzzy MDPs
on discrete spaces and with objective functions other
than the total discounted cost are presented. Secondly,
in (Carrero-Vera et al., 2020), (Cruz-Su
´
arez et al.,
2023a), (Kurano et al., 1996), (Kurano et al., 2003),
(Kurano et al., 1998), and (Semmouri et al., 2020),
discounted MDPs with different fuzzy characteristics
have been provided, but, as it was already said, in
none of these the specific theme of the Multi-Armed
Bandits has been developed.
The main contribution of this paper is to demon-
strate that Gittins strategy is also optimal at the fuzzy
Briones-García, C. R., Montes-de-Oca, R., Vázquez-Guevara, V. H. and Cruz-Suárez, H.
Fuzzy Rewards on the Multi-Armed Bandits Model.
DOI: 10.5220/0013160600003893
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 14th International Conference on Operations Research and Enterprise Systems (ICORES 2025), pages 271-277
ISBN: 978-989-758-732-0; ISSN: 2184-4372
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
271
framework by finding beforehand a relation between
fuzzy and crisp Gittins indexes.
The rest of the paper is organized as follows: Sec-
tion 2 discusses the preliminary tools of Fuzzy The-
ory, Section 3 deals with the crisp Armed Bandits
Model and the Gittins index, section 4 the Armed
Bandits Model with Fuzzy rewards is considered and
it contains the main results of this work. Finally, sec-
tion 5 is devoted to an application setting of the theory
presented in section 4.
Notation 1.1. In this article, it will be necessary
to establish a difference between crisp and fuzzy op-
erations hence, the standard mathematical symbols
will be marked with an asterisk () in the fuzzy con-
text. Moreover, some special functions that appear as
fuzzy quantities, say, the reward function, the optimal
value function, and so on, will be distinguished with
a “tilde”; for instance, the fuzzy reward function will
be written as ˜r.
2 PRELIMINARIES ON FUZZY
THEORY
In this section, introductory fuzzy theory will be dis-
played, for a general discussion see (Diamond and
Kloeden, 1994) and (Zadeh, 1965).
Let Λ be a non-empty set, which denotes the univer-
sal set of the discourse. A fuzzy set Γ on Λ is de-
fined in terms of its membership function m
Γ
, which
assigns to each element of Λ a real value on [0, 1], i.e.
Γ = {(g, m
Γ
(g)) : g Λ}. The α-cut of Γ, denoted by
Γ
α
, is defined as the set Γ
α
= {x Λ | m
Γ
(x) α} for
0 < α 1, and Γ
0
is the closure of {x Λ | m
Γ
(x) > 0}
denoted by cl{x Λ | m
Γ
(x) > 0}. In the sequel, it
will be assumed that Λ = R.
Definition 2.1. A fuzzy number Γ is a fuzzy set de-
fined on the set of real numbers R, which satisfies:
a) m
Γ
is normal, i.e. there exists x
0
R with
m
Γ
(x
0
) = 1;
b) m
Γ
is convex, i.e. Γ
α
is convex for all α [0, 1];
c) m
Γ
is upper-semicontinuous;
d) Γ
0
is compact.
The set of fuzzy numbers will be denoted by F(R).
The manuscript will focus its attention on trape-
zoidal fuzzy numbers, which are typically considered
when the degree of membership for particular values
is known to be asymmetric. This class of fuzzy num-
bers has been successfully applied; for example, to
transportation problems (Raj et al., 2023) and portfo-
lio selection (Pahade and Jha, 2021).
Definition 2.2. A fuzzy number Γ is called a trape-
zoidal fuzzy number if its membership function has
the following form:
m
Γ
(x) =
x l
m l
I
(l,m]
(x) + I
(m,n]
(x) +
p x
p n
I
(n,p]
(x)
where l, m, n and p are real numbers, with l < m
n < p and I
A
is the indicator function of A. Hence,
a trapezoidal fuzzy number will be represented by
(l, m, n, p).
Remark 2.3. For a trapezoidal fuzzy number Γ =
(l, m, n, p) its corresponding α-cuts are given by
Γ
α
= [(m l)α + l, p (p n)α],
α [0, 1] (Rezvani and Molani, 2014).
It may be demonstrated that the following opera-
tions between trapezoidal fuzzy numbers hold (Rez-
vani and Molani, 2014):
Lemma 2.4. If H = (a
l
, a
m
, a
n
, a
p
) and I =
(b
l
, b
m
, b
n
, b
p
) are trapezoidal fuzzy numbers and let-
ting λ be a positive number, it follows that
a) λH = (λa
l
, λa
m
, λa
n
, λa
p
), and
b) H +
I = (a
l
+ b
l
, a
m
+ b
m
, a
n
+ b
n
, a
p
+ b
p
).
Let D denote the set of all closed bounded inter-
vals on the real line R. For Ψ = [a
l
, a
u
], Φ = [b
l
, b
u
]
D define
d(Ψ, Φ) = max(
|
a
l
b
l
|
,
|
a
u
b
u
|
).
It is possible to verify that (D, d) is a complete metric
space (see (Puri and Ralescu, 1986)). Now, if
˜
η
F(R), then, as its membership function satisfies b),
c) and d) of Definition 2.1, it follows that η
α
D.
Therefore, it is defined
ˆ
d : F(R) × F(R) R by
ˆ
d(
e
η,
e
µ) = sup
α[0,1]
d(η
α
, µ
α
),
for
e
η,
e
µ F(R).
Now, for being able of establish comparison be-
tween trapezoidal fuzzy numbers, let us consider
e
η,
e
µ F(R), with α-cuts
e
η
α
= [a
α
, b
α
] and
e
µ
α
=
[c
α
, d
α
], α [0, 1], respectively and define the partial
order on F(R) (
) ” by
e
η
e
µ if and only if a
α
c
α
and b
α
d
α
,
(1)
for all α [0, 1] (Furukawa, 1997).
It is also possible to compare Fuzzy numbers via
the so-called “average ranking”, given for the fuzzy
number
e
η = (a
l
, a
m
, a
n
, a
p
) by
R(
e
η) :=
1
4
(a
l
+ a
m
+ a
n
+ a
p
).
Hence, it will be said that trapezoidal fuzzy numbers
e
η = (a
l
, a
m
, a
n
, a
p
) and
e
µ = (a
1
l
, a
1
m
, a
1
n
, a
1
p
) are such
that
e
η
∗∗
e
µ if and only if R(
e
η) R(
e
µ).
ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems
272
For providing a formal treatment of random enti-
ties on fuzzy environments, some concepts on random
fuzzy variables are presented.
Definition 2.5. Let (, F ) be a measurable space
and (R, B(R)) be the measurable space of the real
numbers. A fuzzy random variable is a function
˜
X :
F(R) such that Gr(
˜
Y
α
) := {(ω.u) × R :
u (
˜
Y (ω))
α
} F B(R), for all α [0, 1].
Definition 2.6. Given a probability space
(, A, P), a fuzzy random variable
e
Y associated
to (, A) is said to be an integrably bounded
fuzzy random variable with respect to (, A, P) if
there is a function h : R, h L
1
(, A, P)
such that |u| h(ω), for all (ω, u) × R with
u (
e
Y (ω))
0
:=
e
Y
0
(ω).
Definition 2.7. Given an integrably bounded fuzzy
random variable
e
Y associated with respect to the
probability space (, A, P), then the fuzzy expected
value of
e
Y in Aumann’s sense is the unique fuzzy set
of R, E
[
e
Y ] such that for each α [0, 1] :
E
[
e
Y ]
α
=
(
Z
f (ω)dP(ω) | f : R,
f L
1
(P), f (ω) (
e
Y (ω))
α
a.s. [P]
)
.
3 FUZZY GITTINS INDEX
A Simple Family of Armed Bandits Processes is a
particular class of discrete-time Markov control pro-
cess. In this class of process, the controller (player)
must play one among K Z
+
available Armed Ban-
dits processes. In this way, at each time t {0, 1, ...};
for each Bandit Process, the controller has two possi-
ble actions: either play it or not. In case of playing the
m-th Armed Bandit Process, the controller receives a
reward (from m-th process) and only the m-th system
moves to another state according to a Markovian dy-
namic. The objective is to determine an activation
policy so as to maximize the total expected reward of
the overall selecting process. Formally:
Let {X(t) = (X
1
(t), ..., X
K
(t)) : t = 0, 1, ...} the
state of a Simple Family of Armed Bandits Pro-
cesses, where for each t {0, 1, ...}, X
i
(t) is a
discrete random variable defined on the probabil-
ity space (, F , P) and taking values on a finite-
nonempty set X called the state space. Let a(t) =
(a
1
(t), a
2
(t), ..., a
K
(t)) the action selected by the con-
troller at time t. For each t {0, 1, ...}, a
i
(t) is a bi-
nary function, which takes the value 1, if the player
chooses the i-th Armed Bandit process, otherwise
a
i
(t) = 0. Hence, the action space can be defined
for each time t as A(t) := {(a
1
(t), a
2
(t), ..., a
K
(t)) :
K
i=1
a
i
(t) = 1}. Consider for each i = 1, 2, ..., K,
r
i
: X R the corresponding reward function and
let P
i
= [p
i
(x, y)] the markovian transition law of the
stochastic processes {X
i
(t)}.
Define Π := {{a(t) : t = 0, 1, ...} : a(t) A(t)} the
set of all admissible policies (or activation strategies).
Then for π Π and x X
K
the expected total dis-
counted value function is defined as follows:
V (π, x) := E
π
x
"
t=0
β
t
K
i=1
r
i
(X
i
(t))a
i
(t)
#
, (2)
where 0 < β < 1 is a discount factor. The expectation
operator E
π
x
is associated with the product measure P
π
x
defined on (
K
, F
K
) induced by the Ionescu-Tulcea
theorem (Puterman, 2014). Then the Armed Bandits
problem consists in determining an activation strat-
egy π
o
Π such that maximizes the total expected
discounted reward, i.e.
V (π
o
, x) = sup
πΠ
V (π, x),
x X
K
. The function V
o
(x) := V (π
o
, x) is called, op-
timal value
One approach to solve this problem was proposed
in (Gittins and Jones, 1974), where for each bandit
process one may compute the dynamic allocation in-
dex (or simply Gittins index), which depends only on
that process, and then at each time the player oper-
ates on the bandit process with the highest index. The
Gittins index is defined as follows (Gittins, 2018):
G(x) = sup
τ>0
G(x, τ), x X, (3)
where G(x, τ) :=
E
x
[
τ1
t=0
β
t
r
i
(X
i
(t))
]
E
x
[
τ1
t=0
β
t
]
and τ is a stop-
ping time associated with the stochastic process
{X
i
(t)}, i.e. for each n N, [τ = n] F
n
:=
σ(X
i
(1), X
i
(2)..., X
i
(n)).
In the remainder of the manuscript, the subscript i
will remove for the sake of simplicity in argumenta-
tion. Moreover, the following assumption on the re-
ward function is considered.
Assumption 3.1. (a) r(·) 0 and r(·) has finite
support.
(b) E
x
"
t=0
r(X (t))
#
< , for all x X.
Remark 3.2. (a) Assumption 3.1 (b) is satisfied,
for instance under a Transient Markov condition
(Mart
´
ınez-Cort
´
es, 2021).
(b) Moreover, observe that Assumption 3.1 (b) im-
plies that
0
t=0
β
t
r(X (t))
t=0
r(X (t)) < , (4)
P
x
almost surely.
Fuzzy Rewards on the Multi-Armed Bandits Model
273
4 FUZZY ARMED BANDITS
This section is devoted to a fuzzy version of the
Armed Bandits problem introduced in the previous
section. Firstly, consider the following assumption.
Assumption 4.1. Let = (b, d, f , k) and
1
=
(b
1
, d
1
, f
1
, k
1
) two fixed trapezoidal fuzzy numbers,
with 0 b < d f < k and 0 b
1
< d
1
f
1
< k
1
.It
is also supposed that
˜r(x) = r(x) +
1
, (5)
with x X, where r is a reward function as it was
considered in the previous section, and such that As-
sumption 3.1 holds.
Intuition behind expression (5) is that instead of
receiving reward r(x), one will obtain a reward that is
approximately within interval [dr(x)+d
1
, f r(x)+ f
1
].
In a similar way to (2), it is defined; for π Π and
x X
K
, the fuzzy total expected discounted reward
by:
˜
V (π, x) := E
x
"
t=0
β
t
K
i=1
˜r
i
(X
i
(t))a
i
(t)
#
. (6)
Then the fuzzy Armed Bandits problem consists in
determining an activation strategy π
o
Π that max-
imizes with respect to any considered fuzzy order
(based on α-cuts and average ranking), the fuzzy total
expected discounted reward, i.e. such that
˜
V (π
o
, x) = sup
πΠ
˜
V (π, x), (7)
x X
K
. The function
˜
V
o
(x) :=
˜
V (π
o
, x) is called,
fuzzy optimal value function.
The objective now is to propose a fuzzy version of
the Gittins index in order to characterize the optimal
selection policy at the fuzzy frame. For this, consider
the following sets: S
F
:= {ω :
t=0
r(X (t, ω)) <
} and S
I
:= {ω :
t=0
r(X (t, ω)) = }. Then,
define the random variable:
Y (ω) :=
τ1
t=0
β
t
r(X (t, ω)) i f ω A(r, τ)
0 i f ω B(r, τ).
(8)
where A(r, τ) := [τ < ] ([τ = ] S
F
) and
B(r, τ) := [τ = ] S
I
.
Observe that Y (ω) is a well-defined function due
to = [τ < ] ([τ = ] S
F
) ([τ = ] S
I
) and
P([τ = ] S
I
) = 0, see Remark 3.2 (b). In conse-
quence,
τ1
t=0
β
t
r(X (t)) = Y, P
x
a.s. (9)
Now, it will be presented some notation about the
rank of the random variable Y given by (8). For this,
note that [τ < +] =
+
[
n=1
[τ = n]. Hence, the charac-
teristics of Y restricted to each of the elements of the
so-called partition of sample space is as follows:
(a) On [τ = n]; for n 1, image of Y is denoted by
Y [τ = n] =
{
y
n
1
, y
n
2
, . . .
}
.
(b) In addition, Y [[τ = +] S
F
] = Y
"
+
[
n=1
[τ = n]
#
which is a denumerable set.
(c) And, finally Y [[τ = +] S
I
] = {0}.
In a similar way for the random variable
Z :=
τ1
t=0
β
t
< , P
x
a.s. (10)
whose rank adopts the notation Z [τ = n] = {z
n
},
n = 1, 2, . . . , and Z [τ = +] = {z}.
Previous facts will be utilized during the proof of
the following lemma, which is the first step in order
to propose a fuzzy version of the Gittins index.
Lemma 4.2. Let
˜
Y (ω) :=
τ1
t=0
β
t
˜r(X(t, ω)), if ω A(r, τ)
z
1
, if ω B(r, τ).
(11)
Then
˜
Y is a fuzzy random variable.
Proof. Fix α [0, 1]. Firstly observe that
˜
Y defined
in (11) can be represented as
˜
Y = Y +
Z
1
, (12)
as a consequence of (8) and (10). Then, the α-cut of
˜
Y is given by
˜
Y
α
= [Y b(α) + Zb
1
(α),Y c(α) + Zc
1
(α)].
Thus the graph of
˜
Y
α
can be written as follows:
Gr(
˜
Y
α
) =
[
n=1
[
j=1

Y = y
n
j
[Z = z
n
]
×
y
n
j
b(α) + z
n
b
1
(α), y
n
i
c(α) + z
n
c
1
(α)

S
(([Y = 0] [Z = z]) × [zb
1
(α), zc
1
(α)]).
Consequently, Gr(
˜
Y
α
) F B(R). Hence
˜
Y is a
fuzzy random variable.
ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems
274
Lemma 4.3.
˜
Y is an integrably bounded fuzzy ran-
dom variable with respect to (, F , P
x
) and
E
x
[
˜
Y ] = E
x
[Y ] + E
x
[Z]
1
, x X. (13)
Proof. Let x X fixed. To prove that
˜
Y
is integrably bounded, note that,
˜
Y (ω)
0
=
[bY (ω) + Z(ω)b
1
,Y (ω)k + Z(ω)k
1
], ω .
Define h : R given by,
h(ω) = Y (ω)k + Z(ω)k
1
, ω .
Then, if (ω, u) × R with u
˜
Y
0
(ω), it yields that:
bY (ω) + Z(ω)b
1
u Y (ω)k + Z(ω)k
1
,
this implies that |u| h(ω). Since E
x
[h] < , due to
(9) and (10), it is concluded from Definition 2.6 that
˜
Y is integrably bounded.
Now, the expected value for
˜
Y is to be calculated,
firstly noted that for α [0, 1] the following identities
are valid
E
x
˜
Y

α
=
(
Z
f dP| f : R, f L
1
, f (ω)
[Y (ω)b(α) +z(ω)b
1
(α),Y (ω)c(α) + Z(ω)c
1
(α)]
)
.
Then,
Y (ω)b(α) + Z(ω)b
1
(α) f (ω) Y (ω)c(α)
+ Z(ω)c
1
(α),
P
x
almost surely. Hence, by taking expectations,
we find that
E
x
[Y ] b(α) + E
x
[Z] b
1
(α)
Z
f dP
x
E
x
[Y ] c(α) + E
x
[Z] c
1
(α). (14)
Let I
α
be the α-cut of the fuzzy number E
x
[Y ] +
E
x
[Z]
1
, i.e.
I
α
:= [E
x
[Y ] b(α) + E
x
[Z] b
1
(α), E
x
[Y ] c(α)
+ E
x
[Z] c
1
(α)].
Consequently, it follows from (14) that
E
x
[
˜
Y ]
α
I
α
. (15)
On the other hand, let y I
α
and f : R given by:
f (ω) = y, ω . Now, suppose that ω , y ̸∈
[Y (ω)b(α) +Z(ω)b
1
(α),Y (ω)c(α) + Z(ω)c
1
(α)],
then
Z
f dP
x
= y ̸∈ I
α
,
which is a contradiction. Hence, there exists
¯
ω
such that:
y [Y (
¯
ω)b(α) + Z(
¯
ω)b
1
(α),Y (
¯
ω)c(α) + Z(
¯
ω)c
1
(α)]
however, given that f (ω) = y, for all ω , it implies
that ω :
y [Y (ω)b(α) +Z(ω)b
1
(α),Y (ω)c(α) + Z(ω)c
1
(α)].
In consequence, y (E
x
[
˜
Y ])
α
, i.e.
I
α
(E
[
˜
Y ])
α
. (16)
Then, from (15) and (16), (E
[
˜
Y ])
α
= I
α
. Since x
X is arbitrary, the above arguments demonstrate the
validity of (13).
Now, a fuzzy version of the Gittins index dis-
cussed in Section 3 will be defined.
Definition 4.4. Let τ be a stopping time associated
with the stochastic process {X(t)}. The fuzzy Gittins
index is defined; for all x X, as
˜
G(x) = sup
τ>0
˜
G(x, τ), (17)
where
˜
G(x, τ) :=
E
x
τ1
t=0
β
t
˜r(X (t))
E
x
[
τ1
t=0
β
t
]
.
The following theorem relates the crisp and fuzzy
versions of the Gittins index. Furthermore, it shows
that the optimal stopping time in both paradigms co-
incides.
Theorem 4.5. The following statements hold
(a)
˜
G(x, τ) = G(x, τ) +
1
, x X and τ an stopping
time with the stochastic process {X(t)}.
(b) If τ
o
is a stopping time such that G(x) = G(x, τ
o
),
x X then
˜
G(x) =
˜
G(x, τ
o
), x X.
Proof. Let x X and τ a stopping time be fixed. The
validity of (a) is a consequence of the following iden-
tities:
˜
G(x, τ) =
E
x
[Y ] +
E
x
[Z]
1
E
x
[Z]
= G(x, τ) +
1
.
To prove (b) observe that
˜
G(x, τ)
α
= [G(x, τ)b(α) +b
1
(α), G(x, τ)c(α) + c
1
(α)].
Then
G(x, τ)b(α) + b
1
(α) G(x, τ
o
)b(α) + b
1
(α) =
G(x)b(α) + b
1
(α), and
G(x, τ)c(α) + c
1
(α) G(x, τ
o
)c(α) + c
1
(α) =
G(x)c(α) + c
1
(α).
Therefore, since x and τ are arbitrary, the result fol-
lows as a consequence of
˜
G(x, τ
o
) = G(x)+
1
(see
(a) of this theorem) and Definition 4.4.
Fuzzy Rewards on the Multi-Armed Bandits Model
275
In a similar way, the corresponding proof by con-
sidering the average ranking approach can be pro-
vided. To this end, observe that
R(
˜
G(x, τ)) = R(G(x, τ) +
1
)
= R()G(x, τ) + R(
1
)
R()G(x) + R(
1
).
This concludes the proof of the theorem.
5 APPLICATION SETTING
In this section, a fuzzy-reward version of a scheduling
example due to Gittins (Gittins, 2018) is taken into
account. To this end, consider a machine capable of
processing one job at a time from a total of N 1.
At the beginning of each processing period; t, one of
such jobs must be selected in order to be executed
whether or not the previous one was completed. As-
sociated with job i and process time t
i
1 (the number
of times that job i has been chosen for execution dur-
ing the first t units of time), there exists a probability
of P
i
(t
i
) that t
i
+ 1 periods are necessary to complete
such job given that more than t
i
periods are needed.
In addition, if at process time t
i
1 job i has not
been completed and has been selected for performing,
then the following fuzzy reward will be received
P
i
(t
i
)V
i
+
1
, (18)
where = (b, d, f , k) and
1
= (b
1
, d
1
, f
1
, k
1
) are
fuzzy numbers such that 0 b and 0 b
1
with V
i
> 0
for each i. Once a job has been terminated it will pro-
duce a null reward.
At this point, it will be considered that Assump-
tion 3.1 holds for every job. Some situations in which
such supposition is satisfied are, for example:
1. (Finite patience) For each job i, there exists a max-
imum time for its conclusion (M
i
).
2. (Constant probability of conclusion) For each job;
i, and every process time; t
i
we may assume addi-
tionally that P
i
(t
i
) = P
i
.
3. (Limited ability for conclusion) For every job,
there exists a maximum value of probability for
completing it.
Formally, for job i, set S
i
= {0, 1, 2, . . . , L
i
}
S
{C}
where L
i
= M
i
or L
i
= and C is an absorbing state
associated with completion of corresponding job. Ad-
ditionally, consider that P[{C}|C] = 1, P[{C}|t
i
] =
P
i
(t
i
), P[{t
i
+ 1}|t
i
] = 1 P
i
(t
i
), ˜r
i
(C) =
1
, ˜r
i
(t
i
) =
P
i
(t
i
)V
i
+
1
, t
i
1.
If in addition, under the “finite patience”
paradigm, it is obtained that P[{M
i
}|M
i
] = 1 and
˜r
i
(M
i
) =
1
.
5.1 The Deteriorating Case
In some situations, it might be appropriate to consider
that each job is such that, the more attempts are made
to complete it, the less likely it is that the job will be
terminated: i and t : P
i
(t 1) P
i
(t).
In this case, for each uncompleted (and still going
in the “finite patience” case: t < M
i
) job, the fuzzy
Gittins index coincides with immediate reward be-
cause at the crisp and fuzzy setting optimal stopping
time for job i is τ
i
= 1:
i
˜
G(t
i
) =
i
G(t
i
) +
1
= P
i
(t
i
)V
i
+
1
and
i
˜
G(C) =
1
.
Hence, the fuzzy Gittins strategy is: Select an un-
completed job with the greatest value; with respect to
both considered orders, of P
i
(t
i
)V
i
+
1
, such that its
limit time has not been reached (if applicable). Or,
equivalently; given that
˜
0 and
˜
0
1
, choose an
unfinished job such that P
i
(t
i
)V
i
is the biggest.
Remark 5.1. Deteriorating framework becomes
important in view of theorem 2D in (Kaspi and
Mandelbaum, 1998) which states an equivalence
between any Armed Bandits Process (under suitable
conditions) and a deteriorating one in the sense that
the value of both bandits equals and the classes of
optimal strategies for both bandits coincide.
6 CONCLUSIONS
In this paper, a fuzzy affine transformation on re-
ward functions associated with a Multi-Armed Ban-
dits Model was considered. Such transformation led
us to model the possibility that rewards lie approxi-
mately in certain interval. Furthermore, it guided us
to formalize the Fuzzy-random structure of its com-
ponents and to find relationships with their crisp ana-
logues. In particular, it was found that the policy dic-
tated by Gittins strategy as well as the corresponding
stopping time are also optimal at the considered fuzzy
setting. Further development may be possible by con-
sidering a fuzzy rewards scheme with a denumerable
number of Bandit Processes with denumerable state
spaces. Another way of action may arise by taking
into account reward functions with countable support
or by finding application settings that satisfy Assump-
tion 3.1.
ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems
276
ACKNOWLEDGMENTS
This work was partially supported by Proyecto
CONAHCYT: “Procesos de Decisi
´
on de Markov en
ambiente difuso”, Ciencia de Frontera 2023, CF-
2023-I-1362.
REFERENCES
Carrero-Vera, K., Cruz-Su
´
arez, H., and Montes-de Oca, R.
(2020). Finite-horizon and infinite-horizon Markov
decision processes with trapezoidal fuzzy discounted
rewards. In International Conference on Operations
Research and Enterprise Systems, pages 171–192.
Springer.
Carrero-Vera, K., Cruz-Su
´
arez, H., and Montes-de Oca, R.
(2022). Markov decision processes on finite spaces
with fuzzy total rewards. Kybernetika (Prague),
58(2):180–199.
Cruz-Su
´
arez, H., Montes de Oca, R., and Ortega Guti
´
errez,
R. (2023a). Deterministic discounted Markov deci-
sion processes with fuzzy rewards/costs. Fuzzy Infor-
mation and Engineering, 15(3):274–290.
Cruz-Su
´
arez, H., Montes-de Oca, R., and Ortega-Guti
´
errez,
R. (2023b). An extended version of average Markov
decision processes on discrete spaces under fuzzy en-
vironment. Kybernetika (Prague), 59(1):160–178.
Diamond, P. and Kloeden, P. (1994). Metric Spaces of
Fuzzy Sets: Theory and Applications. WORLD SCI-
ENTIFIC.
Furukawa, N. (1997). Paramentric orders on fuzzy numbers
and their roles in fuzzy optimization problems. Opti-
mization, 40(2):171–192.
Gittins, J. and Jones, D. (1974). A dynamic allocation index
for the sequential design of experiments. Progress in
Statistics (edited by J. Gani), 241–266.
Gittins, J. C. (2018). Bandit Processes and Dynamic Alloca-
tion Indices. Journal of the Royal Statistical Society:
Series B (Methodological), 41(2):148–164.
Kaspi, H. and Mandelbaum, A. (1998). Multi-armed ban-
dits in discrete and continuous time. The Annals of
Applied Probability, 8(4):1270–1290.
Kurano, M., Song, J., Hosaka, M., and Huang, Y. (1998).
Controlled Markov set-chains with discounting. Jour-
nal of applied probability, 35(2):293–302.
Kurano, M., Yasuda, M., Nakagami, J.-i., and Yoshida, Y.
(1996). Markov-type fuzzy decision processes with
a discounted reward on a closed interval. European
Journal of Operational Research, 92(3):649–662.
Kurano, M., Yasuda, M., Nakagami, J.-i., and Yoshida, Y.
(2003). Markov decision processes with fuzzy re-
wards. Journal of Nonlinear and Convex Analysis,
4(1):105–116.
Mart
´
ınez-Cort
´
es, V. M. (2021). Bi-personal stochastic tran-
sient Markov games with stopping times and total re-
ward criterion. Kybernetika, 57(1):1–14.
Pahade, J. K. and Jha, M. (2021). Credibilistic vari-
ance and skewness of trapezoidal fuzzy variable and
mean–variance–skewness model for portfolio selec-
tion. Results in Applied Mathematics, 11:100159.
Puri, M. L. and Ralescu, D. A. (1986). Fuzzy random vari-
ables. Journal of Mathematical Analysis and Applica-
tions, 114(2):409–422.
Puterman, M. L. (2014). Markov decision processes: dis-
crete stochastic dynamic programming. John Wiley &
Sons.
Raj, M. E. A., Sivaraman, G., and Vishnukumar, P. (2023).
A novel kind of arithmetic operations on trapezoidal
fuzzy numbers and its applications to optimize the
transportation cost. International Journal of Fuzzy
Systems, 25(3):1069–1076.
Rezvani, S. and Molani, M. (2014). Representation of trape-
zoidal fuzzy numbers with shape function. Ann. Fuzzy
Math. Inform, 8(1):89–112.
Semmouri, A., Jourhmane, M., and Belhallaj, Z. (2020).
Discounted Markov decision processes with fuzzy
costs. Annals of Operations Research, 295:769–786.
Thompson, W. R. (1933). On the likelihood that one un-
known probability exceeds another in view of the evi-
dence of two samples. Biometrika, 25(3-4):285–294.
Zadeh, L. (1965). Fuzzy sets. Information and Control,
8(3):338–353.
Fuzzy Rewards on the Multi-Armed Bandits Model
277