Fuzzy Rewards on the Multi-Armed Bandits Model

Ciria R. Briones-Garc

ıa

, Ra

ul Montes-de-Oca

2 a

, V

ıctor H. V

azquez-Guevara

1 b

and Hugo Cruz-Su

arez

1 c

Facultad de Ciencias F

ısico Matem

aticas, Benem

erita Universidad Aut

onoma de Puebla,

San Claudio y 18 sur San Manuel 72570, Puebla, Mexico

Departamento de Matem

aticas, Universidad Aut

onoma Metropolitana-Iztapalapa, Av. Ferrocarril San Rafael Atlixco 186,

Col. Leyes de Reforma 1 A Secci

on, Alcald

ıa Iztapalapa 09310, Ciudad de M

exico, Mexico

ciria.briones@alumnos.buap.mx, momr@xanum.uam.mx, {vvazquez, hcs}@fcfm.buap.mx

Keywords:

Armed Bandits Model, Gittins Index, Fuzzy Reward, Trapezoidal Fuzzy Number.

Abstract:

In this paper an extension of the Armed Bandits problem is considered under the possibility that reward

functions take trapezoidal fuzzy values as the results of a fuzzy afﬁne transformation (which is susceptible

of being interpreted as receiving “approximately” a reward located in some interval instead of such reward

itself). The main objective is to ﬁnd an optimal selection strategy that maximizes the fuzzy total expected

discounted reward with respect to the partial order based on α-cuts and the one provided by the average

ranking. For this, it is obtained that Gittins strategy (that is optimal in the crisp setting) is still optimal at

the fuzzy paradigm. In addition, it is found that optimal stopping time associated with crisp Gittins index

is the same for its fuzzy counterpart by ﬁnding a link between the fuzzy and crisp versions of Gittins index

which leads us to demonstrate that fuzzy value function is connected to its crisp analog via some fuzzy afﬁne

transformation, with this in mind, it is possible to ensure that value function is approximately in certain interval

related to the fuzzy transformation.

1 INTRODUCTION

In this paper it will be taken into account a Simple

Family of Armed Bandits with expected discounted

total reward criteria in which a trapezoidal fuzzy

transformation on reward will be discussed. This

topic will lead us to a transition between the theory

of Multi-Armed Bandits and the theory of fuzzy num-

bers.

On the one hand, the ﬁrst Multi-Armed Ban-

dits problem was discussed in (Thompson, 1933), in

which two treatments of unknown efﬁciency are con-

sidered with the goal of adaptively assigning as many

patients as possible to the treatment with the great-

est success rate. Notably, in (Gittins and Jones, 1974)

and (Gittins, 2018) it was developed the so-called Dy-

namic Allocation index, later renamed as the Gittins

index, which is a scalar quantity associated with each

process in the Multi-Armed Bandits setting and con-

duces us to the Gittins optimal strategy: at each stage,

the controller must select the process with the highest

https://orcid.org/0000-0002-7632-9190

https://orcid.org/0000-0001-6602-8733

https://orcid.org/0000-0002-0732-4943

Gittins index.

On the other hand, the Fuzzy theory was ﬁrst in-

troduced by Zadeh in his pioneer work “Fuzzy sets”

in 1965 (Zadeh, 1965) where the seminal deﬁnitions

and operations are stated.

In the best of our knowledge there are not previous

works on the Multi-Armed Bandits in a fuzzy envi-

ronment. However, the Multi-Armed Bandits treated

here are related to the fuzzy Markov decision pro-

cesses (MDPs), specially, with the fuzzy discounted

MDPs. Now, concerning the antecedents of fuzzy

MDPs, ﬁrstly, observe that in (Carrero-Vera et al.,

2022) and (Cruz-Su

arez et al., 2023b) fuzzy MDPs

on discrete spaces and with objective functions other

than the total discounted cost are presented. Secondly,

in (Carrero-Vera et al., 2020), (Cruz-Su

arez et al.,

2023a), (Kurano et al., 1996), (Kurano et al., 2003),

(Kurano et al., 1998), and (Semmouri et al., 2020),

discounted MDPs with different fuzzy characteristics

have been provided, but, as it was already said, in

none of these the speciﬁc theme of the Multi-Armed

Bandits has been developed.

The main contribution of this paper is to demon-

strate that Gittins strategy is also optimal at the fuzzy

Briones-García, C. R., Montes-de-Oca, R., Vázquez-Guevara, V. H. and Cruz-Suárez, H.

Fuzzy Rewards on the Multi-Armed Bandits Model.

DOI: 10.5220/0013160600003893

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 14th International Conference on Operations Research and Enterprise Systems (ICORES 2025), pages 271-277

ISBN: 978-989-758-732-0; ISSN: 2184-4372

271

framework by ﬁnding beforehand a relation between

fuzzy and crisp Gittins indexes.

The rest of the paper is organized as follows: Sec-

tion 2 discusses the preliminary tools of Fuzzy The-

ory, Section 3 deals with the crisp Armed Bandits

Model and the Gittins index, section 4 the Armed

Bandits Model with Fuzzy rewards is considered and

it contains the main results of this work. Finally, sec-

tion 5 is devoted to an application setting of the theory

presented in section 4.

Notation 1.1. In this article, it will be necessary

to establish a difference between crisp and fuzzy op-

erations hence, the standard mathematical symbols

will be marked with an asterisk (∗) in the fuzzy con-

text. Moreover, some special functions that appear as

fuzzy quantities, say, the reward function, the optimal

value function, and so on, will be distinguished with

a “tilde”; for instance, the fuzzy reward function will

be written as ˜r.

2 PRELIMINARIES ON FUZZY

THEORY

In this section, introductory fuzzy theory will be dis-

played, for a general discussion see (Diamond and

Kloeden, 1994) and (Zadeh, 1965).

Let Λ be a non-empty set, which denotes the univer-

sal set of the discourse. A fuzzy set Γ on Λ is de-

ﬁned in terms of its membership function m

, which

assigns to each element of Λ a real value on [0, 1], i.e.

Γ = {(g, m

(g)) : g ∈ Λ}. The α-cut of Γ, denoted by

, is deﬁned as the set Γ

= {x ∈ Λ | m

(x) ≥ α} for

0 < α ≤ 1, and Γ

is the closure of {x ∈ Λ | m

(x) > 0}

denoted by cl{x ∈ Λ | m

(x) > 0}. In the sequel, it

will be assumed that Λ = R.

Deﬁnition 2.1. A fuzzy number Γ is a fuzzy set de-

ﬁned on the set of real numbers R, which satisﬁes:

a) m

is normal, i.e. there exists x

∈ R with

) = 1;

b) m

is convex, i.e. Γ

is convex for all α ∈ [0, 1];

c) m

is upper-semicontinuous;

d) Γ

is compact.

The set of fuzzy numbers will be denoted by F(R).

The manuscript will focus its attention on trape-

zoidal fuzzy numbers, which are typically considered

when the degree of membership for particular values

is known to be asymmetric. This class of fuzzy num-

bers has been successfully applied; for example, to

transportation problems (Raj et al., 2023) and portfo-

lio selection (Pahade and Jha, 2021).

Deﬁnition 2.2. A fuzzy number Γ is called a trape-

zoidal fuzzy number if its membership function has

the following form:

(x) =

x − l

m − l

(l,m]

(x) + I

(m,n]

(x) +

p − x

p − n

(n,p]

(x)

where l, m, n and p are real numbers, with l < m ≤

n < p and I

is the indicator function of A. Hence,

a trapezoidal fuzzy number will be represented by

(l, m, n, p).

Remark 2.3. For a trapezoidal fuzzy number Γ =

(l, m, n, p) its corresponding α-cuts are given by

= [(m − l)α + l, p − (p − n)α],

α ∈ [0, 1] (Rezvani and Molani, 2014).

It may be demonstrated that the following opera-

tions between trapezoidal fuzzy numbers hold (Rez-

vani and Molani, 2014):

Lemma 2.4. If H = (a

, a

) and I =

, b

) are trapezoidal fuzzy numbers and let-

ting λ be a positive number, it follows that

a) λH = (λa

, λa

), and

b) H +

∗

I = (a

+ b

, a

+ b

, a

+ b

, a

+ b

Let D denote the set of all closed bounded inter-

vals on the real line R. For Ψ = [a

, a

], Φ = [b

, b

] ∈

D deﬁne

d(Ψ, Φ) = max(

− b

It is possible to verify that (D, d) is a complete metric

space (see (Puri and Ralescu, 1986)). Now, if

η ∈

F(R), then, as its membership function satisﬁes b),

c) and d) of Deﬁnition 2.1, it follows that η

∈ D.

Therefore, it is deﬁned

d : F(R) × F(R) −→ R by

η,

µ) = sup

α∈[0,1]

d(η

, µ

for

η,

µ ∈ F(R).

Now, for being able of establish comparison be-

tween trapezoidal fuzzy numbers, let us consider

η,

µ ∈ F(R), with α-cuts

= [a

, b

] and

, d

], α ∈ [0, 1], respectively and deﬁne the partial

order on F(R) “(≤

∗

) ” by

η ⩽

∗

µ if and only if a

≤ c

and b

≤ d

(1)

for all α ∈ [0, 1] (Furukawa, 1997).

It is also possible to compare Fuzzy numbers via

the so-called “average ranking”, given for the fuzzy

number

η = (a

, a

) by

η) :=

+ a

Hence, it will be said that trapezoidal fuzzy numbers

η = (a

, a

) and

µ = (a

, a

) are such

that

η ≤

∗∗

µ if and only if R(

η) ≤ R(

µ).

ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems

272

For providing a formal treatment of random enti-

ties on fuzzy environments, some concepts on random

fuzzy variables are presented.

Deﬁnition 2.5. Let (Ω, F ) be a measurable space

and (R, B(R)) be the measurable space of the real

numbers. A fuzzy random variable is a function

X :

Ω −→ F(R) such that Gr(

) := {(ω.u) ∈ Ω × R :

u ∈ (

Y (ω))

} ∈ F ⊗ B(R), for all α ∈ [0, 1].

Deﬁnition 2.6. Given a probability space

(Ω, A, P), a fuzzy random variable

Y associated

to (Ω, A) is said to be an integrably bounded

fuzzy random variable with respect to (Ω, A, P) if

there is a function h : Ω −→ R, h ∈ L

(Ω, A, P)

such that |u| ≤ h(ω), for all (ω, u) ∈ Ω × R with

u ∈ (

Y (ω))

(ω).

Deﬁnition 2.7. Given an integrably bounded fuzzy

random variable

Y associated with respect to the

probability space (Ω, A, P), then the fuzzy expected

value of

Y in Aumann’s sense is the unique fuzzy set

of R, E

∗

[

Y ] such that for each α ∈ [0, 1] :



∗

[

Y ]



(

Ω

f (ω)dP(ω) | f : Ω −→ R,

f ∈ L

(P), f (ω) ∈ (

Y (ω))

a.s. [P]

)

3 FUZZY GITTINS INDEX

A Simple Family of Armed Bandits Processes is a

particular class of discrete-time Markov control pro-

cess. In this class of process, the controller (player)

must play one among K ∈ Z

available Armed Ban-

dits processes. In this way, at each time t ∈ {0, 1, ...};

for each Bandit Process, the controller has two possi-

ble actions: either play it or not. In case of playing the

m-th Armed Bandit Process, the controller receives a

reward (from m-th process) and only the m-th system

moves to another state according to a Markovian dy-

namic. The objective is to determine an activation

policy so as to maximize the total expected reward of

the overall selecting process. Formally:

Let {X(t) = (X

(t), ..., X

(t)) : t = 0, 1, ...} the

state of a Simple Family of Armed Bandits Pro-

cesses, where for each t ∈ {0, 1, ...}, X

(t) is a

discrete random variable deﬁned on the probabil-

ity space (Ω, F , P) and taking values on a ﬁnite-

nonempty set X called the state space. Let a(t) =

(t), a

(t), ..., a

(t)) the action selected by the con-

troller at time t. For each t ∈ {0, 1, ...}, a

(t) is a bi-

nary function, which takes the value 1, if the player

chooses the i-th Armed Bandit process, otherwise

(t) = 0. Hence, the action space can be deﬁned

for each time t as A(t) := {(a

(t), a

(t), ..., a

(t)) :

∑

i=1

(t) = 1}. Consider for each i = 1, 2, ..., K,

: X → R the corresponding reward function and

let P

= [p

(x, y)] the markovian transition law of the

stochastic processes {X

(t)}.

Deﬁne Π := {{a(t) : t = 0, 1, ...} : a(t) ∈ A(t)} the

set of all admissible policies (or activation strategies).

Then for π ∈ Π and x ∈ X

the expected total dis-

counted value function is deﬁned as follows:

V (π, x) := E

∞

∑

t=0

∑

i=1

(t))a

(t)

, (2)

where 0 < β < 1 is a discount factor. The expectation

operator E

is associated with the product measure P

deﬁned on (Ω

, F

) induced by the Ionescu-Tulcea

theorem (Puterman, 2014). Then the Armed Bandits

problem consists in determining an activation strat-

egy π

∈ Π such that maximizes the total expected

discounted reward, i.e.

V (π

, x) = sup

π∈Π

V (π, x),

x ∈ X

. The function V

(x) := V (π

, x) is called, op-

timal value

One approach to solve this problem was proposed

in (Gittins and Jones, 1974), where for each bandit

process one may compute the dynamic allocation in-

dex (or simply Gittins index), which depends only on

that process, and then at each time the player oper-

ates on the bandit process with the highest index. The

Gittins index is deﬁned as follows (Gittins, 2018):

G(x) = sup

τ>0

G(x, τ), x ∈ X, (3)

where G(x, τ) :=

[

∑

τ−1

t=0

(t))

]

[

∑

τ−1

t=0

]

and τ is a stop-

ping time associated with the stochastic process

(t)}, i.e. for each n ∈ N, [τ = n] ∈ F

σ(X

(1), X

(2)..., X

(n)).

In the remainder of the manuscript, the subscript i

will remove for the sake of simplicity in argumenta-

tion. Moreover, the following assumption on the re-

ward function is considered.

Assumption 3.1. (a) r(·) ≥ 0 and r(·) has ﬁnite

support.

(b) E

∞

∑

t=0

r(X (t))

< ∞, for all x ∈ X.

Remark 3.2. (a) Assumption 3.1 (b) is satisﬁed,

for instance under a Transient Markov condition

(Mart

ınez-Cort

es, 2021).

(b) Moreover, observe that Assumption 3.1 (b) im-

plies that

0 ≤

∞

∑

t=0

r(X (t)) ≤

∞

∑

t=0

r(X (t)) < ∞, (4)

almost surely.

Fuzzy Rewards on the Multi-Armed Bandits Model

273

4 FUZZY ARMED BANDITS

This section is devoted to a fuzzy version of the

Armed Bandits problem introduced in the previous

section. Firstly, consider the following assumption.

Assumption 4.1. Let ∆ = (b, d, f , k) and ∆

, d

, f

, k

) two ﬁxed trapezoidal fuzzy numbers,

with 0 ≤ b < d ≤ f < k and 0 ≤ b

< d

≤ f

< k

.It

is also supposed that

˜r(x) = r(x)∆ + ∆

, (5)

with x ∈ X, where r is a reward function as it was

considered in the previous section, and such that As-

sumption 3.1 holds.

Intuition behind expression (5) is that instead of

receiving reward r(x), one will obtain a reward that is

approximately within interval [dr(x)+d

, f r(x)+ f

In a similar way to (2), it is deﬁned; for π ∈ Π and

x ∈ X

, the fuzzy total expected discounted reward

by:

V (π, x) := E

∗

∞

∑

∗

t=0

∑

∗

i=1

˜r

(t))a

(t)

. (6)

Then the fuzzy Armed Bandits problem consists in

determining an activation strategy π

∈ Π that max-

imizes with respect to any considered fuzzy order

(based on α-cuts and average ranking), the fuzzy total

expected discounted reward, i.e. such that

V (π

, x) = sup

∗

π∈Π

V (π, x), (7)

x ∈ X

. The function

(x) :=

V (π

, x) is called,

fuzzy optimal value function.

The objective now is to propose a fuzzy version of

the Gittins index in order to characterize the optimal

selection policy at the fuzzy frame. For this, consider

the following sets: S

:= {ω ∈ Ω :

∑

∞

t=0

r(X (t, ω)) <

∞} and S

:= {ω ∈ Ω :

∑

∞

t=0

r(X (t, ω)) = ∞}. Then,

deﬁne the random variable:

Y (ω) :=







∑

τ−1

t=0

r(X (t, ω)) i f ω ∈ A(r, τ)

0 i f ω ∈ B(r, τ).

(8)

where A(r, τ) := [τ < ∞] ∪ ([τ = ∞] ∩ S

) and

B(r, τ) := [τ = ∞] ∩ S

Observe that Y (ω) is a well-deﬁned function due

to Ω = [τ < ∞] ∪ ([τ = ∞] ∩ S

) ∪ ([τ = ∞] ∩ S

) and

P([τ = ∞] ∩ S

) = 0, see Remark 3.2 (b). In conse-

quence,

τ−1

∑

t=0

r(X (t)) = Y, P

− a.s. (9)

Now, it will be presented some notation about the

rank of the random variable Y given by (8). For this,

note that [τ < +∞] =

+∞

[

n=1

[τ = n]. Hence, the charac-

teristics of Y restricted to each of the elements of the

so-called partition of sample space is as follows:

(a) On [τ = n]; for n ≥ 1, image of Y is denoted by

Y [τ = n] =

{

, y

, . . .

}

(b) In addition, Y [[τ = +∞] ∩S

] = Y

+∞

[

n=1

[τ = n]

which is a denumerable set.

] = {0}.

In a similar way for the random variable

Z :=

τ−1

∑

t=0

< ∞, P

− a.s. (10)

whose rank adopts the notation Z [τ = n] = {z

n = 1, 2, . . . , and Z [τ = +∞] = {z}.

Previous facts will be utilized during the proof of

the following lemma, which is the ﬁrst step in order

to propose a fuzzy version of the Gittins index.

Lemma 4.2. Let

Y (ω) :=











τ−1

∑

∗

t=0

˜r(X(t, ω)), if ω ∈ A(r, τ)

z∆

, if ω ∈ B(r, τ).

(11)

Then

Y is a fuzzy random variable.

Proof. Fix α ∈ [0, 1]. Firstly observe that

Y deﬁned

in (11) can be represented as

Y = Y ∆ +

∗

Z∆

, (12)

as a consequence of (8) and (10). Then, the α-cut of

Y is given by

= [Y b(α) + Zb

(α),Y c(α) + Zc

(α)].

Thus the graph of

can be written as follows:

Gr(

) =

∞

[

n=1

∞

[

j=1



Y = y



∩ [Z = z

]



b(α) + z

(α), y

c(α) + z

(α)



(([Y = 0] ∩ [Z = z]) × [zb

(α), zc

(α)]).

Consequently, Gr(

) ∈ F ⊗ B(R). Hence

Y is a

fuzzy random variable.

ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems

274

Lemma 4.3.

Y is an integrably bounded fuzzy ran-

dom variable with respect to (Ω, F , P

) and

∗

[

Y ] = E

[Y ]∆ + E

[Z]∆

, x ∈ X. (13)

Proof. Let x ∈ X ﬁxed. To prove that

is integrably bounded, note that,



Y (ω)



[bY (ω) + Z(ω)b

,Y (ω)k + Z(ω)k

], ω ∈ Ω.

Deﬁne h : Ω → R given by,

h(ω) = Y (ω)k + Z(ω)k

, ω ∈ Ω.

Then, if (ω, u) ∈ Ω × R with u ∈

(ω), it yields that:

bY (ω) + Z(ω)b

≤ u ≤ Y (ω)k + Z(ω)k

this implies that |u| ≤ h(ω). Since E

[h] < ∞, due to

(9) and (10), it is concluded from Deﬁnition 2.6 that

Y is integrably bounded.

Now, the expected value for

Y is to be calculated,

ﬁrstly noted that for α ∈ [0, 1] the following identities

are valid



∗





(

Ω

f dP| f : Ω → R, f ∈ L

, f (ω) ∈

[Y (ω)b(α) +z(ω)b

(α),Y (ω)c(α) + Z(ω)c

(α)]

)

Then,

Y (ω)b(α) + Z(ω)b

(α) ≤ f (ω) ≤ Y (ω)c(α)

+ Z(ω)c

(α),

−almost surely. Hence, by taking expectations,

we ﬁnd that

[Y ] b(α) + E

[Z] b

(α) ≤

f dP

≤ E

[Y ] c(α) + E

[Z] c

(α). (14)

Let I

be the α-cut of the fuzzy number E

[Y ] ∆ +

[Z] ∆

, i.e.

:= [E

[Y ] b(α) + E

[Z] b

(α), E

[Y ] c(α)

+ E

[Z] c

(α)].

Consequently, it follows from (14) that



[

Y ]



⊆ I

. (15)

On the other hand, let y ∈ I

and f : Ω → R given by:

f (ω) = y, ω ∈ Ω. Now, suppose that ∀ω ∈ Ω, y ̸∈

[Y (ω)b(α) +Z(ω)b

(α),Y (ω)c(α) + Z(ω)c

(α)],

then

Ω

f dP

= y ̸∈ I

which is a contradiction. Hence, there exists

ω ∈ Ω

such that:

y ∈ [Y (

ω)b(α) + Z(

ω)b

(α),Y (

ω)c(α) + Z(

ω)c

(α)]

however, given that f (ω) = y, for all ω ∈ Ω, it implies

that ∀ω ∈ Ω:

y ∈ [Y (ω)b(α) +Z(ω)b

(α),Y (ω)c(α) + Z(ω)c

(α)].

In consequence, y ∈ (E

∗

[

Y ])

, i.e.

⊆ (E

∗

[

Y ])

. (16)

Then, from (15) and (16), (E

∗

[

Y ])

= I

. Since x ∈

X is arbitrary, the above arguments demonstrate the

validity of (13).

Now, a fuzzy version of the Gittins index dis-

cussed in Section 3 will be deﬁned.

Deﬁnition 4.4. Let τ be a stopping time associated

with the stochastic process {X(t)}. The fuzzy Gittins

index is deﬁned; for all x ∈ X, as

G(x) = sup

∗

τ>0

G(x, τ), (17)

where

G(x, τ) :=

∗



∑

∗

τ−1

t=0

˜r(X (t))



[

∑

τ−1

t=0

]

The following theorem relates the crisp and fuzzy

versions of the Gittins index. Furthermore, it shows

that the optimal stopping time in both paradigms co-

incides.

Theorem 4.5. The following statements hold

(a)

G(x, τ) = G(x, τ)∆ +

∗

∆

, x ∈ X and τ an stopping

time with the stochastic process {X(t)}.

(b) If τ

is a stopping time such that G(x) = G(x, τ

x ∈ X then

G(x) =

G(x, τ

), x ∈ X.

Proof. Let x ∈ X and τ a stopping time be ﬁxed. The

validity of (a) is a consequence of the following iden-

tities:

G(x, τ) =

[Y ] ∆ +

∗

[Z] ∆

[Z]

= G(x, τ)∆ +

∗

∆

To prove (b) observe that



G(x, τ)



= [G(x, τ)b(α) +b

(α), G(x, τ)c(α) + c

(α)].

Then

• G(x, τ)b(α) + b

(α) ≤ G(x, τ

)b(α) + b

(α) =

G(x)b(α) + b

(α), and

• G(x, τ)c(α) + c

(α) ≤ G(x, τ

)c(α) + c

(α) =

G(x)c(α) + c

(α).

Therefore, since x and τ are arbitrary, the result fol-

lows as a consequence of

G(x, τ

) = G(x)∆+

∗

∆

(see

(a) of this theorem) and Deﬁnition 4.4.

Fuzzy Rewards on the Multi-Armed Bandits Model

275

In a similar way, the corresponding proof by con-

sidering the average ranking approach can be pro-

vided. To this end, observe that

G(x, τ)) = R(G(x, τ)∆ + ∆

)

= R(∆)G(x, τ) + R(∆

)

≤ R(∆)G(x) + R(∆

This concludes the proof of the theorem.

5 APPLICATION SETTING

In this section, a fuzzy-reward version of a scheduling

example due to Gittins (Gittins, 2018) is taken into

account. To this end, consider a machine capable of

processing one job at a time from a total of N ≥ 1.

At the beginning of each processing period; t, one of

such jobs must be selected in order to be executed

whether or not the previous one was completed. As-

sociated with job i and process time t

≥ 1 (the number

of times that job i has been chosen for execution dur-

ing the ﬁrst t units of time), there exists a probability

of P

) that t

+ 1 periods are necessary to complete

such job given that more than t

periods are needed.

In addition, if at process time t

≥ 1 job i has not

been completed and has been selected for performing,

then the following fuzzy reward will be received

∆ + ∆

, (18)

where ∆ = (b, d, f , k) and ∆

= (b

, d

, f

, k

) are

fuzzy numbers such that 0 ≤ b and 0 ≤ b

with V

> 0

for each i. Once a job has been terminated it will pro-

duce a null reward.

At this point, it will be considered that Assump-

tion 3.1 holds for every job. Some situations in which

such supposition is satisﬁed are, for example:

1. (Finite patience) For each job i, there exists a max-

imum time for its conclusion (M

2. (Constant probability of conclusion) For each job;

i, and every process time; t

we may assume addi-

tionally that P

) = P

3. (Limited ability for conclusion) For every job,

there exists a maximum value of probability for

completing it.

Formally, for job i, set S

= {0, 1, 2, . . . , L

}

{C}

where L

= M

or L

= ∞ and C is an absorbing state

associated with completion of corresponding job. Ad-

ditionally, consider that P[{C}|C] = 1, P[{C}|t

] =

), P[{t

+ 1}|t

] = 1 − P

), ˜r

, ˜r

) =

∆ + ∆

, t

≥ 1.

If in addition, under the “ﬁnite patience”

paradigm, it is obtained that P[{M

}|M

] = 1 and

˜r

) = ∆

5.1 The Deteriorating Case

In some situations, it might be appropriate to consider

that each job is such that, the more attempts are made

to complete it, the less likely it is that the job will be

terminated: ∀i and ∀t : P

(t − 1) ≥ P

(t).

In this case, for each uncompleted (and still going

in the “ﬁnite patience” case: t < M

) job, the fuzzy

Gittins index coincides with immediate reward be-

cause at the crisp and fuzzy setting optimal stopping

time for job i is τ

= 1:

G(t

) =

G(t

)∆ + ∆

= P

∆ + ∆

and

G(C) = ∆

Hence, the fuzzy Gittins strategy is: Select an un-

completed job with the greatest value; with respect to

both considered orders, of P

∆ + ∆

, such that its

limit time has not been reached (if applicable). Or,

equivalently; given that

0 ≤ ∆ and

0 ≤ ∆

, choose an

unﬁnished job such that P

is the biggest.

Remark 5.1. Deteriorating framework becomes

important in view of theorem 2D in (Kaspi and

Mandelbaum, 1998) which states an equivalence

between any Armed Bandits Process (under suitable

conditions) and a deteriorating one in the sense that

the value of both bandits equals and the classes of

optimal strategies for both bandits coincide.

6 CONCLUSIONS

In this paper, a fuzzy afﬁne transformation on re-

ward functions associated with a Multi-Armed Ban-

dits Model was considered. Such transformation led

us to model the possibility that rewards lie approxi-

mately in certain interval. Furthermore, it guided us

to formalize the Fuzzy-random structure of its com-

ponents and to ﬁnd relationships with their crisp ana-

logues. In particular, it was found that the policy dic-

tated by Gittins strategy as well as the corresponding

stopping time are also optimal at the considered fuzzy

setting. Further development may be possible by con-

sidering a fuzzy rewards scheme with a denumerable

number of Bandit Processes with denumerable state

spaces. Another way of action may arise by taking

into account reward functions with countable support

or by ﬁnding application settings that satisfy Assump-

tion 3.1.

ICORES 2025 - 14th International Conference on Operations Research and Enterprise Systems

276

ACKNOWLEDGMENTS

This work was partially supported by Proyecto

CONAHCYT: “Procesos de Decisi

on de Markov en

ambiente difuso”, Ciencia de Frontera 2023, CF-

2023-I-1362.

REFERENCES

Carrero-Vera, K., Cruz-Su

arez, H., and Montes-de Oca, R.

(2020). Finite-horizon and inﬁnite-horizon Markov

decision processes with trapezoidal fuzzy discounted

rewards. In International Conference on Operations

Research and Enterprise Systems, pages 171–192.

Springer.

Carrero-Vera, K., Cruz-Su

arez, H., and Montes-de Oca, R.

(2022). Markov decision processes on ﬁnite spaces

with fuzzy total rewards. Kybernetika (Prague),

58(2):180–199.

Cruz-Su

arez, H., Montes de Oca, R., and Ortega Guti

errez,

R. (2023a). Deterministic discounted Markov deci-

sion processes with fuzzy rewards/costs. Fuzzy Infor-

mation and Engineering, 15(3):274–290.

Cruz-Su

arez, H., Montes-de Oca, R., and Ortega-Guti

errez,

R. (2023b). An extended version of average Markov

decision processes on discrete spaces under fuzzy en-

vironment. Kybernetika (Prague), 59(1):160–178.

Diamond, P. and Kloeden, P. (1994). Metric Spaces of

Fuzzy Sets: Theory and Applications. WORLD SCI-

ENTIFIC.

Furukawa, N. (1997). Paramentric orders on fuzzy numbers

and their roles in fuzzy optimization problems. Opti-

mization, 40(2):171–192.

Gittins, J. and Jones, D. (1974). A dynamic allocation index

for the sequential design of experiments. Progress in

Statistics (edited by J. Gani), 241–266.

Gittins, J. C. (2018). Bandit Processes and Dynamic Alloca-

tion Indices. Journal of the Royal Statistical Society:

Series B (Methodological), 41(2):148–164.

Kaspi, H. and Mandelbaum, A. (1998). Multi-armed ban-

dits in discrete and continuous time. The Annals of

Applied Probability, 8(4):1270–1290.

Kurano, M., Song, J., Hosaka, M., and Huang, Y. (1998).

Controlled Markov set-chains with discounting. Jour-

nal of applied probability, 35(2):293–302.

Kurano, M., Yasuda, M., Nakagami, J.-i., and Yoshida, Y.

(1996). Markov-type fuzzy decision processes with

a discounted reward on a closed interval. European

Journal of Operational Research, 92(3):649–662.

Kurano, M., Yasuda, M., Nakagami, J.-i., and Yoshida, Y.

(2003). Markov decision processes with fuzzy re-

wards. Journal of Nonlinear and Convex Analysis,

4(1):105–116.

Mart

ınez-Cort

es, V. M. (2021). Bi-personal stochastic tran-

sient Markov games with stopping times and total re-

ward criterion. Kybernetika, 57(1):1–14.

Pahade, J. K. and Jha, M. (2021). Credibilistic vari-

ance and skewness of trapezoidal fuzzy variable and

mean–variance–skewness model for portfolio selec-

tion. Results in Applied Mathematics, 11:100159.

Puri, M. L. and Ralescu, D. A. (1986). Fuzzy random vari-

ables. Journal of Mathematical Analysis and Applica-

tions, 114(2):409–422.

Puterman, M. L. (2014). Markov decision processes: dis-

crete stochastic dynamic programming. John Wiley &

Sons.

Raj, M. E. A., Sivaraman, G., and Vishnukumar, P. (2023).

A novel kind of arithmetic operations on trapezoidal

fuzzy numbers and its applications to optimize the

transportation cost. International Journal of Fuzzy

Systems, 25(3):1069–1076.

Rezvani, S. and Molani, M. (2014). Representation of trape-

zoidal fuzzy numbers with shape function. Ann. Fuzzy

Math. Inform, 8(1):89–112.

Semmouri, A., Jourhmane, M., and Belhallaj, Z. (2020).

Discounted Markov decision processes with fuzzy

costs. Annals of Operations Research, 295:769–786.

Thompson, W. R. (1933). On the likelihood that one un-

known probability exceeds another in view of the evi-

dence of two samples. Biometrika, 25(3-4):285–294.

Zadeh, L. (1965). Fuzzy sets. Information and Control,

8(3):338–353.

Fuzzy Rewards on the Multi-Armed Bandits Model

277