versatility and power of multi-bandit algorithms in
dealing with complex, uncertain environments.
The applications of multi-armed bandit
algorithms are extensive and impactful across various
domains. In advertising, these algorithms optimize ad
placements by continuously adjusting to user
responses, thereby maximizing click-through rates
and overall campaign effectiveness. This dynamic
approach reduces wastage of ad spend and improves
the relevance of ads shown to users.
Today, advertising permeates every aspect of
people's lives, and for any product, the best way to
gain visibility is through advertising. Advertising
significantly boosts production and consumption by
ensuring that the right products reach the right people.
It generates billions in revenue; according to Statista,
global advertising expenditure is projected to reach
$1.089 trillion in 2024(https://www.statista.com,
2024). This represents a vast market, and enhancing
the quality and precision of advertising campaigns
can yield substantial benefits. Employing superior
algorithms for advertising promotions is undoubtedly
a strategy that can lead to significant economic
advantages.
In the field of applications, numerous scholars
have engaged in practical implementations in this
direction. X. Zhang and colleagues have employed
multi-armed bandit algorithms in a broader context of
generalized item recommendations, striving to make
optimal decisions(). Meanwhile, W. Chen and
associates have utilized the combinatorial Multi-
Armed Bandit (CMAB) framework specifically for
targeted advertising recommendations, addressing
real-world challenges. This paper specially targets the
advertisement deployment on the Amazon shopping
website, utilizing Amazon product data to enable
algorithms to simulate advertising campaigns. By
analyzing the performance of each algorithm based
on data-driven simulations, the study identifies
distinctive features and optimizes parameter settings
for each method. It also provides empirical
recommendations for parameter adjustments tailored
to different scenarios, aiming to maximize
algorithmic performance. This approach not only
enhances the efficiency of advertisement placements
but also contributes to a deeper understanding of how
adaptive algorithms can be fine-tuned for specific
marketing challenges.
2 ALGORITHM MODEL
The Upper Confidence Bound (UCB) algorithm
embodies the optimism principle in uncertain
situations. It assumes that the best possible outcome
is likely and exploits current knowledge to minimize
long-term regret. By optimistically estimating
potential rewards from actions, the UCB algorithm
balances exploring new options and exploiting known
ones, effectively guiding decision-making processes
toward the most rewarding outcomes. This strategy is
crucial for problems like online ad placement and
clinical trials, where decisions must adapt to evolving
data.
In the UCB algorithm, after each decision-making
iteration, the algorithm assigns an Upper Confidence
Bound (UCB) to each arm based on the results
obtained. In the subsequent round, the arm with the
highest UCB is selected, which is anticipated to yield
the greatest return. This cycle is continuously
repeated to ensure that each decision represents the
theoretically optimal choice. While specific
procedures and the calculation method of the UCB
may vary among different UCB algorithms, the
underlying philosophy remains consistent.
For each round 𝑡=1,2,3,…,𝑛, the algorithm
must select an arm 𝐴
from the set
1,2,3,...,𝑖
Upon
making a decision, a random reward 𝑋
is received.
The reward probabilities for each arm are
independent. The cumulative reward for an arm is
defined as 𝐸𝑖 = ∑
𝑋
.The average expected return
for each arm is given by µ𝑖=
∑
, where𝑇
is the
number of times arm 𝑖 has been selected up to round
𝑡. The regret is defined as smaxi(µi) · T −∑T t =
1X , which quantifies the difference between the
cumulative reward of always choosing the optimal
arm and the cumulative reward actually accrued.
2.1 Algorithms Used in the Study
Lattimore presents a classical UCB algorithm and its
derived improved algorithm Asymptotic Optimality
UCB in his paper. For the classical UCB algorithm,
the UCB of each arm can be derived from P:
𝑃μ≥μ+
(
/
)
≤δ for all δ∈
(
0,1
)
(1)
Here, the reward probability of Lattimore's
default arm is consistent with 1-subgaussian random
variables.
The Asymptotic Optimality UCB. Distinct from
the classical UCB, its principal feature is the
elimination of the need to specify the horizon n,
which is the total number of algorithmic explorations.
This is greatly beneficial in practical applications
where the optimal number of explorations is often
unknown. The Asymptotic Optimality UCB