deprovision resources (event consumers replicas) so
that the usage of resources is minimized while
guaranteeing (1).
Solutions offered by cloud providers and by on-
premises cluster orchestrators such as Kubernetes
(KEDA, 2023) to scale event consumer replicas when
a certain metric reaches a certain threshold are not
satisfactory. In essence, these autoscalers assume a
linear relationship between the current value of a
monitored metric and the desired value of that metric
to compute the needed number of replicas. Hence, a
linear autoscaler for event queues emulating cloud
autoscalers will use the ratio of the arrival rate of
events to the maximum consumption rate per replica
to get the needed number of replicas. However, the
number of needed replicas meets a bin pack solution
rather than a linear one as we describe throughout this
paper. Furthermore, Kubernetes and cloud
autoscalers are not middleware/platform aware. As
such, they do not recommend or participate in the
assignment of the provisioned event consumer
replicas to partitions. Rather, they rely on the
distributed event queue middleware logic for
assigning partitions to event consumers. This might
result in scenarios where event consumer replicas
might not be assigned to partitions in a latency-aware
manner, even if enough replicas are provisioned by
such autoscalers. A latency-aware and resource-
efficient autoscaler for distributed event queue must
provision just enough replicas to achieve the desired
latency, and it must recommend to the distributed
event queue middleware the assignment of partitions
to the provisioned event consumer replicas in order to
guarantee the desired latency.
As we discuss throughout this paper, in distributed
event queues aiming at high percentile latency SLA
is not straightforward even in the presence of a
dynamic resource provisioning mechanism. This is
because reducing the percentage of events that exhibit
a latency beyond the desired latency (that is, the tail
latency) while at the same time dynamically
provisioning and deprovisioning of resources (event
consumer replicas) are two objectives which are at
conflict in distributed event queues. This stems from
the fact that scaling up or down event consumers
necessitates a blocking synchronization protocol to
distribute the load of the events waiting in queues
among the provisioned event consumer replicas.
During this synchronization protocol, which is also
called rebalancing or assignment (Blee-Goldman,
2020; Narkhede et al., 2017) all the event consumer
replicas will stop processing events, thus eventually
contribute to a larger tail latency and less percentile
of latency SLA guarantee. The increase in the tail
latency results from the fact that all the events
arriving during the synchronization protocol
execution will exhibit a relatively higher latency as
compared to the latency of events processed during
normal operation of the system. Clearly, the relation
between the desired latency SLA and the time of the
blocking rebalancing protocol dictates if there is some
space for regularly and dynamically modifying or not
the number of event consumers. If the rebalancing
time is very high compared to the desired latency
SLA, obviously the only deployment that can ensure
the desired latency SLA for a high percentile of
events is one where all replicas would be provisioned
from start-up time (an overprovisioned solution).
Even if this is at a cost considered to be non-optimal
as some of the ready replicas may not operate all the
time. But if the ratio of the desired latency SLA to the
rebalancing time is greater than 1, a good tradeoff can
be sought: a just-needed number of replicas deployed
while ensuring a small tail-latency. Our research
contributes a solution towards finding such a tradeoff,
still prioritizing the latency SLA guarantee over cost
reduction.
To achieve this, we first formulate the problem of
autoscaling event consumers from distributed event
queues to meet a desired latency as a bin pack
problem: it depends on the arrival rate of events into
queues which can even be skewed, the number of
events in the queues backlog, and the maximum
consumption rate of the event consumers. We
propose an appropriate heuristic (Least Loaded) to
solve the bin pack problem in polynomial time and to
maintain a balanced load in terms of events served by
each event consumer replica. As the synchronization
protocol upon consumer replica (un-)provisioning is
blocking, we extend our initial bin pack solution by
taking into account new events that will accumulate
during the autoscaling. We also propose several
recommendations on the configurations of the
rebalancing protocol that contribute to a lower tail
latency. We first experimentally show that on some
selected workloads, our bin pack solution
outperforms a linear autoscaling solution by 3.5% up
to 10% in terms of latency guarantee in a first system
setting where rebalancing time overhead is way
smaller than the desired latency SLA. Then, under
other and less favourable system settings regarding
rebalancing overhead, we show that the proposed bin
pack extension applied to the same workloads results
in a lower tail latency, and thus better latency SLA,
but at higher resource utilization cost.
To our knowledge, latency-aware and dynamic
resource provisioning for distributed event queues in
the presence of a blocking resource synchronization/