and off-peak time intervals, it is possible to
distribute operators of streaming application in such
a way as to combine peak intervals with off-peak
ones to increase resources utilization and improve
system’s performance without losing the
performance of overloaded resources.
Therefore, the main idea of this work is to
investigate the scheduling problem of streaming data
processing with an ability to overload the
computational resources in order to combine peak
and off-peak workload densities of application’s
operators on the basis of predictive and performance
modeling. The contribution of the work is:
Modeling and problem statement of scheduling of
applications for streaming data processing taking
into account the forecasting of incoming
workload and overloading of computing nodes;
Development of the distributed streaming
platform simulator that allows exploring the
behavior of the system under various conditions
and scenarios;
Development of a genetic algorithm for
scheduling of streaming data processing.
The article is further structured as follows.
Section 2 is devoted to a review of the related works
to the scheduling of streaming data processing.
Sections 3 presents the background of streaming
data processing. The model and problem statement
of scheduling problem are described in section 4.
Section 5 is devoted to the development of simulator
and genetic algorithm to solve the scheduling
problem. Experimental studies and analysis of their
results are carried out in section 6. Section 7
includes a conclusion and future works.
2 RELATED WORKS
In general, the field of task scheduling in distributed
computing environments has long been at sight of
the scientific community. There is a huge amount of
works devoted to the scheduling of batch data
processing in a form of composite applications or,
for example, MapReduce applications in various
computing environments (Singh and Singh, 2013;
Wu et al., 2015). In such works, a lot of algorithms
of different classes, such as heuristic (Arabnejad,
2013; Topcuoglu et al., 2002), metaheuristic (Liu et
al., 2013; Nasonov et al., 2015) and possible
multiple modifications or hybrid schemes are
developed and investigated (Rahman et al., 2013;
Tsai et al., 2014; Yin et al., 2011). However,
compared to batch processing, the area of scheduling
of streaming data processing is currently poorly
explored and is at the development stage.
Most of the existing algorithms are sharpened for
a particular streaming platform (Storm, Spark
Streaming).
A resource-aware scheduling algorithm for Storm
is proposed in (Peng et al., 2015). The algorithm is
aimed at increasing the throughput of the system due
to the tight placement of application’s operators. The
allocation is based on the calculation of the minimal
difference between the available resource on node
and operator’s requirements to these resources.
Another algorithm (Xu et al., 2014), which is also
a modification of the Storm platform is aimed at
minimizing the inter-node interaction. The algorithm
works with an allocation matrix of operators on
computing nodes. Calculation and update of this
matrix are based on a monitoring of system’s
workload.
Authors of (Eskandari et al., 2016) present a
hierarchical algorithm for Storm. The main idea of the
algorithm lies in the two-phase partitioning of
application’s topology graph into roughly equal parts
for uniform placement of operators across computing
nodes. The partition is made by minimizing the sum
of edges’ weights between subgraphs. Moreover,
before the partitioning, the optimal number of
required nodes is estimated.
Two algorithms for the Storm platform are
suggested in (Aniello et al., 2013). The first is an
offline algorithm that tries to determine the most
related parts of the topology and place them on one or
nearby nodes. The second algorithm uses monitoring
data of resources utilization and traffic between nodes
for further periodic adaptation of previous schedules.
The next work is devoted to Spark Streaming (Liao
et al., 2016). There, one of the most influential
system’s parameters, which should be correctly
selected, is the time window for microbatching. Thus,
the work is focused on dynamical adaptation of the
time window for the microbatching, depending on a
number of entering events in the system.
Besides platform oriented algorithms, there are
works devoted to generalized modeling and
investigation of the streaming data processing. In
such works, the scheduling problem is presented in a
more generalized form, with determination of
indicators and performance characteristics of the
system and ways to evaluate and improve them.
The method for grouping incoming tuples across
operators is proposed in (Rivetti et al., 2016),
allowing the computational workload to be balanced.
Due to the evaluation of the execution time of tuples
and their distribution to less loaded operators, the