2 STREAMING ENGINE
OVERVIEW
Stream Processing (SP) is a novel paradigm for
analyzing in real-time data captured from
heterogeneous data sources. Instead of storing the
data and then processing it, the data is processed on
the fly, as soon as it is received, or at most a window
of data is stored in memory. SP queries are
continuous queries run on a (infinite) stream of
events. Continuous queries are modeled as graphs
where nodes are SP operators and arrows are streams
of events. SP operators are computational boxes that
process events received over the incoming stream and
produce output events on the outgoing streams. SP
operators can be either stateless (such as projection,
filter) or stateful, depending on whether they operate
on the current event (tuple) or on a set of events (time
window or number of events window). Several
implementations went out to the consumer market
from both academy and industry, such as Borealis
(Ahmad et al., 2005), Infosphere (Pu et al., 2001),
Storm
1
, Flink
2
and StreamCloud (Gulisano et al.,
2012). Storm and Flink followed a similar approach
to the one of StreamCloud in which a continuous
query runs in a distributed and parallel way over
several machines, which in turn increases the system
throughput in terms of number of tuples processed per
second. The streaming engine UPM-CEP (Complex
Event Processing) adds efficiency to this parallel-
distributed processing being able to reach higher
throughput using less resources. It improves the
network management, reduces the inefficiency of the
garbage collection by implementing techniques such
as object reutilization and takes advantage of the
novel Non Uniform Memory Access (NUMA)
multicore architectures by minimizing the time spent
in context switching of SP threads/processes.
The UPM-CEP JCEPC (Java CEP Connectivity)
driver hides from the applications the complexity of
the underlying cluster. Applications can create and
deploy continuous queries using the JCEPC driver as
well as register to the source streams and subscribe to
output streams of these queries. During the
deployment the JCEPC driver takes care of splitting a
query into sub-queries and deploys them in the CEP
cluster. Some of those sub-queries can be
parallelized.
1
http://storm.apache.org/
3 METHOD OVERVIEW
The ParCorr time series correlation discovery
algorithm (Yagoubi et al., 2018) is based on a work
on fast window correlations over time series of
numerical data (Cole et al., 2005), and concentrates
on adapting the approach for the context of a big
number of parallel data streams. The analysis is done
on sliding windows of time series data, so that recent
correlations are being continuously discovered in
nearly real-time. At each move of the sliding window,
the latest elements of the time series are taken as
multi-dimensional vectors. As a similarity measure
between such vectors, we take the Euclidean distance,
since it is related to the Pearson correlation
coefficient if applied to normalized vectors. Since the
sliding window can result in a very high number of
dimensions of time series vectors, which makes them
very expensive to be compared to each other, a major
challenge the algorithm addresses is the reduction of
the dimensionality in a way that nearly preserves the
Euclidean distances. For this purpose, random
projection approach is adopted, where each high-
dimensional vector is transformed into a low-
dimensional one (called “sketch” of the vector), by
applying a product with a specific transformation
matrix, the elements of which are randomly selected
from the values of either -1 or 1. This approach
guarantees with high probability that the distance
between any pair of original vectors correspond to the
distance between their sketches. Furthermore, to
simplify the comparing across sketches, each sketch
vector is partitioned into subvectors (e.g. two-
dimensional), so that for example a 30-dimensional
sketch vector is broken into 15 two-dimensional
subvectors. Then, discrete grid structures (in the
example, 15 two-dimensional grids) are built and
subvectors are assigned to grid cells, so that close
subvectors are grouped in the same grid cells. This
process essentially performs a locality-sensitive
hashing of high-dimensional time series vectors,
where close vectors are discovered by searching for
pairs of vectors, which are represented together in a
high number of grid cells. Since this can output false
positives, the candidate pairs are explicitly verified by
computing the actual distance between them.
This outlines four main steps of the algorithm:
Sketching: computation and partitioning of
sketches;
Collocation: grouping together all time series
assigned to the same grid cell;
Correlation: finding frequently collocated pairs as
2
https://flink.apache.org/
ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications
682