In a previous paper (Kolev et al., 2019), we
presented details of the generic implementation of the
method. In this paper, we focus on a modification that
allows for efficient data pipelining, considering the
fact that, at each window, target time series can be
hashed as a first step and then all the others can be
correlated to the targets in a pipeline. The rest of the
paper gives a brief overview of the streaming engine
and the ParCorr method, followed by a description of
the pipelined implementation in comparison with a
naïve one, which are then experimentally evaluated.
2 UPM-CEP: STREAMING
ENGINE OVERVIEW
Stream Processing (SP) is a novel paradigm for
analyzing in real-time data captured from
heterogeneous data sources. Instead of storing the
data and then processing it, the data is processed on
the fly, as soon as it is received, or at most a window
of data is stored in memory. SP queries are
continuous queries run on a (infinite) stream of
events. Continuous queries are modeled as graphs
where nodes are SP operators and arrows are streams
of events. SP operators are computational boxes that
process events received over the incoming stream and
produce output events on the outgoing streams. SP
operators can be either stateless (such as projection,
filter) or stateful, depending on whether they operate
on the current event (tuple) or on a set of events (time
window or number of events window). Several
implementations went out to the consumer market
from both academy and industry, such as Borealis
(Ahmad et al., 2005), Infosphere (Pu et al., 2001),
Storm
1
, Flink
2
and StreamCloud (Gulisano et al.,
2012). Storm and Flink followed a similar approach
to the one of StreamCloud in which a continuous
query runs in a distributed and parallel way over
several machines, which in turn increases the system
throughput in terms of number of tuples processed per
second. The streaming engine UPM-CEP (Complex
Event Processing) adds efficiency to this parallel-
distributed processing being able to reach higher
throughput using less resources. It improves the
network management, reduces the inefficiency of the
garbage collection by implementing techniques such
as object reutilization and takes advantage of the
novel Non Uniform Memory Access (NUMA)
multicore architectures by minimizing the time spent
in context switching of SP threads/processes.
1
http://storm.apache.org/
The UPM-CEP JCEPC (Java CEP Connectivity)
driver hides from the applications the complexity of
the underlying cluster. Applications can create and
deploy continuous queries using the JCEPC driver as
well as register to the source streams and subscribe to
output streams of these queries. During the
deployment the JCEPC driver takes care of splitting a
query into sub-queries and deploys them in the CEP
cluster. Some of those sub-queries can be
parallelized.
3 ParCorr: METHOD OVERVIEW
The ParCorr time series correlation discovery
algorithm (Yagoubi et al., 2018) is based on a work
on fast window correlations over time series of
numerical data (Cole et al., 2005), and concentrates
on adapting the approach for the context of a big
number of parallel data streams. The analysis is done
on sliding windows of time series data, so that recent
correlations are being continuously discovered in
nearly real-time. At each move of the sliding window,
the latest elements of the time series are taken as
multi-dimensional vectors. As a similarity measure
between such vectors, we take the Euclidean distance,
since it is related to the Pearson correlation
coefficient if applied to normalized vectors.
Since the sliding window can result in a very high
number of dimensions of time series vectors, which
makes them very expensive to be compared to each
other, a major challenge the algorithm addresses is the
reduction of the dimensionality in a way that nearly
preserves the Euclidean distances. For this purpose,
random projection approach is adopted, where each
high-dimensional vector is transformed into a low-
dimensional one (called “sketch” of the vector), by
applying a product with a specific transformation
matrix, the elements of which are randomly selected
from the values of either -1 or 1. This approach
guarantees with high probability that the distance
between any pair of original vectors correspond to the
distance between their sketches.
Furthermore, to simplify the comparing across
sketches, each sketch vector is partitioned into
subvectors (e.g. two-dimensional), so that for
example a 30-dimensional sketch vector is broken
into 15 two-dimensional subvectors. Then, discrete
grid structures (in the example, 15 two-dimensional
grids) are built and subvectors are assigned to
grid cells, so that close subvectors are grouped in the
same grid
cells. This process essentially performs a
2
https://flink.apache.org/