query the JCEPC driver takes care of splitting the
query into sub-queries and deploys them in the CEP.
Some of those sub-queries can be parallelized. For
instance, the query in Figure 3 shows a data streaming
query made out 2 input streams and 8 operators. The
operators are either stateless (SL) or stateful (SF)
operators.
Figure 3: Data streaming query.
UPM-CEP has several stateless and stateful operator
implemented and users can create their own
customized operators. The stateless operators than
can be found are: 1) Map: that allows to select the
desired fields from the input tuple and create an
output tuple with those fields. 2) Filter: Only the
tuples that satisfy a defined condition are sent through
the output stream, the rest of tuples are discarded. 3)
Demux: sends the input tuples to all the output
streams that satisfies the defined conditions and 4)
Union: Tuples that arrives to the operator from
different input streams are sent to one output stream.
Regarding the stateful operators two window
oriented operators are available: 1) Aggregate: group
all tuples that are in the time or size window taking
into account defined functions executed over the
fields of all the tuples. Moreover, tuples can be
grouped into different windows if the group by
parameter is specified. 2) Join: Correlates tuples from
two different input stream. Two time windows are
created, one per input stream and when the windows
are slide tuples are joined creating one output tuple
taking into account a specified predicate.
UPM-CEP partitions queries into subqueries so
that, each subquery executes in a different node.
Figure 4Error! Reference source not found. shows
how the previous query is split into four subqueries
(SQ1, SQ2, SQ3 and SQ4). The number of
subqueries of a given query is defined by the number
of stateful operators. All consecutive stateless
operators are grouped together in a subquery till a
stateful operator is reached. That stateful operator is
the first operator of the next subquery. This way of
partitioning queries has proven to be efficient in
distributed scenarios (Gulisano, 2010). We have
applied the same design principles to UPM-CEP
although it is not a distributed setup, the same
principles apply minimizing the communication
across NUMA nodes in this case and keeping the
same semantics a centralized system will provide.
Figure 4: Query partitioning.
Subqueries can be parallelized in order to increase
the throughput. Each instance of a subquery can run
in a different core in the same node. Figure 5 shows
how subqueries of the previous example could be
parallelized. There are 3 instances of SQ1, one
instance of SQ2, two instances of SQ3 and three
instances of SQ4.
Figure 5: Query parallelization.
The main challenge in query-parallelization is to
guarantee that the output of a parallel execution is the
same as a centralized one. If we consider a sub-query
made by only one operator, this challenge means that
the output of a parallel operator must be the same as
a centralized operator. On the other hand, window
oriented operators require that all tuples that have to
be aggregated/correlated together are processed by
the same CEP instance. For example, if an Aggregate
operator computing the total monthly operations of
the bank accounts for each client is parallelized over
three CEP Instances, it must be ensured that all tuples
belonging to the same user account must be processed
by the same CEP Instance in order to produce the
correct result.
To guarantee the equivalence between centralized
and parallel queries, particular attention must be
given to the communications among sub-queries.
Consider the scenario depicted in Figure 6 where there
are two sub-queries, Sub-query 1 and Sub-query 2,
with a parallelization degree of two and three,
respectively. If Sub-query 2 does not contain any
window oriented operator, CEP instances at Sub-
query 1 can arbitrary decide to which CEP instance of
Sub-query 2 send their output tuples. Output tuples of
Sub-query 1 are assigned to buckets. This assignment
ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications
674