DESIGN AND IMPLEMENTATION OF DATA STREAM
PROCESSING APPLICATIONS
Edwin Kwan, Janusz R. Getta
School of Computer Science and Software Engineering, University of Wollongong,Wollongong, Australia
Ehsan Vossough
Department of Computing and Mathematics, University of Western Sydney, Campbelltown, Australia
Keywords:
Data stream, data stream management system, data stream application, processing data streams.
Abstract:
Processing of data streams requires the continuous processing of end-user applications over the long and
steadily increasing sequences of data items. This work considers the design and implementation of data stream
processing applications in the domains where the limited computational resources, constraints imposed on the
implementation techniques and specific properties of applications exclude the use of a general purpose data
stream management system. The implementation techniques described in the paper include the representation
of atomic application as sequences of operation in an XML based language and translation of XML specifica-
tions into the programs in an object-oriented programming language.
1 INTRODUCTION
The technological advances of small size and energy
efficient electronic sensing devices allow for collect-
ing and real time processing of long sequences of
data items commonly known as data streams. A
data stream is a theoretically unlimited and con-
tinuously expanding sequence of homogeneous data
items (Babcock et al., 2002). Many of such sequences
are obtained from the periodical measurements of pa-
rameters of physical processes like for instance the
values of temperature, humidity, air pressure or even
the series of radio signals received from the outer
space. Processing and managing the high frequency
data streams are beyond the performance capabilities
of present commercial Database Management Sys-
tems (DBMS). It is commonly agreed that a new class
of systems, commonly called as Data Stream Man-
agement Systems (DSMS) (Motwani et al., 2003), is
needed to reach the performance levels needed by the
data stream processing applications.
The typical application domains where it is hard to
apply the general purpose DSMS include processing
of data streams in the embedded systems and in the
wireless sensor networks.
A methodology for the design and implementation
of data stream processing applications in an object-
oriented programming languages is still an open prob-
lem. The applications expressed in a high level query
and data manipulation language must be translated
into the programs in a lower level implementation lan-
guage and later on these programs must be optimized
as well. There is no well established and commonly
accepted language suitable for programming and op-
timization of data stream processing applications at
the implementation level.
An interesting question is how to formally express
the functionality of a data stream processing applica-
tion at the lower levels of abstraction and how to op-
timize an application ? As the data streams are the-
oretically unlimited sequences of data items, the pro-
cessing of complete streams at any moment in time is
practically impossible. To avoid this problem we as-
sume that only subsets of data streams, also called as
windows, are processed by an application. The reac-
tivity principle requires the processing of all windows
to be performed whenever the contents of at least one
window have been changed. If only approximate re-
sults are expected then some of the data items from
a window are processed and processing is performed
every n-th modification.
This work is based on a formal model of data
stream processing presented in (Getta and Vossough,
2004). If an application processes n data streams
193
Kwan E., R. Getta J. and Vossough E. (2007).
DESIGN AND IMPLEMENTATION OF DATA STREAM PROCESSING APPLICATIONS.
In Proceedings of the Second International Conference on Software and Data Technologies - Volume ISDM/WsEHST/DC, pages 193-196
DOI: 10.5220/0001326901930196
Copyright
c
SciTePress
x
1
, . . . , x
n
then we represent it as an n-ary operation
f(x
1
, . . . , x
n
). To enforce the reactivity of an appli-
cation we have to implement n operations, f
1
, . . . , f
n
,
each one on a data stream and n 1 windows. The op-
erations are implemented as the expressions e
1
, . . . , e
n
built over the binary operations on the elements of
data stream processed x
i
and the windows on the
streams w
x
1
, . . . , w
x
n
.
Design and implementation of data stream pro-
cessing applications in a way consistent with a for-
mal model described above is suitable for the envi-
ronments were data stream processing software has to
be merged with software implemented in the general
purpose programming languages.
The paper is organized in the following way. We
start from a review of the previous works in an area
of data stream processing systems. Section 3 presents
the basic concepts of data stream model used in the
paper. It is followed by a presentation of operational
model of data stream processing. Section 4 and Sec-
tion 5 that overviews the implementation aspects and
the experiments conducted so far. Finally, Section 6
summarizes and concludes the paper.
2 PREVIOUS WORKS
Design and implementation of scalable and dis-
tributed data stream processing systems attracted a lot
of attention in the last years. As many of the funda-
mental assumptions behind the traditional DBMS no
longer hold for data stream processing systems (Bab-
cock et al., 2002), implementation of the prototype
systems presented a significant challenge.
STREAM system (Motwani et al., 2003) is a gen-
eral purpose DSMS that supports a declarative query
language and it is able to process many continuous
queries on the data streams with high frequencies of
input data. The system also supports the approximate
query answering when processing the queries over the
data streams with very high frequencies.
TelegraphCQ is a dataflow system for the process-
ing of continuous queries over data streams (Avnur
and Hellerstein, 2002). It uses an adaptive query en-
gine, which is based on a concept of Eddy earlier in-
vented for the adaptive query processing in the rela-
tional DBMS.
Aurora system (Abadi et al., 2003) is designed to
process large number of asynchronous and push based
data streams. Aurora builds continuous queries out of
a small set of well defined operators that implement
standard filtering, mappings, aggregate and windows
join operations. Currently, the development of Aurora
has been superseded by Borealis project, Borealis is
max
w
sort
w
swritemax mwrite find
Figure 1: The compositions of elementary operations.
a distributed data stream processing engine that in-
herits core functionality from Aurora and inter-node
communication from Medusa system (Zdonik et al.,
2003).
Gigascope is a data stream processing system for
network applications including traffic analysis, intru-
sion detection, router configuration analysis, network
monitoring, and performance monitoring and debug-
ging (Cranor et al., 2003).
3 BASIC CONCEPTS
A data stream is a theoretically unlimited sequence
of homogeneous and either elementary or composite
data items. A type of the individual data items deter-
mines the special types of data streams, for instance,
a relational data stream is a stream whose data items
are the tuples of elementary values, XML data stream
is a stream whose data items are the XML documents,
and so on.
A system of elementary operations on data items
is derived from a formal model of data stream pro-
cessing and optimization proposed by (Getta and Vos-
sough, 2004). An elementary operation always pro-
cesses one input data stream, at most one data con-
tainer, and outputs the data items to zero or more out-
put data streams.
There are two types of data containers: fixed size
container also called as windows on data streams and
variable size containers used to keep the intermediate
results of stream processing.
The complex operations on a data stream are im-
plemented through the composition of elementary op-
erations. For instance, finding the current largest
value in a sequence of data items can be implemented
as the composition of a read operation
max
and write
operation
mwrite
, see Figure 1. In another exam-
ple, an operation
find
compares an item taken from
a stream with n largest and sorted items stored in a
data container w
sort
and finds all items smaller than
the new one, see Figure 1. If at least one item is found
a sequence of items that should be written to w
sort
to
keep it sorted is passed to an operation
swrite
,
In a general case implementation of n argu-
ment operation is performed by the decomposition of
f(w
1
, . . . , w
i1
, x
i
, w
i+1
, . . . , w
n
), into n 1 binary op-
erations.
ICSOFT 2007 - International Conference on Software and Data Technologies
194
y
w
z
w
*
*
write
xy
w
+
zx
w
write
+
write
write
result
x
copy
Figure 2: A data stream processing networks implementing
(x w
y
) + (x w
z
).
For example, an operation f(x, w
y
, w
z
) that pro-
cesses a data stream x, fixed size windows w
y
,
w
z
on the streams y and z can be implemented
as f
2
( f
1
(x, w
y
), w
z
) and represented as a path p :
f
1
(w
y
), f
2
(w
z
), ε. A data stream x is piped into a path
p, x p in order process the data items. The last
symbol in a path identifies the next path to be used
for the processing. A special symbol ε denotes an end
of processing.
As a simple example consider an operation
f(x, w
x
, w
y
) = (x w
y
) + (x w
z
) that processes a data
stream x against the fixed size windows w
y
and w
z
.
The paths:
p:copy()(1:p
1
, 2:p
2
)
p
1
:(w
y
), write(w
xy
), +(w
zx
), write(result), ε
p
2
:(w
z
), write(w
zx
), +(w
xy
), write(result), ε
implementing f(x, w
x
, w
y
) are visualized in Figure 2.
A module is a set of paths encapsulated as a complex
operation m(p
1
, . . . , p
m
, d
1
, . . . , d
n
)1:ε, . . . , p:ε where
p
1
, . . . , p
m
are the path parameters, d
1
, . . . , d
n
are the
data container parameters and 1:ε, . . . , p:ε are the out-
puts.
Processing of many data streams needs the indi-
vidual implementations of paths for each one of the
streams involved in an application. A data stream
processing network is a set of path expressions to-
gether with the data streams ”piped” into the paths.
4 DESIGN OF APPLICATIONS
In an operational model of data stream processing an
application acting on the streams x
1
, . . . , x
n
is repre-
sented as n-argument operation f(x
1
, . . . , x
n
). Due to
the reactivity principle, an application should be able
to recompute the operation after a new data item δ
i
is appended to anyone of the streams. Therefore, an
application programmer must provide the implemen-
tations of n operations f
1
(w
x
1
δ
1
, w
x
1
, . . . , w
x
n
),...,
f
1
(w
x
1
, . . . , w
x
n
δ
n
) where w
x
i
δ
i
denotes the con-
tents of a window w
x
i
after the insertion of a data item
δ
i
.
In order to speed up the evaluation of the oper-
m−1
d
2
d
α
1
α
2
α
m
α
m−1
1
d
i
δ w
i
m
d
Figure 3: Implementation of f
i
(w
1
, . . . , w
i
δ
i
, . . . , w
n
).
ations, all computations on the windows that have
not been changed since the previous evaluation are
taken from the earlier recorded temporary results,
also called as materializations d
1
, . . . , d
m
, see Fig-
ure 3. Hence, the implementation of f
i
(w
1
, . . . , w
i
δ
i
, . . . , w
n
) is performed through the transforma-
tion of n-argument operation into an expression
e
i
(d
1
, . . . , d
m
, w
i
δ
i
) over the binary operations
α
1
, . . . , α
m
, window w
i
δ
i
, and materializations
d
1
, . . . , d
m
,
A sequence of binary operations is transformed
into a path p
i
: α
1
(d
1
), . . . , α
m
(d
m
) where d
1
is a win-
dow on a data stream see Figure 3. A transforma-
tion of an expression e
i
into a path is performed
in the following way. We start from an operation
α
1
(δ
i
w
i
, d
1
) and we construct a path p
i
:α
i
(d
1
).
Next, we consider an operation α
2
(α
1
(δ
i
w
i
, d
1
), d
2
)
and we extend a path p
i
to get p
i
:α
1
(d
1
)), α
2
(d
2
).
We repeat this process until an operation α
m
at the
root of expression e
i
is processed. Finally, we append
write(w
out
) to a path p
i
. We repeat, this process for
all expressions e
i
, i = 1, . . . , n. Next, if d
j
is a mate-
rialization of the intermediate results then we insert
an operation write(d
j
) into all paths expressions that
contribute to the contents of d
j
. At the end, we add
an operation write(w
i
) at the beginning all paths p
i
whose inputs are directly taken from the data streams.
5 IMPLEMENTATION OF
APPLICATIONS
An implementation stage that follows an application
design includes the preparation of formal specifica-
tion and optimization of paths, generation of imple-
mentation code, and implementation of the opera-
tions. XML is chosen as a language for formal speci-
fication of paths.
XML document that describes a data stream pro-
cessing application consists of
PORT
,
DATA-TYPE
,
WINDOWS
, and
PATH
elements. An element
PORT
in-
cludes information about the sources from where the
data items are collected and it is described by the at-
DESIGN AND IMPLEMENTATION OF DATA STREAM PROCESSING APPLICATIONS
195
tributes
IP
,
TYPE
, and
TO-PATH
.
An element
DATA-TYPE
contains information
about the structures of data items handled by an ap-
plication. Its subelements and attributes are modeled
in a way similar to C++ class structures.
An element
WINDOWS
contains information about
the different types of windows used by an application.
An element
PATH
represents the paths a data
stream processing application consists of. It is de-
scribed by the attributes
NAME
and
TYPE
where
NAME
identifies a path and
TYPE
is a type of data item pro-
cessed by a path. The subelements of
PATH
inlude the
repetitions of the elements
OPERATION
and
OUTPUT
.
An element
OPERATION
that represent the operations
included in a path has its own sub-elements including
COMMENTS
,
GET
,
STORE
, and
BRANCH
.
C++ code generated from XML specification im-
plements entire application except the elementary op-
erations that have to be separately provided by the ap-
plication programmers. Every data item handled by
a data stream processing application has its type de-
clared in the application. Data types are represented
by C++ classes of objects and the variables being the
instances of a particular data type are stored as either
private or public variables.
All paths described in XML document are repre-
sented as sequences of operations are the segments
within the function. A code generated from XML
uses sockets for connecting and listening to different
ports.
6 SUMMARY, AND FUTURE
WORK
This work considers the design and implementation
of data stream processing applications in the envi-
ronments where the limited computational resources
or specific requirements imposed on the applications
make the utilization of complex DSMS not practical.
In our approach an application processing n input data
streams is represented as an n-ary operation. We show
how to decompose such operations into the expres-
sions built of binary operations, materializations, and
input data items and later on we describe the transala-
tion the expressions into the sets of data stream pro-
cessing paths. In our model one data stream is di-
rected for processing to one path and set of paths
represents entire application. The paths are formally
described in XML based language and implemented
through the automatic translation into C++ code.
The following are the possible directions for fu-
ture extensions of our approach to data stream pro-
cessing. An interesting idea is to distribute the com-
putations over many processing units. A closely re-
lated problem is the distribution of the processing in
the sensor networks. Another problem is related to
the simultaneous processing of more than one data
stream. In such a case the synchronization of flows
of data items along the processing paths needs to
throughly be addressed.
REFERENCES
Abadi, D., Carney, D., Cetintemel, U., M.Cherniack, Con-
vey, C., Erwin, C., Galvez, E., Hauton, M., Maskey,
A., Rasin, A., A.Singer, Stonebraker, M., Tatbul, N.,
Xing, Y., Yan, R., and Zdonik, S. (2003). Aurora: A
data stream management system. In Proceedings of
the 2003 ACM SIGMOD International Conference on
Management of Data, pages 663–663.
Avnur, R. and Hellerstein, J. (2002). Continuously adaptive
continuous queries over streams. In Proceedings of
the 2002 ACM SIGMOD International Conference on
Management of Data, pages 49–60.
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom,
J. (2002). Models and issues in data stream sys-
tems. In Popa, L., editor, Proceedings of the Twenty-
first ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems, pages 1–16. ACM
Press.
Cranor, C., Johnson, T., Spatatschek, O., and Shkapenyuk,
V. (2003). Gigascope: A stream database for net-
work applications. In Proceedings of the 2003 ACM
SIGMOD International Conference on Management
of Data, pages 644–648.
Getta, J. R. and Vossough, E. (2004). Optimization of data
stream processing. SIGMOD Record, 33(3):34–39.
Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S.,
Datar, M., Manku, G., C.Olston, Rosenstein, J., and
Varma, R. (2003). Query processing, resource man-
agement, and approximation in a data stream man-
agement system. In Proceedings of the First Bi-
ennial Conference on Innovative Data Systems Re-
search, pages 245–256.
Zdonik, S., Stonebraker, M., M.Cherniack, Cetintemel, U.,
Balazinska, M., and H.Balakrishnan (2003). The Au-
rora and Medusa projects. Bulletin of the Technical
Committee on Data Engineering, pages 3–10.
ICSOFT 2007 - International Conference on Software and Data Technologies
196