DESIGN AND IMPLEMENTATION OF DATA STREAM

PROCESSING APPLICATIONS

Edwin Kwan, Janusz R. Getta

School of Computer Science and Software Engineering, University of Wollongong,Wollongong, Australia

Ehsan Vossough

Department of Computing and Mathematics, University of Western Sydney, Campbelltown, Australia

Keywords:

Data stream, data stream management system, data stream application, processing data streams.

Abstract:

Processing of data streams requires the continuous processing of end-user applications over the long and

steadily increasing sequences of data items. This work considers the design and implementation of data stream

processing applications in the domains where the limited computational resources, constraints imposed on the

implementation techniques and speciﬁc properties of applications exclude the use of a general purpose data

stream management system. The implementation techniques described in the paper include the representation

of atomic application as sequences of operation in an XML based language and translation of XML speciﬁca-

tions into the programs in an object-oriented programming language.

1 INTRODUCTION

The technological advances of small size and energy

efﬁcient electronic sensing devices allow for collect-

ing and real time processing of long sequences of

data items commonly known as data streams. A

data stream is a theoretically unlimited and con-

tinuously expanding sequence of homogeneous data

items (Babcock et al., 2002). Many of such sequences

are obtained from the periodical measurements of pa-

rameters of physical processes like for instance the

values of temperature, humidity, air pressure or even

the series of radio signals received from the outer

space. Processing and managing the high frequency

data streams are beyond the performance capabilities

of present commercial Database Management Sys-

tems (DBMS). It is commonly agreed that a new class

of systems, commonly called as Data Stream Man-

agement Systems (DSMS) (Motwani et al., 2003), is

needed to reach the performance levels needed by the

data stream processing applications.

The typical application domains where it is hard to

apply the general purpose DSMS include processing

of data streams in the embedded systems and in the

wireless sensor networks.

A methodology for the design and implementation

of data stream processing applications in an object-

oriented programming languages is still an open prob-

lem. The applications expressed in a high level query

and data manipulation language must be translated

into the programs in a lower level implementation lan-

guage and later on these programs must be optimized

as well. There is no well established and commonly

accepted language suitable for programming and op-

timization of data stream processing applications at

the implementation level.

An interesting question is how to formally express

the functionality of a data stream processing applica-

tion at the lower levels of abstraction and how to op-

timize an application ? As the data streams are the-

oretically unlimited sequences of data items, the pro-

cessing of complete streams at any moment in time is

practically impossible. To avoid this problem we as-

sume that only subsets of data streams, also called as

windows, are processed by an application. The reac-

tivity principle requires the processing of all windows

to be performed whenever the contents of at least one

window have been changed. If only approximate re-

sults are expected then some of the data items from

a window are processed and processing is performed

every n-th modiﬁcation.

This work is based on a formal model of data

stream processing presented in (Getta and Vossough,

2004). If an application processes n data streams

193

Kwan E., R. Getta J. and Vossough E. (2007).

DESIGN AND IMPLEMENTATION OF DATA STREAM PROCESSING APPLICATIONS.

In Proceedings of the Second International Conference on Software and Data Technologies - Volume ISDM/WsEHST/DC, pages 193-196

DOI: 10.5220/0001326901930196

 SciTePress

, . . . , x

then we represent it as an n-ary operation

f(x

, . . . , x

). To enforce the reactivity of an appli-

cation we have to implement n operations, f

, . . . , f

each one on a data stream and n− 1 windows. The op-

erations are implemented as the expressions e

, . . . , e

built over the binary operations on the elements of

data stream processed x

and the windows on the

streams w

, . . . , w

Design and implementation of data stream pro-

cessing applications in a way consistent with a for-

mal model described above is suitable for the envi-

ronments were data stream processing software has to

be merged with software implemented in the general

purpose programming languages.

The paper is organized in the following way. We

start from a review of the previous works in an area

of data stream processing systems. Section 3 presents

the basic concepts of data stream model used in the

paper. It is followed by a presentation of operational

model of data stream processing. Section 4 and Sec-

tion 5 that overviews the implementation aspects and

the experiments conducted so far. Finally, Section 6

summarizes and concludes the paper.

2 PREVIOUS WORKS

Design and implementation of scalable and dis-

tributed data stream processing systems attracted a lot

of attention in the last years. As many of the funda-

mental assumptions behind the traditional DBMS no

longer hold for data stream processing systems (Bab-

cock et al., 2002), implementation of the prototype

systems presented a signiﬁcant challenge.

STREAM system (Motwani et al., 2003) is a gen-

eral purpose DSMS that supports a declarative query

language and it is able to process many continuous

queries on the data streams with high frequencies of

input data. The system also supports the approximate

query answering when processing the queries over the

data streams with very high frequencies.

TelegraphCQ is a dataﬂow system for the process-

ing of continuous queries over data streams (Avnur

and Hellerstein, 2002). It uses an adaptive query en-

gine, which is based on a concept of Eddy earlier in-

vented for the adaptive query processing in the rela-

tional DBMS.

Aurora system (Abadi et al., 2003) is designed to

process large number of asynchronous and push based

data streams. Aurora builds continuous queries out of

a small set of well deﬁned operators that implement

standard ﬁltering, mappings, aggregate and windows

join operations. Currently, the development of Aurora

has been superseded by Borealis project, Borealis is

max

sort

swritemax mwrite find

Figure 1: The compositions of elementary operations.

a distributed data stream processing engine that in-

herits core functionality from Aurora and inter-node

communication from Medusa system (Zdonik et al.,

2003).

Gigascope is a data stream processing system for

network applications including trafﬁc analysis, intru-

sion detection, router conﬁguration analysis, network

monitoring, and performance monitoring and debug-

ging (Cranor et al., 2003).

3 BASIC CONCEPTS

A data stream is a theoretically unlimited sequence

of homogeneous and either elementary or composite

data items. A type of the individual data items deter-

mines the special types of data streams, for instance,

a relational data stream is a stream whose data items

are the tuples of elementary values, XML data stream

is a stream whose data items are the XML documents,

and so on.

A system of elementary operations on data items

is derived from a formal model of data stream pro-

cessing and optimization proposed by (Getta and Vos-

sough, 2004). An elementary operation always pro-

cesses one input data stream, at most one data con-

tainer, and outputs the data items to zero or more out-

put data streams.

There are two types of data containers: ﬁxed size

container also called as windows on data streams and

variable size containers used to keep the intermediate

results of stream processing.

The complex operations on a data stream are im-

plemented through the composition of elementary op-

erations. For instance, ﬁnding the current largest

value in a sequence of data items can be implemented

as the composition of a read operation

max

and write

operation

mwrite

, see Figure 1. In another exam-

ple, an operation

find

compares an item taken from

a stream with n largest and sorted items stored in a

data container w

sort

and ﬁnds all items smaller than

the new one, see Figure 1. If at least one item is found

a sequence of items that should be written to w

sort

keep it sorted is passed to an operation

swrite

In a general case implementation of n argu-

ment operation is performed by the decomposition of

f(w

, . . . , w

i−1

, x

, w

i+1

, . . . , w

), into n− 1 binary op-

erations.

ICSOFT 2007 - International Conference on Software and Data Technologies

194

write

result

copy

Figure 2: A data stream processing networks implementing

(x∗ w

) + (x∗ w

For example, an operation f(x, w

, w

) that pro-

cesses a data stream x, ﬁxed size windows w

on the streams y and z can be implemented

as f

( f

(x, w

), w

) and represented as a path p :

), f

), ε. A data stream x is piped into a path

p, x → p in order process the data items. The last

symbol in a path identiﬁes the next path to be used

for the processing. A special symbol ε denotes an end

of processing.

As a simple example consider an operation

f(x, w

, w

) = (x∗ w

) + (x∗ w

) that processes a data

stream x against the ﬁxed size windows w

and w

The paths:

p:copy()(1:p

, 2:p

)

:∗(w

), write(w

), +(w

), write(result), ε

:∗(w

), write(w

), +(w

), write(result), ε

implementing f(x, w

, w

) are visualized in Figure 2.

A module is a set of paths encapsulated as a complex

operation m(p

, . . . , p

, d

, . . . , d

)1:ε, . . . , p:ε where

, . . . , p

are the path parameters, d

, . . . , d

are the

data container parameters and 1:ε, . . . , p:ε are the out-

puts.

Processing of many data streams needs the indi-

vidual implementations of paths for each one of the

streams involved in an application. A data stream

processing network is a set of path expressions to-

gether with the data streams ”piped” into the paths.

4 DESIGN OF APPLICATIONS

In an operational model of data stream processing an

application acting on the streams x

, . . . , x

is repre-

sented as n-argument operation f(x

, . . . , x

). Due to

the reactivity principle, an application should be able

to recompute the operation after a new data item δ

is appended to anyone of the streams. Therefore, an

application programmer must provide the implemen-

tations of n operations f

⊕ δ

, w

, . . . , w

),...,

, . . . , w

⊕ δ

) where w

⊕ δ

denotes the con-

tents of a window w

after the insertion of a data item

In order to speed up the evaluation of the oper-

m−1

δ w

Figure 3: Implementation of f

, . . . , w

⊕ δ

, . . . , w

ations, all computations on the windows that have

not been changed since the previous evaluation are

taken from the earlier recorded temporary results,

also called as materializations d

, . . . , d

, see Fig-

ure 3. Hence, the implementation of f

, . . . , w

⊕

, . . . , w

) is performed through the transforma-

tion of n-argument operation into an expression

, . . . , d

, w

⊕ δ

) over the binary operations

, . . . , α

, window w

⊕ δ

, and materializations

, . . . , d

A sequence of binary operations is transformed

into a path p

: α

), . . . , α

) where d

is a win-

dow on a data stream see Figure 3. A transforma-

tion of an expression e

into a path is performed

in the following way. We start from an operation

(δ

⊕ w

, d

) and we construct a path p

:α

Next, we consider an operation α

(α

(δ

⊕ w

, d

), d

)

and we extend a path p

to get p

:α

)), α

We repeat this process until an operation α

at the

root of expression e

is processed. Finally, we append

write(w

out

) to a path p

. We repeat, this process for

all expressions e

, i = 1, . . . , n. Next, if d

is a mate-

rialization of the intermediate results then we insert

an operation write(d

) into all paths expressions that

contribute to the contents of d

. At the end, we add

an operation write(w

) at the beginning all paths p

whose inputs are directly taken from the data streams.

5 IMPLEMENTATION OF

APPLICATIONS

An implementation stage that follows an application

design includes the preparation of formal speciﬁca-

tion and optimization of paths, generation of imple-

mentation code, and implementation of the opera-

tions. XML is chosen as a language for formal speci-

ﬁcation of paths.

XML document that describes a data stream pro-

cessing application consists of

PORT

DATA-TYPE

WINDOWS

, and

PATH

elements. An element

PORT

in-

cludes information about the sources from where the

data items are collected and it is described by the at-

DESIGN AND IMPLEMENTATION OF DATA STREAM PROCESSING APPLICATIONS

195

tributes

TYPE

, and

TO-PATH

An element

DATA-TYPE

contains information

about the structures of data items handled by an ap-

plication. Its subelements and attributes are modeled

in a way similar to C++ class structures.

An element

WINDOWS

contains information about

the different types of windows used by an application.

An element

PATH

represents the paths a data

stream processing application consists of. It is de-

scribed by the attributes

NAME

and

TYPE

where

NAME

identiﬁes a path and

TYPE

is a type of data item pro-

cessed by a path. The subelements of

PATH

inlude the

repetitions of the elements

OPERATION

and

OUTPUT

An element

OPERATION

that represent the operations

included in a path has its own sub-elements including

COMMENTS

GET

STORE

, and

BRANCH

C++ code generated from XML speciﬁcation im-

plements entire application except the elementary op-

erations that have to be separately provided by the ap-

plication programmers. Every data item handled by

a data stream processing application has its type de-

clared in the application. Data types are represented

by C++ classes of objects and the variables being the

instances of a particular data type are stored as either

private or public variables.

All paths described in XML document are repre-

sented as sequences of operations are the segments

within the function. A code generated from XML

uses sockets for connecting and listening to different

ports.

6 SUMMARY, AND FUTURE

WORK

This work considers the design and implementation

of data stream processing applications in the envi-

ronments where the limited computational resources

or speciﬁc requirements imposed on the applications

make the utilization of complex DSMS not practical.

In our approach an application processing n input data

streams is represented as an n-ary operation. We show

how to decompose such operations into the expres-

sions built of binary operations, materializations, and

input data items and later on we describe the transala-

tion the expressions into the sets of data stream pro-

cessing paths. In our model one data stream is di-

rected for processing to one path and set of paths

represents entire application. The paths are formally

described in XML based language and implemented

through the automatic translation into C++ code.

The following are the possible directions for fu-

ture extensions of our approach to data stream pro-

cessing. An interesting idea is to distribute the com-

putations over many processing units. A closely re-

lated problem is the distribution of the processing in

the sensor networks. Another problem is related to

the simultaneous processing of more than one data

stream. In such a case the synchronization of ﬂows

of data items along the processing paths needs to

throughly be addressed.

REFERENCES

Abadi, D., Carney, D., Cetintemel, U., M.Cherniack, Con-

vey, C., Erwin, C., Galvez, E., Hauton, M., Maskey,

A., Rasin, A., A.Singer, Stonebraker, M., Tatbul, N.,

Xing, Y., Yan, R., and Zdonik, S. (2003). Aurora: A

data stream management system. In Proceedings of

the 2003 ACM SIGMOD International Conference on

Management of Data, pages 663–663.

Avnur, R. and Hellerstein, J. (2002). Continuously adaptive

continuous queries over streams. In Proceedings of

the 2002 ACM SIGMOD International Conference on

Management of Data, pages 49–60.

Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom,

J. (2002). Models and issues in data stream sys-

tems. In Popa, L., editor, Proceedings of the Twenty-

ﬁrst ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Database Systems, pages 1–16. ACM

Press.

Cranor, C., Johnson, T., Spatatschek, O., and Shkapenyuk,

V. (2003). Gigascope: A stream database for net-

work applications. In Proceedings of the 2003 ACM

SIGMOD International Conference on Management

of Data, pages 644–648.

Getta, J. R. and Vossough, E. (2004). Optimization of data

stream processing. SIGMOD Record, 33(3):34–39.

Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S.,

Datar, M., Manku, G., C.Olston, Rosenstein, J., and

Varma, R. (2003). Query processing, resource man-

agement, and approximation in a data stream man-

agement system. In Proceedings of the First Bi-

ennial Conference on Innovative Data Systems Re-

search, pages 245–256.

Zdonik, S., Stonebraker, M., M.Cherniack, Cetintemel, U.,

Balazinska, M., and H.Balakrishnan (2003). The Au-

rora and Medusa projects. Bulletin of the Technical

Committee on Data Engineering, pages 3–10.

ICSOFT 2007 - International Conference on Software and Data Technologies

196