for each program. Therefore, the need arises for a
finer-grained distributed platform for the execution of
statistical analysis programs.
Processing of data can be performed in batches
and in continuous streams. In batch processing, in-
put data is aggregated into a single batch that will all
be analysed at one time. Typically, the output of a
batch is only visible at the end of the run. An ex-
ample of a batch process is a retailer taking all sales
data of the past week and then calculating sales per-
formance. If the start of the week contains anomalous
data, the retailer must wait until the end of the week
for the output of the batch process. Historical analysis
of sensor data, e.g. calculating the standard deviation
for the sensor values of the previous day, fits perfectly
into the category of batch processing.
In stream processing, input data is analysed as it
arrives and (partial) output is available immediately.
This means that one can act on the data as soon as it
arrives. However, this may lead to skewed results as
the analysis is never actually finished. In the case of
the retailer, comparing partial results from one coun-
try to the next is difficult because of possibly different
time zones and business hours. To be able to quickly
respond to anomalous situations, the analysis of cur-
rent sensor data is performed soon after the data ar-
rives. The stream processing approach, e.g. calcu-
lating the moving average of the twenty latest sensor
values, is a good fit for this kind of analysis.
A combination of stream processing (for detect-
ing current anomalous situations) and batch process-
ing (for historical analysis) is therefore needed for a
multi-purpose sensor data analysis platform. This pa-
per describes the analysis cloud, a platform for the
reliable execution of large numbers of small scale sta-
tistical analysis programs for sensor data. The plat-
form supports both batch processing and stream pro-
cessing.
2 RELATED WORK
2.1 Batch Processing
Hadoop (White, 2009) (Hadoop Website, 2012) is one
of the most well-knownbatch processing systems cur-
rently in use. It is a frameworkfor distributedprocess-
ing of large data sets across a cluster of machines. The
design and implementation is inspired by the papers
on MapReduce (Dean and Ghemawat, 2004) and the
Google File System (GFS) (Ghemawat et al., 2003).
At its core, Hadoop consists of the MapReduce en-
gine and the Hadoop Distributed File System (HDFS).
The file system ensures that the data is stored reliably
on the nodes in the cluster. The engine allows appli-
cations to be split up into many small fragments of
work. These fragments are executed on nodes in the
cluster in such a way that the input data is close in
terms of latency.
Disco (Disco Website, 2012) is a large scale data
analysis platform. Its goals and design are very simi-
lar to those of Hadoop. Disco also provides a MapRe-
duce engine and the Disco Distributed File System
(DDFS). The main difference lies in the chosen pro-
gramming languages. The MapReduce engine of
Disco is written in Erlang, which is a language de-
signed for building robust, fault-tolerant, distributed
applications. The user applications themselves are
written in Python.
Spark (Zaharia et al., 2010) (Spark Website, 2012)
is a cluster computing system that aims to make data
analysis fast. It provides primitives for in-memory
cluster computing, so that repeated access to data
is much quicker than with disk-based systems like
Hadoop and Disco. Although Spark is a relatively
new system, it can access any data source supported
by Hadoop, making it easy to run over existing data.
Akka (Munish, 2012) (Akka Website, 2012) is a
toolkit for building distributed, fault tolerant, event-
driven applications on the Java Virtual Machine
(JVM). Akka uses actors, lightweight concurrent en-
tities, to asynchronously process messages. This raise
in abstraction level relieves developers from low-level
issues in distributed systems, such as threads and
locks. Actors are location transparent by design,
which means that the distribution of an application
is not hardcoded, but can be configured based on a
certain topology at runtime.
Hadoop, Disco and Spark can all be used for his-
torical sensor data analysis. However,because of their
batch processing nature, it is awkward or even impos-
sible to use them for detecting current anomalous sit-
uations. Akka is a toolkit for developing distributed
applications and a lot of framework functionality is
therefore still missing. Akka also does not guarantee
message arrival, which makes it less suitable for data
analysis.
2.2 Stream Processing
Esper (Esper Website, 2012) is a software component
for processing large volumes of incoming messages
or events. It is not an application or framework of
itself, but can be plugged into an existing Java appli-
cation. Esper is a Complex Event Processing (CEP)
engine with its own domain specific language, called
EPL, for processing events. EPL is a declarative
language for dealing with high frequency time-based
AnalysisCloud-RunningSensorDataAnalysisProgramsonaCloudComputingInfrastructure
359