moves data between applications (Apache Software
Foundation, 2016b). Kaa supports Kafka as one pos-
sible log appender which takes data to use it in their
service (CyberVision, 2016b). In our IoT context, we
need Kafka to deal with incoming data that may be
asynchronous. The pipeline takes even large amounts
and handles them while guaranteeing at least once se-
mantics. Kafka is run on one or more servers as a
cluster, is therefore parallel and stores the data as so-
called streams of records in topics. These topics serve
as categories (Apache Software Foundation, 2016b).
As in Figure 2 described, Kafka consists of three
main parts in our application. The producers 1,...n re-
ceives data from the Kaa cluster which are pushed to
the brokers in the Kafka cluster. It pushes the data
as fast as the brokers can handle them. There are
usually several brokers for maintaining load balance.
One broker can handle large data amounts which can
be several hundred thousand read- and write-actions
per second. ZooKeeper coordinates the brokers and
informs producers and consumers about new brokers
or failures of one of them. Kafka brokers are stateless.
Consumers can also use other applications to process
the stream of records pulled from the Kafka cluster.
The third step is the Apache Storm
5
cluster. The
decision for choosing Storm was made the same way
as for choosing Kafka, we compared characteristics of
different possible systems. Storm enables us to pro-
cess unbounded streams of data with at least once se-
mantics. Storm consists of nodes and processes data
in tuples. The nimbus is a master node, all other no-
des are workers. The master distributes data to all
workers, assigns tasks to workers and controls for fai-
lures of nodes. Supervisors receive their tasks from
the nimbus. It has multiple processes and manages
them to complete the tasks. ZooKeeper monitors the
working node statuses and coordinates between no-
des. It maintains the supervisor and nimbus by taking
care of their states.
The last step will be a long-term storage, several
services like search services on the data and a UI.
There will be different possibilities which are not part
of our work as we mainly focus on the architecture
and how it collects and processes data. Saving or pre-
senting services is not considered.
As it can be seen, our requirements have been sa-
tisfied. We have a fault-tolerant distributed, open and
easy-to understand system that, due to these charac-
teristics, should be easy to maintain. Our architecture
enables us to use wireless and movable sensors of any
manufacturer. The number of sensors can change any
time and the amount of data being processed can be
handled flexible and in real-time. We have included
5
https://storm.apache.org/
parallel working elements like a DSMS. Processing
data and queries can be done in parallel. It can be used
with many other technologies, modules and langua-
ges. Due to its fault-tolerance, the system can handle
different issues that may be caused by restricted re-
sources and so on.
4 SIMULATION
To test our architecture, we simulated what happens
to data that would be sent through our system until
they leave the Storm cluster. Here, we identified four
categories of possible issues that might come up. We
then dealt with those issues by providing solutions.
These solutions are the way our architecture handles
issues. We used this to compare our used modules
to other possible modules instead of Apache Storm,
Kafka, etc.
The categories are data, hardware, synchrony and
software issues as well as miscellaneous issues.
Before presenting further information, we want
to emphasize that understanding our issues is easier
when having a look at the data of the United States
Environmental Protection Agency (US-EPA), which
we used. Data of mainly traditional systems are freely
available on the web. The US-EPA
6
provides a vast
number of data for downloading as well as several
German cities
78
or pages providing overviews
9
. They
mostly provide data of PM10 or PM2.5 particulate
matters in microgram per cubic meter (µg/m
3
). The
city of Stuttgart provides values from 1987-today and
even delivers measurements for O2, O3, rainfall and
so on. As of 2017, the German government plans to
generally publish weather data of the Deutsche Wetter
Dienst (DWD)
10
. The data we found usually come in
the Comma Separated Value-format (CSV). For our
architecture, we focused on the US-EPA data as they
focus on particulate matter and deliver side informa-
tion like the GPS position. This enabled us to pretend
these data are from a WSN as we designed it. An ex-
ample of the data we used can be found in Table 1
11
.
6
http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/
download files.html#Blanks
7
http://www.stadtklima-
stuttgart.de/index.php?klima messdaten download
8
http://umweltdaten.nuernberg.de/aussenluft/stadt-
nuernberg/messstation-am-flugfeld/feinstaub-
pm10/bereich/30-Tages-Ansicht.html Accessed 2017-1-3
9
http://aqicn.org/map/world/
10
https://www.bmvi.de/SharedDocs/DE/
Pressemitteilungen/2017/006-dobrindt-dwd-gesetz.html
11
We shortened the Information where necessary. For
more please see the original data mentioned.
IoTBDS 2017 - 2nd International Conference on Internet of Things, Big Data and Security
120