as a pure storage system that should not perform any
analysis tasks. Instead, analyses are performed in
the stream processing applications using the manifold
advantages of the SPEs for distributed data analysis.
Therefore entire GPS tracks and other sensor data (bi-
nary data) should be stored and accessed quickly. So,
CRUD operations, (especially read, insert, and up-
date) are the only operations of importance to us.
The software development within our project has
shown that database performance is the key to the
overall performance of our processing. Unexpectedly,
we discovered the effect that the performance charac-
teristics of the databases change when they are ac-
cessed from streaming applications. Our resulting as-
sumption is that stream processing changes the access
patterns used to query the databases. This could re-
sult from the fact that the engines use windowing or
micro-batching mechanisms, which lead to short in-
terruptions between the individual processing steps.
In addition to this presumed unusual access behav-
ior, the direct stream handling confronts the databases
with countless small queries, whose amount can con-
stantly change and which usually would be bundled
into larger transactions in a batch processing world.
This results in a relatively uncommon and quite spe-
cial access behavior for which the databases may not
have been optimized.
To further analyze this behavior, we have per-
formed extensive studies on the performance of dis-
tributed databases integrated in streaming applica-
tions. We assumed that a pure analysis of the
databases, independent of the stream processing,
would have possibly led to unreliable results for our
use case, since the presumed access patterns, result-
ing from the stream processing, would not have been
considered. Consequently, we have analyzed the in-
teraction of common databases and SPEs on the basis
of database queries typical for our use cases, in which
we mainly work on binary data rather than typed data.
Our study is focused on three research questions:
1) Which distributed databases are best suited for
high-performance processing of binary data?
2) Is there a SPE that offers performance advantages
regarding the integration of distributed databases?
3) Are there specific combinations of SPEs and
databases that work more efficiently than others?
In this paper we present the results of this study, in
which we have namely benchmarked the databases
Cassandra, HBase, MariaDB, MongoDB and Post-
greSQL across the SPEs Apex, Flink and Spark.
Within the scope of several measurement series,
we have identified the weaknesses and strengths of
the storage systems in distributed streaming environ-
ments when processing binary data in order to achieve
a well tuned and balanced data processing with low
latency and high throughput.
In the following, we will discuss the related work
and introduce the examined software systems before
our test setup is explained in detail. The results of
these tests are presented and discussed afterwards. Fi-
nally, the results will be summarized and an outlook
on our further research will be given.
2 RELATED WORK
The performance of SQL and NoSQL databases for
Big Data processing has already been examined from
several perspectives. The Yahoo! Cloud Serving
Benchmark (YCSB) (Cooper et al., 2010) is widely
used to test storage solutions based on a set of prede-
fined workloads. It is further extensible with respect
to workloads and connectors to storage solutions and
can thus, serve as a base for comparative benchmarks.
In (Cooper et al., 2010) the YCSB was used to
benchmark Cassandra, HBase, PNUTS and sharded
MySQL as representatives of database systems with
different architectural concepts. Hypothetical com-
promises derived from architecture decisions were
confirmed in practice. For example, Cassandra and
HBase showed higher read latencies for high-read
workloads than PNUTS and MySQL, and lower up-
date latencies for high-write workloads. While YCSB
is designed to be extensible, the YCSB client directly
accesses a database interface layer which does not
support an easy integration in a benchmark for stream
processing. Thus, we adopted several workloads for
our benchmark but implemented it by ourselves.
(Abramova and Bernardino, 2013) analyzed Mon-
goDB and Cassandra regarding the influence of data
size on the query performance in non-cluster setups.
They used a modified version of YCSB with six work-
loads. Their results showed that as data size in-
creased, MongoDB’s performance decreased, while
Cassandra’s performance increased. Cassandra per-
formed better than MongoDB in most experiments.
In (Nelubin and Engber, 2013) the authors exam-
ined the performance of Aerospike, Cassandra, Mon-
goDB and Couchbase in terms of differences between
using SSDs as persistent storage and a purely in-
memory data management. They also used the YCSB
benchmark, with a cluster of 4 nodes. They found
that Aerospike had the best write performance in dis-
tributed use with SSDs, while still offering ACID
guarantees. However, the authors themselves state
that this result is partly caused by the test condi-
tions, which matched closely the conditions for which
Performance Analysis of Continuous Binary Data Processing using Distributed Databases within Stream Processing Environments
139