Authors:
Manuel Weißbach
;
Hannes Hilbert
and
Thomas Springer
Affiliation:
Faculty of Computer Science, Technische Universität Dresden, Germany
Keyword(s):
Stream Processing, Benchmarking, Database Benchmark, Big Data, Performance.
Abstract:
Big data applications must process increasingly large amounts of data within ever shorter time. Often a stream processing engine (SPE) is used to process incoming data with minimal latency. While these engines are designed to process data quickly, they are not made to persist and manage it. Thus, databases are still integrated into streaming architectures, which often becomes a performance bottleneck. To overcome this issue and achieve maximum performance, all system components used must be examined in terms of their throughput and latency, and how well they interact with each other. Several authors have already analyzed the performance of popular distributed database systems. While doing so, we focus on the interaction between the SPEs and the databases, as we assume that stream processing leads to changes in the access patterns to the databases. Moreover, our main focus is on the efficient storing and loading of binary data objects rather than typed data, since in our use cases the
actual data analysis is not to be performed by the database, but by the SPE. We’ve benchmarked common databases within streaming environments to determine which software combination is best suited for these requirements. Our results show that the database performance differs significantly depending on the access pattern used and that different software combinations lead to substantial performance differences. Depending on the access pattern, Cassandra, MongoDB and PostgreSQL achieved the best throughputs, which were mostly the highest when Apache Flink was used.
(More)