Tackling the Six Fundamental Challenges of Big Data in Research

Projects by Utilizing a Scalable and Modular Architecture

Andreas Freymann

, Florian Maier

, Kristian Schaefer

and Tom Böhnel

Anwendungszentrum KEIM, Fraunhofer Institute for Industrial Engineering IAO, Esslingen am Neckar, Germany

Keywords:

Big Data Fundamentals, Scalability, Modular Architecture, Research Projects, Data Lake, Real Time, Open

Source, Docker Swarm, Micro Services.

Abstract:

Over the last decades the necessity for processing and storing huge amounts of data has increased enormously,

especially in the fundamental research area. Beside the management of large volumes of data, research projects

are facing additional fundamental challenges in terms of data velocity, data variety and data veracity to create

meaningful data value. In order to cope with these challenges solutions exist. However, they often show short-

comings in adaptability, usability or have high licence fees. Thus, this paper proposes a scalable and modular

architecture based on open source technologies using micro-services which are deployed using Docker. The

proposed architecture has been adopted, deployed and tested within a current research project. In addition,

the deployment and handling is compared with another technology. The results show an overcoming of the

fundamental challenges of processing huge amounts of data and the handling of Big Data in research projects.

1 INTRODUCTION

Processing and storing of today’s increasing amount

of Big Data has become an important key factor in

all areas of life such as research, industry, public or

social networks (Y. Demchenko et al., 2013). One re-

sponsible factor is that data comes from everywhere

and from everybody (S. Kaisler et al., 2013). It origi-

nates for example from an enormous amount of dy-

namic sensors and devices around the world creat-

ing massive amounts of data (M. Kiran et al., 2015),

(L. Sun et al., 2017). Within companies, Big Data

also plays a crucial role such as for decision-making

(Stucke and Grunes, 2016). Thus, new technologies

and architectures are necessary to deal with Big Data

to reach valuable results (Katal et al., 2013), (Volk

et al., 2019). However, bringing Big Data together

with research projects which investigate new tech-

nologies and approaches causes additional challenges

which need to be handled.

The general handling of Big Data requires to con-

sider certain characteristics such as data volume or

data velocity (Katal et al., 2013). However, Big Data

https://orcid.org/0000-0002-3735-4545

https://orcid.org/0000-0002-5695-6509

https://orcid.org/0000-0002-7855-6741

https://orcid.org/0000-0001-6426-2606

faces challenges as well (S. Kaisler et al., 2013),

(Katal et al., 2013), (Volk et al., 2019). They can

be derived from the Big Data characteristics (Ahmed

Oussous et al., 2018) which we identify as fundamen-

tal challenges (FCs) of Big Data at the same time.

They make processing and storing of data more dif-

ﬁcult. Also just the processing of data or the variety

of nature might cause difﬁculties (Volk et al., 2019).

In addition, these FCs of Big Data get intensiﬁed in

conjunction with research projects as they represent

additional challenges due to their settings such as for

instance ﬁnancial limitations. We call them FCs of

research projects.

There already are solutions in practice and litera-

ture which try to handle FCs of Big Data (S. Kaisler

et al., 2013), (M. Kiran et al., 2015). However, several

challenges still persist: Firstly, traditional solutions

for Big Data often show shortcomings in efﬁciency,

scalability, ﬂexibility and performance (Ahmed Ous-

sous et al., 2018). Secondly, such solutions do not

consider the additional challenges that come along

with the FCs of research projects. Thirdly, many so-

lutions have cost models instead of having an open-

source character (M. Kiran et al., 2015).

This paper provides an architecture which has

been adopted and developed further to overcome the

difﬁculties of handling FCs of Big Data in conjunc-

tion with FCs of research projects by showing a suc-

Freymann, A., Maier, F., Schaefer, K. and Böhnel, T.

Tackling the Six Fundamental Challenges of Big Data in Research Projects by Utilizing a Scalable and Modular Architecture.

DOI: 10.5220/0009388602490256

In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security (IoTBDS 2020), pages 249-256

ISBN: 978-989-758-426-8

249

cessful deployment in a current project (i-rEzEPT

The content of this work is based on a previous

publication which presents a ﬂexible architecture for

smart cities by taking up several architectural design

patterns (K. Lehmann and A. Freymann, 2018). Our

work provides a special architectural design featur-

ing e.g., a scalable and modular design based on open

source technologies or a distributed server cluster. For

the evaluation of our suggested architecture, it is de-

ployed, used and tested within the aforementioned re-

search project. In addition, we compare the deploy-

ment of the architecture with two different technolo-

gies: Docker swarm and Kuberentes (Kubernetes Au-

thors, 2020). The results show that the proposed ar-

chitecture overcomes the FCs of Big Data and FCs of

research projects.

The paper is structured as follows: After an insight

into background information in Section 2, Section 3

describes the FCs of Big Data and FCs of research

projects and how they inﬂuence one another. Section

4 presents our architecture derived from seven identi-

ﬁed requirements. The architecture is then evaluated

in Section 5. Finally, after the related work in Section

6, the last Section 7 discusses the conclusion of our

work and gives a future outline.

2 BACKGROUND INFORMATION

Dealing with Big Data requires a well-deﬁned archi-

tecture and technologies to be able to process the huge

amounts of data (Katal et al., 2013). Strong basic fea-

tures of those architectures usually are ﬂexibility and

scalability to cope with changes such as changing re-

quirements or new data sources (K. Lehmann and A.

Freymann, 2018). Different design patterns for ar-

chitectures are state-of-the-art which have been used

over the last years such as the lambda architecture (a),

micro-services (b) and distributed systems (c).

A Lambda Architecture offers a solution for an ef-

ﬁcient processing of large amounts of data (a). It

enables simultaneously real-time analysis and more

complex, accurate analysis using batch methods. The

architecture consists of three layers: speed layer,

batch layer and serving layer. The speed layer pro-

cesses an incoming data stream in real-time. The

batch layer executes heavy computations in a lower

frequency. The output of speed and batch layer can be

joined before presentation. The serving layer stores

The project i-rEzEPT is promoted by the German Fed-

eral Ministry of Transport and Digital Infrastructure. It in-

vestigates the participation of battery electric vehicles in the

primary reserve market (Funding code:03EMF0103B).

results of computations, handles queries and provides

the interface for the user. (M. Kiran et al., 2015)

A Micro-service architecture divides a complex sys-

tem into many small applications, called micro-

services (b). They offer an interesting contribution

to the architecture, as they only processes small inde-

pendent units and therefore provide a lot of ﬂexibility

(Peinl et al., 2016), (L. Sun et al., 2017). In compari-

son, the traditional monolithic approach uniﬁes a soft-

ware solution in a single uniﬁed application. Micro-

services beneﬁt of being highly horizontally scalable,

ﬂexible and easy to maintain. (L. Sun et al., 2017)

’A Distributed System is a collection of independent

computers that appears to its users as a single coher-

ent system’ (Tanenbaum and van Steen, 2007, p. 2)

(c). They provide high scalability as computers re-

spectively servers can be added, changed or removed.

The challenge of distributed systems is to manage and

allocate tasks (e.g. micro-services) between the avail-

able computation resources (Verma et al., 2015).

In addition, to manage micro-services in dis-

tributed systems is a signiﬁcant factor for a well func-

tioning operation of an entire system. For the orches-

tration of micro-services, they get packed into con-

tainers. Those containers enable faster booting of the

services and easy deployment (H. Li et al., 2019).

This additionally simpliﬁes the service orchestration

as a whole such as by using automation functions.

3 FCs IN DETAIL

3.1 FCs of Big Data

Many deﬁnitions mention ﬁve basic characteristics

which are related to Big Data (Y. Demchenko et al.,

2013), (Katal et al., 2013), (S. Kaisler et al., 2013).

These are often described as the "5 Vs" of Big Data:

volume, velocity, variety, veracity and value (cf. Ta-

ble 1). This work also considers the speciﬁc attribute

complexity mentioned in (S. Kaisler et al., 2013), as

research data often has a complex structure which

makes this characteristic especially important. Table

1 represents the Big Data characteristics in detail.

Data Volume. It deals with the huge amount of

data which need to be handled (Volk et al., 2019).

At the same time, processing the volume is a chal-

lenge that Big Data has to face due to the fact that new

data is continuously generated everywhere (S. Kaisler

et al., 2013). Especially, smartphones or RFID de-

vices produce a massive amount of data around the

world (M. Kiran et al., 2015).

Data Velocity. It describes the frequency of in-

coming data from different sources (Katal et al.,

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

250

Table 1: Characteristics of Big Data.

Data ... Short description

Volume

Available amount of data existing

within a certain context (S. Kaisler

et al., 2013), (Volk et al., 2019).

Velocity

Speed or frequency at which data

originates from a certain data

source (Y. Demchenko et al., 2013).

Variety

Diversity the data can be repre-

sented by, e.g. images, text or

videos. This also addresses the

data streaming and data aggrega-

tion (Katal et al., 2013).

Complexity

Interconnectedness and interdepen-

dence of data content (S. Kaisler

et al., 2013).

Veracity

Plausibility and correctness of data

(Y. Demchenko et al., 2013).

Value

Creation of valuable information

which can be further used such

as for decision-making (Y. Dem-

chenko et al., 2013).

2013). A high velocity requires transmitting and pro-

cessing data quickly (Ahmed Oussous et al., 2018).

Data Variety. Data variety measures the diversity

(Volk et al., 2019). It comprises, e.g., possible data

formats such as documents, time series or videos be-

ing processed. The related challenge is that data is of-

ten incompatible, non-structured and inconsistent (S.

Kaisler et al., 2013). This is also based on the large

amount of different IoT devices which produce differ-

ent data formats (L. Sun et al., 2017).

Data Complexity. Relationships and interconnections

between data from various sources represent the data

complexity (S. Kaisler et al., 2013). This means that

data content depends on other data content. Chal-

lenges are linking and changing interconnected data

across a large Big Data system (Katal et al., 2013).

Data Veracity. It is mentioned by (Y. Demchenko

et al., 2013) and comprises consistency and trustwor-

thiness of data. Ensuring a non-manipulation of data

is important during data processing, beginning from

trusted sources to a secure storage. Implausible data

needs to be detected while it is being processed. Oth-

erwise, data that has no trustworthiness or consistency

might have negative impacts, e.g., interpretations.

Data Value. The data value is the reason why all Big

Data efforts are made. It is created through four pro-

cessing steps: collection, cleaning, aggregation and

presentation. The data value focuses on the useful-

ness of data which means to create valuable informa-

tion and knowledge which can be further used such

as for decision-making (Y. Demchenko et al., 2013).

This characteristic depends on a good consideration

of all other Big Data characteristics.

3.2 FCs of Research Projects

Research projects have special settings, which differ

from non-research projects without a research context

which lead to different methods and architectures (Y.

Demchenko et al., 2013). In our practice, we identi-

ﬁed several interdependent characteristics (described

in Table 2) which make a research project unique.

Table 2: Characteristics of research projects.

Character. Short description

Large

amounts of

data

Research projects create com-

plex and large amounts of data.

(Y. Demchenko et al., 2013)

Volatile re-

quirements

Quickly changing requirements.

Developing

prototypes

Focus on research results, less

concern for marketable products.

Available

budgets

A given scope which limits ﬁnan-

cial options.

Innovative

character

Trying new concepts and tech-

nologies.

Research

community

Have an open character to share

research results (Y. Demchenko

et al., 2013).

Large Data Amount. A typical property of research

projects is a large amount of data which needs to be

processed and stored as hypotheses and research goals

are pursued (Y. Demchenko et al., 2013). In conjunc-

tion with a Big Data context, the derived challenge

from research projects is the confrontation with an ad-

ditional large volume of structured and unstructured

data from different data sources.

Volatile Requirements. Research projects have clear

research goals, however, how to technically reach the

goals (e.g. software architecture design) is generally

determined during the project. This depends on other

factors such as later identiﬁed data sources or bad

data quality (e.g., unstructured or volatile data) which

might change during the project.

Developing Prototypes. Research projects have a

strong focus on research results and on answering re-

search questions. Thus, less focus is set on a broad

functionality of software solutions. Implementation

of the rudimentary functionality is generally realized

by developing prototypes. An arising challenge is

searching for technologies or methods on the ﬂy, as

this would result in a pieced-together solution which

might inﬂuence scalability or adaptability.

Available Budgets. The ﬁnancing of research projects

Tackling the Six Fundamental Challenges of Big Data in Research Projects by Utilizing a Scalable and Modular Architecture

251

is usually characterized by a predeﬁned budget.

Changing that budget, especially in public research

projects promoted by federal and state governments,

might be connected with higher effort.

Innovative Character. Being innovative is an im-

portant factor within research projects. This means

new technologies or frameworks need to be tested to

achieve new experiences. However, using new in-

novative technologies, software or frameworks might

cause a challenge due to their lesser maturity.

Research Community. Research projects have an

open character to an open research community. This

means that the published results can be validated and

reproduced by other scientists (Y. Demchenko et al.,

2013). This requires to produce valuable and mean-

ingful knowledge through a well-deﬁned solution.

3.3 Intensiﬁcation of Big Data FCs in

Research Projects

The FCs of Big Data and of research projects often

go hand in hand. Sometimes they inﬂuence and in

some cases even intensify one another such as the

data volume characteristic which is intensiﬁed within

research projects. Generally, such intensiﬁcation re-

quires an architecture allowing an easy and quick in-

clusion of additional data sources. This complication

also affects the complexity of data handling through-

out the whole project, as it can dynamically in- or

decrease with every additional included data source.

The ensuring of data veracity is affected in the same

sense. Every new data source causes the implementa-

tion of new functions, e.g to detect outliers or to clean

data. The handling of data velocity faces the afore-

mentioned problems as well. If new data sources are

acquired, which offer data of a higher or lower ve-

locity than the sources that are already included in the

project, more difﬁculties arise, such as to ensure a fast

data consumption with different velocities.

4 A FLEXIBLE ARCHITECTURE

FOR MANAGING BIG DATA

WITHIN RESEARCH

PROJECTS

This Section presents our architecture which supports

managing Big Data (cf. Figure 1) by considering the

FCs of Big Data (cf. Section 3.1) and the FCs of

research projects (cf. Section 3.2). Generally, the

structure of our architecture illustrates the data pro-

cessing steps from data collection (bottom left), over

data cleaning, data aggregation (bottom right) to data

presentation (top left). The right side represents how

the data is stored, using a data lake and a frontend

database for outside requests. Passing on the data be-

tween the different data processing steps, the data is

stored within data queues. Furthermore, for the sep-

aration between the frontend (cf. data presentation)

and the backend (cf. data collection, cleaning and

aggregation) a proper interface which separates the

transfer between the backend and frontend is used.

In the following, we present identiﬁed architecture

requirements and how they are realized within our

architecture. In general, the identiﬁed requirements

are derived from the described FCs of Big Data, the

FCs of research projects and from the literature. They

comprise modularity, adaptability, scalability, well-

deﬁned data handling, distributed system, computing

capacity, and infrastructure management.

4.1 Modularity

We identiﬁed the modularity as a required feature

which means to divide and structure a system into

software and hardware modules realized by micro-

services. Containers are a common approach as they

offer, e.g, virtualization or lightweight operations in

comparison to conventional virtual techniques (H. Li

et al., 2019). This enables scalability and adaptability

of a system and helps coming along with data vari-

ety. Therefore, our architecture has a modular design

which is achieved by using micro-services. We use

the Docker container technology to run each software

component as a micro-service. This comprises, e.g.,

Docker containers for databases, for the frontend or

for scripts to collect, clean and aggregate data.

4.2 Adaptability

We identiﬁed that being adaptable supports handling

of volatile requirements. In general, adaptability de-

scribes the ability to modify and extend a system (K.

Lehmann and A. Freymann, 2018). This means to be

able to change, add or remove hardware, software or

technologies such as databases, frameworks or pro-

gramming languages. This also beneﬁts the develop-

ment of prototypes due to their innovative character

which is known for changes, e.g., technologies or pro-

gramming languages of the prototype.

In order to support adaptability, we use a standard-

ized syntax for the data format (i.e. JSON) which is

used for the data ﬂow between each of the micro-

services. Additionally, a standardized query lan-

guage (at the frontend) is realized by using GraphQL

(GraphQL Foundation, 2019) as well as an automated

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

252

Figure 1: Overview of our architecture.

testing and delivery of Docker images which is real-

ized with Drone (Drone, 2019).

4.3 Scalability

Scalability is an important and required feature within

Big Data (Ahmed Oussous et al., 2018), (Volk et al.,

2019). Offering scalability supports the expansion

of a solution horizontally and vertically by its hard-

ware and software components. This enables to store

a large and constantly growing data amount for in-

stance by adding new database nodes and helps com-

ing along with volatile requirements. To realize scal-

ability it is common to have a distributed system with

distributed databases and servers to split data pro-

cessing (S. Kaisler et al., 2013), (Sindhu and Hegde,

2017). We realize the scalability by using a server

cluster managed by the Docker Swarm orchestration.

To scale the data store, we use Elasticsearch (Elastic-

search, 2019) which allows to arbitrarily spread data

and manager nodes over the server cluster.

4.4 Data Handling

A well-deﬁned data processing should comprise the

four data processing steps of Big Data (e.g., collec-

tion, cleaning, aggregation and presentation). This

enables to come along with the FCs of Big Data and

with the additional large amount of data related to re-

search projects. Finally, a proper data handling cre-

ates a better research result which might get more

attention within the research community. Addition-

ally, the aspect of data ﬂows needs to be addressed.

Data ﬂow means how the data is transported through

the four data processing steps. For the transportation

three important matters are recommended: Firstly,

packing data into small units simpliﬁes the data pro-

cessing. Secondly, buffering data packages between

data processing steps is a common approach, e.g., by

using a message broker to save intermediate results.

Thirdly, splitting the data ﬂow into several data layers

using the lambda architecture is a required way for

Big Data processing (M. Kiran et al., 2015).

The architecture is designed to realize the four

data processing steps (cf. Figure 1). Addressing the

data collection, for each data source, we realized an

individual micro-service running in a Docker con-

tainer which collects and queues the data using the

message broker RabbitMQ (RabbitMQ, 2019). The

data gets pulled from the queue by further micro-

services for data cleaning (e.g. checking the time for-

mat). The data is then embedded within an uniform

data structure. Finally, other micro-services store the

data within a data lake. Real-time data is directly sent

to a frontend database. The stored data within the data

lake is then aggregated using different micro-services.

Within our architecture, two databases are chosen:

Elasticsearch (Elasticsearch, 2019) as data lake and

ArangoDB (ArangoDB, 2019) as frontend database.

This separates non-aggregated data (cleaned and raw

data) from aggregated data (for the frontend). This

also relieves the data lake because requests for aggre-

gated data are only sent to the frontend database. For

the data presentation two frameworks are being used:

Django (django, 2020) and Angular (Angular, 2019).

4.5 Distributed System

Such systems have become a signiﬁcant and required

role within Big Data (Sindhu and Hegde, 2017),

(Katal et al., 2013). The software is running on dif-

ferent interconnected servers. Distributed systems

enable load balancing, distribution of computational

power, data storage and efﬁcient parallel process-

Tackling the Six Fundamental Challenges of Big Data in Research Projects by Utilizing a Scalable and Modular Architecture

253

ing. Our architecture realizes this by using a Docker

Swarm server cluster.

4.6 Infrastructure Management

Realizing the aforementioned requirements needs an

overall management of the system to get transparency

(Peinl et al., 2016). Beside the controlling and mon-

itoring of the system, the orchestration of running

Docker containers (e.g. micro-services) is a signiﬁ-

cant task for such a management (H. Li et al., 2019).

The entire system and the Docker containers are

orchestrated by Docker Swarm and we use Portainer

(Portainer, 2019) to manage (i.e. conﬁgure) the

swarm. Portainer manages distributed servers from

different locations, Docker container images, related

Docker networks and volumes.

4.7 Computing Capacity

A signiﬁcant requirement and key factor for process-

ing Big Data is to offer appropriate computing capac-

ity (Y. Demchenko et al., 2013). This is often related

to parallel processing, especially, to enable real-time

data processing (S. Kaisler et al., 2013). This helps to

process large amounts of data as well as coming along

with a high data veracity. Having appropriate servers

with a high computational power, a distributed system

can optimize utilization of the computational power

by load balancing.

5 EVALUATION

In this work, we adapted and further developed our

architecture in a current research project called i-

rEzEPT. The project is characterised by processing

data from various sources with high velocity. Data

types comprise environmental data (e.g. temperature,

humidity or cloudiness), telematic data from electric

vehicles (i.e. GPS, speed or battery state) and smart

metering data (e.g. power inverters from photovoltaic

systems or frequency meters).

Table 3 lists the needed data storage in detail for

the different data types. To ensure maximum avail-

ability, the data gets replicated within the data lake,

which doubles the needed capacity. The raw data (cf.

Figure 1) is saved in a compressed state, as it does

not need to be accessed on a regular basis. Thus, it

is expected to require 190GB (assuming a 90% com-

pression ratio) of storage capacity. Aggregations of

the different timeseries are expected to need another

200MB of storage. This accumulates to a total ex-

pected storage need of around 4.15TB.

Table 3: Storage requirement by data type for 2 years.

Data type Gigabyte

+ smart metering data 1800

+ Environmental data 67

+ Telematic data 15

+ Raw data 190

= Subtotal 2072

+ With replication 4144

+ Aggregation (only ArangoDB) 0.2

= Total 4,144.2

Our evaluation of the presented architecture fo-

cuses, ﬁrstly, on the fulﬁllment of the most important

Big Data characteristics and, secondly, of the archi-

tecture deployment using Docker Swarm. This de-

ployment is then compared with the additional tech-

nology Kubernetes (Kubernetes Authors, 2020) by

evaluating the handling of both technologies. This

project currently runs on a server cluster with Docker

Swarm. It consists of seven Ubuntu 18.04 virtual ma-

chines, utilizing a total of 42 CPUs as well as 82GB of

RAM. The Docker Swarm shares this hardware with

another research project, so it does not have exclu-

sive access to the cluster’s ressources. It remains to

be seen, whether the architecture performs as well on

bigger clusters.

5.1 Big Data Characteristics Evaluation

Variety: Elasticsearch allows to easily add new data

sources without concerning the data format. By the

end of the project, it is expected to have 25 different

data sources, providing data in 18 different formats.

Velocity: The 25 different data sources each provide

measurements ranging from two times per second up

to once every thirty minutes which gets handled by the

message broker. It splits the incoming data streams

into easy-to-process data packages and temporarily

stores them in queues, until another micro-service

pulls them from queues and processes them.

Veracity: This can be checked outside and within the

architecture. For some data sources, veracity can be

ensured before the data even gets pulled from the API.

For other data sources the veracity can be checked

during the data cleaning phases. Simple plausibility

checks can be performed before storing it in the data

lake (e.g. invalid speed values).

Complexity: The evaluation shows that Elasticsearch

is suitable for working with data having different data

structures. Connections between different data types

can easily be represented by adding additional meta-

values to the different timeseries and the Elasticsearch

query system allows for complex aggregations across

multiple indexes.

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

254

5.2 Deployment Evaluation

In reference to the Portainer deployment, we evalu-

ated that adding new servers to the Docker Swarm

and micro-services for data processing was proven as

simple. At the beginning of the project, the cluster

comprised ﬁve virtual machines. During the project

two additional database servers were added to scale

the data lake and to set up the frontend database. The

cluster also started out with only a couple of micro-

services. The number of micro-services has been in-

crementally expanded by including new data sources,

running new data aggregations and adding the fron-

tend. It is expected that the number of micro-services

will grow up to around 100 services by the end of

the project. Adding these micro-services showed the

adaptability of the architecture but also shows its lim-

itations. Our current limitation for an in depth eval-

uation is the small size of the cluster. Furthermore

the clusters scalability and load balancing capabilities

are limited by the underlying storage layer since the

database nodes are currently pinned to speciﬁc virtual

machines with additional storage. Focusing Kuber-

netes and a distributed storage system in our testing

deployments allows a single database node to move

freely within the cluster and between different virtual

machines. Therefore offering a promising solution

for the further growing architecture, its scalability and

load balancing features.

6 RELATED WORK

In the literature, publications exist which present ar-

chitectures and frameworks for Big Data. According

to challenges related to Big Data, several publications

speak about the Big Data characteristics comprising

data volume, data velocity, data variety, data value

and data veracity (S. Kaisler et al., 2013), (Katal et al.,

2013) and (Y. Demchenko et al., 2013). (Katal et al.,

2013) as well as (S. Kaisler et al., 2013) added data

complexity as an additional Big Data characteristic.

In our work, we took these Big Data characteristics as

a fundamental scope that needs to be considered.

The content of this work is based on a previ-

ous publication which presents an architecture for

smart cities in the context of research projects and

takes up several architectural design features such

as a distributed Event Based System, micro-services

and a lambda-architecture for the data handling (K.

Lehmann and A. Freymann, 2018). Scalability and

ﬂexibility are described as basic features. Our archi-

tecture extends this previous work in different parts,

e.g., by using a server cluster to distribute the micro-

services which signiﬁcantly enhances the scalability

or by a proper orchestration to manage the distributed

system. Furthermore, our architecture is designed for

small, medium and large research projects.

The publication (Y. Demchenko et al., 2013)

presents an architecture called the Scientiﬁc Data In-

frastructure (SDI) which tackles challenges of Big

Data in the context of science and also focuses on a

general approach for a data lifecycle management in

research and industry. The SDI also comprises the

data lifecycle from data collection, processing and

presentation. An additional micro-service architec-

ture for IoT applications is proposed by (L. Sun et al.,

2017) which also has strong consideration for scala-

bility and adaptability by concerning it from a service

layer to a physical layer (L. Sun et al., 2017). Fur-

thermore, they address signiﬁcant challenges which

arise with the dynamically growing amount of phys-

ical IoT devices. In essence, they propose a system

design comprising several core micro-services, a ser-

vice orchestration and a lightweight communication

deployed with Docker and Kubernetes. Additionally

(Volk et al., 2019) address difﬁculties of creating a

big data architecture in regards to requirements engi-

neering, the technology selection and the project re-

alization. They provide several references to existing

architectures and propose a solution to ﬁnd a Big Data

architecture by utilizing a decision support system.

In comparison to (L. Sun et al., 2017), our archi-

tecture has a strong focus on challenges within re-

search projects, presenting a clear comparison and in-

tensiﬁcation between challenges of Big Data and re-

search projects. In addition, our architecture also con-

siders scalability and modularity as an important fea-

ture for such an architecture in order to come along

with the mentioned challenges which is missing in (Y.

Demchenko et al., 2013). We also offer a concrete

proposal how to implement or to deploy the architec-

ture which is evaluated and shown with a current re-

search project. This also stands in contrast to (Volk

et al., 2019) who only propose a solution for ﬁnding

an architecture, not a concrete architecture itself.

7 CONCLUSIONS AND FUTURE

WORK

This work presented an architecture which deals with

the fundamental challenges of processing Big Data,

while also taking the unique characteristics and chal-

lenges of modern day research projects into account.

Therefore, it supports the handling of Big Data in re-

search projects comprising a huge amount of various

high frequency structured, unstructured and complex

Tackling the Six Fundamental Challenges of Big Data in Research Projects by Utilizing a Scalable and Modular Architecture

255

data. At the same time it is easily and quickly de-

ployable. This work identiﬁed requirements needed

to be considered during designing such an architec-

ture comprising e.g. a well-deﬁned data handling, an

infrastructure management or scalability. Our archi-

tecture is scalable both horizontally and vertically.

A possibility of improving the architecture in the

future would be to switch the container orchestration

from using Docker Swarm to using Kubernetes. It of-

fers a more robust solution and better ﬁne tuning. It

would allow the utilization of a lightweight operating

distribution as opposed to the Ubuntu distribution that

is currently used, which would free up a non-trivial

part of the clusters resources and would reduce man-

agement efforts. Another desirable improvement of

the architecture would be a more extensive focus on

load balancing, synchronization between the cluster’s

machines and the ensuring of service and data con-

sistency within the cluster. Problems with synchro-

nization and consistency are handled on a code level

and should optimally get shifted towards the cluster

management as well, wherever applicable.

In conclusion, our architecture allows to easily

handle all the aforementioned challenges which have

been laid out under Section 4. In addition to that, it is

completely made up by open-source solutions, allow-

ing for more freedom in terms of budget allocation.

REFERENCES

Ahmed Oussous, Fatima-Zahra Benjelloun, Ayoub Ait Lah-

cen, and Samir Belfkih (2018). Big data technologies:

A survey. Journal of King Saud University - Computer

and Information Sciences, 30(4):431–448.

Angular (2019). One framework. mobile & desktop. URL:

https://angular.io, accessed 2019-12-17.

ArangoDB (2019). One engine. one query language. multi-

ple data models. URL: arangodb.com, accessed 2019-

12-17.

django (2020). django: The web framework for perfec-

tionists with deadlines. URL: djangoproject.com/, ac-

cessed 2020-02-18.

Drone (2019). Automate software testing and delivery.

URL: https://drone.io/, accessed 2019-12-17.

Elasticsearch (2019). Get started with elasticsearch. URL:

elastic.co, accessed 2019-12-17.

GraphQL Foundation (2019). A query language for your

api. URL: https://graphql.org, accessed 2019-12-16.

H. Li, N. Chen, B. Liang, and C. Liu (2019). Rpbg: Intel-

ligent orchestration strategy of heterogeneous docker

cluster based on graph theory. In 2019 IEEE 23rd Int.

Conf. on Computer Supported Cooperative Work in

Design (CSCWD), pages 488–493.

K. Lehmann and A. Freymann (2018). Demo abstract:

Smart urban services platform a ﬂexible solution

for smart cities. In 2018 IEEE/ACM Third Int.

Conf. on Internet-of-Things Design and Implementa-

tion (IoTDI), pages 306–307.

Katal, A., Wazid, M., and Goudar, R. H. (2013). Big

data: issues, challenges, tools and good practices.

In 2013 Sixth int. conf. on contemporary computing

(IC3), pages 404–409.

Kubernetes Authors (2020). Production-grade container or-

chestration: Automated container deployment, scal-

ing, and management. URL: kubernetes.io, accessed

2020-02-21.

L. Sun, Y. Li, and R. A. Memon (2017). An open iot frame-

work based on microservices architecture. China

Communications, 14(2):154–162.

M. Kiran, P. Murphy, I. Monga, J. Dugan, and S. S. Baveja

(2015). Lambda architecture for cost-effective batch

and speed big data processing. In 2015 IEEE Int. Conf.

on Big Data (Big Data), pages 2785–2792.

Peinl, R., Holzschuher, F., and Pﬁtzer, F. (2016). Docker

cluster management for the cloud - survey results and

own solution. Journal of Grid Computing, 14(2):265–

282.

Portainer (2019). Making docker management easy. URL:

portainer.io, accessed 2019-12-17.

RabbitMQ (2019). Understanding rabbitmq. URL: rab-

bitmq.com, accessed 2019-12-17.

S. Kaisler, F. Armour, J. A. Espinosa, and W. Money

(2013). Big data: Issues and challenges moving for-

ward. In 2013 46th Hawaii Int. Conf. on System Sci.,

pages 995–1004.

Sindhu, C. S. and Hegde, N. P. (2017). Handling com-

plex heterogeneous healthcare big data. Int. Journal

of Computational Intelligence Research, 13(5):1201–

1227.

Stucke, M. E. and Grunes, A. P. (2016). Big data and com-

petition policy. Oxford University Press, Oxford, 1st

edition edition.

Tanenbaum, A. S. and van Steen, M. (2007). Distributed

Systems: Principles and Paradigms. Pearson Prentice

Hall, Upper Saddle River, NJ, 2 edition.

Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D.,

Tune, E., and Wilkes, J. (2015). Large-scale clus-

ter management at google with borg. In Proceedings

of the Tenth European Conf. on Computer Systems,

page 18.

Volk, M., Staegemann, D., Pohl, M., and Turowski, K.

(2019). Challenging big data engineering: Positioning

of current and future development. In Proceedings of

the 4th Int. Conf. on Internet of Things, Big Data and

Security, pages 351–358. SCITEPRESS - Science and

Technology Publications.

Y. Demchenko, P. Grosso, C. de Laat, and P. Membrey

(2013). Addressing big data issues in scientiﬁc data

infrastructure. In 2013 Int. Conf. on Collaboration

Technologies and Systems (CTS), pages 48–55.

IoTBDS 2020 - 5th International Conference on Internet of Things, Big Data and Security

256