Data Transformation Methodologies

between Heterogeneous Data Stores

A Comparative Study

Arnab Chakrabarti

1,2

and Manasi Jayapal

Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany

Databases and Information Systems, RWTH Aachen University, Aachen, Germany

Keywords:

Industrial Data Exchange, NoSQL, Hadoop, Pentaho, Talend, Apache Scoop, Cassandra.

Abstract:

With the advent of the NoSQL Data Stores, solutions for data migration from traditional relational databases to

NoSQL databases is gaining more impetus in the recent times. This is also due to the fact that data generated

in recent times are more heterogeneous in nature. In current available literatures and surveys we ﬁnd that

in-depth study has been already conducted for the tools and platform used in handling structured, unstructured

and semi-structured data, however there are no guide which compares the methodologies of transforming and

transferring data between these data stores. In this paper we present an extensive comparative study where we

compare and evaluate data transformation methodologies between varied data sources as well as discuss the

challenges and opportunities associated with it.

1 INTRODUCTION

RDBMS (Relational Database Management System)

have been running in data centers for over 30 years

and this could no longer keep up with the pace at

which data is being created and consumed. Limita-

tions of scalability and storage led to the emergence

of NoSQL databases (Abramova et al., 2015). This

breakthrough technology became increasingly pop-

ular owing to the low cost, increased performance,

low complexity and the ability to store “anything”.

This led the pathway to the solutions that included

distributed application workloads and distributed data

storage platforms such as MapReduce which trans-

formed into an open source project called “Apache

Hadoop” (Schoenberger et al., 2013). These tools

and technologies have increased the need to ﬁnd new

ways of transforming data efﬁciently between differ-

ent heterogeneous data stores for organizations to de-

rive maximum beneﬁts of being a data-driven enter-

prise.

For our work we had to study the platforms in-

volved in transforming data in detail and choose the

most appropriate storage systems by determining a se-

lection criteria. Fundamental reasons like the varied

nature of the platforms and the difﬁculty in choos-

ing the right technology components made the ini-

tial decision making cumbersome. The focus was

then to ﬁnd out the process of transforming data from

one database system to the other. It was impor-

tant to create a systematic procedure of how the pro-

cess of transformation was done. After which the

appropriate technologies involved in transformation

were identiﬁed and then data was seamlessly trans-

ferred between the databases. An important step was

to then determine the characteristics of comparison

which could be system/platform dependent or appli-

cation/algorithm dependent. An experimental set up

to execute the above steps was then done and ﬁnally

transformation technologies were compared. While

much has been done in describing the platforms and

tools in great depth, however, there exists no guide to

a comparative study of data transformation between

different big data stores and there lies the contribution

of our work which is presented in this paper.

2 RELATED WORK

Our comparative study presented in this paper could

serve as guidelines for organizations looking to mi-

grate to other platforms by helping them choose the

most efﬁcient way of transforming their data from

one platform to another. Further, data transformation

also ﬁnds its utilization in “data exchange” which is

the process of taking data structured under a source

Chakrabarti, A. and Jayapal, M.

Data Transformation Methodologies between Heterogeneous Data Stores - A Comparative Study.

DOI: 10.5220/0006438802410248

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 241-248

ISBN: 978-989-758-255-4

241

schema and transforming it into data structured under

a target schema (Doan et al., 2012). Industrial Data

Exchange(IDX) is one such application that provides

a solution to the problem of moving or sharing real-

time data between incompatible systems (Xchange,

2015). Our work is also a part of another such ap-

plication called “Industrial Data Space” project which

lays the groundwork for the foundation of an Indus-

trial Data Space consortium. This research project by

the Fraunhofer Institutes (Otto et al., 2016) aims at

creating an inter company collaboration with fast, au-

tomated data exchange to help companies achieve an

edge in the international competition.

Our work is steered particularly into the compari-

son of data transformation between the most widely

used platforms and tools which will be used effec-

tively to select methodologies amid different partners

of the Industrial Data Space project

2.1 Data Transformation

The upsurge of NoSQL databases led to the downfall

of relational database. In this scenario many orga-

nizations switched to NoSQL databases and needed

to transform their data at large. For example, Netﬂix

moved from Oracle to Cassandra (Netﬂix, 2015) to

prevent a single point failure with affordable scalabil-

ity. Coursera also experienced unexpected deteriora-

tion with MySQL and after thoroughly investigating

Mongodb, DynamoDB and HBASE shifted to Cas-

sandra (Thusoo et al., 2010). Thus, the study of data

transformation can be useful for many enterprises.

3 OVERVIEW OF THE

APPROACH

In Figure 1 below we depict the solution scope of our

work. It presents a classiﬁcation of the data stores that

are addressed in our work with the red dotted lines in-

dicating the transformation process between the data

stores that we report in this paper.

Figure 1: Data transformation between data stores.

Keeping one data store as the source, all the pos-

sible transformation technologies to the other data

stores were found. For example, transformation from

Cassandra to Neo4j could be done using Talend, Pen-

taho, Neo4j-Cassandra data import tool and such pos-

sible technologies were considered individually.In ad-

dition to the technologies, another important goal was

to ﬁnd an overlap of methods between them. For

instance, if the transformation technologies between

data store D1 and data store D2 were M1, M2 and

M3. i.e,

D1 → D2 = M1, M2, M3 = Set A (Consider)

Similarly,

D3 → D4 = M4, M5 = Set B (Consider)

D4 → D1 = M6, M7 = Set C (Consider)

Set A, Set B and Set C should not be totally indepen-

dent of each other i.e, there should be an overlap of

methods between them. This was to ensure a mean-

ingful comparative study at the end. Figure 1 depicts

all the potential transformation possibilities. How-

ever, an important point to note here is that this list

is not exhaustive and there could be other technolo-

gies available for transformation.

3.1 Selection of the Data Stores

Based on the research done in the ﬁeld of compar-

ative study (Abramova et al., 2015),(Kovacs, 2016),

leading surveys (DBEngines, 2008) and Google trend

based on the most popularly used NoSQL databases

- MongoDB (document database), Cassandra and

HBASE (column-store database) were chosen for our

work. Though HDFS(Hadoop Distributed File Sys-

tem) is good for sequential data access, being a

ﬁle system it lacks the random read/write capabil-

ity. Thereby HBASE, a NoSQL database that runs

on top of the Hadoop cluster was also included in

our work. Keeping in mind the rule of variety in

diversity, including a graph database was important.

Graph databases are fast becoming a viable alterna-

tive to relational databases. They are relatively new,

so most of the comparative studies do not compare

graph databases in their work. One of the primary

goals our work was to study the results and challenges

in data transformation for graph databases. Thus we

also include Neo4j as one of the NoSQL data stores.

3.2 Selection of the Data

Transformation Platforms

The scope of our work was not to study the various

tools individually in detail but limited to choosing the

platforms based on a selection criteria and then com-

paring the transformation technologies between them.

Further for an optimal solution to manage big data,

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

242

choosing the right combination of a software frame-

work and storage system was of utmost importance.

Although the software framework and storage sys-

tem share a very important relationship, data trans-

formation between platforms is mainly dependent on

the storage systems chosen to work with the frame-

work. For instance, if an organization chooses to

run Apache Spark with MongoDB and another orga-

nization chooses Apache Spark with Cassandra, data

transformation techniques need to be applied between

MongoDB and Cassandra. Therefore, the primary fo-

cus of this work is on the transformation between var-

ious data stores and the coming sections head in that

direction.

3.3 Choosing the Right Data Sets

Each of the chosen databases have a unique way of

storing data and it was important to choose the dataset

that can be replicated and stored across all of them

to obtain uniform results. For example, the chosen

dataset would be stored as JSON in MongoDb, CSV

in Cassandra and a graph in Neo4j. Other factors were

also considered like Hadoop’s HDFS demon called

Namenode holds the metadata in memory and thus,

the number of ﬁles in the ﬁlesystem is limited to the

Namenode’s memory. The best option in this case

would be to use large ﬁles instead of multiple small

ﬁles (White and Cutting, 2009). Of all the options

available, the Yahoo Webscope Program provided a

“reference library of interesting and scientiﬁcally use-

ful datasets for non-commercial use by academics and

other scientists” (Labs, 2016). Different sizes of Ya-

hoo! Music ratings data were requested and ﬁnally 3

data sets consisting of 1 million, 2 million and 10 mil-

lion records were selected for the proof of concept.

These data sets consisted of random information like

music artists, songs, user preferences and ratings etc.

This variation of data was needed to test if every cho-

sen technique was able to scale and perform well with

the increase in size.

4 CHARACTERISTICS OF

COMPARISON

In this section we present a set of well deﬁned charac-

teristics that we considered for our comparative study.

Previous research (Tudorica and Bucur, nd) indicates

that NoSQL databases are often evaluated on the basis

of scalability, performance and consistency. In addi-

tion to these, system or platform dependent character-

istics could be complexity, cost, time, loss of informa-

tion, fault tolerance and algorithm dependent charac-

teristics could be real-time processing, data size sup-

port etc. For the scope of this work we only con-

sidered Quantitative Characteristics which is a set of

values that focuses on explaining the actual results ob-

served by performing the transformation. These nu-

merical aspects were carefully studied before collect-

ing the data to give the best comparative picture at the

end. Below we present the metrics that we have used

to evaluate our results.

• Maximum CPU Load: This is the maximum per-

centage of the processor time used by the process

during the transformation. This is a key perfor-

mance metric and useful for investigating issues if

any. There was no quota set and the process was

monitored by shutting down all other unnecessary

processor technologies.

• CPU Time: CPU time is the time that the process

has used to complete the transformation.

• Maximum Memory Usage: Maximum memory

usage is the maximum percentage of the physical

RAM used by the process during the transforma-

tion. An important metric to keep a track of re-

source consumption and impact it has on the time.

Analyzing the changes in the resource consump-

tion is an important performance metric. Maximum

CPU load, CPU time and maximum memory usage

were calculated for each of the transformation tech-

niques using htop command in ubuntu.

• Execution Time: The total time taken to com-

plete the transformation. This was measured us-

ing the respective tools for each for the transfor-

mation techniques to compare the faster means

of transforming data between any two given

databases. This time included the time taken to

establish a connection to the source and destina-

tion databases, reading data from the source and

writing data to the destination. As a common unit,

all the results were converted into seconds. How-

ever, some transformations that took a long time

to complete, were expressed in minutes and hours.

• Speed: Speed is computed as the number of rows

transformed per second. For each of the transfor-

mation techniques, this value was obtained from

the tools using which the transformation was per-

formed. The value of speed was important, for

example, in the transformations to Neo4j. Slow

transformations of 3 - 30 rows/second suggested

the need to ﬁnd alternative faster techniques.

Data Transformation Methodologies between Heterogeneous Data Stores - A Comparative Study

243

5 IMPLEMENTATION DETAILS

In this section we will describe in detail the imple-

mentation of this study in which an experiment to ex-

ecute the transformations between the data stores was

setup.

5.1 Experimental Setup

An ubuntu machine with the following conﬁguration

was chosen to run all the transformations described in

our work:

• Ubuntu : 15.10

• Memory : 15.6 GiB

• Processor : Intel R Core TM i7 -3770 CPU @ 3.40

GHz *8

• OS type : 64-bit

• Disk : HDD 500 GiB, Norminal media rotation

rate : 7200, Cache/ Buffer size : 16 MB, Average

seek read : < 8.5 ms, Average seek write : < 9.5

Only the technology and the concerned databases

were made to run whereas all the other processes were

shut down to make sure that no other variables had

an impact on the results. After the completion of ev-

ery job, the technologies and databases were restarted.

Htop command was used to analyse the processes run-

ning on the machine with respect to the CPU and

memory utilization. The process speciﬁc to the tech-

nology was studied using Htop and the quantitative

characteristics like maximum CPU%, Memory% and

Time are documented as Maximum CPU load, Max-

imum Memory Usage and CPU Time respectively.

Figure 2 shows an instance of the Htop command in

which the characteristics are highlighted.

Figure 2: Htop instance.

5.2 Workﬂow

For the experimental setup as described in the paper a

total of 74 transformations were performed and each

of them were ﬁrst tested using 1 million records and

then subsequently with 2 million records. The initial

plan was to use 10 million records for every transfor-

mation but, some transformations took a long time to

complete and given the time restriction it was not fea-

sible to do so. Every technology underwent a stress

test to check if it could complete a job with 10 mil-

lion records and the other transformations were tested

with 1 and 2 million records. Below as shown in Fig-

ure 3 we give a workﬂow that we adhered to during

the entire course of the experiment. This was made

to systematically run and verify each job as it was es-

sential in concluding this study with a fair comparison

between the technologies.

Figure 3: Workﬂow to run the transformations.

5.3 Designing the Data Store

An obstacle that was not expected in the initial design

phase was ﬁnding the right combination of database

versions to be used with the technologies. The same

version had to be compatible with all the technologies

of transformation. First, the decision to ﬁnd the right

version of the technologies was made.

Talend is an open source tool and the latest avail-

able version was used. However, the latest releases

of Pentaho are not open source anymore and an older

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

244

stable version 5.0.1 was used. Sqoop was used in ac-

cordance with Hadoop.

Thus, the following versions of the technologies

were chosen:

• Pentaho data-integration - 5.0.1 stable version

• Talend open studio for big data - 6.1.1

• Apache Hadoop - 1.2.1

• Apache Sqoop - 1.4.4

Below we report the versions of the databases that

we selected for our study :

• MySQL - 5.5.46

• Cassandra - 1.2.4

• Mongodb - 3.0.9

• Neo4j - 2.2

• Hbase - 0.94.11

It was then vital to store the chosen datasets into

all the databases. Since the datasets procured were in

the TSV format, an initial idea was to import it into

Cassandra and then transform them into the other data

stores. However, a decision was made to generalize

this as a procedure and ﬁrst the dataset was stored into

Hadoop. The big elephant is not datatype dependent

and engulfs everything thrown into it. Further, it was

much faster to start the Hadoop demons and upload

the datasets into HDFS.

After the data was stored into 64 MB chunks in

HDFS, the technologies were started one by one. All

databases were subjected to a connection test to check

if they were successfully connected to the tools. Next,

a trial run of the transformations was done. In this

way, two goals were achieved - one testing the whole

system and other getting the datasets into all the data

stores.

5.4 Manual Methods of Transformation

The results of all transformations between Cassandra

and Neo4j were recorded for 2 million records. How-

ever, Neo4j was becoming extremely unresponsive af-

ter transforming 2 million records. For records upto

580,000, successful results were obtained and a de-

cision was made to restrict the number of records in

transformations involving Neo4j and Talend/Pentaho.

To make matters worse, the number of rows trans-

formed per second into Neo4j using various technolo-

gies was limited between 3-28. For example, HDFS

to Neo4j using Pentaho, data was transformed at an

average speed of 18.5 rows/second and it took a to-

tal of 8.7 hours to complete. Similarly for others, the

job would have taken hours to complete and it was

not in the scope of our work to test each one of them.

Thus, alternative manual methods of transformation

were tested. Surprisingly, the transformation yielded

better results and the jobs ran to completion. Man-

ual method of transformation was able to surpass the

5,80,000 barrier and run jobs with 2 million records.

5.4.1 Cassandra - Neo4j Transformation Process

This is a multi step transformation process. As a ﬁrst

step, it was important to make sure that enough mem-

ory was reserved for the Neo4j-Shell. The memory

mapping sizes were conﬁgured according to the ex-

pected data size. Transforming data from Cassandra

to Neo4j was done in the following steps:

• Exporting Cassandra Table: The data needed to

be transformed was exported into a CSV ﬁle using

a simple cqlsh query. Cassandra also stores data

in a CSV format, hence there was no change in

the structure of the ﬁle.

• Validating CSV File: The exported CSV ﬁle had

to be checked for format description, consistency

and the like. To ensure efﬁcient mapping into

Neo4j, the ﬁle was checked for binary zeros, non-

text characters, inconsistent line breaks, header

inconsistencies, unexpected newlines etc. This

was done with the help of a user friendly tool

called CSVkit.

• Load CSV into Neo4j: Neo4j provides “LOAD

CSV” for the medium sized datasets and “Super

Fast Batch Importer” for very large datasets. For

the datasets chosen by us, LOAD CSV was suf-

ﬁcient to complete the transformation. LOAD

CSV is used to map CSV data into complex graph

structures in Neo4j. According to the Neo4j de-

veloper site, it is not only a basic data import

cypher query but a powerful ETL tool. It is often

used to merge data and convert them into graph re-

lationships. Using the LOAD CSV, the validated

CSV ﬁle was transformed into Neo4j.

Similarly, in the Neo4j - Cassandra transforma-

tion, the graph database was ﬁrst exported from Neo4j

using the copy query and then imported into Cassan-

dra after creating an appropriate table.

6 RESULTS AND DISCUSSION OF

THE CHALLENGES

In this section we discuss the results of the experiment

and also report the challenges that we faced during the

entire phase.

Data Transformation Methodologies between Heterogeneous Data Stores - A Comparative Study

245

6.1 Comparing Transformation

Dependent Quantitative

Characteristics

This formative evaluation was used to monitor if the

study was going in the right direction. The data trans-

formation methodologies that were implemented in

this work were compared with one another and eval-

uated in the matrix as described. This on-going eval-

uation was done in parallel with the implementation

stage to facilitate revision as needed. Since every-

thing could not be anticipated at the start of the study

and due to uncertain changes that occurred at different

phases, a revision of the methodologies was necessary

at every stage.

6.1.1 Transformation Results

An environment as described earlier was setup; the

values of maximum CPU load, CPU time, maxi-

mum memory usage were recorded using the Htop

command, outcome of execution time, speed were

documented from the respective technology used in

the transformation and the results were compiled as

shown from table 4 to table 9. There are 6 data

stores in this study and keeping the source constant

and varying the target data store results in 5 combi-

nations per data store. The technologies involved in

this transformation were Pentaho, Talend and Sqoop

(Mappers from 1 to 4). The number of Mappers are

represented with “M” in the table.

• MySQL – > Other Data Stores: Sqoop was much

easier to implement and use in most cases; how-

ever, an increasing number of mappers did not

show a pattern in the execution time. To HDFS,

although Pentaho performed this transformation

in the best way, the output ﬁle records in HDFS

consisted of extra 0's which were added when it

was converted into BIGINT datatype. To Mon-

goDB, each row of the table in MySQL was

assigned an object id when it was transformed

into JSON objects to be stored in MongoDB. To

Neo4j, Pentaho and Talend were not successful in

transforming large datasets into Neo4j. The trans-

formation was very slow in the order of 3-28 rows/

second. Thus, both the transformations were not

completed till the end.

• HDFS –> Other Data Stores: To MySQL, the out-

put table according to the desired schema should

be declared before starting the transformation. Al-

though, Talend started off with 372 percent max-

imum CPU usage, it went down to 1% CPU util-

isation during most of the time with mysql util-

ising 98% of the CPU load during this time.

To HBASE, Talend was faster when compared

to Pentaho but, it became unresponsive after the

transformation was complete. To Neo4j, since this

was one of the ﬁrst instances to be tested, Pentaho

was allowed to run the transformation into com-

pletion and it took 8.7 hours; Talend transformed

data at 3.2 records per second and it would have

taken approximately a day's time to complete.

• HBASE —> Other Data Stores: On the whole,

performance of HBASE was not up to the mark

with Talend. Most of the transformations involv-

ing Talend and HBASE ended in making Talend

unresponsive.

• MongoDB –> Other Data Stores: To Neo4j, the

quality and structure of the output graph in Neo4j

using Pentaho depends on the cypher query given

manually in the generic database connection. On

the other hand talend has an inbuilt Neo4j connec-

tor and the resulting graph is generated based on

the schema declared by the user.

• Cassandra –> Other Data Stores: To Neo4j, Pen-

taho and Talend were very slow in transforming

data. An alternative manual method was then used

which completed transforming 2 million records

in a total time of 153.224 seconds. However,

the output of the manual method depends on the

cypher query used. Despite completing 2 million

records, Neo4j became unresponsive in the end.

• Neo4j –> Other Data Stores: To MySQL, since

Pentaho 5.0.1 was connected to Neo4j using a

generic database connection, MySQL was throw-

ing errors with respect to the length of the record

speciﬁed from the graph database. The transfor-

mation to HDFS was a little different than the oth-

ers in this category. For all the other databases, the

schema was identiﬁed automatically in Pentaho,

but since Hadoop is not datatype dependent, at-

tributes from Neo4j nodes were directly saved into

HDFS. As Pentaho 5.0.1 does not support Neo4j

directly, there was no option to specify the schema

explicitly.

Figure 4: MySQL to other data stores.

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

246

Figure 5: HDFS to other data stores.

Figure 6: HBASE to other data stores.

Figure 7: MongoDB to other data stores.

Figure 8: Cassandra to other data stores.

Figure 9: Neo4j to other data stores.

6.1.2 Processing Speed of Transformation

Methodologies

The processing speed of all transformation method-

ologies involved in transforming data from one

database to the other have been plotted as shown from

ﬁgure 10 to ﬁgure 15. This gives a clear picture of

which technology was the fastest in comparison to the

others. The average rows transformed per second for

each technology has also been plotted to convey the

capability of each technology.

6.2 Summarization of the Results

Although, a conclusion on the fastest tool amongst

them on the whole cannot be drawn, there were cer-

Figure 10: MySQL to other data stores.

Figure 11: HDFS to other data stores.

Figure 12: HBASE to other data stores.

Figure 13: MongoDB to other data stores.

Figure 14: Cassandra to other data stores.

Figure 15: Neo4j to other data stores.

Data Transformation Methodologies between Heterogeneous Data Stores - A Comparative Study

247

tain results worth noting:

• Talend reached a higher maximum CPU load than

Pentaho in most scenarios.

• In transformations from all data stores (except

Neo4j) to MySQL, Pentaho proved to be faster

than Talend.

• There was a signiﬁcant difference in the results of

the transformation from all stores to Cassandra.

Talend was much faster in performing transfor-

mations, sometimes even 10,000 times faster than

Pentaho in these cases. Ironically, for the transfor-

mations from Cassandra to all other data stores,

Pentaho was much faster compared to Talend.

Throwing light on the other transformation technolo-

gies: Although manual method is a multi-step (ex-

port, data validation, load) process of transformation,

it was much faster than Pentaho or Talend. While its

counterparts were extremely slow in some cases and

transformed only 5,80,000 records in others, manual

method was able to complete the jobs with 2 mil-

lion records. A big drawback was that in some cases

Neo4j became unresponsive.

7 CONCLUSION AND FUTURE

WORK

The main aim of our work was to compare these tech-

nologies in detail using well deﬁned characteristics

and datasets.Though there exist other ways of trans-

forming data like using commercial tools, but the crux

of this study was to compare the open source tools

which are freely available resources for every user.

When tools like Pentaho and Talend did not deliver,

other alternatives like manual methods were deﬁned.

All the 74 transformation methodologies in this work

were implemented individually and evaluated. Fur-

ther as a reference, all the challenges faced during the

course of this work have been documented. Our vi-

sion for our work in this paper is that it could serve as

a guidelines to choose suitable transformation tech-

nologies for organizations looking to transform data,

migrate to other data stores, exchange data with other

organizations and the like. Although the number of

technologies could be limited with some factor, there

can always be more data stores that can be a part of

this comparative study. Every organisation has it’s

own needs and thereby follows different database so-

lutions. Adding more databases will help widen the

study and prove beneﬁcial for future users.

REFERENCES

Abramova, V., Bernardino, J., and Furtado, P. (2015). Sql or

nosql? performance and scalability evaluation. Inter-

national Journal of Business Process Integration and

Management, 7(4):314.

DBEngines (2008). Db-engines ranking. http://db-

engines.com/en/ranking. Accessed: 2016-01-25.

Doan, A., Halevy, A., and Ives, Z. G. (2012). Principles

of data integration. Morgan Kaufmann Publishers In,

Waltham, MA.

Kovacs, K. (2016). Cassandra vs mongodb vs couchdb

vs redis vs riak vs hbase vs couchbase vs hyper-

table vs elasticsearch vs accumulo vs voltdb vs

scalaris comparison: Software architect kristof ko-

vacs. http://kkovacs.eu/cassandra-vs-mongodb-vs-

couchdb-vs-redis. Accessed: 2016-01-20.

Labs, Y. (2016). Webscope datasets. http://webscope. sand-

box.yahoo.com/. accessed: 2016-02-22.

Netﬂix (2015). Case study: Netﬂix. http://www.datas

tax.com/resources/casestudies/netﬂix. Accessed:

2016-02-02.

Otto, B., Juerjens, J., Schon, J., Auer, S., Menz, N., Wenzel,

S., and Cirullies, J. (2016). Industrial data space - dig-

ital sovereignity over data. Fraunhofer-Gesellschaft

zur Foerderung der angewandten Forschung e.V.

Schoenberger, V. M., Cukier, K., and Schonberger, V. M.

(2013). Big data: A revolution that will transform

how we live, work, and think. Eamon Dolan/Houghton

Mifﬂin Harcourt, Boston.

Thusoo, A., Anthony, S., Jain, N., Murthy, R., Shao, Z.,

Borthakur, D., Sharma, J. S., and Liu, H. (2010). Data

warehousing and analytics infrastructure at facebook.

SIGMOD’10, 978-1-4503-0032-2(10):06.

Tudorica, B. G. and Bucur, C. (n.d). A comparison between

several nosql databases with comments and notes.

White, T. and Cutting, D. (2009). Hadoop: The deﬁnitive

guide. O’Reilly Media, Inc, USA, United States.

Xchange, I. D. (2015). Industrial communications, indus-

trial it, opc, proﬁbus - industrial data xchange (idx).

http://www.idxonline.com/Default.aspx?tabid=188.

accessed: 2015-12-04.

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

248