Data Preparation for Tourist Data Big Data Warehousing

Nunziato Cassavia

, Pietro Dicosta

, Elio Masciari

and Domenico Sacc

2,3

ICAR-CNR, Rende Italia

DIMES UNICAL, Rende, Italia

Centro di Competenza ICT-SUD, Rende, Italia

Keywords:

ETL for Big Data, NoSQL and Hive.

Abstract:

The pervasive diffusion of new generation devices like smart phones and tablets along with the widespread

use of social networks causes the generation of massive data ﬂows containing heterogeneous information

generated at different rates and having different formats. These data are referred as Big Data and require

new storage and analysis approaches to be investigated for managing them. In this paper we will describe a

system for dealing with massive tourism ﬂows that we exploited for the analysis of tourist behavior in Italy.

We deﬁned a framework that exploits a NoSQL approach for data management and map reduce for improving

the analysis of the data gathered from different sources.

1 INTRODUCTION

Due to the massive use of new software and hardware

tools like social networks, smartphones and tablets,

people leave digital trace of everyday activities. In

this respect, tourists generate huge amounts of valu-

able data that need to be properly elaborated(Agrawal

et al., 2011).

These heterogeneous, stream-based and complex

data are currently referred as Big Data and they are

receiving a great deal of attention as the above men-

tioned features make their management quite intrigu-

ing in order to create value from data (WWW1, 2008;

WWW2, 2010; WWW3, 2011; Agrawal et al., 2012;

Lohr, 2012; Manyika et al., 2011; Noguchi, 2011a;

Noguchi, 2011b). Indeed, we have to deal with sev-

eral problems starting from data acquisition phase

in order to perform meaningful (Labrinidis and Ja-

gadish, 2012). In particular, the data being collected

at high rates from different sources requires us to

make decisions, currently in an ad-hoc manner, about

what data to keep and what to discard, and how to

store, what we keep reliably with the right metadata

(Agrawal et al., 2012). Moreover, many problems

arise also when choosing the proper pre-elaboration

for data being analyzed. Indeed, we need to cope with

the following aspects:

1. Structure: Data are often generated in an unstruc-

tured format (e.g., in sensor networks, data can be

generated by heterogeneous sensors, possibly be-

cause of different vendors);

2. Semantic: Data may refer to different concepts

(e.g. in sensor networks, data can refer to different

physical properties which are observed for differ-

ent purposes);

3. Integration: Data value increases considerably

when target data sources can be linked with other

data sources, thus data integration is a crucial task

in the data value chain.

In this paper, we describe our approach to Big

Data analysis in the tourist data scenario. We aim

to deﬁne and implement models, processes and tools

for sustainable development of an intelligent territory

through the exploitation of its cultural heritage, envi-

ronmental resources and the promotion and market-

ing of its tourist offer. The main challenge is to orga-

nize and model data in order to improve later linkage,

querying, retrieval and analysis of previously created

data. In particular, data analysis could be a bottleneck

in many applications, both due to lack of scalability of

the underlying algorithms and due to the complexity

of the data that needs to be analyzed. Finally, pre-

sentation of the results and its interpretation by non-

technical domain experts is crucial for extracting ac-

tionable knowledge.

In more detail, our goal is to deﬁne and develop an

integrated system of novel services and applications

for the creation, certiﬁcation, organization, monitor-

ing and promotion of tourism and a real time platform

419

Cassavia N., Dicosta P., Masciari E. and Saccà D..

Data Preparation for Tourist Data Big Data Warehousing.

DOI: 10.5220/0005144004190426

In Proceedings of 3rd International Conference on Data Management Technologies and Applications (KomIS-2014), pages 419-426

ISBN: 978-989-758-035-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

to support Travel Mobility. We focus in this paper on

the effective analysis of tourist data by On Line An-

alytical Approach tools designed for supporting the

analysis of Big Data gathered from several sources. In

particular, we deﬁne an end-to-end framework to as-

sist decision makers starting from the pre-elaboration

process (that could reveal really hard for heteroge-

neous sources) to the analysis steps. To better under-

stand the features of our framework we ﬁrst describe

basic tools devoted to Big Data management and our

effort to properly integrate them.

2 BIG DATA: DATA STORAGE

AND ANALYSIS FEATURES

Due to the advances in data gathering and storage and

the availability of new data sources such as social net-

works, data volume collected by public organizations

and private companies is rapidly growing. Moreover,

the users interested to access and analyse these data

are growing too. As a matter of fact, data management

systems are used intensively, in order to answer an in-

creasing number of queries to solve complex analysis

tasks needed to assist decision makers in crucial busi-

ness processes. As a result, activities like data anal-

ysis and business reporting call for ever increasing

resources. Moreover, only few years ago the largest

Data Warehouse size was about 100 Terabytes, while

nowadays Data Warehouse size of Petabytes are fre-

quently built. Therefore, there is a need for better,

faster and more effective techniques for dealing with

this huge amount of data.

Moreover, Big Data exhibits several formats for

raw data being collected. In many practical scenar-

ios, they are really different than the simple numer-

ical and textual information actually stored in Data

Warehouses. Thus, Big Data cannot be analyzed with

common SQL based techniques.

A crucial challenge posed by Big Data is a

paradigm shift in how organization behave with re-

spect to their data assets. More in detail data special-

ists must rethink:

• The data acquisition policies;

• The data analysis techniques most suited for data

being gathered;

• The impact of the analysis on the business strate-

gies.

To better understand the importance of the above

mentioned issues we mention here some key applica-

tion for almost all real life scenarios:

1. Efﬁcient information search, ranking, ad tracking;

2. Geo-referenced analysis;

3. Causal factor discovery;

4. Social Customer Relationship Management;

All these requirements calls Business Intelligence

(BI) systems to provide proper innovative solutions

to complex analysis task. In this respect, decision-

makers need quick access to information as much

complete as possible in order to make accurate and

fast decisions in a continuously changing environ-

ment. Thus, the challenge is to assure a satisfac-

tory efﬁciency when querying huge Data Warehouses,

which (unfortunately) are (often) built, on top of rela-

tional structures, storing data in a row-oriented man-

ner. Indeed, the relational model is ﬂexible and it is

tailored to support both transactional and analytical

processing. However, as the size and complexity of

Data Warehouses increases, a new approach has to be

proposed as an efﬁcient alternative to row oriented ap-

proach. To this end a valid approach is the data stor-

age in a column oriented way. The rest of this section

is devoted to describe the main component of a Big

Data infrastructure.

2.1 Column Oriented DBMS

Using the row oriented approach, the data storage

layer contains records (i.e. rows), while in a col-

umn oriented system it contains families of rows (i.e.

columns). The widespread use of the relational ap-

proach is mainly due to its ﬂexibility to represent al-

most any kind of data. Indeed, users are able to access

and manipulate data without being involved in any

technical aspects concerning data storage and access.

This is a simple model but particularly suitable for

data repositories used by analytical applications deal-

ing with huge amount of data. Indeed, row-oriented

databases are not adequate to deal with complex anal-

ysis of massive datasets because they are designed

for transactional processing. Thus, this approach is

not suitable in an analytical systems (for large scale

processing), because a lot of read operations are ex-

ecuted in order to access a small subset of attributes

in a big volume of data. In fact, transactional queries

are answered by (typically) scanning all the database

records, but processing only few elements of them.

On the contrary, in a column-oriented database all in-

stances of a single data element, such as account num-

ber, are stored together so they can be accessed se-

quentially. Therefore, aggregate operations such as

MIN, MAX, SUM, COUNT, AVG can be performed

very quickly.

Recently, NoSQL(Not Only SQL) approaches are

being used to solve the efﬁciency problems discussed

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

420

above. The rationale to develop and use NoSQL data

stores can be summarized as follows:

• Avoidance of Unneeded Complexity: Relational

databases provide a variety of features that how-

ever must obey strict data consistency constraints.

This rich feature set and the ACID properties im-

plemented by RDBMSs are manadatory while for

some application scenarios they could be disre-

garded;

• High Throughput: NoSQL databases provide a

signiﬁcantly higher data throughput with respect

to traditional RDBMSs.

• Horizontal Scalability and Possible Running on

Commodity Hardware: In contrast to relational

database management systems, most NoSQL

databases are designed to scale well in horizontal

way and not rely on the hardware features;

• Avoidance of Expensive Object-Relational Map-

ping: Most of the NoSQL databases are designed

to store data structures that are either simple or

more similar to the ones of object-oriented pro-

gramming languages compared to relational data

structures. They do not make expensive object-

relational mapping no longer needed.

Non relational data stores are usually grouped ac-

cording to their data model:

1. Key-value Stores: These systems store values

along with an index based on the key deﬁned by

users;

2. Document Stores: These systems store docu-

ments. A “document” can contain values that are

nested documents or list of values as well as scalar

values. Attribute names are dynamically deﬁned

for each document at runtime. Documents are in-

dexed and a simple query mechanism is provided

through Javascript.

3. Column Family Stores: These systems store ex-

tensible records that can be partitioned vertically

and horizontally (eventually simultaneously on

the same table) across nodes, but generally do not

support secondary indexes. Rows are split across

nodes through sharding on the primary key, while

columns of a table are distributed over multiple

nodes by using the so called “column groups”.

4. Graph Stores: Provide efﬁcient storage and

querying of a graph exploiting references among

nodes. Like for relational DBMS, these systems

usually support ACID transactions.

Systems belonging to categories 1,2 and 3 achieve

scalability by reading (potentially) out-of-date repli-

cas obeying the constraints ﬁxed by CAP theorem. In-

deed, CAP theorem states that a system can exhibits

only two out of three of the following properties:

Consistency, Availability, and Partition-tolerance.

As usual in a distributed system, it is Consistent if

update operations performed by a writer can be seen

by all users on the shared data source. Availability

refers to the system design. In particular, a system is

available if it is designed and implemented in a way

that allows it to properly work operation (i.e. allow-

ing read and write operations) even if some node in

the cluster fails or some hardware or software com-

ponents are down due to maintenance operations. Fi-

nally, Partition-tolerance is the ability of the system

to work if the network is partitioned. The latter has

not to be confused with the ability of a system to cope

with the dynamic addition and removal of nodes. Due

to the complexity of a satisfactory trade-off usually

NoSQL systems smooth consistency constraints.

2.2 HBase

HBase is an open source, non-relational, distributed

database modeled as Google BigTable, it is devel-

oped in Java. More in detail, it is an Apache

project and runs on top of HDFS (Hadoop Distributed

Filesystem), providing BigTable-like capabilities for

Hadoop. That is, it provides a fault-tolerant way of

storing large quantities of sparse data. HBase main

features are:

• Quite good compression performances;

• in-memory execution of operation;

• Bloom ﬁlters on a per-column basis as in BigTable

speciﬁcation.

Tables in HBase are used to perform Input and

Output for MapReduce jobs running on Hadoop,

and may be accessed through the Java API but also

through REST, Avro or Thrift gateway APIs. It is

worth noting that HBase is not a column-oriented

database in the typical RDBMS sense, but utilizes an

on-disk column storage format. Rows are composed

of columns, and those, in turn, are grouped into col-

umn families in order to build semantical or topical

boundaries between the data as shown in Fig. 1. Fur-

thermore, the latter data organization makes it pos-

sible to improve compression or speciﬁc in-memory

operation.

Columns are referenced as family having a qual-

iﬁer represented as an array of bytes. Each column

value (or cell) is either implicitly timestamped by the

system or can be set explicitly by the user. Rows in

the tables are sorted by a row key and this key pro-

vides access to information contained in the row. On

the other side, columns are grouped into column fam-

ilies and can be updated at runtime (by specifying the

DataPreparationforTouristDataBigDataWarehousing

421

Figure 1: HBase storage organization.

column family through a preﬁx). Indeed, this model

turns to be efﬁcient and scalable, thus well suited

for Big Data management as in this context a row

based approach is inefﬁcient, simple column based

approaches are efﬁcient but not scalable while column

family based approaches achieve both efﬁciency and

scalability. Fig. 2 summarizes the features and the

difference among the approaches.

Figure 2: Features of several storage models.

At the physical level, all columns in a column fam-

ily are stored together in the same low level storage

ﬁle, called an HFile. In addition to the notion of the

column, table and row, HBase uses the so called ”re-

gion”. In fact, the HBase tables are automatically par-

titioned horizontally into regions that are distributed

in the cluster. Each region consists of a subset of rows

of a table and in this way a table that is too large to

be contained in a server can be distributed in different

servers in the cluster.

2.3 The MapReduce Framework

MapReduce framework

is a programming model for

processing and generating large data sets. It basically

works by implementing two key function:

1. a map function that processes a key/value pair

to generate a set of intermediate key/value result

pairs, and

2. a reduce function that merges all intermediate val-

ues associated with the same intermediate key.

Many real world tasks can be solved using this

model, and algorithms that are MapReduce compli-

ant can be automatically parallelized and executed on

large computing clusters. In fact, the runtime system

https://hadoop.apache.org/docs/r1.2.1/mapred tutorial.html

takes care of the details concerning input data parti-

tioning, scheduling, machine failures handling, and

managing communication overhead.

Indeed, recently, a plethora of special-purpose

computational tasks have been implemented to pro-

cess large amounts of raw data, such as crawled doc-

uments, web request logs, to cite a few. These tasks

allow to compute many kind of derived data, e.g. in-

verted indices, representations of the graph structure

of web documents, summaries of the number of pages

crawled per host, the set of most frequent queries in a

given day.

2.4 Data Analysis Tools

A crucial component of our system is the On Line An-

alytical Processing module. Due to Big Data features,

we need to take care of several requirements. In or-

der to guarantee ﬂexibility to our system, we chose to

exploit Pentaho BI suite free community edition. As

mentioned above, a crucial activity is the Extraction,

Transformation and Loading (ETL) of raw data be-

ing collected from several sources. In this respect,

Pentaho provides a suite for performing this task,

namely Pentaho Data Integration (PDI) (also referred

as Kettle (Kettle Extraction Transforming Transporta-

tion Loading Environment)). We describe here, in de-

tail, this solution as, it has been proﬁtably exploited

in our prototype.

Each PDI task is based on the following sequence

of operations:

1. Step. It is used to represent both Input or Out-

put data. More in detail, every step performs a

speciﬁc task on its input datastream, such as data

manipulation, ﬁltering or pattern matching;

2. Transformation. It is a wrapper for steps devoted

to perform operations such as data normalization

intended to prepare data for successive analysis;

3. Job. It is a sequence of transformations, that will

be run sequentially, according to user set up.

In order to assist users to specify the operation to

be performed, Pentaho, provides four additional tools

to design the ETL process based on user needs. In

particular it provides:

• Spoon: it is the main tool for ETL workﬂow

deﬁnition. It provides a graphical user interface

that allows a simple editing of Kettle operations.

Main features provided include the possibility of

extracting and storing data from wide range of

data sources (databases, spreadsheets, text ﬁles,

etc.), data manipulation with the possibility of us-

ing tasks with built-in functionality, custom tasks

eventually deﬁned by the user through Java code

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

422

and JavaScript. Moreover, it performs data ag-

gregations that feed the data warehouse for later

usage by Mondrian;

• Pan: it is the tool that allows user to execute

transformations operations previously designed

by Spoon. Usually, operations are scheduled in

batch mode and executed according to a user de-

ﬁned plan;

• Kitchen: allows you to run from the command

line with the Job deﬁned by Spoon

• Carte: is an HTTP server for the execution of

transformations and jobs remotely. It runs on the

cluster and distributes the load on the resources.

As mentioned above, Mondrian is the OLAP

server provided by Penatho BI suite. It is based on

ROLAP technology, thus it translates MDX queries to

SQL based on a user deﬁned multidimensional model

queried using Hive. It accesses information stored in

the data repository and perform aggregation opera-

tions that are then cached. It also makes extensive

use of materialized views to optimize the response

time. Finally, the BI suite also contains Schema Work-

bench (SW), that is a graphical tool for data cubes cre-

ation. The schema ﬁle is generated using a graphical

user interface. Schema Workbench produces as out-

put an XML ﬁle containing the deﬁnition of the cube

structure for OLAP analysis that will be performed

by Mondrian. Schema Workbench provides an inte-

grated environment in order to validate the speciﬁed

schema based on the source data speciﬁed during con-

ﬁguration. Finally, Pentaho provides two options to

display data. You can use either JPivot, the built-

in tool, or Saiku, a plugin that offers several advan-

tages with respect to JPivot as it allows simple drag

and drop and deals with HTML5 and CSS in a ﬂexi-

ble way. In the following section we brieﬂy describe

main features of Hive, i.e. the query executor that we

exploited in our prototype.

2.4.1 Hive

Apache Hive is a Data Warehouse infrastructure built

on top of Hadoop that facilitates querying and man-

aging large datasets residing in a distributed storage.

Hive provides a mechanism to map the data struc-

ture on the storage layer in order to query data us-

ing a SQL-like language called HiveQL, which au-

tomatically translates SQL-like queries into Map Re-

duce jobs executed on Hadoop. A Hadoop cluster

is a collection of heterogeneous data, from multiple

sources and in different formats. Hive allows users

to explore data, analyze it, and then turning them to

business insight. For each database, tables are se-

rialized and each table is stored in a Hadoop Dis-

tributed File System (HDFS) directory. Data storage

is guided by two parameters: row format and ﬁle for-

mat. More in detail, row format inﬂuences the way

the rows, and their row ﬁelds, are stored. This opera-

tions is called SerDe (Serializer-Deserializer). Dese-

rialization is performed when querying a table, thus

SerDe will deserialize a data row from the bytes con-

tained in the ﬁle using the object type deﬁned by Hive

to operate on that row. On the contrary, when per-

forming an INSERT, serialization is required. In this

case, SerDe will serialize Hive internal representation

of a data row into the bytes to be written to the out-

put ﬁle. The ﬁle format deﬁnes types associated to

ﬁelds in a row. The simplest format is a plain-text

ﬁle, but there are row-oriented and column-oriented

binary formats available, too. In order to speed-up

queries, Hive provides indexes, including bitmap in-

dexes that allows response times faster than the ones

obtained with other tools. Finally, Hive achieves good

scalability due to its storage model. The overall archi-

tecture is reported in Fig. 3

Figure 3: Hive Architecture.

3 SYSTEM ARCHITECTURE

In this section we present the architecture of our

analysis engine. It effectively exploits the column-

oriented storage model, the scalability of NoSQL sys-

tems and the Mondrian OLAP server. More in detail,

the system depicted in Fig. 4 is composed by three

modules:

1. Mondrian as OLAP Server

DataPreparationforTouristDataBigDataWarehousing

423

2. Hive as Query Executor on Hadoop MapReduce

3. HBase as NoSQL Data Storage

Figure 4: Big Data Analysis Module architecture.

We choose to combine Mondrian and Hive in or-

der to guarantee the distributed processing of queries

across multiple nodes of our cluster composed of 20

(sixteen core each) nodes. We take also advantage

of the usage of SQL as a common language for both

sub-systems.

Moreover, the combined use of HBase and Hive

allows to overcome some speed limitations of Hive

thus accelerating data access and querying. The latter

is obtained by exploiting HBase main features such as

vertical partitioning into Column Families, horizontal

partitioning into Regions, replication, realtime access

and indexing. We point out here that the integration

of these systems is far from being trivial as we need to

take into account the different features of each mod-

ule.

As regards the OLAP analysis, Mondrian issues

SQL queries to Hive that translates them into MapRe-

duce jobs. Then, jobs access data, stored in HBase,

through a JDBC driver. It also performs the mapping

of SQL commands to HBase commands (e.g. get, put,

scan, delete).

In order to guarantee scalability, Mondrian pro-

pose two different solutions. The ﬁrst solution is

based on the so called aggregate tables. These ta-

bles contains pre-aggregate data. As an example, sup-

pose that the system stores information about sales at

hourly granularity level, but manager is interested to

perform analysis having daily and weekly granularity

level. We can create an aggregate table storing in-

formation at those granularity levels. The latter will

result in a reduced size of data being returned when

querying the repository.

A second approach is based on caching. Indeed,

Mondrian allows to cache schema, members, and seg-

ments (the objects used for aggregating data). This

means that as data are queried they are materialized.

The proposed architecture provides a complete

tool for Big Data analysis that allows user to spec-

ify queries in a simple way disregarding the complex-

ity of the data acquisition and cleaning that are per-

formed in background with the guide of a domain ex-

pert as shown in Fig. 5.

Figure 5: Our simpliﬁed workﬂow.

3.1 Intelligent Data Filtering Example

Our main contribution is the development of an in-

telligent ﬁltering approach to Big Data. Indeed, the

availability of such a tool is crucial in many applica-

tion scenarios in order to suggest users valuable infor-

mation(IBM et al., 2011; Herodotou et al., 2011). We

exploit map-reduce approach in order to explore data

in a smart and faster way (Jiang et al., ). We show

in the following an illustrating example instead of de-

scribing the technical details of the query processing

to be performed. Consider a book store selling tourist

books whose information system is distributed over

several nodes. For each book, information related to

title, publisher, support type, authors are stored. Man-

agers would like to understand which dimensions are

relevant for performance analysis about selling. In

this respect, they initially identify dimensions, Type,

Publisher and Usb as relevant. Based on this expert

input, we perform a map task as reported in Fig. 6 and

then a reduce step as shown in Fig. 7, with the goal of

reﬁning the initial information.

Figure 6: Map.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

424

As it is easy to see, information gathered about the

dimension count allow to suggest that dimension Usb

is useless for analysis purposes. This approach, can

be proﬁtably exploited in all system where efﬁcient

analysis of data distributed over nodes is mandatory

for decision support.

Figure 7: Reduce.

3.2 Preliminary Analysis

Indeed, preliminary tests performed on early data

available are quite encouraging. More in detail, the

efﬁciency achieved with our approach is quite satis-

factory. In Fig. 8 we show the execution report of

two aggregate queries issued on a test set of 240M tu-

ples. As it is easy to see, the results are returned very

quickly. In particular, ﬁrst query that does not use any

ﬁlter to count tuples took 45 seconds while the second

one that computes a sum and group by took only 26

seconds.

Figure 8: Query execution report.

Finally, due to the overall goal of our project, we

provide results in a user friendly way, i.e. user may

specify dimensions by simply drag and drop and nav-

igate through the returned results as shown in Fig. 9.

Figure 9: A query issued on our system.

4 CONCLUSION

In this paper we presented a framework for end-to-end

analysis of Big Data tourist information. We provided

an effective tool to fully manage Big Data life cycle in

the target scenario, from the data loading to the data

analysis. Preliminary results conﬁrm the validity of

our approach both in term of usability by non-expert

users and efﬁciency of the query processing. As a fu-

ture work we plan to develop a novel data ﬁltering

technique intended to assist users in the discovery of

new dimensions induced by data. The latter will heav-

ily improve the analysis task.

ACKNOWLEDGEMENTS

This work was supported by MIUR Project

PON04a2 D DICET INMOTO Organization

of Cultural Heritage for Smart Tourism and REal

Time Accessibility (OR.C.HE.S.T.R.A.)

REFERENCES

Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal,

U., Franklin, M., Gehrke, J., Haas, L., Halevy, A.,

Han, J., Jagadish, H. V., Labrinidis, A., Madden, S.,

Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R.,

Ross, K., Shahabi, C., Suciu, D., Vaithyanathan, S.,

and Widom, J. (2012). Challenges and opportuni-

ties with big data. a community white paper devel-

oped by leading researchers across the united states.

http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.

Agrawal, D., Das, S., and El Abbadi, A. (2011). Big

data and cloud computing: Current state and future

opportunities. In Proceedings of the 14th Interna-

tional Conference on Extending Database Technol-

ogy, EDBT/ICDT ’11, pages 530–533, New York,

NY, USA. ACM.

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L.,

Cetin, F. B., and Babu, S. (2011). Starﬁsh: A self-

tuning system for big data analytics. In In CIDR,

pages 261–272.

IBM, Zikopoulos, P., and Eaton, C. (2011). Understanding

Big Data: Analytics for Enterprise Class Hadoop and

Streaming Data. McGraw-Hill Osborne Media, 1st

edition.

Jiang, D., Chin, B., Lei, O., and Wu, S. S. The performance

of mapreduce: An in-depth study.

Labrinidis, A. and Jagadish, H. V. (2012). Challenges and

opportunities with big data. PVLDB, 5(12):2032–

2033.

Lohr, S. (2012). The age of big data. http://www.

nytimes.com/2012/02/12/sunday-review/bigdatas-

impact-in-the-world.html.

DataPreparationforTouristDataBigDataWarehousing

425

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,

Roxburgh, C., and Byers, A. H. (2011). Big data: The

next frontier for innovation, competition, and produc-

tivity. McKinsey Global Institute.

Noguchi, Y. (2011a). Following digital breadcrumbs

to big data gold. National Public Radio,

http:// www.npr.org/2011/11/29/142521910/the-

digitalbreadcrumbs-that-lead-to-big-data.

Noguchi, Y. (2011b). The search for analysts to

make sense of big data. National Public Ra-

dio, http://www.npr.org/2011/11/30/142893065/the-

search-foranalysts-to-make-sense-of-big-data.

WWW1 (2008). Big data. nature. http://www.nature.

com/news/specials/bigdata/index.html.

WWW2 (2010). Data, data everywhere. the economist.

http://www.economist.com/node/15557443.

WWW3 (2011). Drowning in numbers digital

data will ﬂood the planetand help us under-

stand it better. the economist. http://www.

economist.com/blogs/dailychart/2011/11/bigdata-0.

DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications

426