Cloud Storage
15
. In Openstack, the object storage
module is Swift (Rupprecht et al., 2017), in which
data analysis can be performed with the ArchaDIA
architecture.
4.3 Data Lake
Data lakes are centralized repositories of enterprise
data, including structured, semi-structured and un-
structured data. This data is usually in its native for-
mat and stored on low-cost, high-performance file
systems such as HDFS or object storage (Dixon,
2010). The purpose of the data lake is different from a
data warehouse (DW). In DW, the data are processed
and structured for the query and the structure is de-
fined before ingestion in the system, through ETL
routines. This technique is called schema-on-write,
a task that is not technically difficult, but is time-
consuming.
In data lakes the data is in its original format,
with little or no transformation and the data structure
is defined during its reading, a technique known as
schema-on-read. Users can quickly define and rede-
fine data schemas during the process of reading the
records. With this, the ETL runs from the data lake
itself (Fang, 2015).
Data lake provisioning and configuration are per-
formed by the private cloud platform, with the Open-
Stack Swift module. Swift is integrated with Hadoop
and Spark in order to allow data analysis with the
main file formats: SequenceFiles, Avro
16
and Par-
quet
17
(Liu et al., 2014).
The advantage of the data lake is its flexibility,
which is at the same time a problem because it makes
the analysis very complete, but also complex. Data
lake users should be highly specialized, such as data
scientists and developers. There are also other risks in
adopting data lakes, such as quality assurance, secu-
rity, privacy and data governance, which are still open
questions.
4.4 NoSQL Databases
This new database paradigm, which does not follow
relational algebra, is generally called Not Only SQL
(NoSQL). In a NoSQL database, the data is stored in
its raw form and the formatting of the result is done
during the read operation, a feature called schema-on-
read (Chang, 2015a).
NoSQL has fast access to read and write, sup-
ports large volumes of data and replication, so they
15
https://cloud.google.com/storage/docs/
16
https://avro.apache.org/
17
https://parquet.apache.org/
are suitable for big data systems. However, NoSQL
databases do not follow the same rules and standards
as a relational database. For example, there is no na-
tive SQL support, and queries are typically run in pro-
prietary languages, or through third-party tools.
At this point, there are big differences between
relational and NoSQL modeling. While a relational
data model is standardized to avoid data redundancy,
NoSQL databases do not use normalization, and data
is often duplicated in several tables to ensure maxi-
mum performance (Chebotko et al., 2015).
4.5 API Management
The use of web APIs is becoming the standard for
web, mobile, cloud and big data applications (Tan
et al., 2016). APIs make it easy to exchange data and
are used to integrate business, make algorithms avail-
able, connect people, and share information between
devices. This new business model, called the APIs
economy, enables companies to become true data
platforms, which simplifies the creation of new ser-
vices, products and business models (Gartner, 2018).
Web APIs are composed of independent services
in the form of reusable components, which can be
combined to create the data platform. For example, a
company can create a new service by using third-party
APIs, such as maps, machine learning, geolocation,
and payments. These services are usually based on
REST and JSON, thus allowing the sharing of the data
and the new features with high performance. This is
the strategy adopted by major API providers and users
such as Netflix, Google, AWS and eBay.
In this context, it is extremely important that a
big data architecture provide technological support
for API management. In ArchaDIA, the Data Inte-
gration Component is the technical solution for cre-
ating data services by accessing NoSQL databases or
the Hadoop cluster. The API server is permanent and
the VMs are not released, only resized in the case of
processing peaks.
5 ARCHITECTURE EVALUATION
The evaluation of the proposed architecture (Archa-
DIA) used a proof of concept (PoC), in which the us-
age scenarios and the behavior of the system were ver-
ified. In this way, it was possible to determine the pos-
itives and negatives of the project. After defining the
functionalities of the BDaaS, experiments were con-
ducted using techniques and tools to create big data
systems in order to find the most appropriate combi-
nation.
CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science
194