4.1 Security
Cloud computing and big data security is a current
and critical research topic (Popović & Hocenski,
2015). This problem becomes an issue to corporations
when considering uploading data onto the cloud.
Questions such as who is the real owner of the data,
where is the data, who has access to it and what kind
of permissions they have are hard to describe.
Corporations that are planning to do business with a
cloud provider should be aware and ask the following
questions:
a) Who is the Real Owner of the Data and Who has
Access to it?
The cloud provider’s clients pay for a service and
upload their data onto the cloud. However, to
which one of the two stakeholders does data really
belong? Moreover, can the provider use the
client’s data? What level of access has to it and
with what purposes can use it? Can the cloud
provider benefit from that data?
In fact, IT teams responsible for maintaining the
client’s data must have access to data clusters.
Therefore, it is in the client’s best interest to grant
restricted access to data to minimize data access
and guarantee that only authorized personal
access its data for a valid reason.
These questions seem easy to respond to, although
they should be well clarified before hiring a
service. Most security issues usually come from
inside of the organizations, so it is reasonable that
companies analyse all data access policies before
closing a contract with a cloud provider.
b) Where is the Data?
Sensitive data that is considered legal in one
country may be illegal in another country,
therefore, for the sake of the client, there should
be an agreement upon the location of data, as its
data may be considered illegal in some countries
and lead to prosecution.
The problems to these questions are based upon
agreements (Service Level Agreements – SLAs),
however, these must be carefully checked in order to
fully understand the roles of each stakeholder and
what policies do the SLAs cover and not cover
concerning the organization’s data. This is typically
something that must be well negotiated.
Concerning limiting data accesses, (Tu et al.,
2013) and (Popa et al., 2011) came up with an
effective way to encrypt data and run analytical
queries over encrypted data. This way, data access is
no longer a problem since both data and queries are
encrypted. Nevertheless, encryption comes with a
cost, which often means higher query processing
times.
4.2 Privacy
The harvesting of data and the use of analytical tools
to mine information raises several privacy concerns.
Ensuring data security and protecting privacy has
become extremely difficult as information is spread
and replicated around the globe. Analytics often mine
users’ sensitive information such as their medical
records, energy consumption, online activity,
supermarket records etc. This information is exposed
to scrutiny, raising concerns about profiling,
discrimination, exclusion and loss of control (Tene
and Polonetsky, 2012). Traditionally, organizations
used various methods of de-identification
(anonymization or encryption of data) to distance data
from real identities. Although, in recent years it was
proved that even when data is anonymized, it can still
be re-identified and attributed to specific individuals
(Tene and Polonetsky, 2012). A way to solve this
problem was to treat all data as personally identifiable
and subject to a regulatory framework. Although,
doing so might discourage organizations from using
de-identification methods and, therefore, increase
privacy and security risks of accessing data.
Privacy and data protection laws are premised on
individual control over information and on principles
such as data and purpose minimization and limitation.
Nevertheless, it is not clear that minimizing
information collection is always a practical approach
to privacy. Nowadays, the privacy approaches when
processing activities seem to be based on user consent
and on the data that individuals deliberately provide.
Privacy is undoubtedly an issue that needs further
improvement as systems store huge quantities of
personal information every day.
4.3 Heterogeneity
Big data concerns big volumes of data but also
different velocities (i.e., data comes at different rates
depending on its source output rate and network
latency) and great variety. The latter comprehends
very large and heterogeneous volumes of data coming
from several autonomous sources. Variety is one of
the “major aspects of big data characterization”
(Majhi and Shial, 2015) which is triggered by the
belief that storing all kinds of data may be beneficial
to both science and business.
Data comes to big data DBMS at different
velocities and formats from various sources. This is
because different information collectors prefer their
own schemata or protocols for data recording, and the
nature of different applications also result in diverse
data representations (Wu et al., 2014). Dealing with