(ETL). This is confirmed by Intel Corporation, which
mentions that data integration stands for 80% of
development effort in Big Data projects (Intel, 2013).
Extract task gathers data from external sources;
Transform task employs a variety of software tools
and custom programs to manipulate collected data,
cleaning is also addressed on this task to remove
unnecessary data; finally, Load task loads the data to
permanent storage.
The data integration process into the selected
framework has a slight variation and first there is
Extract, then Load and finally Transform (ELT).
However, and after observing raw data (possible due
to log information being plain text files) we
concluded that some information was irrelevant or
redundant, for instance, the producer name, domain
names that can be resolved by IP addresses, etc. Thus,
and with a simple test by manually removing the
irrelevant fields, the size of the information could be
reduced before the Load stage and more data could be
stored without quality loss.
This background suggested the development of an
intuitive algorithm for removing useless fields after
data extraction process. The attained results through
a developed script were satisfactory but important
scalability and flexibility issues need to be resolved,
due to human intervention requirement in changing
and maintaining said script.
Several data cleaning techniques and tools have
been developed, but the major part of them target data
deduplication while our requirements were the
useless data cleaning. In this article, we limit to
present a Security Big Data ecosystem testbed and an
intuitive data cleaning technique with some results
and improvement proposals, for instance, increase
data retention by at least 25%. This solution has been
tested with data provided by responsible for 10,000
university users.
One of the main challenges in this research is
adapting data cleaning techniques to security data in
Big Data ecosystems to overcome storage space
constraints and we intend to solve the following
questions:
Is it possible to define a Big Data security
ecosystem?
Is it possible to apply data cleaning
techniques to security data to reduce storage
space?
2 STATE OF THE ART
Commercially, there are several available solutions to
process Big Data, however, as we already mentioned
before, they are expensive and the inclusion of a third
party can introduce security and confidentiality
liabilities that are unacceptable to some companies. A
Big Data ecosystem must be based on the following
six pillars: Storage, Processing, Orchestration,
Assistance, Interfacing, and Deployment (Khalifa et
al., 2016). The authors mention that solutions must be
scalable and provide an extensible architecture so that
new functionalities can be plugged in with minimal
modifications to the whole framework. Furthermore,
the need for an abstraction layer is highlighted in
order to augment multi-structured data processing
capabilities.
Apache Hadoop is a free licensed distributed
framework that allows working with thousands of
independent computers to process Big Data
(Bhandare, Barua and Nagare, 2013). Stratosphere is
another system comparable to Apache Hadoop,
which, according to the authors, its main advantage is
the existence of a pipeline, which improves execution
performance and optimization; although being
released in 2013 it has not been as widely used as
Hadoop (Alexandrov et al., 2014).
To support the data cleaning process, statistical
outlier detection, pattern matching, clustering and
data mining techniques are some of the available
techniques to data cleaning tasks. However, a survey
(Maletic and Marcus, 2009) evidences that
customized process for data cleaning are use in real
implementations. Thus, there is a need for building
high quality tools.
In a work about data cleaning methodologies for
data warehouse, data quality is assured but there is not
a clear path on how these techniques can be adapted
to our interests (Brizan et al., 2006). BigDansing
technique is targeted to Big Data (Khayyat et al.,
2015), but, similarly to other techniques, its main
purpose is to remove inconsistencies on stored data.
MaSSEETL, an open-source tool, is used over
structured data and works on transforming stored data
when our research focus on extraction and cleaning
(Gill and Singh, 2014).
Data cleaning is often an iterative process adapted
to specific task’s requirements; a survey to data
analysts and industry's infrastructure engineers shows
that data cleaning is still an expensive and time-
consuming activity. Despite community's research
and the development of new algorithms, current
methodology still requires the existence of human-in-
the-loop stages and that both data and result be
evaluated repeatedly, thus several challenges are
faced during design and implementation. Data
cleaning is also a complex process that involves
extraction, schema/ontology matching, value