(ETL). This is confirmed by Intel Corporation, which 
mentions  that  data  integration  stands  for  80%  of 
development effort in Big Data projects (Intel, 2013). 
Extract  task  gathers  data  from  external  sources; 
Transform task employs a variety of software tools 
and  custom  programs  to  manipulate  collected  data, 
cleaning  is  also  addressed  on  this  task  to  remove 
unnecessary data; finally, Load task loads the data to 
permanent storage.  
The  data  integration  process  into  the  selected 
framework  has  a  slight  variation  and  first  there  is 
Extract,  then  Load  and  finally  Transform  (ELT). 
However, and after observing raw data (possible due 
to  log  information  being  plain  text  files)  we 
concluded  that  some  information  was  irrelevant  or 
redundant, for instance, the producer name, domain 
names that can be resolved by IP addresses, etc. Thus, 
and  with  a  simple  test  by  manually  removing  the 
irrelevant fields, the size of the information could be 
reduced before the Load stage and more data could be 
stored without quality loss. 
This background suggested the development of an 
intuitive  algorithm for removing useless  fields after 
data extraction process. The attained results through 
a  developed  script  were  satisfactory  but  important 
scalability and flexibility issues need to be resolved, 
due to  human  intervention requirement  in changing 
and maintaining said script. 
Several  data  cleaning  techniques and  tools  have 
been developed, but the major part of them target data 
deduplication  while  our  requirements  were  the 
useless  data  cleaning.  In  this  article,  we  limit  to 
present a Security Big Data ecosystem testbed and an 
intuitive  data  cleaning  technique  with  some  results 
and  improvement  proposals,  for  instance,  increase 
data retention by at least 25%. This solution has been 
tested with  data  provided  by responsible for  10,000 
university users.  
One  of  the  main  challenges  in  this  research  is 
adapting data cleaning techniques to security data in 
Big  Data  ecosystems  to  overcome  storage  space 
constraints  and  we  intend  to  solve  the  following 
questions: 
  Is it  possible to  define a  Big  Data security 
ecosystem? 
  Is  it  possible  to  apply  data  cleaning 
techniques to security data to reduce storage 
space? 
2  STATE OF THE ART 
Commercially, there are several available solutions to 
process Big Data, however, as we already mentioned 
before, they are expensive and the inclusion of a third 
party  can  introduce  security  and  confidentiality 
liabilities that are unacceptable to some companies. A 
Big Data ecosystem must be based on the following 
six  pillars:  Storage,  Processing,  Orchestration, 
Assistance, Interfacing, and Deployment (Khalifa et 
al., 2016). The authors mention that solutions must be 
scalable and provide an extensible architecture so that 
new functionalities can be plugged  in  with minimal 
modifications to the whole framework. Furthermore, 
the  need  for  an  abstraction  layer  is  highlighted  in 
order  to  augment  multi-structured  data  processing 
capabilities. 
Apache  Hadoop  is  a  free  licensed  distributed 
framework  that  allows  working  with  thousands  of 
independent  computers  to  process  Big  Data 
(Bhandare, Barua and Nagare, 2013). Stratosphere is 
another  system  comparable  to  Apache  Hadoop, 
which, according to the authors, its main advantage is 
the existence of a pipeline, which improves execution 
performance  and  optimization;  although  being 
released  in  2013  it  has  not  been  as  widely  used  as 
Hadoop (Alexandrov et al., 2014). 
To  support  the  data  cleaning  process,  statistical 
outlier  detection,  pattern  matching,  clustering  and 
data  mining  techniques  are  some  of  the  available 
techniques to data cleaning tasks. However, a survey 
(Maletic  and  Marcus,  2009)  evidences  that 
customized process for data cleaning are use in real 
implementations.  Thus, there is  a need  for building 
high quality tools. 
In a work about data cleaning methodologies for 
data warehouse, data quality is assured but there is not 
a clear path on how these techniques can be adapted 
to  our  interests  (Brizan  et  al.,  2006).  BigDansing 
technique  is  targeted  to  Big  Data  (Khayyat  et  al., 
2015),  but,  similarly  to  other  techniques,  its  main 
purpose is to remove inconsistencies on stored data. 
MaSSEETL,  an  open-source  tool,  is  used  over 
structured data and works on transforming stored data 
when our research focus on extraction and cleaning 
(Gill and Singh, 2014). 
Data cleaning is often an iterative process adapted 
to  specific  task’s  requirements;  a  survey  to  data 
analysts and industry's infrastructure engineers shows 
that  data  cleaning  is  still  an  expensive  and  time-
consuming  activity.  Despite  community's  research 
and  the  development  of  new  algorithms,  current 
methodology still requires the existence of human-in-
the-loop  stages  and  that  both  data  and  result  be 
evaluated  repeatedly,  thus  several  challenges  are 
faced  during  design  and  implementation.  Data 
cleaning  is  also  a  complex  process  that  involves 
extraction,  schema/ontology  matching,  value