Authors:
Alexandra Pomares-Quimbaya
1
;
Alejandro Sierra-Múnera
1
;
Jaime Mendoza-Mendoza
1
;
Julián Malaver-Moreno
1
;
Hernán Carvajal
2
and
Victor Moncayo
2
Affiliations:
1
Pontificia Universidad Javeriana, Bogotá, Colombia, Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Bogotá and Colombia
;
2
Pontificia Universidad Javeriana, Cali, Colombia, Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA), Bogotá and Colombia
Keyword(s):
Anonymization, Analytics, Data Mining, Data Science, Big Data, K-anonymity, Data Privacy, Information Disclosure.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Computer-Supported Education
;
Data Engineering
;
Data Mining
;
Databases and Data Security
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Information Systems Analysis and Specification
;
Information Technologies Supporting Learning
;
Large Scale Databases
;
Security
;
Security and Privacy
;
Sensor Networks
;
Signal Processing
;
Soft Computing
Abstract:
When a company requires analytical capabilities using data that might include sensitive information, it is important to use a solution that protects those sensitive portions, while maintaining its usefulness. An analysis of existing anonymization approaches found out that some of them only permit to disclose aggregated information about large groups or require to know in advance the type of analysis to be performed, which is not viable in Big Data projects; others have low scalability which is not feasible with large data sets. Another group of works are only presented theoretically, without any evidence on evaluations or tests in real environments. To fill this gap this paper presents Anonylitics, an implementation of the k-anonymity principle for small and Big Data settings that is intended for contexts where it is necessary to disclose small or large data sets for applying supervised or non-supervised techniques. Anonylitics improves available implementations of k-anonymity using
a hybrid approach during the creation of the anonymized blocks, maintaining the data types of the original attributes, and guaranteeing scalability when used with large data sets. Considering the diverse infrastructure and data volumes managed by current companies, Anonylitics was implemented in two versions, the first one uses a centralized approach, for companies that have small data sets, or large data sets, but good vertical infrastructure capabilities, and a Big Data version, for companies with large data sets and horizontal infrastructure capabilities. Evaluation on different data sets with diverse protection requirements demonstrates that our solution maintains the utility of the data, guarantees its privacy and has a good time-complexity performance.
(More)