described different anonymization approaches that fo-
cus on specific algorithms or specific platforms. Un-
fortunately, some of them have no implementation or
have not been tested with large datasets.
Converting the originally centralized algorithm to
a distributed algorithm using the Apache Spark plat-
form, brought to light some challenges related to the
recursive nature of the algorithm and the specific data
transformation capabilities of Spark. Although an al-
ternative is to make changes on the platform like Kat-
sogridakis et al. did in (Katsogridakis et al., 2017),
this is not possible in most of the companies. In-
stead, we proposed, implemented and tested solutions
to these challenges, analyzing the final performance
and data utility. Our implementation allows compa-
nies with huge data-sets to anonymize them in order
to perform analysis tasks with the anonymized data-
set protecting confidentiality.
As future work it is important to test our algorithm
in a more controlled environment with dedicated ma-
chines instead of virtual machines, to avoid sharing
resources between virtual machines that may impact
the overall execution performance. Those tests should
also vary the cluster size and the size of the files used,
in order to have better understanding of the scalabil-
ity and the overall performance of the approach pro-
posed. Additionally, it will provide more concrete re-
sults that will serve us to further compare in depth the
different approaches.
Further future work includes integrating the algo-
rithm in streaming data for velocity and also includ-
ing unstructured data anonymization, which are also
important aspects of Big Data strategies.
ACKNOWLEDGEMENTS
This research was carried out by the Center of Ex-
cellence and Appropriation in Big Data and Data
Analytics (CAOBA). It was funded partially by the
Ministry of Information Technologies and Telecom-
munications of the Republic of Colombia (MinTIC)
through the Colombian Administrative Department
of Science, Technology and Innovation (COLCIEN-
CIAS) contract No. FP44842- anex46-2015.
REFERENCES
Byun, J.-W., Kamra, A., Bertino, E., and Li, N. (2007). Ef-
ficient k-anonymization using clustering techniques.
In International Conference on Database Systems for
Advanced Applications, pages 188–200. Springer.
Ciriani, V., di Vimercati, S. D. C., Foresti, S., and Samarati,
P. (2007). Microdata protection. In Yu, T. and Jajodia,
S., editors, Secure Data Management in Decentralized
Systems, volume 33 of Advances in Information Secu-
rity, pages 291–321. Springer.
Clifton, C. and Tassa, T. (2013). On syntactic anonymity
and differential privacy. Trans. Data Privacy,
6(2):161–183.
Dwork, C., McSherry, F., Nissim, K., and Smith, A. D.
(2006). Calibrating noise to sensitivity in private data
analysis. In Theory of Cryptography, Third Theory of
Cryptography Conference, TCC, pages 265–284.
Dwork, C. and Naor, M. (2010). On the difficulties of dis-
closure prevention in statistical databases or the case
for differential privacy. Journal of Privacy and Confi-
dentiality, 2:93–107.
Dwork, C. and Roth, A. (2014). The algorithmic foun-
dations of differential privacy. Found. Trends Theor.
Comput. Sci., 9(3–4):211–407.
El Ouazzani, Z. and El Bakkali, H. (2018). A new tech-
nique ensuring privacy in big data: K-anonymity with-
out prior value of the threshold k. Procedia Computer
Science, 127:52–59.
Eyupoglu, C., Aydin, M. A., Zaim, A. H., and Sertbas, A.
(2018). An efficient big data anonymization algorithm
based on chaos and perturbation techniques. Entropy,
20(5):373.
Fung, B. C., Wang, K., and Yu, P. S. (2005). Top-down spe-
cialization for information and privacy preservation.
In Data Engineering, 2005. ICDE 2005. Proceed-
ings. 21st International Conference on, pages 205–
216. IEEE.
Katsogridakis, P., Papagiannaki, S., and Pratikakis, P.
(2017). Execution of recursive queries in apache
spark. In European Conference on Parallel Process-
ing, pages 289–302. Springer.
Lambert, D. (1993). Measures of disclosure risk and
harm. JOURNAL OF OFFICIAL STATISTICS-
STOCKHOLM-, 9:313–313.
Lee, C. (2015). Security in telecommunications and in-
formationtechnology. Technical report, ITU-T –
Telecommunication Standardization Bureau (TSB).
LeFevre, K. and DeWitt, D. (2007). Scalable anonymiza-
tion algorithms for large data sets. Age, 40:40.
LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2005).
Incognito: Efficient full-domain k-anonymity. In Pro-
ceedings of the 2005 ACM SIGMOD international
conference on Management of data, pages 49–60.
ACM.
LeFevre, K., DeWitt, D. J., and Ramakrishnan, R. (2006).
Mondrian multidimensional k-anonymity. In Data
Engineering, 2006. ICDE’06. Proceedings of the 22nd
International Conference on, pages 25–25. IEEE.
Luisa Pfeiffer, M. (2008). The right to privacy. protect-
ing the sensitive data. REVISTA COLOMBIANA DE
BIOETICA, 3(1):11–36.
Morisawa, Y. and Matsune, S. (2016). Nestgate—realizing
personal data protection with k-anonymization tech-
nology. FUJITSU Sci. Tech. J, 52(3):37–42.
ICEIS 2019 - 21st International Conference on Enterprise Information Systems
70