bilistically k-anonymized data in the context of mixed
datasets. Then an in-depth analysis is carried out
to evaluate the utility and privacy aspects of proba-
bilistic k-anonymity with respect to PPDP. Then we
trained a variety of ML classifiers on probabilistically
k-anonymized data and evaluated the model utility.
When applied with high privacy parameter levels(k)
or a high number of QIDs, probabilistic k-anonymity
has an adverse impact on ML utility. However, com-
pared to the other syntactic privacy models (i.e., k-
anonymity, l-diversity, t-closeness) probabilistic k-
anonymity has gained better ML utility. In conclu-
sion, probabilistic k-anonymity obtain relatively high
utility for ML while providing the data controllers
with numerous advantage such as high flexibility for
sensitive data analysis under GDPR, a means for
PPDP with low attribute disclosure risk and, an easy
adaptation into ML context without additional data
pre-processing or post-processing requirements. In
future work, it can be explored whether these classi-
fication accuracies can be improved further via noise
correction and sample selection methods presented in
the ML literature when learning has to be carried out
on the noisy data.
ACKNOWLEDGMENT
This work is supported by Vetenskapsr
˚
adet project:
”Disclosure risk and transparency in big data pri-
vacy”( VR 2016-03346 , 2017-2020 ).
REFERENCES
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B.,
Mironov, I., Talwar, K., and Zhang, L. (2016). Deep
learning with differential privacy. In Proceedings of
the 2016 ACM SIGSAC Conference on Computer and
Communications Security, pages 308–318. ACM.
Aggarwal, C. C. (2005). On k-anonymity and the curse
of dimensionality. In Proceedings of the 31st inter-
national conference on Very large data bases, pages
901–909. VLDB Endowment.
Agrawal, R. and Srikant, R. (2000). Privacy-preserving data
mining. In Proceedings of the 2000 ACM SIGMOD in-
ternational conference on Management of data, pages
439–450.
Ahmad, A. and Dey, L. (2011). A k-means type clustering
algorithm for subspace clustering of mixed numeric
and categorical datasets. Pattern Recognition Letters,
32(7):1062–1069.
Al-Rubaie, M. and Chang, J. M. (2019). Privacy-preserving
machine learning: Threats and solutions. IEEE Secu-
rity & Privacy, 17(2):49–58.
Domingo-Ferrer, J. and Mateo-Sanz, J. M. (2002). Practical
data-oriented microaggregation for statistical disclo-
sure control. IEEE Transactions on Knowledge and
data Engineering, 14(1):189–201.
Domingo-Ferrer, J. and Mateo-Sanz, J. M. (2002). Practi-
cal data-oriented microaggregation for statistical dis-
closure control. IEEE Trans. Knowl. Data Eng.,
14(1):189–201.
Domingo-Ferrer, J. and Rebollo-Monedero, D. (2009).
Measuring risk and utility of anonymized data us-
ing information theory. In Proceedings of the 2009
EDBT/ICDT Workshops, pages 126–130.
Eyupoglu, C., Aydin, M. A., Zaim, A. H., and Sertbas, A.
(2018). An efficient big data anonymization algorithm
based on chaos and perturbation techniques. Entropy,
20(5):373.
Fung, B. C., Wang, K., and Yu, P. S. (2005). Top-down spe-
cialization for information and privacy preservation.
In 21st international conference on data engineering
(ICDE’05), pages 205–216. IEEE.
Gower, J. C. (1971). A general coefficient of similarity and
some of its properties. Biometrics, pages 857–871.
Herranz, J., Matwin, S., Nin, J., and Torra, V. (2010). Clas-
sifying data from protected statistical datasets. Com-
puters & Security, 29(8):875–890.
Ji, Z., Lipton, Z. C., and Elkan, C. (2014). Differential pri-
vacy and machine learning: a survey and review. arXiv
preprint arXiv:1412.7584.
LeFevre, K., DeWitt, D. J., and Ramakrishnan, R.
(2006). Mondrian multidimensional k-anonymity. In
22nd International conference on data engineering
(ICDE’06), pages 25–25. IEEE.
Li, N., Li, T., and Venkatasubramanian, S. (2007).
t-closeness: Privacy beyond k-anonymity and l-
diversity. In 2007 IEEE 23rd International Confer-
ence on Data Engineering, pages 106–115. IEEE.
Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasub-
ramaniam, M. (2006). l-diversity: Privacy beyond k-
anonymity. In 22nd International Conference on Data
Engineering (ICDE’06), pages 24–24. IEEE.
Oganian, A. and Domingo-Ferrer, J. (2017). Local synthesis
for disclosure limitation that satisfies probabilistic k-
anonymity criterion. Transactions on data privacy,
10(1):61.
Prasser, F., Kohlmayer, F., Lautenschl
¨
ager, R., and
Kuhn, K. A. (2014). Arx-a comprehensive tool for
anonymizing biomedical data. In AMIA Annual Sym-
posium Proceedings, volume 2014, page 984. Ameri-
can Medical Informatics Association.
Rodr
´
ıguez-Hoyos, A., Estrada-Jim
´
enez, J., Rebollo-
Monedero, D., Parra-Arnau, J., and Forn
´
e, J. (2018).
Does k-anonymous microaggregation affect machine-
learned macrotrends? IEEE Access, 6:28258–28277.
Samarati, P. (2001). Protecting respondents identities in mi-
crodata release. IEEE transactions on Knowledge and
Data Engineering, 13(6):1010–1027.
Senavirathne, N. and Torra, V. (2020). On the role of data
anonymization in machine learning privacy. In 2020
IEEE 19th International Conference on Trust, Secu-
Systematic Evaluation of Probabilistic k-Anonymity for Privacy Preserving Micro-data Publishing and Analysis
319