Semantic Anonymisation of Set-valued Data

Montserrat Batet, Arnau Erola, David Sánchez, Jordi Castellà-Roca


It is quite common that companies and organisations require of releasing and exchanging information related to individuals. Due to the usual sensitive nature of these data, appropriate measures should be applied to reduce the risk of re-identification of individuals while keeping as much data utility as possible. Many anonymisation mechanisms have been developed up to present, even though most of them focus on structured/relational databases containing numerical or categorical data. However, the anonymisation of transactional data, also known as set-valued data, has received much less attention. The management and transformation of these data presents additional challenges due to their variable cardinality and their usually textual and unbounded nature. Current approaches focusing on set-valued data are based on the generalisation of original values; however, this suffers from a high information loss derived from the reduced granularity of the output values. To tackle this problem, in this paper we adapt a well-known microaggregation anonymisation mechanism so that it can be applied to textual set-valued data. Moreover, since the utility of textual data is closely related to their meaning, special care has been put in preserving data semantics. To do so, appropriate semantic similarity and aggregation functions are proposed. Experiments conducted on a real set-valued data set show that our proposal better preserves data utility in comparison with non-semantic approaches.


  1. Bar-Ilan, J. (2007). Access to query logs - an academic researcher's point of view. Proc. of the Proceedings of the Query Log Analysis: Social and Technological Challenges Workshop at the 16th World Wide Web Conference, WWW2007). Banff, Alberta, Canada.
  2. Barbaro, M., & Zeller, T. (2006). A Face Is Exposed for AOL Searcher No. 4417749. The New York Times.
  3. Batet, M., Sánchez, D., & Valls, A. (2011). An ontologybased measure to compute semantic similarity in biomedicine. Journal of Biomedical Informatics, 44, 118-125.
  4. Batet, M., Sánchez, D., Valls, A., & Gibert, K. (2013). Semantic similarity estimation from multiple ontologies. Applied Intelligence, 38(1), 29-44.
  5. Couto, F. M., Silva, M. J., & Coutinho, P. M. (2007). Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering, 61(1), 137-152.
  6. Defays, D., & Nanopoulos, P. (1993). Panels of enterprises and confidentiality: the small aggregates method. Proc. of the 92 Symposium On Design and Analysis of Longitudinal Surveys (pp. 195-204). Otawwa, Canada.
  7. Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1), 189-201.
  8. Domingo-Ferrer, J., & Torra, V. (2005). Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation. Data Mining and Knowledge Discovery, 11(2), 195-212.
  9. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J. M., & Sebé, F. (2006). Efficient multivariate dataoriented microaggregation. The VLDB Journal, 15(4), 355-369.
  10. Domingo-Ferrer, J. (2008). A Survey of inference control methods for privacy preserving data mining Privacy preserving data mining: models and algorithms (Vol. 55-80).
  11. Guarino, N. (1998). Formal Ontology in Information Systems. In N. Guarino (Ed.), Proc. of the 1st International Conference on Formal Ontology in Information Systems, FOIS-98 (pp. 3-15). Trento, Italy.
  12. He, Y., & Naughton, J. F. (2009). Anonymization of SetValued Data via TopDown, Local Generalization. Proc. of the Thirtieth international conference on very large data bases (VLDB'09) (pp. 934-945). Lyon, France.
  13. He, Z., Xu, X., & Deng, S. (2008). k-ANMI: A mutual information based clustering algorithm for categorical data. Information Fusion, 9(2), 223-233.
  14. Herranz, J., Matwin, S., Nin, J., & Torra, V. (2010). Classifying data from protected statistical datasets. Computers & Security, 29(8), 875-890.
  15. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., & Wolf, P.-P. d. (2012). Statistical Disclosure Control: Wiley.
  16. Jing, X., Zhang, N., & Das, G. (2011). ASAP: Eliminating algorithm-based disclosure in privacy-preserving data publishing. Information Systems, 36(5), 859-880.
  17. Li, T., & Li, N. (2008). Towards optimal kanonymization. Data & Knowledge Engineering, 65(1), 22-39.
  18. Lin, J. L., Wen, T. H., Hsieh, J. C., & Chang, P. C. (2010). Density-based microaggregation for statistical disclosure contro. Expert Systems with Applications, 37, 3256-3263.
  19. Lord, P., Stevens, R., Brass, A., & Goble, C. (2003). Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19(10), 1275-1283.
  20. Martínez, S., Sánchez, D., & Valls, A. (2012a). Semantic adaptive microaggregation of categorical microdata. Computers & Security, 31(5), 653-672.
  21. Martínez, S., Sánchez, D., Valls, A., & Batet, M. (2012b). Privacy protection of textual attributes through a semantic-based masking method. Information Fusion, 13(4), 304-314.
  22. Martínez, S., Valls, A., & Sánchez, D. (2012c). Semantically-grounded construction of centroids for datasets with textual attributes. Knowledge based systems., 35, 160-172.
  23. Matatov, N., Rokach, L., & Maimon, O. (2010). Privacypreserving data mining: A feature set partitioning approach. Information Sciences, 180(14), 2696-2720.
  24. Oganian, A., & Domingo-Ferrer, J. (2001). On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commision for Europe, 18, 345- 353.
  25. Open Directory Project (2010)., Pesquita, C., Faria, D., Falcão, A. O., Lord, P., & Couto, F. M. (2009). Semantic Similarity in Biomedical Ontologies. PLoS Computational Biology, 5(7), 1-12.
  26. Samarati, P., & Sweeney, L. (1998). Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Proc. of the Proceedings of the IEEE Symposium on Research in Security and Privacy, S&P). Oakland, CA.
  27. Samarati, P. (2001). Protecting respondents' identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6), 1010-1027.
  28. Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: a new featurebased approach. Expert Systems With Applications, 39, 7718-7728.
  29. Sánchez, D., Castellà-Roca, J., & Viejo, A. (2013). Knowledge-Based Scheme to Create PrivacyPreserving but Semantically-Related Queries for Web Search Engines. Information Sciences, 218, 17-30.
  30. Schlicker, A., Domingues, F. S., Rahnenführer, J., & Lengauer, T. (2006). A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics, 7(3002).
  31. Sevilla, J. L., Segura, V., Podhorski, A., Guruceaga, E., Mato, J. M., Martinez-Cruz, L. A., Corrales, F. J., & Rubio, A. (2005). Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(4), 330-338.
  32. Spackman, K. A. (2004). SNOMED CT milestones: endorsements are added to already-impressive standards credentials. Healthcare Informatics, 21(9), 54-56.
  33. Sweeney, L. (2002). k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557- 570.
  34. Terrovitis, M., Mamoulis, N., & Kalnis, P. (2008). Privacy-preserving anonymization of set-valued data. Proceedings of the VLDB Endowment, PVLDB, 1, 115-125.
  35. Torra, V. (2004). Microaggregation for Categorical Variables: A Median Based Approach. Proc. of the Privacy in Statistical Databases (PSD 2004) (pp. 162- 174).
  36. Torra, V., & Miyamoto, S. (2004). Evaluating Fuzzy Clustering Algorithms for Microdata Protection. In J. Domingo-Ferrer & V. Torra (Eds.), Privacy in Statistical Databases (pp. 519-519).
  37. Torra, V. (2011). Towards knowledge intensive data privacy Proceedings of the 5th international workshop on data privacy management, and 3rd international workshop on autonomous spontaneous security, DMP'10/SETOP'10. (Vol. LNCS 6514, pp. 1-7). Athens, Greece: Springer-Verlag.
  38. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. Proc. of the 32nd annual Meeting of the Association for Computational Linguistics (pp. 133- 138). Las Cruces, New Mexico.
  39. Xiong, L., & Agichtein, E. (2007). Towards privacypreserving query log publishing. Proc. of the Proceedings of the Query Log Analysis: Social and Technological Challenges Workshop at the 16th World Wide Web Conference, WWW2007). Banff, Alberta, Canada.

Paper Citation

in Harvard Style

Batet M., Erola A., Sánchez D. and Castellà-Roca J. (2014). Semantic Anonymisation of Set-valued Data . In Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-758-015-4, pages 102-112. DOI: 10.5220/0004811901020112

in Bibtex Style

author={Montserrat Batet and Arnau Erola and David Sánchez and Jordi Castellà-Roca},
title={Semantic Anonymisation of Set-valued Data},
booktitle={Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},

in EndNote Style

JO - Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - Semantic Anonymisation of Set-valued Data
SN - 978-989-758-015-4
AU - Batet M.
AU - Erola A.
AU - Sánchez D.
AU - Castellà-Roca J.
PY - 2014
SP - 102
EP - 112
DO - 10.5220/0004811901020112