Privacy- & Utility-Preserving Data Releases over Fragmented Data

Using Individual Differential Privacy

Luis Del Vasto-Terrientes

1 a

, Sergio Mart

ınez

1 b

and David S

anchez

1,2 c

Universitat Rovira i Virgili, Departament d’Enginyeria Inform

atica i Matem

atiques,

Av. Paisos Catalans 26, Tarragona, 43007, Catalonia, Spain

CYBERCAT-Center for Cybersecurity Research of Catalonia, Av. Paisos Catalans 26, Tarragona, 43007, Catalonia, Spain

Keywords:

Data Analysis, Privacy-Preserving Data Release, Individual Differential Privacy, Data Fragmentation.

Abstract:

Data fragmentation is the process of splitting data into either attributes or records across multiple databases,

thereby improving operational efﬁciency, minimizing processing requirements, and enhancing data privacy.

However, under this approach, data aggregation becomes complex, particularly in environments where adher-

ence to regulatory compliance is essential for organizational data analysis and decision-making tasks. Since

the dataset held by each party may contain sensitive information, simply joining local datasets and releasing

the aggregated result will inevitably reveal such sensitive information to other parties. Differential Privacy

(DP) has become the de facto standard for data protection due to its rigorous notion of privacy. However,

the strong privacy guarantees it offers result in a deterioration of data utility in several scenarios, such as data

releases in either centralized or fragmented data scenarios. This paper explores the application of Individual

Differential Privacy (iDP)—a formulation of DP conceived to better preserve data utility while still provid-

ing strong privacy guarantees to individuals—for data releases in either horizontally or vertically fragmented

scenarios. In combination with individual ranking (IR) microaggregation, an iDP-IR privacy-preserving data

release system is presented, in which multiple data owners can safely share datasets. Our experiments on

the Adult and Wine Quality datasets demonstrate that the proposed system for fragmented data can provide

reasonable information loss with robust ε privacy values.

1 INTRODUCTION

In the information society, researchers and the general

public have been demanding more data, promoting

the creation of public repositories such as the Harvard

Dataverse, Dryad, and government open data portals.

However, these massive data repositories may include

both re-identifying and conﬁdential attributes, which

can jeopardize the privacy of the individuals they re-

fer. For instance, (Sweeney, 2000) demonstrates that

87% of the U.S. population from the Census 1990 had

reported characteristics that likely made them unique

by combining ZIP code, date of birth, and gender.

Also, (Golle, 2006) shows that 63% of the U.S. popu-

lation from the Census 2000 can be identiﬁed by com-

bining these 3 quasi-identiﬁers.

Data anonymization has proposed a variety of so-

lutions to protect privacy in data releases (Samarati

https://orcid.org/0000-0003-0483-8559

https://orcid.org/0000-0002-3941-5348

https://orcid.org/0000-0001-7275-7887

and Sweeney, 1998; Wang and Xu, 2017). Among

these, Differential Privacy (DP) has gained popular-

ity across various ﬁelds due to its strong privacy guar-

antees. However, strong privacy comes at the cost

of low data utility preservation due to the perturba-

tion applied to data. Despite criticism regarding its

usefulness and applicability (Bambauer et al., 2013;

Clifton and Tassa, 2013; Blanco-Justicia et al., 2022;

J. Domingo-Ferrer, 2021), the data privacy commu-

nity continues to propose different relaxations and

alternative formulations of DP to improve its utility

(Dwork et al., 2009; Friedman and Schuster, 2010;

Cummings et al., 2024). Some well-known DP re-

laxations include (ε,δ) (Dwork et al., 2006), R

enyi

DP (Mironov, 2017), and Zero-Concentrated DP (Bun

and Steinke, 2016). These three examples introduce

a “small” chance that standard DP guarantees may be

broken for the promise of better utility.

An alternative approach to DP relaxation is In-

dividual Differential Privacy (iDP). iDP is a differ-

ent formulation that allows the data controller to uti-

318

Vasto-Terrientes, L., Martínez, S. and Sánchez, D.

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy.

DOI: 10.5220/0013141200003899

In Proceedings of the 11th International Conference on Information Systems Security and Privacy (ICISSP 2025) - Volume 2, pages 318-329

ISBN: 978-989-758-735-1; ISSN: 2184-4356

lize the actual knowledge of the dataset —in contrast

to the previous DP relaxations mentioned— when

computing the noise addition to generate protected

datasets. A notable advantage of iDP is that it pro-

vides the strong privacy guarantees of DP to individ-

uals without introducing any risk of breaking these

guarantees, while offering better utility. However,

solely applying iDP to each attribute value for every

record in the dataset would result in low utility due to

the high sensitivity each individual value is exposed

to. To reduce this sensitivity to noise addition, we

employ individual ranking microaggregation (IR).

Presented in (S

anchez et al., 2016), DP-IR is a

privacy-preserving technique aimed at enhancing util-

ity for data releases by combining standard DP and

IR microaggregation. IR replaces detailed data with

centroids calculated from clusters of similar values

for each attribute in an independent and consecutive

manner. Centroids effectively reduce the amount of

noise required to achieve DP by decreasing data sen-

sitivity, thereby offering a considerable improvement

in utility, which is particularly desirable for data anal-

ysis tasks in organizations and industry (Ghazi et al.,

2023).

In the literature on DP, its application is com-

monly assumed in a centralized scenario—referred to

as Centralized Differential Privacy (CDP)—wherein

an aggregator collects raw data prior to the application

of DP (Yang et al., 2024). However, in many prac-

tical applications, datasets are stored in distributed

databases-a trend accelerated by the decreasing costs

of on-premise infrastructures and cloud computing.

This architecture complicates the previously proposed

scenario, as data owners may be unwilling to release

raw data, even to a trusted participant. The decen-

tralized nature of these datasets introduces additional

challenges in ensuring data privacy without compro-

mising utility for data analysis, which is essential for

driving informed decisions and fostering innovation.

1.1 Contributions and Plan

In this paper, we propose a privacy-preserving data

release system where multiple data owners employ an

iDP-IR mechanism—a privacy mechanism that com-

bines iDP and IR microaggregation—to generate ε-

iDP datasets that can be safely shared with an aggre-

gator. This aggregator is responsible for orchestrat-

ing, gathering, and combining the ε-iDP datasets from

the contributing data owners.

We develop a protocol for each of the two data

fragmentation scenarios analyzed in this paper: hori-

zontal and vertical fragmentation. Depending on the

data fragmentation scenario, the aggregator must co-

ordinate the ε privacy value and the microaggregation

conﬁguration that each data party must locally imple-

ment to generate a global ε-iDP.

The implementation of iDP-IR for fragmented

data does not only maximizes utility without compro-

mising the use of strong ε values, but is also scal-

able (no restrictions on the number of data owners)

and requires low computational cost in comparison

to other approaches that rely on cryptographic tech-

niques commonly inefﬁcient for a large number of

data owners.

The paper is organized as follows. First, Sec-

tion 2 presents a background on distributed privacy-

preserving data releases. Section 3 introduces iDP

and IR separately and explains how they are com-

bined into iDP-IR to protect sensitive data. Next, Sec-

tion 4 proposes an honest-but-curious system based

on iDP-IR for data releases over fragmented data.

Horizontal and vertical fragmentation data scenarios

are discussed for the proposed system. Section 5 eval-

uates the information loss calculation of the global

protected dataset generated in the distributed system

proposed compared to that of a centralized environ-

ment using two datasets available at the UCI Ma-

chine Learning Repository: Adult and Wine Qual-

ity. The empirical results obtained are then discussed

to analyze the feasibility of applying the proposed

iDP-IR-based data release system for fragmented data

stored across multiple data owners. Finally, Section 6

presents the conclusions and potential future research

directions.

2 RELATED WORK

In a centralized scenario, where data is assumed to be

stored in a single repository, a dataset is formalized as

follows:

• A is a ﬁnite set of attributes a

, where g ∈

{1,2,...,l}.

• R is a ﬁnite set of records r

, where h ∈

{1,2,...,n}. Each record r

is a tuple of values

corresponding to each attribute a

• D is a dataset containing the set of records R.

In cases where datasets are fragmented across

multiple repositories, we consider two approaches:

horizontal fragmentation and vertical fragmentation.

Horizontal fragmentation entails storing records with

the same attribute schema in multiple databases,

whereas vertical fragmentation involves storing dif-

ferent attributes of the same records in multiple

databases.

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

319

The formalization of datasets considering multiple

data owners is as follows:

• A

is a ﬁnite set of attributes a

, where g ∈

{1,2,...,l}.

• R

is a ﬁnite set of records r

, where h ∈

{1,2,...,n}. Each record r

is a tuple of values

corresponding to each attribute a

∈ A

• D

is a dataset containing the set of records R

Here, j denotes the speciﬁc dataset in which R

and A

are deﬁned.

• For simplicity, l = |A

| and n = |R

|, where l and

n may vary for each j.

Regarding privacy-preserving data releases across

data fragmentation scenarios, most previous works

have focused on the application of standard DP and

Secure Multiparty Computation (SMC) for either hor-

izontal or vertical data fragmentation.

For the horizontal fragmentation scenario, (Alha-

didi et al., 2012) presents a model for data releases

limited to two-party collaboration only. The algo-

rithm relies on generalizing raw data before apply-

ing DP. (Cheng et al., 2020) introduces a differen-

tially private sequential update of Bayesian networks

(DP-SUBN), where the parties and a curator collabo-

ratively quantify the correlations of all attribute pairs

across all local datasets, which may require signif-

icant resources depending on the number of parties

and attributes.

For the case of vertically fragmented data, (Mo-

hammed et al., 2014) presents DistDiffGen, the ﬁrst

data release system for this scenario limited to two-

party cooperation. It generates an anonymous ε-DP

data table tailored for classiﬁcation tasks. A differen-

tially private latent tree (DPLT) approach is described

in (Tang et al., 2021) for solving data publishing in a

secure two-party scenario, with an extension to multi-

party cases. However, this extension may perform

well only when the number of parties is not large.

(Wang et al., 2021) presents a semi-honest model

called ArbDistDP for two-party collaboration, which

privately publishes arbitrarily partitioned data by ap-

plying standard ε-DP. This model relies on multiple

steps of top-down generalization before noise addi-

tion, consuming most of the processing time.

In contrast to the previous works mentioned,

our work focuses on maximizing utility preservation

while providing strong guarantees to individuals for

fragmented data by employing iDP and microaggre-

gation. Due to the composition property of iDP, the

approach presented in this work is applicable to both

horizontal and vertical data fragmentation, with no

limit on the number of data owner participants, and

operates at a low computational cost.

3 INDIVIDUAL DIFFERENTIAL

PRIVACY AND

MICROAGGREGATION

The DP model (Dwork, 2006) was originally de-

signed for the interactive setting, protecting the re-

sults of database queries requesting speciﬁc data. In

this setting, a sanitizer providing ε-DP outputs sits

between the user querying the data and the dataset.

However, its application was extended to other con-

texts such as data releases, differentially private ma-

chine learning models, data collection, among others.

In this work we focus on data releases at microdata

level, which is information at the level of individual

respondents.

3.1 Differential Privacy

DP has gained popularity among the scientiﬁc com-

munity due to its strong privacy guarantee, which en-

sures that the presence or absence of an individual in

the dataset has little effect on the output. In Deﬁnition

1, DP is formally deﬁned.

Deﬁnition 1. A differential privacy (DP) mechanism

ν gives ε-DP if, for all datasets D

and D

differing

in at most one record (i.e., neighbor datasets) and all

S ⊂ Range(ν), we have:

Pr(ν(D

) ∈ S) ≤ exp(ε)Pr(ν(D

) ∈ S) (1)

The privacy parameter ε, also known as privacy

budget, determines the level of disclosure the sys-

tem will tolerate. A smaller budget corresponds to

stronger privacy and consequently, less data utility.

According to (Dwork, 2008), values in the range

ε = [0.1,1] provide robust privacy guarantees.

The amount of noise added to the original values is

proportional to the sensitivity of those values to mod-

iﬁcations. This noise represents how much the value

may be distorted in the output. The global sensitivity

is deﬁned as the largest difference in variability be-

tween neighboring datasets within the same domain

D, and it is formalized as follows:

Deﬁnition 2. Let u be a positive integer and D be

a collection of datasets containing D

and D

. The

global sensitivity of a function f : D → R

△ f = max

∈D

d(D

)=1

|| f (D

) − f (D

)||

, (2)

where d(D

)=1 represents that datasets D

and D

differ in one record.

Assuming that (r

) is a real value response lo-

cated in a dataset D, in record h and attribute g, for

a certain query Q that must be masked by adding

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

320

random noise Y (D), a randomized response ν(D) =

)

(D) + Y (D) is computed.

In the literature, the most common method applied

to attain DP is the Laplace mechanism, which gener-

ates random noise using the Laplace distribution. By

controlling the scale of the noise, the Laplace mecha-

nism can balance the trade-off between privacy and

utility, making it a versatile and widely-used tech-

nique in the context of DP.

3.2 Individual Differential Privacy and

Microaggregation

iDP (Soria-Comas et al., 2017) is a privacy model

that can incur less information loss than standard DP,

while giving individuals the same privacy protection

as DP. In contrast to standard DP, iDP utilizes knowl-

edge from the actual dataset containing the individu-

als to protect, while DP follows the worst case sce-

nario, i.e., it is designed to provide robust privacy

protection under the most challenging and adversar-

ial conditions.

iDP is formalized as follows:

Deﬁnition 3. Given a dataset D

, an iDP mechanism

ν gives ε-iDP if, for any dataset D

that is neighbor

of D

, and any S ⊂ Range(ν), we have:

exp(−ε)Pr(ν(D

) ∈ S) ≤ Pr(ν(D

) ∈ S)

≤ exp(ε)Pr(ν(D

) ∈ S)

(3)

For data releases, iDP can be attained by calibrat-

ing the noise addition to local sensitivity instead of

global sensitivity. This results in better preservation

of utility because it typically requires much smaller

noise addition than global sensitivity, where the ac-

tual dataset speciﬁcations are not considered.

The local sensitivity is deﬁned as the largest dif-

ference in variability between the actual dataset D

and any neighbor D

within the domain D, and it is

formalized as follows:

Deﬁnition 4. Let u be a positive integer and D is

a collection of datasets containing D

as the actual

dataset and D

any neighbor dataset. The local sen-

sitivity of a function LS

: D → R

= max

∈D

d(D

)=1

|| f (D

) − f (D

)||

, (4)

One of the most interesting properties of iDP (and

DP) is composability, which we develop in this work

to generate integrated ε-iDP datasets in data fragmen-

tation scenarios. Composability is considered for a

sequence of iDP mechanisms, such as consecutive

queries executed over the sanitizer.

Theorem 1. Sequential composition: Let ν

and

be randomized functions accessing non-disjoint

datasets D

and D

, satisfying ε

-iDP and ε

-iDP

respectively, the combined output satisﬁes (ε

+ ε

iDP.

Theorem 2. Parallel composition: Let ν

and ν

randomized functions accessing disjoint datasets D

and D

, satisfying ε

-iDP and ε

-iDP respectively,

the combined output satisﬁes max(ε

,ε

)-iDP.

The composition property of the iDP model is cru-

cial in this study to effectively guarantee a global ε-

iDP data release output. Depending on the type of

data fragmentation, an aggregator must accurately as-

sign the privacy budget that each data owners needs

to implement. Further elaboration on this process is

provided in Section 4.2.

iDP was originally designed for the interactive set-

ting, same as DP. In this scenario, a user poses a

query over a dataset to receive an answer in the form

of aggregated data (e.g., average and medians). For

microdata-level data releases over distributed data us-

ing iDP, we leverage on a pre-processing step based

on data microaggregation to provide greater data util-

ity without comprising strong privacy.

3.3 Individual Ranking

Microaggregation

In data releases, to generate an ε-iDP record r

, by se-

quential composition (see Theorem 1), it is required

to add ε/l-iDP ∀g, (r

), where l is the number of

attributes in D. Thus, for h=1 the consecutive set of

queries ∀g, Q

)

(D) results in an ε-individual dif-

ferentially private record r

. Such queries are very

sensitive and require a considerable amount of ε to

provide reasonable utility, even if we consider local

sensitivity rather than global sensitivity.

To reduce the sensitivity, this work applies the

mechanism proposed in (S

anchez et al., 2016), which

relies on a microaggregation-based approach called

individual ranking (IR) as a pre-processing step.

Microaggregation is a perturbative masking tech-

nique where the attribute values of each record are

replaced by aggregating values of similar records.

It encompasses two methods: univariate and mul-

tivariate microaggregation. The univariate method-

approach known as IR-aggregates the values of each

attribute one at a time in a consecutive and indepen-

dent manner, while multivariate methods aggregate

all attributes simultaneously, making the latter com-

putationally more costly.

IR clusters a dataset D

so that values for each at-

tribute a

∈ A

are sorted either in increasing or de-

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

321

creasing order for subsequent aggregation into suc-

cessive clusters of k elements. From each of the clus-

ters formed, a centroid is generated, typically through

a simple arithmetic average. This centroid then re-

places the original values belonging to the cluster. IR

produces low utility loss but a high disclosure risk

when applied in isolation. However, this is compen-

sated by the less sensitive centroids it produces com-

pared to the original values.

The low computational cost required by IR is op-

timal for enhancing data utility in combination with

iDP for fragmented data in a distributed environment,

where, instead of relying on one powerful server spec-

iﬁcation, the workload is commonly shared across

several less powerful parties.

The IR microaggregation method is explained in

the following algorithm, assuming that the computa-

tion is performed by a single data owner.

Data: Dataset D, Parameter value k

Result: IR dataset D

∗

← D;

// Copy of the original dataset to

microaggregate

numO fClusters ← ⌊n/k⌋;

for a

∈ A

∗

← ascendingSortForAttribute(D

∗

, a

);

for cluster ← 1 to numO fClusters do

partialCluster, centroid ← null;

// Cluster and centroid

partialCluster ←

computeCluster(cluster, D

∗

, a

, k);

centroid ←

calculateCentroid(partialCluster);

replaceValuesByCentroid(D

∗

partialCluster, centroid);

end

reorderRecordsToOriginalOrder(D, D

∗

);

Algorithm 1: Individual Ranking Microaggregation.

The algorithm presented receives dataset D and

the parameter value k as inputs to compute.

First, a copy of D, called D

∗

, is created. D

∗

then the dataset to be microaggregated and returned

once the process is completed. The number of clus-

ters generated depends on the number of records n

and the value k. The function f loor is used to return

the largest integer that is less than or equal to the ar-

gument, so that for f loor(102/5) the function returns

20. In this example, 20 clusters are computed and the

remaining values are grouped within the last cluster.

Continuing with Algorithm 1, for each attribute

∈ A

∗

, the values are sorted in an ascending way,

as well as the rest of record values. Next, the

algorithm continues with the clustering and cen-

troid calculations for each attribute a

independently.

partialCluster groups the set of values that belongs to

cluster, which ranges between 1 and numO fClusters.

Function computeCluster calculates and returns at

least k similar values from attribute a

∈ D

∗

, forming

the cluster.

generateCluster returns an array of values, from

which the original values are mapped and replaced by

the computed centroid. Once all centroids are calcu-

lated for all a

∈ A

∗

, Algorithm 1 reorders each record

to its original position as in the dataset D, resulting in

the microaggregated dataset D

∗

4 PRIVACY-PRESERVING DATA

RELEASES FOR

FRAGMENTED DATA

APPLYING iDP-IR

This section aims to introduce an honest-but-curious

iDP-IR-based system model and its associated pro-

tocols for data releases. The model considers dis-

tributed data in either horizontal or vertical fragmen-

tation and must be locally protected prior to being

shared with a trusted party to guarantee privacy. Also,

it ensures that each participant in the system runs the

protocol exactly as speciﬁed (no deviations or mali-

cious parties).

4.1 Distributed System Model

Let us consider an honest-but-curious distributed sys-

tem where several data owners collaborate to con-

struct a robust private dataset for analysis without

compromising data privacy. In this architecture, it is

assumed that both the aggregator and the participating

parties adhere to the protocol.

Figure 1 depicts the architecture of the system and

illustrates the participants and the basic communica-

tion ﬂow.

The proposed architecture encompasses the fol-

lowing participants:

• Data owners φ: Consider the set of data own-

ers φ

, where j ∈ {1, 2, ...m}. These data own-

ers may involve on-premise databases as well as

cloud databases holding datasets D

. Each φ

pos-

sesses an iDP-IR mechanism ν

that generates an

ε-individual differentially private dataset D

′

. This

work assumes that |φ| ≥ 2.

• Aggregator κ: It is in charge of orchestrating the

aggregation of ε-individual differentially private

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

322

𝛎

System

Aggregator κ

𝛎

Global protected dataset

Protected dataset in φ

Request

Response

Protected dataset in φ

Figure 1: Architecture of the proposed model.

datasets D

′

into a combined protected dataset D

′

so that D

′

= D

′

∪ D

′

∪ ... ∪ D

′

. This work as-

sumes that |κ| = 1.

• Users ρ: External users represent individuals who

are willing to use the released data for analy-

sis. Their participation in the model may not

be mandatory, meaning that data releases may be

shared only among data owners to enhance their

knowledge for decision-making.

In the proposed system, the aggregator sets the

privacy parameters to be applied by each data owner

and validates the data schema between parties. Each

data owner is equipped with an iDP-IR mechanism.

Initially, the mechanism computes IR microaggrega-

tion over D

—which represents the dataset each data

owner is willing to share— thereby generating a mi-

croaggregated dataset D

∗

Finally, for each centroid computed for the several

clusters generated in D

∗

, ε-iDP is applied. The noisy

centroid value obtained then replaces the values of the

cluster associated with the centroid in the microag-

gregated dataset, resulting in an iDP-IR dataset D

′

This dataset is sent to the aggregator, who is respon-

sible for enforcing the protocol for delivering all data

owners’ protected datasets before generating a global

protected dataset for data release. This work assumes

a secure channel connects all parties involved in the

model.

4.2 Composition Property for Data

Fragmentation & iDP-IR

We now explore two possible scenarios of privacy-

preserving data releases in fragmented data: horizon-

tal and vertical fragmentation. Depending on the sce-

nario, the communication ﬂow and data exchange be-

tween φ and κ differ.

Let the global dataset D be distributed between

data owners φ in either an horizontal (Joyce and Nir-

malrani, 2015; Sauer and Hao, 2015) or vertical way

(Vaidya, 2008). In an horizontally fragmented data,

two data owners φ

and φ

contain a disjoint set of

individuals with the same schema, so that A

= A

and R

∩ R

= ⊘. These parties are willing to co-

operate by combining samples from multiple sources

to obtain enhanced datasets for robust data analysis.

For instance, two or more small marketing compa-

nies holding different datasets with the same attributes

may desire to gain knowledge by collecting different

insights.

On the other hand, vertical fragmented data col-

lects different sets of attributes about the same indi-

viduals stored in multiple data owners φ

. For two

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

323

data owners φ

and φ

, A

∩ A

= ⊘ and R

= R

. As

an example, two subsidiary companies belonging to

the same parent company may have stored different

features about the same clients. The parent company

may decide to join these datasets to perform descrip-

tive and predictive analyses for decision-making.

In the context of these two data fragmentation

scenarios, an aggregated ε-iDP dataset in κ may be

achieved for data release by correctly applying the

composition property as described in Theorem 1 and

Theorem 2. This process is explained and illustrated

in the following sections, along with the protocols

between the participants for each data fragmentation

scenario.

4.2.1 Horizontal Fragmentation

For horizontally fragmented data, parallel composi-

tion is utilized to obtain an ε-individual differentially

private dataset D

′

from protected datasets D

′

. Fol-

lowing Theorem 2, if (ε

= ε

= ... = ε

), an ε-iDP

dataset D

′

is computed by ensuring that each ν

pro-

vides ε-iDP records ∈ D

′

accordingly.

Protocol for Horizontal Fragmentation

1. κ requests ∀ j ∈ A

, i.e., the attributes each data

owner φ

is willing to share.

2. ∀ j ∈ φ

, A

metadata is sent to κ.

3. κ veriﬁes if A

= A

= ... = A

4. κ requests ∀ j ∈ φ

a protected dataset D

′

with pa-

rameters k ≥ 3 and equal ε-iDP, following Theo-

rem 2.

5. ∀ j ∈ φ

, k partitions are computed ∀g, a

∈ A

generating a microaggregated dataset D

∗

6. ∀ j ∈ φ

, ε-iDP is applied over D

∗

by adding

noise to each centroid, which then replaces the at-

tribute values in the cluster, generating a protected

dataset D

′

7. κ retrieves ∀ j ∈ D

′

, aggregating them into a pro-

tected dataset D

′

8. κ releases D

′

to ρ and φ.

4.2.2 Vertical Fragmentation

For the case of vertical data fragmentation (i.e., dis-

joint attributes), where different mechanisms ν

are

independently computing different sets of attribute

outputs of the same individual, let us consider The-

orem 1 to compute an ε-iDP-IR private dataset D

′

Assuming that the aggregator equitably divides ε

between all data owners φ

, then a (

∑

)-iDP-IR

dataset D

′

is obtained from the combination of their

individual outputs. Locally in φ

, each masked at-

tribute value output is computed as

, resulting in an

iDP record (

∑

g=1

In vertical fragmentation, either all data owners

must share a common identiﬁer or the data must be

fragmented in a way that allows for record recon-

struction during the defragmentation process at the

aggregator. As a result, this fragmentation method

is computationally more demanding than horizontal

fragmentation.

Protocol for Vertical Fragmentation

1. κ requests ∀ j ∈ A

, i.e., the attributes each data

owner φ

is willing to share.

2. ∀ j ∈ φ

, A

metadata is sent to κ.

3. κ veriﬁes if A

∩ A

∩ ... ∩ A

= ⊘.

4. κ requests ∀ j ∈ φ

a protected training dataset D

′

with parameters k ≥ 3 and

-iDP, following The-

orem 1.

5. ∀ j ∈ φ

, k partitions are computed ∀g, a

∈ A

generating a microaggregated dataset D

∗

6. ∀ j ∈ φ

-iDP is applied over D

∗

by adding

noise to each centroid, which then replaces the at-

tribute values in the cluster, generating a protected

dataset D

′

7. κ retrieves ∀ j ∈ D

′

, constructs records and aggre-

gates them into a protected dataset D

′

8. κ releases D

′

to ρ and φ.

5 EXPERIMENTAL EVALUATION

This section evaluates the iDP-IR-based privacy-

preserving data release system proposed for frag-

mented data. The focus is on assessing the informa-

tion loss obtained from the protected datasets using

iDP-IR in a centralized environment with different (ε,

k) pairs.

This is compared to the information loss obtained

from the aggregated protected datasets following the

horizontal and vertical fragmentation protocols pre-

sented, using the same (ε, k) pairs.

The tests were conducted on a Windows 11 Home

PC with an Intel i7-1355U CPU @5.00 GHz and 16

GB DDR5 RAM.

5.1 Evaluation Datasets

The following UCI datasets have been selected for the

experimental evaluation due to their diverse attribute

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

324

data types, ensuring a more robust analysis across dif-

ferent scenarios:

• Adult is a well-known dataset available on the

UCI Machine Learning Repository, comprising

48,842 records of census income information. Its

objective is to predict whether income exceeds

$50,000 per year. For the Adult dataset evalua-

tion, the following attributes are used: age, work-

class, education, marital status, occupation, rela-

tionship, race, sex, hours per week, and native

country. The attributes in this dataset include dis-

crete numerical values and categorical data.

A pre-processing step was carried out to remove

records with missing values. The ﬁnal dataset

used in our evaluation consists of 45,222 records.

• The Wine Quality dataset (Cortez et al., 2009)

contains Portuguese red and white wine sam-

ples, collected from protected designation of ori-

gin samples that were tested at the ofﬁcial certiﬁ-

cation entity from May 2004 to February 2007.

The dataset originally split into two separate

datasets, Red and White wine. In our study, we

combined the two samples into a single dataset

containing 6,497 records. The following attributes

are used: ﬁxed acidity, volatile acidity, citric acid,

residual sugar, chlorides, free sulfur dioxide, total

sulfur dioxide, density, pH, sulphates, and alco-

hol. This dataset includes continuous and discrete

numerical attributes.

The evaluation consists of comparing the informa-

tion loss between the protected dataset D

base

—which

results from applying the data masking process to the

original Adult and Wine Quality datasets using iDP-

IR in a centralized manner— and the results obtained

from the aggregated protected dataset D

′

using the

presented protocols. In this centralized scenario, we

assume a single party holds the entire datasets and an

iDP-IR mechanism.

Given the composition property of iDP, we expect

a comparable level of information loss between the

global protected dataset D

′

, result of the aggregation

of protected datasets subsets D

′

, and D

base

To quantify the information loss of the

anonymized dataset concerning D

base

and D

′

we used the Sum of Squared Errors (SSE) in our

tests. SSE represents the sum of squared distances

between the original and anonymized dataset records.

In order to show a more informative measure, the

mean SSE is computed by dividing the SSE by the

number of the records in the dataset.

MeanSSE = (1/n)

∑

h=1,...,n

dist(r

′

)

(5)

To calculate the distance between the original

record r

and the anonymized record r

′

, we consid-

ered the average of the distances between attributes.

Then, for two records r

and r

′

we have

dist(r

′

) = (1/l)

((r

),(r

′

))/σ

)

... + (d

((r

),(r

′

))/σ

)

(6)

where d

is the distance between values of at-

tribute a

, σ

is the sample variance of attribute a

in the original dataset and l is the number of attributes

in D

Centroids Computation for Categorical At-

tributes. For categorical attributes, we sort the

attribute values based on their frequency of ap-

pearance. Clusters are then formed by selecting

k contiguous values. For example, assuming the

frequencies of a

= 10 and a

= 15, we have a

and

, where the superscripts indicate the order. For a

cluster of 3 elements, an example may be { a

}. Since it is not possible to operate directly on the

values, we operate on the indices instead.

The centroid of a cluster is the value correspond-

ing to the average index. From the previous example,

1+1+2

= 1.33. Notice that the centroid does not cor-

respond to an actual value. To convert it into an actual

value, we round the index to the nearest integer (i.e., 1

in this example). However, this rounding step is per-

formed only after noise has been added.

5.1.1 Baseline - Centralized Scenario

The mean SSE obtained for the centralized scenario-

used as a baseline for comparison-for the Adult and

Wine Quality datasets are shown in Figure 2 and Fig-

ure 3, respectively. These ﬁgures present a heatmap

representing the mean SSE calculated for different

(ε,k) pairs, where each attribute value is the result of

the average of ﬁfty (50) consecutive runs of iDP-IR.

The color degradation in the heatmaps from

cream to red indicates that, the redder the cell, the

larger the information loss between the original and

anonymized datasets. Notice that we included the

mean SSE obtained for k = 1, which is equivalent to

applying plain iDP over each attribute value, i.e., no

microaggregation step.

Notice that the privacy budgets are set to ε =

[0.1,1], a range commonly used in the literature for

robust privacy protection. Because of the difference

in the number of records between the two datasets an-

alyzed, the k values are set to [50, 2000] and [50,300]

for the Adult and Wine Quality datasets, respectively.

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

325

Figure 2: Adult iDP-IR SSE Baseline.

Figure 3: Wine Quality iDP-IR SSE Baseline.

Theoretically, for larger ε, we may expect a de-

crease in SSE due to the relaxation of the privacy bud-

get, resulting in less noise injection. Also, for larger

k, we may also expect a decrease in SSE because of

the reduced data sensitivity, resulting in less required

noise. This pattern is exactly what is observed in Fig-

ures 2 and 3.

These results now serve as the baseline for evalu-

ating the proposed iDP-IR system model. The data

fragmentation conﬁguration of the datasets is de-

scribed in detail for both scenarios, outlining how the

datasets are partitioned and distributed among multi-

ple predeﬁned parties.

Before evaluating the D

base

and the data fragmen-

tation scenarios, we illustrate the advantages that iDP-

IR offers in reducing information loss compared to

plain iDP for each attribute value and to DP-IR, which

applies IR microaggregation and DP rather than iDP

(i.e., global sensitivity).

Evaluation of iDP and iDP-IR. The application of

IR as a pre-processing step signiﬁcantly impacts the

reduction of information loss in data releases. This ef-

fect is depicted in Figures 2 and 3. Speciﬁcally, when

considering plain iDP, i.e., k = 1, the results show that

the decrease in information loss is minimal while in-

creasing the ε values.

This observation indicates that, even though local

sensitivity is employed, each attribute value indepen-

dently maintains a high sensitivity to noise modiﬁca-

tions. Consequently, achieving meaningful utility ne-

cessitates higher ε values. However, this requirement

inherently results in a trade-off where the utility gains

are offset by a corresponding decrease in privacy pro-

tection.

Evaluation of DP-IR and iDP-IR. We now com-

pare the information loss obtained from comparing

iDP-IR (local sensitivity) and DP-IR (global sensitiv-

ity), the latter as presented in (S

anchez et al., 2016).

Figure 4 shows the differences obtained from iDP-

IR and DP-IR for both Adult (left) and Wine Quality

(right). The calculated values represent the difference

between the mean SSE depicted in Figures 2 and 3 for

iDP-IR and those obtained using DP-IR. Notice that,

for all pairs (ε,k), iDP-IR obtains better SSE com-

pared to DP-IR, i.e., lower values from the former re-

sult in a negative SSE relative to the latter.

5.1.2 Horizontally Fragmented Datasets

Conﬁguration

In this scenario, we aim to generate unbalanced sub-

sets of data to reﬂect common situations in collab-

orative environments where organizations contribute

varying amounts of data. Additionally, a shufﬂing

mechanism has been implemented, allowing records

to be exchanged among data owners during each eval-

uation iteration. This may help us examine how

record redistribution affects information loss, assess

the robustness of the proposed iDP-IR system, and

understand the impact of data variability on evalua-

tion outcomes.

Adult. The evaluation considers four (4) data own-

ers storing randomized subsets of the dataset in an

unbalanced split: φ

contains 11,100 records, φ

con-

tains 15,000 records, φ

contains 12,000 records, and

contains 7,122 records.

Wine Quality. The evaluation considers three (3)

data owners storing randomized subsets of Wine

Quality in an unbalanced split: φ

contains 500

records, φ

contains 1,000 records, and φ

contains

4,997 records.

The domain of the attributes in this dataset has

been limited to twice the maximum attribute value.

5.1.3 Vertically Fragmented Datasets

Conﬁguration

Adult. We considered ten (10) data owners, each

storing the same number of records belonging to the

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

326

Figure 4: Adult (left) and Wine Quality (right) Comparison of iDP-IR and DP-IR.

same individuals, but with different attributes. There-

fore, each data owner holds a single attribute, such

that φ

contains attribute age, φ

contains attribute

workclass, ..., φ

contains native country.

Wine Quality. The dataset’s attributes are dis-

tributed across three (3) data owners, grouping indi-

viduals’ attribute values as follows: φ

contains at-

tributes { ﬁxed acidity, volatile acidity, citric acid },

contains { residual sugar, chlorides, free sulfur

dioxide, total sulfur dioxide }, and φ

contains { den-

sity, pH, sulphates, alcohol }.

As a pre-processing step, each record holds an

identiﬁer prior to the split, allowing for the defrag-

mentation of the records when κ receives the subsets

of protected data from the data owners φ

5.1.4 Evaluation Results

Figures 5 and 6 present the results obtained from the

two datasets by comparing the average from the data

fragmentation scenarios and the baseline (i.e., central-

ized scenario) for pairs (ε, k). Hence, each cell in the

heatmap is computed from the difference between the

average obtained from the data fragmentation scenar-

ios and the baseline average (as presented in Figures

2 and 3).

The creamer the cell, the closer the difference be-

tween the mean SSE of the centralized and data frag-

mentation scenarios is to 0. Mean SSE lower than

computed on the baseline is also set to 0 (i.e., cream

cell). The redder the cell, the larger the information

loss between the data fragmentation and the baseline.

Notice that the mean SSE for the data fragmen-

tation conﬁgurations is also calculated as the average

after ﬁfty (50) runs.

Adult. Overall, the differences between the mean

SSE averages for the horizontally fragmented data

scenarios are minimal compared to the baseline for

low k values (i.e., k = [50,250]). However, some in-

formation loss can be observed for larger k values,

where it increases compared to the baseline. This

behavior may occur for data owners with a lower

number of records to share because microaggregation

depends on data distribution, data homogeneity, and

group size, which can result in less optimal data clus-

tering for subsets of records.

In this particular case, considering that Adult

mostly contains categorical attributes (a type that has

no natural order and has been artiﬁcially sorted based

on data distribution), the results are prone to larger

differences with respect to the baseline, as the sorted

values depend on the actual distribution on each data

owner.

For vertical fragmentation, the variation with re-

spect to the baseline is negligible, with the maximum

differences obtained being around 0.04, as shown in

Figure 5. Parameter k does not signiﬁcantly affect the

resulting outcome in this scenario because the entire

set of values for each attribute remains the same as in

the baseline. This occurs because IR microaggrega-

tion treats each attribute independently. Hence, a cor-

rect implementation of the iDP-IR composition prop-

erty for this type of fragmentation results in similar

information loss compared to the centralized baseline.

Wine Quality. For the Wine Quality dataset, the

horizontal fragmentation scenario provides robust re-

sults, similar to that of the vertical fragmentation, ac-

cording to the similar differences between both and

the centralized outputs. Notice that for this dataset,

most of the cells in the heatmap are cream. As

such, the results are either as good as in the central-

ized scenario or yield slightly better results due to

the unpredictable nature of the noise addition. For

ε = {0.01, 0.1}, where the mean SSE in the central-

ized scenario is in the thousands, the difference with

respect to both data fragmentation scenarios are be-

low 300. For ε = {0.5,1}, we mostly observe slightly

different mean SSE values, indicating that both hor-

izontal and vertical splitting do not signiﬁcantly im-

pact IR clustering for this dataset.

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

327

Figure 5: Adult SSE comparison between the iDP-IR data fragmentation scenarios and the centralized scenario.

Figure 6: Wine Quality SSE comparison between the iDP-IR data fragmentation scenarios and the centralized scenario.

6 CONCLUSIONS

This work proposes a distributed data release model in

which a set of parties is willing to release a protected

dataset to an aggregator. Following an honest-but-

curious model, data owners may only share data us-

ing iDP-IR as a utility-preserving privacy mechanism,

which relies on the application of IR microaggrega-

tion as a pre-processing step to iDP, thereby reduc-

ing data sensitivity to noise injection. Consequently, a

lesser amount of noise injection is required to achieve

ε-iDP datasets.

The results from the experimental scenarios for

both horizontal and vertical fragmentation data show

that it is possible to achieve similar information loss

from distributed iDP-protected datasets to those ob-

tained in a centralized setting for the same pairs (ε,k).

This is achieved thanks to the composition properties

that iDP inherits from standard DP.

ACKNOWLEDGEMENTS

This research was funded by the European Com-

mission (project H2020-871042 “SoBigData++”), the

Government of Catalonia (ICREA Acad

emia Prize

to D. S

anchez), MCIN/AEI/ 10.13039/501100011033

and “ERDF A way of making Europe” under

grant PID2021-123637NB-I00 “CURLING”, and by

Project HERMES, funded by the European Union

NextGenerationEU/PRTR via INCIBE.

REFERENCES

Alhadidi, D., Mohammed, N., Fung, B., and Debbabi, M.

(2012). Secure distributed framework for achieving ε-

differential privacy. In Fischer-H

ubner, S. and Wright,

M., editors, Privacy Enhancing Technologies, pages

120–139, Berlin, Heidelberg. Springer Berlin Heidel-

berg.

Bambauer, J., Muralidhar, K., and Sarathy, R. (2013).

Fool’s Gold: An Illustrated Critique of Differential

Privacy. Arizona Legal Studies Discussion Paper No.

13-47.

Blanco-Justicia, A., S

anchez, D., Domingo-Ferrer, J., and

Muralidhar, K. (2022). A critical review on the use

(and misuse) of differential privacy in machine learn-

ing. ACM Comput. Surv., 55(8).

Bun, M. and Steinke, T. (2016). Concentrated differen-

tial privacy: Simpliﬁcations, extensions, and lower

bounds. http://arxiv.org/abs/1605.02065.

Cheng, X., Tang, P., Su, S., Chen, R., Wu, Z., and Zhu, B.

(2020). Multi-party high-dimensional data publish-

ing under differential privacy. IEEE Transactions on

Knowledge and Data Engineering, 32(8):1557–1571.

Clifton, C. and Tassa, T. (2013). On syntactic anonymity

and differential privacy. In 2013 IEEE 29th Inter-

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

328

national Conference on Data Engineering Workshops

(ICDEW), pages 88–93. IEEE.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis,

J. (2009). Modeling wine preferences by data mining

from physicochemical properties. Decision Support

Systems, 47(4):547 – 553.

Cummings, R., Desfontaines, D., Evans, D., Geambasu,

R., Huang, Y., Jagielski, M., Kairouz, P., Ka-

math, G., Oh, S., Ohrimenko, O., Papernot, N.,

Rogers, R., Shen, M., Song, S., Su, W., Terzis, A.,

Thakurta, A., Vassilvitskii, S., Wang, Y.-X., Xiong,

L., Yekhanin, S., Yu, D., Zhang, H., and Zhang,

W. (2024). Advancing Differential Privacy: Where

We Are Now and Future Directions for Real-World

Deployment. Harvard Data Science Review, 6(1).

https://hdsr.mitpress.mit.edu/pub/sl9we8gh.

Dwork, C. (2006). Differential Privacy, pages 1–12.

Springer Berlin Heidelberg, Berlin, Heidelberg.

Dwork, C. (2008). Differential Privacy: A Survey of Re-

sults, pages 1–19. Springer Berlin Heidelberg, Berlin,

Heidelberg.

Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and

Naor, M. (2006). Our data, ourselves: Privacy via

distributed noise generation. In Annual International

Conference on the Theory and Applications of Cryp-

tographic Techniques, pages 486–503. Springer.

Dwork, C., Naor, M., Reingold, O., Rothblum, G. N., and

Vadhan, S. (2009). On the complexity of differentially

private data release: efﬁcient algorithms and hardness

results. In Proceedings of the Forty-First Annual ACM

Symposium on Theory of Computing, STOC ’09, page

381–390, New York, NY, USA. Association for Com-

puting Machinery.

Friedman, A. and Schuster, A. (2010). Data mining

with differential privacy. In Proceedings of the 16th

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, KDD ’10, pages

493–502, New York, NY, USA. ACM.

Ghazi, B., Hu, X., Kumar, R., and Manurangsi, P. (2023).

Differentially private data release over multiple tables.

In Proceedings of the 42nd ACM SIGMOD-SIGACT-

SIGAI Symposium on Principles of Database Systems,

PODS ’23, page 207–219, New York, NY, USA. As-

sociation for Computing Machinery.

Golle, P. (2006). Revisiting the uniqueness of simple de-

mographics in the us population. In Proceedings of

the 5th ACM Workshop on Privacy in Electronic So-

ciety, WPES ’06, pages 77–80, New York, NY, USA.

ACM.

J. Domingo-Ferrer, D. S

anchez, A. B.-J. (2021). The limits

of differential privacy (and its misuse in data release

and machine learning). Communications of the ACM,

64(7):33–35.

Joyce, M. S. and Nirmalrani, V. (2015). Privacy in horizon-

tally distributed databases based on association rules.

In 2015 International Conference on Circuits, Power

and Computing Technologies [ICCPCT-2015], pages

1–6.

Mironov, I. (2017). R

enyi differential privacy. In 2017

IEEE 30th Computer Security Foundations Sympo-

sium (CSF), pages 263–275.

Mohammed, N., Alhadidi, D., Fung, B. C., and Debbabi,

M. (2014). Secure two-party differentially private

data release for vertically partitioned data. IEEE

Transactions on Dependable and Secure Computing,

11(1):59–71.

Samarati, P. and Sweeney, L. (1998). Protecting privacy

when disclosing information: k-anonymity and its en-

forcement through generalization and suppression.

anchez, D., Domingo-Ferrer, J., Mart

ınez, S., and Soria-

Comas, J. (2016). Utility-preserving differentially pri-

vate data releases via individual ranking microaggre-

gation. Information Fusion, 30:1 – 14.

Sauer, B. and Hao, W. (2015). Horizontal cloud database

partitioning with data mining techniques. In 2015

12th Annual IEEE Consumer Communications and

Networking Conference (CCNC), pages 796–801.

Soria-Comas, J., Domingo-Ferrer, J., S

anchez, D., and

Meg

ıas, D. (2017). Individual differential privacy: A

utility-preserving formulation of differential privacy

guarantees. IEEE Transactions on Information Foren-

sics and Security, 12(6):1418–1429.

Sweeney, L. (2000). Simple demographics often iden-

tify people uniquely. Health (San Francisco),

671(2000):1–34.

Tang, P., Cheng, X., Su, S., Chen, R., and Shao, H. (2021).

Differentially private publication of vertically parti-

tioned data. IEEE Transactions on Dependable and

Secure Computing, 18(2):780–795.

Vaidya, J. (2008). A Survey of Privacy-Preserving Meth-

ods Across Vertically Partitioned Data, pages 337–

358. Springer US, Boston, MA.

Wang, H. and Xu, Z. (2017). Cts-dp: Publishing correlated

time-series data via differential privacy. Knowledge-

Based Systems, 122:167 – 179.

Wang, R., Fung, B. C., Zhu, Y., and Peng, Q. (2021). Dif-

ferentially private data publishing for arbitrarily parti-

tioned data. Information Sciences, 553:247–265.

Yang, M., Guo, T., Zhu, T., Tjuawinata, I., Zhao, J., and

Lam, K. (2024). Local differential privacy and its ap-

plications: A comprehensive survey. Computer Stan-

dards & Interfaces, 89:103827.

Privacy- & Utility-Preserving Data Releases over Fragmented Data Using Individual Differential Privacy

329