A Federated K-Means-Based Approach in eHealth Domains with

Heterogeneous Data Distributions

Giovanni Paragliola

1 a

, Patrizia Ribino

2 b

and Maria Mannone

2,3 c

National Research Council (CNR), Institute for High-Performance Computing and Networking (ICAR), Naples, Italy

National Research Council (CNR), Institute for High-Performance Computing and Networking (ICAR), Palermo, Italy

Institute of Physics and Astronomy, University of Potsdam, Potsdam, Germany

{giovanni.paragliola, patrizia.ribino, maria.mannone}@icar.cnr.it

Keywords:

Federated Clustering, Healthcare, Heterogeneous Data Distribution.

Abstract:

Healthcare organizations collect and store signiﬁcant amounts of patient health information. However, sharing

or accessing this information outside of their facilities is often hindered by factors such as privacy concerns.

Federated Learning (FL) data systems are emerging to overcome the siloed nature of health data and the

barriers to sharing it. While federated approaches have been extensively studied, especially in classiﬁcation

problems, clustering-oriented approaches are still relatively few and less widespread, both in formulating al-

gorithms and in their application in eHealth domains. The primary objective of this paper is to introduce a

federated K-means-based approach for clustering tasks within the healthcare domain and explore the impact of

heterogeneous health data distributions. The evaluation of the proposed federated K-means approach has been

conducted on several health-related datasets through comparison with the centralized version and by estimating

the trade-off between privacy and performance. The preliminary ﬁndings suggest that in the case of hetero-

geneous health data distributions, the difference between the centralized and federated approach is marginal,

with the federated approach outperforming the centralized one on some healthcare datasets.

1 INTRODUCTION

Healthcare organizations typically gather large

amounts of patient health data. Multi-centre research

plays a crucial role in developing machine learning

(ML) algorithms for real-world scenarios. However,

various factors hinder the dissemination or retrieval

of this information outside the organization, such

as privacy concerns (Bonawitz et al., 2021). The

Health Insurance Portability and Accountability Act

(HIPAA) (Act, 1996) and the General Data Protection

Regulation (GDPR) (Voigt and Von dem Bussche,

2017) have established regulations that restrict the ex-

change of electronic health records (EHRs) between

stakeholders and healthcare providers without patient

consent (Sheller et al., 2020). Protecting conﬁdential

medical information while leveraging the collective

knowledge of healthcare facilities presents a complex

and demanding challenge (Dhade and Shirke, 2024).

Federated Learning (FL) systems are emerging as

https://orcid.org/0000-0003-3580-9232

https://orcid.org/0000-0003-3266-9617

https://orcid.org/0000-0003-3606-3436

a promising solution to overcome the siloed nature of

health data and the associated barriers to sharing them

(Bharati et al., 2022). FL enables the decentralized

training of ML models without transferring medical

data to a central server. Each healthcare institution

is a client node, which independently trains its model

and transmits it to a central server for aggregation.

A global model is formed by integrating local mod-

els from all nodes, which are then disseminated to the

nodes by a centralized server responsible for coordi-

nating and aggregating the models.

However, while federated approaches have been

extensively studied, mainly in the context of classiﬁ-

cation problems (Marulli et al., 2021a; Marulli et al.,

2021b), clustering-oriented approaches remain rela-

tively scarce and less widespread. The formulation

of algorithms tailored explicitly for federated cluster-

ing and their application in e-health domains are areas

that are still in their early stages of exploration.

Despite the beneﬁts of federated learning, such as

preserving privacy and reducing communication over-

head, a more comprehensive investigation is required

to fully understand and leverage the advantages of

federated clustering in healthcare domains. As the

550

Paragliola, G., Ribino, P. and Mannone, M.

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions.

DOI: 10.5220/0012981200003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 550-559

ISBN: 978-989-758-721-4; ISSN: 2184-3236

healthcare landscape evolves toward more collabora-

tive and data-driven models, addressing these issues

will become increasingly critical to realizing the full

potential of federated learning to improve patient out-

comes and advance medical research.

Furthermore, when investigating methodologies

for distributed clustering, it is imperative to con-

sider the distinct challenges associated with federated

learning, such as data heterogeneity, meaning that

data are not uniformly distributed among the partic-

ipants. This problem, known as the non-independent

and identically distributed (non-IID) data challenge

(Wahab et al., 2021), occurs when the local data of

individual clients does not accurately reﬂect the en-

tire dataset due to heterogeneous class imbalances,

distribution variations, and data size. These chal-

lenges can lead to signiﬁcant performance degrada-

tion in federated learning models because traditional

machine learning algorithms assume uniform data

distribution across clients. In particular, local updates

may diverge during model aggregation, resulting in

slow convergence or even degraded performance in

the global model. To address these issues, various

strategies have been proposed, including methods for

grouping clients with similar data distributions. These

weighted aggregation schemes assign importance to

clients based on data size or distribution and ﬁne-

tuning global models on local data.

This paper presents a study focused on evaluating

a federated clustering approach in the e-health domain

in non-IID scenarios. Speciﬁcally, the primary objec-

tive is to introduce a federated K-means-based (FKM)

approach for clustering within the healthcare domain

and explore the impact of heterogeneous data distri-

butions. The evaluation of the proposed approach

has been conducted on several health-related datasets

through comparison with the centralized version and

by estimating the trade-off between privacy and per-

formance. The approach was evaluated mainly un-

der three different data distribution scenarios to more

accurately assess the impact of data heterogeneity

across clients on the federated model. These scenar-

ios range from independent and identically distributed

data across clients to a non-independent and iden-

tically distributed scenario, within which we distin-

guish soft and hard heterogeneity.

The preliminary ﬁndings suggest that in the case

of heterogeneous health data distributions, the differ-

ence between the centralized and federated approach

is marginal, with the federated approach outperform-

ing the centralized one on some healthcare datasets.

The paper’s contribution can be summarized into

two main points: 1) Establish a federated K-means

approach to evaluate the effectiveness of clustering

models in the federated eHealth domain; 2) Inves-

tigate how the hypothesis of heterogeneous health

data distribution affects the convergence of local and

global models.

2 RELATED WORKS

Centralized clustering approaches require storing and

accessing all raw data from a single central node. The

K-means clustering algorithm, introduced more than

six decades ago, continues to be widely favored and

utilized in contemporary research and practice.

Federated clustering is a framework within the

ﬁeld of federated learning whose objective is to clus-

ter data distributed across multiple devices or loca-

tions while preserving privacy and data security. The

process involves aggregating local data points that

exhibit global similarity. The distributed nature of

the data points allows for their clustering based on

a global similarity measure, as they are distributed

among multiple clients. Notably, the data remains lo-

cal on client devices despite the clustering process.

Based on our current understanding, a limited body

of literature is devoted to investigating this issue.

Federated K-means clustering can be used to per-

form unsupervised learning (for a variable number of

clusters between centers), also clustering on multiple

datasets avoiding sharing the underlying data (Garst

and Reinders, 2024). The idea of avoidance of lo-

cal, sensitive data sharing is also used in the algo-

rithm for Federated, Fair, and Fast K-means (F

KM)

(Zhu et al., 2023). With this approach, the K-means

are efﬁciently solved in vertical FL. The overall prob-

lem is decomposed into multiple sub-problems that

are solved at the level of single clients. Thus, the

clients only transmit their results to the server rather

than the original sensitive data. Clustering precision

may be enhanced by exploiting local data stored on

each client to help deﬁne better initial centroids (Yang

et al., 2024). Deﬁning the initial centroids directly

at the server level results in the initial features be-

ing devoid of insights derived from the clients. Then,

the server ﬁnds the initial global centroids, feeding a

greedy algorithm with the client-based information.

To improve the ﬁrst step, the clients share the data

point number for each centroid, allowing better parti-

tioning. The problem of privacy and security in FL

is addressed in (Pedrycz, 2022) using fuzzy logic,

and in particular, a problem of unsupervised feder-

ating learning is solved via fuzzy sets based on fed-

erated clustering, named Fuzzy C-Means Federated

(F-FCM). In (Pedrycz, 2022), an objective function is

proposed. Clusters are formed via a partition function

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions

551

that assigns points to them, and a set of cluster proto-

types is updated according to the exchanges between

the server and clients. Another fuzzy approach is used

for multistep federated clustering, validating clusters

through the DB index (Stallmann and Wilbik, 2022).

An alternative methodology addressing the pri-

vacy concerns associated with FKM within proac-

tive caching for cellular networks is put forth by (Liu

et al., 2020). The base stations collect users’ data,

which may cause an information leakage. The so-

lution of (Liu et al., 2020) uses privacy-preserving

federated K-means and secret-sharing protocols. The

passage of secret sharing is needed to avoid recon-

structing users’ data from the shared gradients up-

loaded to the macro-cell base station to update cluster

centroids. Splitting the gradient into random shared,

its reconstruction is only made possible when a mini-

mal number of the shared is collected, thus guarantee-

ing the privacy of user data. Moreover, secure aggre-

gation and homomorphic encryption strategies mak-

ing K-means more robust concerning non-IID data

(nonindependent and identically distributed) are ad-

dressed in (Brand

ao et al., 2021).

The current study introduces a novel K-means ap-

proach based on a global centroids similarity measure

to improve the balance between maintaining privacy

and achieving effective clustering. We start with the

standard K-means clustering steps and add modiﬁca-

tions to integrate them with federated learning. Addi-

tionally, our study explores the potential inﬂuence of

the heterogeneous health data distribution on the con-

vergence of local and global models using real-world

healthcare datasets.

3 MATERIALS AND METHODS

3.1 Federated K-Means Approach

Beyond the aforementioned advantage concerning

privacy issues, we focus on federated clusters for

their multiple beneﬁts. To name but a few, we

mention their scalability efﬁciency (no need to store

and process the whole dataset in the same location),

data control (to remain in possession of individual

clients), and communication efﬁciency (reduction of

the amount of shared data). Other advantages in-

clude decreased large-scale data breaches, real-time

latency reduction, and local pattern discovery. This

last element is particularly relevant in medical ap-

plications. The investigation of heterogeneously-

distributed medical datasets is, in fact, the primary

motivation of our research.

The main steps of the federated K-means algo-

rithm presented here are the following: In the ﬁrst

phase, (Broadcast Parameters), the central server pro-

vides the different clients with training parameters,

including the number of clusters to set for each lo-

cal model, the maximum number of iterations, and

the centroid initialization method. After this phase,

each client independently performs clustering by ap-

plying the K-means algorithm to its data. Each client

transmits its set of centroids to the central server (Col-

lect Local Centroids), randomly selecting a set of cen-

troids from one of the n clients (Random Selection

of Centroid C

∗

), hereinafter referred to as C

∗

, indi-

cating the initialization set of centroids to deﬁne the

global centroids. Indeed, the global centroids deﬁ-

nition needs an initialization of centroids’ selections

as the local K-means; the selection of C

∗

addresses

this requirement. Then, for each centroid c belong-

ing to C

∗

, the central server creates a set of centroids

belonging to other clients closest to cosine similarity

according to the following formula:

∀c ∈ C

∗

, S

= min

(cosine(c, c

i j

)), (1)

where C

∗

is the initial set of centroids for the deﬁni-

tion of global centroids, c is the generic centroid be-

longing to C

∗

, and S

is the set of the closer centroids

to c belonging to C

, with C

̸= C

∗

After identifying for each centroid c ∈ C

∗

the cor-

respondence set S

, the global centroids are computed

as the average (Calculate global centroids). At this

point, local centroids are evaluated and sent to local

nodes (Broadcast global centroids), which repeat the

local clustering until the rounds are completed.

3.2 Datasets

The following four publicly available datasets and a

synthetic dataset are used to evaluate the FKM model:

• OASIS2 (Battineni et al., 2019) is a dataset re-

lated to Alzheimer’s disease (AD) containing pa-

tient sociodemographic characteristics and clini-

cal variables. Size: 354 samples x 12 features.

• Heart Disease (Abid Ali Awan, 2021) is a dataset

with features of heart disease patients. Size: 303

samples x 11 features.

• Obesity (Palechor and De la Hoz Manotas, 2019)

is a dataset containing data to estimate obesity in

patients according to their eating habits and phys-

ical condition. Size: 2111 samples x 17 features.

• Breast Cancer (W. H. Wolberg and Mangasarian,

2017) is a dataset of digitized images of ﬁne nee-

dle aspirates from breast masses. Size: 569 sam-

ples x 30 features.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

552

• Custom is a synthetic dataset. Size: 500 samples

x 10 features.

3.3 Data Distribution Scenarios

The data contained in each dataset have been used to

deﬁne three scenarios corresponding to three differ-

ent data distributions. Speciﬁcally, we deﬁne a uni-

form distribution, where the data are uniformly dis-

tributed between clients; a soft-heterogeneous distri-

bution, which exhibits a slight heterogeneity of data

distribution; and ﬁnally, a hard-heterogeneous distri-

bution, which is characterized by a pronounced het-

erogeneity of data distribution. Each scenario repre-

sents a different usage context for our model and will

allow us to evaluate its performance under realistic

and varied conditions. Figure 1 shows the architec-

ture used for the Federated Clustering. It is delineated

by two primary nodes: local nodes, also known as

client nodes, and server nodes, which act as aggrega-

tors. Without loss of generality, we assume to work

with three client nodes and one aggregator node.

Figure 1: Federated architecture with 3 client nodes.

The clients emulate three entities where data are

stored and are subject to mobility constraints to facili-

tate machine-learning training. The aggregator serves

as the focal point for the aggregation process, receiv-

ing the local centroids generated by the clients during

each round of communication.

In each of the three considered scenarios, each

client stores a data portion according to a type of het-

erogeneous data distribution (HDD), as follows: (i)

Homogeneous (H)- Each client has the same number

of samples; (ii) Soft-Heterogeneous (SH)- Each sam-

ple in the dataset is randomly assigned to one of the 3

clients according to a uniform distribution; (iii) Hard-

Heterogeneous (HH)- First, 80% of each dataset is

equally distributed among all clients. Then, half of the

clients are randomly selected to receive the remaining

20% of the data, again using a uniform distribution.

A sample allocation example is shown in Table 1.

It is worth introducing the hypothesis underlying

Table 1: Client’s samples for each dataset.

Client Datatet H SH HH

1 OASIS2 118 108 165

2 OASIS2 118 138 98

3 OASIS2 118 108 91

1 Heart Disease 101 105 146

2 Heart Disease 101 105 63

3 Heart Disease 101 93 94

1 Obesity 703 664 1003

2 Obesity 703 731 568

3 Obesity 703 716 540

1 Breast Cancer 189 176 238

2 Breast Cancer 189 166 135

3 Breast Cancer 189 158 127

1 Custom 166 173 257

2 Custom 166 201 173

3 Custom 166 195 139

our proposed approach. Since we are deﬁning our use

case in a federated environment with multiple clients,

each client performs independent local clustering on

its data. To compare clustering results across clients,

we assume that the probability distributions on the lo-

cal clients are identical. This assumption implies that

the underlying probability distributions governing the

data points are consistent across clients:

(x) = F

(x) ∀x ∈ X with X ̸= Y, (2)

where F is the probability distribution function, and

X and Y are two generic clients. By assuming the

same probability distribution, the comparison of the

clusters produced by different clients becomes mean-

ingful. Consequently, we can evaluate the effective-

ness and consistency of the clustering algorithms em-

ployed across the distributed system.

4 COMPARISON BETWEEN

CENTRALISED AND

FEDERATED K-MEANS

We perform a comparative analysis from two perspec-

tives to evaluate the proposed federated K-means ap-

proach. First, we compare the performance of the two

approaches using widely known metrics for unsuper-

vised techniques, assessing the quality and effective-

ness of the clustering algorithms. Then, we compare

the two approaches considering cluster composition

and the similarity of items within clusters.

Experiments were conducted to evaluate different

cluster conﬁgurations, with cluster numbers ranging

from two to ﬁve. A random initialization was used

alongside a max iteration parameter set at 100.

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions

553

Table 2: Overview of the Performance Results - Hard Heterogeneous Distribution.

Dataset Clusters Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

oasis2 2 Centralized 0.19 76.40 2.00 Federated 0.19 73.63 2.06

oasis2 3 Centralized 0.20 71.87 1.80 Federated 0.26 93.52 1.58

oasis2 4 Centralized 0.26 87.22 1.53 Federated 0.40 141.77 1.27

oasis2 5 Centralized 0.44 143.20 1.13 Federated 0.61 238.33 1.05

ObesityDataSet 2 Centralized 0.19 277.98 2.53 Federated 0.11 225.02 2.92

ObesityDataSet 3 Centralized 0.12 228.80 2.37 Federated 0.34 174.66 1.71

ObesityDataSet 4 Centralized 0.13 283.36 2.13 Federated 0.16 349.53 2.08

ObesityDataSet 5 Centralized 0.17 327.37 1.94 Federated 0.25 467.40 1.75

heart disease 2 Centralized 0.17 63.26 2.11 Federated 0.19 68.36 2.03

heart disease 3 Centralized 0.19 55.59 1.87 Federated 0.20 62.43 1.93

heart disease 4 Centralized 0.17 182.40 2.01 Federated 0.43 123.97 1.32

heart disease 5 Centralized 0.45 121.40 1.20 Federated 0.63 209.60 0.96

custom 2 Centralized 0.59 726.53 0.63 Federated 0.60 731.56 0.62

custom 3 Centralized 0.76 2992.18 0.35 Federated 0.56 1748.00 0.80

custom 4 Centralized 0.59 2343.33 0.84 Federated 0.66 2597.46 0.73

custom 5 Centralized 0.56 2344.11 0.78 Federated 0.68 2936.42 0.77

Breast Cancer 2 Centralized 0.34 267.69 1.32 Federated 0.30 260.51 1.33

Breast Cancer 3 Centralized 0.30 208.42 1.41 Federated 0.29 219.32 1.44

Breast Cancer 4 Centralized 0.32 212.07 1.20 Federated 0.32 220.81 1.29

Breast Cancer 5 Centralized 0.41 268.48 0.99 Federated 0.52 354.95 1.01

4.1 Performance Analysis

Tables 2, 3, and 4 provide an overview of the per-

formance achieved with the centralized and federated

approaches across three data distribution scenarios.

Each row represents a separate run with different set-

tings in terms of dataset and number of clusters, so

each run was evaluated with both centralized and fed-

erated approaches. We consider here, as clustering-

quality indexes, the Calinski-Harabasz (CH), the Sil-

houette, and the Davies-Bouldin ones, respectively.

CH provides information on how well-separated and

compact the clusters are. The lower the CH, the better.

The silhouette index summarizes clusters’ cohesion

(proximity of points inside the same cluster) and sep-

aration (distance between clusters). Its values range

between -1 and 1, where -1 indicates a misclassiﬁ-

cation. The higher the value, the better. Finally, the

Davies-Bouldin (DB) index tells us how similar the

clusters are, with intra-cluster dispersion and inter-

cluster separation. In this case, the lower the value,

the better. Focusing on hard-heterogeneous data dis-

tribution (Table 2), we notice an improvement of all

the indices while considering the federated clustering

method against the clustering performed classically.

The improvement is still evident for data soft distri-

bution (Table 4) and less noticeable for heterogeneous

data distribution (Table 3).

The data reported in the tables represent averages

calculated from multiple training runs. Performance

was evaluated using three metrics: 1) the Silhouette

score utilized to assess the cohesion and separation of

clusters; it ranges from -1 to 1, with higher scores in-

dicating better clustering performance. 2) the Davies-

Bouldin index (DBI) that measures the ratio of within-

cluster distances to between-cluster distances; a lower

score indicates better clustering quality, and 3) the

Calinski-Harabasz index (CHI) evaluates the ratio of

between-cluster dispersion to within-cluster disper-

sion; a higher score indicates better-deﬁned clusters.

From a high-level analysis, the results clearly

show that the centralized approach does not consis-

tently outperform the federated approach; on the con-

trary, federated approaches are proving to be highly

competitive. Looking at the results in Table 2 for the

hard distribution scenario, it is noticeable that feder-

ated approaches outperform centralized approaches in

14 out of 20 reported runs. Here, ’outperform’ means

that federated approaches give better results in at least

2 out of 3 metrics.

For H and SH scenarios (see tables 3 and 4), cen-

tralized approaches are more effective in 11 out of 20

runs and 13 out of 20 runs, respectively.

To further support the competitiveness of the ap-

proach, the average performance gap between the

centralized and federated approaches is generally

modest across all scenarios. Here, the gap is cal-

culated as the average difference between the corre-

sponding pairs of metrics for each run, considering

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

554

Table 3: Overview of the Performance Results - Homogeneous Distribution.

Dataset Clusters Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

oasis2 2 Centralized 0.19 76.40 2.00 Federated 0.19 76.12 1.99

oasis2 3 Centralized 0.18 74.28 1.78 Federated 0.18 74.21 1.78

oasis2 4 Centralized 0.18 67.19 1.64 Federated 0.17 65.48 1.73

oasis2 5 Centralized 0.15 60.67 1.69 Federated 0.15 58.98 1.76

ObesityDataSet 2 Centralized 0.19 277.98 2.53 Federated 0.10 201.84 2.94

ObesityDataSet 3 Centralized 0.14 244.42 2.32 Federated 0.16 163.43 2.28

ObesityDataSet 4 Centralized 0.15 226.26 2.11 Federated 0.14 180.76 1.90

ObesityDataSet 5 Centralized 0.13 222.67 1.96 Federated 0.15 141.32 2.12

heart disease 2 Centralized 0.17 63.26 2.11 Federated 0.17 63.22 2.1

heart disease 3 Centralized 0.14 47.53 2.11 Federated 0.13 47.28 2.22

heart disease 4 Centralized 0.14 43.60 2.04 Federated 0.13 41.07 2.14

heart disease 5 Centralized 0.12 38.28 2.07 Federated 0.13 37.98 2.05

custom 2 Centralized 0.59 726.53 0.63 Federated 0.52 485.43 0.79

custom 3 Centralized 0.75 2918.89 0.36 Federated 0.67 2408.70 0.66

custom 4 Centralized 0.53 2048.49 1.39 Federated 0.54 2043.10 1.38

custom 5 Centralized 0.34 1618.97 1.95 Federated 0.45 1599.02 1.64

Breast Cancer 2 Centralized 0.34 267.69 1.32 Federated 0.35 267.44 1.30

Breast Cancer 3 Centralized 0.31 197.11 1.53 Federated 0.31 196.93 1.53

Breast Cancer 4 Centralized 0.27 158.68 1.51 Federated 0.20 156.09 1.76

Breast Cancer 5 Centralized 0.16 140.16 1.76 Federated 0.16 137.99 1.81

Table 4: Overview of the Performance Results - Soft Distribution.

Dataset Clusters Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

Approach Silhouette

Calinski

Harabasz

Davies

Bouldin

oasis2 2 Centralized 0.19 76.40 2.00 Federated 0.19 76.37 2.00

oasis2 3 Centralized 0.18 74.28 1.78 Federated 0.19 74.22 1.77

oasis2 4 Centralized 0.18 67.19 1.64 Federated 0.17 65.00 1.70

oasis2 5 Centralized 0.15 60.67 1.69 Federated 0.16 59.16 1.72

ObesityDataSet 2 Centralized 0.19 277.98 2.53 Federated 0.19 277.76 2.53

ObesityDataSet 3 Centralized 0.14 244.42 2.32 Federated 0.15 228.32 2.33

ObesityDataSet 4 Centralized 0.1 5 226.26 2.11 Federated 0.12 223.13 2.21

ObesityDataSet 5 Centralized 0.13 222.67 1.96 Federated 0.15 213.95 2.07

heart disease 2 Centralized 0.17 63.26 2.11 Federated 0.17 63.24 2.11

heart disease 3 Centralized 0.14 47.53 2.11 Federated 0.14 47.18 2.25

heart disease 4 Centralized 0.14 43.60 2.04 Federated 0.14 42.63 2.04

heart disease 5 Centralized 0.12 38.28 2.07 Federated 0.13 38.35 2.03

custom 2 Centralized 0.59 726.53 0.63 Federated 0.56 625.57 0.69

custom 3 Centralized 0.75 2918.89 0.36 Federated 0.75 2918.89 0.36

custom 4 Centralized 0.53 2048.49 1.39 Federated 0.50 1597.72 1.45

custom 5 Centralized 0.34 1618.97 1.95 Federated 0.48 1594.48 1.54

Breast Cancer 2 Centralized 0.34 267.69 1.32 Federated 0.34 267.65 1.31

Breast Cancer 3 Centralized 0.31 197.11 1.53 Federated 0.32 196.47 1.54

Breast Cancer 4 Centralized 0.27 158.68 1.51 Federated 0.19 156.15 1.79

Breast Cancer 5 Centralized 0.16 140.16 1.76 Federated 0.16 137.18 1.79

all runs performed for each scenario. In fact, in Ta-

ble 5, it is noticeable that the average gaps are small.

For example, in the SH scenario, the average gap for

the Silhouette is 0, indicating that even when the cen-

tralized approach performs better, the difference from

the federated approach is minimal. From Table 5, we

observe that, in the homogeneous case, the silhou-

ette worst gap is circa 0, that is, the point is close to

the decision boundary between the considered clus-

ters. Moving from the homogeneous data distribution

to the soft heterogeneous one, the silhouette worst

gap is equal to 0 and slightly below 0 for the hard-

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions

555

Table 5: Performance Gap Between Data Distribution Scenario.

Data

Distribution

Silhouette

Average Gap

Calinski-Harabasz

Average Gap

Davies-Bouldin

Average Gap

Silhouette

Worst Gap

Calinski-Harabasz

Worst Gap

Davies-Bouldin

Worst Gap

Homogeneous 0.0085 53.63 -0.053 0.08 0.30 -2.58

Hard Heterogeneous -0.04 -10.68 0.02 0.08 -6.84 -2.56

Soft Heterogeneous 0 30.78 -0.02 0 -0.06 -2.17

Table 6: Cluster similarity analysis.

ARI FMI

Alzheimer’s Disease

H-scenario 0.95 0.98

SH-scenario 1 1

HH-scenario 0.91 0.96

Heart Disease

H-scenario 0.93 0.96

SH-scenario 1 1

HH-scenario 0.88 0.94

Obesity

H-scenario 0.1 0.58

SH-scenario 0.96 0.98

HH-scenario 0.72 0.86

Breast Cancer

H-scenario 0.94 0.97

SH-scenario 1 1

HH-scenario 0.94 0.97

Custom

H-scenario 1 1

SH-scenario 1 1

HH-scenario 1 1

heterogeneous case, thus showing a certain stability

with respect to the passage toward the inhomogene-

ity. A higher instability is shown by the average gap in

the Calinski-Harabasz index, from soft to hard homo-

geneous. Finally, the Davies-Bouldin index presents

only a small clustering-quality diminution concerning

the worst gap, even between homogeneous and hard-

heterogeneous data distribution. It’s also interesting

to note that the average gap value for the Calinski-

Harabasz metric ranges from -0.053 to 0.02, indicat-

ing that the clusters are well separated with low intra-

cluster variance and high inter-cluster variance.

In addition, Table 5 also reports the gap between

the worst-performing runs among all runs for each

data distribution. Even in these cases, the gaps for

the Silhouette and Calinski-Harabasz metrics remain

small. However, the Davies-Bouldin gap is larger

than the average gap.

These results aim to demonstrate the effectiveness

of the proposed approach by comparing it with tra-

Table 7: Clusters composition of Alzheimer’s patients ob-

tained with the Centralised approach. EDUC- Education

level; SES - Socio-Economic Status; eTIV - estimated To-

tal Intracranial Volume; Normalized Whole Brain Volume

- nWBV; MMSE - Mini-Mental State Examination; CDR -

Clinical Dementia Rating.

Cluster 1 Cluster 2

Age

Mean (Std) 76.29 ± 7.3 77.46± 8.0

Range [61, 98] [60, 97]

Sex

Female # (%) 51 (39.53%) 153 (68%)

Male # (%) 78 (60.47%) 72 (32%)

EDUC (years)

Mean (Std) 13.79 ± 3.01 15.23 ± 2.69

Range [6, 20] [8, 23]

SES

Mean (Std) 2.79 ± 1.19 2.27 ± 1.05

Range [1, 5] [ 1, 5]

eTIV

Mean (Std) 1494.34 ± 172.84 1487.5 ± 176.99

Range [1143.0, 1957.0] [1106.0, 2004.0]

nWBV

Mean (Std) 0.71 ± 0.03 0.74 ± 0.04

Range [0.65, 0.81] [0.64, 0.84]

MMSE

Mean (Std) 24.39 ± 4.63 ) 29.14 ( 1.04 )

Range [4, 30] [24, 30]

CDR

Mean (Std) 0.67 ±0.31 0.04 ± 0.14

Range [ 0, 2] [0, 0.5]

Diagnosis

CN 2 (1.6%) 188 (83.6%)

AD 127 (98.4%) 0 (0%)

Converted 0 (0%) 37 (16.4%)

ditional centralized K-means. The performance from

a more general perspective provides insight into the

results in terms of the evaluation metrics. Thus, the

evaluation of the average and worst gaps highlights

the approach’s consistency, showing that even in aver-

age and worst-case scenarios, the proposed federated

approach converges to the centralized one.

4.2 Similarity Clusters Analysis

To determine the quality of the cluster composition

generated by the federated approach, we compare the

similarity between the clusters identiﬁed by the cen-

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

556

Table 8: Clusters of AD patients obtained with the federated

approach under homogeneous distribution settings.

Cluster 1 Cluster 2

Age

Mean (Std) 76.53 ± 7.21 77.31 ± 89

Range [61, 98] [60, 97]

Sex

Female # (%) 49 (39.2%) 155 (67.69%)

Male # (%) 76 (60.8) % 74 (32.31%)

EDUC

Mean (Std) 13.69 ± 2.99 15.26 ± 2.68

Range [6, 20] [8, 23]

SES

Mean (Std) 2.85 ± 1.17 2.25 ± 15

Range [1, 5] [1, 5]

eTIV

Mean (Std) 1492.74 ± 170 1488.49 ± 178.44

Range [1143 1957] [1106 2004]

nWBV

Mean (Std) 0.71 ± 03 0.74 ± 04

Range [0.65, 0.81] [0.64, 0.84]

MMSE

Mean (Std) 24.25 ± 4.63 29.14 ± 14

Range [4, 30] [24, 30]

CDR

Mean ± Std 0.67 ± 0.31 05 ± 0.15

Range [0, 2] [0, 0.5]

Diagnosis

CN 2 (1.6%) 188 (82.1%)

AD 123 (98.4%) 4 (1.7%)

Converted 0 (0%) 37 (16.2%)

tralized approach and those identiﬁed by the feder-

ated one using two similarity measures: 1) Adjusted

Rand Index (ARI), while Rand Index (RI) measures

cluster similarity through the percentage of consistent

decisions between two clustering, the Adjusted Rand

Index (ARI) corrects the RI by the chance grouping

of elements, providing more robust statistics for com-

paring different clustering algorithms or methods; 2)

Fowlkes–Mallows index (FMI) is a metric used to

evaluate the clusters similarity obtained through vari-

ous clustering algorithms. It is typically used to evalu-

ate the clustering performance of a speciﬁc algorithm

by assuming that the obtained cluster is compared to

the ground truth–i.e., the perfect cluster.

To adopt such indexes, we assume that the results

of the centralised clustering approach are the bench-

mark clusters to be compared.

The indexes obtained from the comparison be-

tween the centralised approach and the federated

approach in three different scenarios and for each

dataset are presented in Table 6. It can be noted that

signiﬁcant heterogeneity within the data may result

in diminished cluster composition quality in feder-

ated k-means, analogously to the case of traditional

supervised federated-learning methodologies. Never-

Table 9: Clusters of AD patients obtained with the federated

approach under soft distribution settings.

Cluster 1 Cluster 2

Age

Mean (Std) 76.29 ± 7.3 77.46± 8.0

Range [61, 98] [60, 97]

Sex

Female # (%) 51 (39.53%) 153 (68%)

Male # (%) 78 (60.47%) 72 (32%)

EDUC

Mean (Std) 13.79 ± 3.01 15.23 ± 2.69

Range [6, 20] [8, 23]

SES

Mean (Std) 2.79 ± 1.19 ) 2.27 ± 1.05

Range [1, 5] [ 1, 5]

eTIV

Mean (Std) 1494.34 ± 172.84 1487.5 ± 176.99

Range [1143.0, 1957.0] [1106.0, 2004.0]

nWBV

Mean (Std) 0.71 ± 0.03 0.74 ± 0.04

Range [0.65, 0.81] [0.64, 0.84]

MMSE

Mean (Std) 24.39 ± 4.63 ) 29.14 ( 1.04 )

Range [4, 30] [24, 30]

CDR

Mean (Std) 0.67 ±0.31 0.04 ± 0.14

Range [ 0, 2] [0, 0.5]

Diagnosis

CN 2 (1.6%) 188 (83.6%)

AD 127 (98.4%) 0 (0 %)

Converted 0 (0%) 37 (16.4%)

theless, a noteworthy outcome indicates that soft het-

erogeneity provides advantages in federated settings.

This speciﬁc ﬁnding manifests with greater promi-

nence in the Obesity dataset, which encompasses a

fourfold greater number of samples than the Breast

Cancer and Custom datasets, six times more than the

Oasis2 and Heart Disease datasets. Indeed, the anal-

ysis reveals that the ARI for the Obesity dataset dis-

plays a notably low performance in the homogeneous

scenario. At the same time, there is a notable increase

in the SH scenario and a somewhat lesser increase in

the HH scenario.

Finally, to further demonstrate the efﬁcacy of the

federated approach in generating comparative out-

comes with the classical centralized K-means method,

we present a descriptive analysis of the clusters iden-

tiﬁed through both approaches across three speci-

ﬁed scenarios using the OASIS2 dataset related to

Alzheimer’s disease. Such a dataset was selected

based on its unique provision of patients’ diagnoses,

which allows for a more practical demonstration of

similarities and differences among clusters compared

to other available datasets.

The clusters resulting from the centralized K-

means approach and the federated one with homoge-

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions

557

Table 10: Clusters of AD patients using the federated ap-

proach under Hard Heterogeneous distribution settings.

Cluster 1 Cluster 2

Age

Mean (Std) 76.94 ± 8.34 79.26 ± 8.21

Range [ 66.0 , 98.0 ] [ 60.0 , 97.0 ]

Sex

Female # (%) 45 ( 40.18 %) 161 ( 66.53 %)

Male # (%) 67 ( 59.82 %) 81 ( 33.47 %)

EDUC

Mean (Std) 13.16 ± 3.14 15.1 ± 2.76

Range [ 6.0 , 20.0 ] [ 8.0 , 23.0 ]

SES

Mean (Std) 2.96 ± 1.16 2.34 ± 1.09

Range [ 1.0 , 5.0 ] [ 1.0 , 5.0 ]

eTIV

Mean ± SD 1465.46 ± 154.55 1489.93 ± 199.58

Range [ 1143.0 , 1911.0 ] [ 1154.0 , 2004.0 ]

nWBV

Mean (Std) 0.71 ± 0.04 0.73 ± 0.04

Range [ 0.65 , 0.81 ] [ 0.66 , 0.84 ]

MMSE

Mean (Std) 23.04 ± 5.18 28.95 ± 1.17

Range [ 4.0 , 30.0 ] [ 24.0 , 30.0 ]

CDR

Mean (Std) 0.67 ± 0.29 0.06 ± 0.16

Range [ 0.5 , 2.0 ] [ 0.0 , 0.5 ]

Diagnosis

CN 0 (0%) 201 (83.1%)

AD 112 (100%) 3 (1.2%)

Converted 0 (0%) 38 (15.7%)

neous data distribution are shown in Tables 7 and 8.

The clusters identiﬁed by the proposed feder-

ated K-means closely resemble those identiﬁed by

the classical K-means, as evidenced by the indexes

presented in Table 6. The federated K-means al-

gorithm successfully clustered 123 individuals with

Alzheimer’s disease, while the classical K-means al-

gorithm clustered 127 individuals with the same con-

dition. Both methods categorized an equal number of

cognitively normal subjects and subjects likely to ex-

perience a conversion into cluster #2.

Moreover, Tables 9 and 10 show the clusters

detected by the federated approach under hetero-

geneous data distribution, Soft-Heterogeneous and

Hard-Heterogeneous distribution, respectively. As

observed, in cases of a soft distribution condition, the

federated approach demonstrates performance com-

parable to that of the traditional K-means algorithm.

Any slight discrepancies identiﬁed in the preceding

condition are mitigated, as evidenced by the indexes

presented in Table 6. On the contrary, when the data

distribution is heavily unbalanced under the previous

conditions, the discriminatory power of the federated

K-means decreases.

5 CONCLUSIONS

Federated Learning (FL) represents an emerging trend

in machine learning. This approach allows for the de-

velopment of a global model without sharing private

data distributed among multiple data owners. Signiﬁ-

cant research in the ﬁeld of FL has primarily focused

on applying supervised learning techniques. On the

contrary, the available literature about federated un-

supervised learning is still limited.

In particular, among unsupervised techniques,

clustering has shown numerous beneﬁcial applica-

tions in the healthcare domain. Hence, the advance-

ment of federated clustering has the potential to ad-

dress certain limitations associated with data usage,

particularly privacy concerns, which currently hinder

the full realization of the vast potential of health data-

clustering approaches. Moreover, the performance

of Federated Learning is found to be satisfactory in

scenarios where data are independent and identically

distributed (IID). Conversely, in cases where data

are non-independent and non-identically distributed

(Non-IID), it becomes challenging to effectively train

a machine learning algorithm that relies on global

measures while ensuring that all data remain local.

The current study presents a new federated K-

means clustering framework based on a global cosine-

similarity measure to enhance the trade-off between

privacy preservation and clustering effectiveness.

Furthermore, our investigation explores the impact of

the heterogeneous health data-distribution hypothesis

on the convergence of both local and global models.

The efﬁcacy of the proposed federated K-means

algorithm has been evaluated across diverse health-

care datasets and under various experimental condi-

tions, with a comparative analysis of its performance

against the centralized K-means algorithm. Accord-

ing to the initial results, while dealing with hetero-

geneous distributions of health data, the difference

between the centralized and federated approaches is

minimal, and the federated approach demonstrates

promising potential. The decentralized approach ex-

hibits superior performance when evaluated on cer-

tain healthcare datasets compared to the centralised

one. Finally, the results of this study provide empiri-

cal evidence that diversity within federated networks

can have a beneﬁcial effect on the overall quality of

cluster composition.

Future developments of this research can also ad-

dress the challenges associated with heterogeneous

data distributions in federated learning, for instance,

divergence of local models, and slow or unstable con-

vergence, or class imbalance. The detailed explo-

ration of solutions to undertake in these cases can

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

558

strengthen the proposed clustering approach.

Future work will focus on evaluating the approach

on larger datasets while removing the assumption of

equal probability-distributions among clients.

ACKNOWLEDGEMENTS

This paper was developed within the project funded

by Next Generation EU - “Age-It - Ageing well in

an ageing society” project (PE0000015), National Re-

covery and Resilience Plan (NRRP) - PE8 - Mission

4, C2, Intervention 1.3. The views and opinions ex-

pressed are only of the authors and do not necessarily

reﬂect those of the European Union or the European

Commission.

REFERENCES

Abid Ali Awan (2021). Heart Disease patients -

Targeting treatment for heart disease patients.

URL: https://www.kaggle.com/datasets/kingabzpro/

heart-disease-patients.

Act, A. (1996). Health insurance portability and account-

ability act of 1996. Public law, 104:191.

Battineni, G., Chintalapudi, N., and Amenta, F. (2019).

Data for: Machine learning in medicine: performance

calculation of dementia prediction by support vector

machines (SVM). Mendeley Data.

Bharati, S., Mondal, M. R. H., Podder, P., and Prasath, V. S.

(2022). Federated learning: Applications, challenges

and future directions.

Bonawitz, K., Kairouz, P., McMahan, B., and Ramage,

D. R. (2021). Federated learning and privacy. ACM

Queue, 19.

Brand

ao, A., Mendes, R., and Vilela, J. P. (2021). Ef-

ﬁcient Privacy Preserving Distributed K-Means for

Non-IID Data. In Abreu, P. E., Pereira Rodrigues, P.,

Fern

andez, A., and Gama, J., editors, Advances in In-

telligent Data Analysis XIX, 19th International Sym-

posium on Intelligent Data Analysis, IDA 2021, vol-

ume 12695, pages 439–451. Springer.

Dhade, P. and Shirke, P. (2024). Federated learning for

healthcare: A comprehensive review. Engineering

Proceedings, 59(1):230.

Garst, S. and Reinders, M. (2024). Federated K-Means

Clustering. ArXiV preprint, arXiv:2310.01195v2.

Liu, Y., Ma, Z., Yan, Z., Wang, Z., Liu, X., and a, J. (2020).

Privacy-preserving federated k-means for proactive

caching in next generation cellular networks. Infor-

mation Sciences, 521:14–31.

Marulli, F., Balzanella, A., Campanile, L., Iacono, M., and

Mastroianni, M. (2021a). Exploring a federated learn-

ing approach to enhance authorship attribution of mis-

leading information from heterogeneous sources. In

2021 International Joint Conference on Neural Net-

works (IJCNN), pages 1–8.

Marulli, F., Verde, L., Marrone, S., Barone, R., and De Bi-

ase, M. S. (2021b). Evaluating efﬁciency and effec-

tiveness of federated learning approaches in knowl-

edge extraction tasks. In 2021 International Joint

Conference on Neural Networks (IJCNN), pages 1–6.

Palechor, F. M. and De la Hoz Manotas, A. (2019). Dataset

for estimation of obesity levels based on eating habits

and physical condition in individuals from colombia,

peru and mexico. Data in brief, 25:104344.

Pedrycz, W. (2022). Federated FCM: Clustering Under Pri-

vacy Requirements. IEEE Transactions on Fuzzy Sys-

tems, 30(8):3384–3388.

Sheller, M. J., Edwards, B., Reina, G. A., Martin, J., Pati,

S., Kotrotsou, A., Milchenko, M., Xu, W., Marcus,

D., Colen, R. R., et al. (2020). Federated learning

in medicine: facilitating multi-institutional collabora-

tions without sharing patient data. Scientiﬁc reports,

10(1):12598.

Stallmann, M. and Wilbik, A. (2022). On a Framework

for Federated Cluster Analysis. Applied Sciences,

12(10455).

Voigt, P. and Von dem Bussche, A. (2017). The eu gen-

eral data protection regulation (gdpr). A Practical

Guide, 1st Ed., Cham: Springer International Pub-

lishing, 10(3152676):10–5555.

W. H. Wolberg, W. N. S. and Mangasarian, O. L.

(2017). Breast cancer Wisconsin (diagnos-

tic) data set. URL: https://www.kaggle.com/

datasets/nancyalaswad90/breast-cancer-dataset.

Wahab, O. A., Mourad, A., Otrok, H., and Taleb, T. (2021).

Federated machine learning: Survey, multi-level clas-

siﬁcation, desirable criteria and future directions in

communication and networking systems. IEEE Com-

munications Surveys Tutorials, 23(2):1342–1397.

Yang, K., Mohammadi Amiri, M., and Kulkarni, S. R.

(2024). Greedy centroid initialization for federated K

-means. Knowledge and Information Systems.

Zhu, S., Xu, Q., Zeng, J., Wang, S., Sun, Y., Yang, Z., and

Oeng, Z. (2023). F

KM: Federated, Fair, and Fast k-

means. Proc. ACM Manag. Data, 1(4):241.

A Federated K-Means-Based Approach in eHealth Domains with Heterogeneous Data Distributions

559