Privacy-Preserving Mortality Prediction in ICUs Using Federated

Learning

Pedro Vieira

, Eva Maia

and Isabel Prac¸a

Instituto Superior de Engenharia do Porto, Polit

ecnico do Porto, Porto, Portugal

{pmrrv, egm, icp}@isep.ipp.pt

Keywords:

Federated Learning, Machine Learning, Artiﬁcial Intelligence, Mortality Prediction.

Abstract:

Managing multiple patients in an Intensive Care Unit (ICU) can be extremely challenging. By predicting

patient mortality, healthcare professionals can provide more efﬁcient treatment and manage resources more

effectively. This allows for more precise and useful interventions, potentially preventing fatalities. Although

artiﬁcial intelligence (AI) is making signiﬁcant advancements in this ﬁeld, traditional Machine Learning (ML)

continues to be the most widely used AI method, though it raises concerns about data security in collaborative

environments. Since ensuring the safe handling of patients’ private data is crucial, Federated Learning (FL)

has emerged as a viable alternative. Its intrinsic characteristics offer a valuable solution for training predictive

models securely, as raw data does not need to be shared between participants. In this study, FL was used

to develop models capable of predicting ICU patient mortality while protecting data privacy. Using data

from the MIMIC-IV dataset, the most accurate model achieved an accuracy of 0.886, a recall of 0.817, and

a speciﬁcity of 0.965, surpassing all the analyzed studies. A comparison between FL and traditional ML

approaches revealed similar performance results. Moreover, three FL aggregation algorithms were evaluated,

a less common focus in this area of research. Federated Averaging performed best with some classiﬁers,

while delivering results comparable to FederAdagrad and FedAdam with others. In conclusion, the ﬁndings

demonstrate that FL can be as effective as traditional ML for mortality prediction, with the added beneﬁt of

enhanced data privacy.

1 INTRODUCTION

Predicting mortality in Intensive Care Units (ICUs)

can offer various advantages to healthcare profession-

als and facilities. Firstly, it enables the hospitals and

medical professionals to allocate resources more ef-

fectively by identifying patients at higher risk of de-

terioration. This allows for prioritization in monitor-

ing, treatment, and intervention strategies, potentially

reducing preventable deaths. Furthermore, mortality

prediction tools, often powered by Machine Learn-

ing (ML), can enhance decision-making regarding

treatment plans and end-of-life care, providing fam-

ilies and medical teams with data-driven insights to

support difﬁcult choices. Additionally, by predict-

ing mortality, it is possible to do a more efﬁcient re-

sources allocation. In this context, Artiﬁcial Intelli-

gence (AI) shows itself as a weapon to forecast and

ﬁght deadly outcomes (Holmstr

om et al., 2023).

https://orcid.org/0009-0003-9103-896X

https://orcid.org/0000-0002-8075-531X

https://orcid.org/0000-0002-2519-9859

However, a critical issue that must always be

addressed when applying AI in healthcare is pri-

vacy (Khalid et al., 2023). While the vast amount

of healthcare data offers immense potential for ad-

vancing medical research and innovations, it also re-

quires measures to protect the security and privacy

of individuals (Rieke et al., 2020). Ensuring the

conﬁdentiality, integrity, and secure management of

patient data is essential, especially considering the

highly sensitive nature of health information (Stan-

ﬁll and Marc, 2019). To address this challenge,

Federated Learning (FL) has emerged as an innova-

tive ML approach designed to balance data privacy

with data storage needs (Mammen, 2021). In FL,

multiple clients collaborate with one or more central

servers in decentralized ML setups. This approach

enables models to learn from distributed data sources

without exposing sensitive information, ensuring pri-

vacy while promoting collaborative insights (Mam-

men, 2021).

Even if FL brings several advantages, mainly con-

sidering the privacy of the data, it also presents some

Vieira, P., Maia, E. and Praça, I.

Privacy-Preserving Mortality Prediction in ICUs Using Federated Learning.

DOI: 10.5220/0013120600003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 87-95

ISBN: 978-989-758-731-3; ISSN: 2184-4305

issues. Integrating FL seamlessly into the healthcare

domain while navigating the details of varied data

sources, guaranteeing model accuracy, and address-

ing regulatory conformity requires innovative strate-

gies and robust frameworks. The decentralization of

the data encompasses different types of data partition-

ing, and selecting the appropriate partitioning method

is critical to the success of the model. The most

common approaches include: Horizontal Federated

Learning (HFL), where different clients have data

on the same features but for different individuals or

cases (Yang et al., 2019); Vertical Federated Learning

(VFL), where multiple sources each provide differ-

ent features for the same group of individuals or cases

(Zhuang et al., 2016); and Federated Transfer Learn-

ing (FTL), which applies transfer learning techniques

to build a new model using a pre-trained one (Yang

et al., 2023). HFL is the most common approach in

the scope of healthcare in the literature (Sharma and

Guleria, 2024). VFL is pertinent in scenarios where

different domains collaborate to train a global model

using shared data that are not linked (Zhuang et al.,

2016). This methodology allows the collaboration

and utilization of data across unrelated domains while

preserving the conﬁdentiality of sensitive information

unique to each domain. In other words, in HFL the

features are the same in every client, while in VFL

the clients have different features. FTL employs the

conventional ML-based transfer learning technique

to train a new requirement on a pre-trained frame-

work that has already undergone training on a simi-

lar dataset. This way it is possible to address an en-

tirely distinct problem. The fundamental concept be-

hind FTL revolves around the diversity in characteris-

tics among different participants. It addresses issues

related to limited or inadequate data by effectively

leveraging knowledge transfer while simultaneously

ensuring the security (Yang et al., 2023).

Beyond the choice of partitioning strategy, the ag-

gregation of model updates from decentralized data

sources presents additional challenges. To address

this, various aggregation techniques are employed,

each designed to cope with the issues of non-IID

(non-independent and identically distributed) data

across clients (Lazzarini et al., 2023). Federated Av-

eraging (FedAvg) is one of the most common meth-

ods, where client updates are averaged to create the

global model, but it may struggle with heterogeneous

data (Nilsson et al., 2018). More advanced optimizers

like FedAdam (C¸ elik and G

ull

u, 2023) and FedAda-

grad (C¸ elik and G

ull

u, 2023) apply adaptive learn-

ing rate strategies from classical optimization tech-

niques to better handle variations in data distribu-

tion. FedAdam builds on the Adam optimizer, ad-

justing the learning rate based on momentum, while

FedAdagrad adapts learning rates per parameter, im-

proving convergence for clients with differing data

scales. Selecting the right aggregation technique is

crucial for ensuring model performance, fairness, and

generalizability across diverse and partitioned health-

care datasets.

The aim of this work is to develop a FL approach

that allows to predict the mortality of ICU patients

while keeping in mind the privacy of health data.

For that, an HFL solution was developed to analyse

the performance of different algorithms, combined

with different FL aggregation algorithms, using a sub-

set of Medical Information Mart for Intensive Care

(MIMIC) IV dataset (Johnson et al., 2023). Logis-

tic Regression, Decision Tree, Random Forest, Sup-

port Vector Classiﬁer (SVC), and Multi-Layer Per-

ceptron (MLP) models were employed, alongside dif-

ferent aggregation methods, to assess and compare

the impact of FL on prediction accuracy. This way

we were able to perform a comparative analysis be-

tween different aggregation algorithms, an aspect of-

ten missing in most studies in this ﬁeld, and between

traditional ML and FL approaches. Furthermore, un-

like most previous research, this work also includes a

time efﬁciency comparison, highlighting which clas-

siﬁers and algorithms provide faster predictions of pa-

tient mortality. By optimizing time, we also reduce

energy and resource consumption, a crucial factor in

today’s context. This not only results in signiﬁcant

cost savings (Kim et al., 2023) but also contributes to

environmental sustainability (Rosen, 2021), aligning

ﬁnancial efﬁciency with ecological responsibility.

The paper is structured into different sections. The

ﬁrst section introduced the topic. Section 2 aims to

present and discuss works related to the scope. Sec-

tion 3 focus on the methodology employed in the

study. Section 4 presents and discusses the results

obtained by the authors. Lastly, Section 5 displays

a conclusion, while showing options for future work.

2 RELATED WORK

FL has been used to predict patients mortality in the

recent years. Randl, et al.’s (Randl et al., 2023) work

concludes that FL can be efﬁciently employed to pre-

dict early mortality in ICU patients. This conclu-

sion is supported by the fact that the FL approach

utilized by the authors also presented comparable re-

sults to those presented by traditional ML. More-

over, the study adds that utilizing the F1-Score met-

ric can result in a more efﬁcient model performance.

Therefore, it was possible for Randl et al. to ad-

HEALTHINF 2025 - 18th International Conference on Health Informatics

mit that FL can be a reliable solution to predict pa-

tients outcome, while securing their privacy. Further-

more, Georgoutsos’ work (Georgoutsos, 2023) esta-

bilished another comparison between FL and tradi-

tional ML. This time, the author concluded that, in

settings with non-IID datasets, FL models outperform

privacy-preserving Local Machine Learning (LML)

models in terms of AUROC, AUPRC, and F1-Score.

Nevertheless, he also concluded they generally per-

form slightly worse than Centralized Machine Learn-

ing (CML) models. Furthermore, his study stated that

the effectiveness of speciﬁc FL algorithms depends

on the characteristics of the FL elements, such as

data size and class representation. Mondrejevski et

al. (Mondrejevski et al., 2022) also delved into the

comparison between traditional ML and FL. The au-

thors came to the conclusion that CML and FL have

comparable performances when judging metrics such

as AUPRC and F1-Score, which goes against the ﬁnd-

ings of the previous study. Just as Georgoutsos con-

cluded, they also stated that FL performs better than

LML, which seems to conﬁrm this tendency. Despite

focusing on predicting patients mortality prediction,

none of the three studies seemed to focus on the dif-

ferent aggregation algorithms performance, which is

a pivot matter in this study.

Vieira et al. (Vieira et al., 2024) evaluated the

performance of several algorithms, with and without

FL techniques, to assess and compare the impact of

FL on predicting mortality in acute pancreatitis pa-

tients. Their results indicate that both traditional and

FL methods are highly effective in predicting mor-

tality in this patient population. The authors exam-

ined different aggregation algorithms, with FedAvg

emerging as the best choice, although the three ag-

gregation methods yielded similar results. However,

the study is limited to the speciﬁc medical condition

of acute pancreatitis, a very focused area. Expanding

the scope to encompass a broader range of diseases

would not only increase the dataset size but also al-

low for a deeper analysis of how FL models perform

with larger and more diverse datasets.

3 METHOD

The following subsections will explain the approach

used in creating both the FL and ML models,

designed for comparing and evaluating purposes.

Firstly, the dataset that was selected will be elaborated

on. Following that, the pre-processing steps will be

detailed, ﬁnalizing with the information regarding the

FL procedure.

3.1 Dataset

Several healthcare datasets have been employed in

previous studies, such as various versions of the

MIMIC dataset family and the eICU Collabora-

tive Research Database (eICU-CRD) (Pollard et al.,

2018). In this study, the MIMIC-IV dataset (John-

son et al., 2023) was used, which is one of the largest

publicly accessible healthcare datasets. This dataset

includes detailed records of patients admitted to In-

tensive Care Units (ICUs) at a major tertiary care hos-

pital. It offers a vast range of medical data, such as vi-

tal signs, medications, laboratory test results, provider

notes, ﬂuid balance, procedure and diagnostic codes,

imaging reports, hospital stay durations, survival in-

formation, and more. It is important to highlight,

however, that access to this dataset is restricted and

requires prior approval. To gain access, individuals

must complete a credentialing process, undergo re-

quired training, and sign a data use agreement.

MIMIC-IV is organized into various tables, each

containing distinct types of data. For the purposes of

this study, only a subset of these tables, speciﬁcally

those containing data relevant to the research objec-

tives, were selected. The tables used in this research

include the following:

admissions. This table is responsible for having data

regarding the admission of patients to the hospi-

tal. Even though a patient can be admitted more

than once, each row represents a single admission

exclusively.

patients. The patients table contains demographic

information for each admitted patient, including

details such as gender and age.

chartevents. This table records measurements and

observations about patients during their hospital

stay. It is one of the most crucial tables, as it con-

tains a wide range of medical data, primarily vital

signs, along with laboratory results, intake/output

measurements, and other clinical observations.

d items. This table functions as a map between code

and name for items recorded in the chartevents.

Each ID from the previous table is mapped to its

corresponding meaning here.

The admissions and patients tables were used

to identify the ﬁrst admission for each adult patient.

Additionally, the patients table provided key demo-

graphic information such as age and gender, both

of which were utilized as categories in the model’s

training and testing phases. The necessary medi-

cal data for model creation was extracted from the

chartevent table, made possible through the d Items

Privacy-Preserving Mortality Prediction in ICUs Using Federated Learning

table, which provided the deﬁnitions for each code

found in chartevents.

3.2 Pre-Processing

The ﬁrst step when dealing with the data provided by

the MIMIC-IV dataset was removing all the under-

age patients. This was needed due to the fact that mi-

nors may introduce variables that come with the risk

of impacting the prediction signiﬁcantly (Bavdekar,

2013). Removing all the lines that included null val-

ues was also an employed technique, as we were

dealing with large portions of data and the removal

did not harm the quality and the size of the dataset.

Next, all the variables were converted into binary for-

mat. The majority of models beneﬁt from binary data,

which allows them to have better results. Moreover,

the dataset also exhibits an imbalance in the “hospi-

tal expire ﬂag” category, which serves as the depen-

dent variable indicating whether a patient survived or

died. This imbalance could potentially introduce bias

into the model’s predictions. Therefore, undersam-

pling and SMOTE techniques were both tested in or-

der to balance it. As undersampling, with 50/50 rep-

resentation, presented the best results, it was chosen

to ﬁght the imbalance problem, resulting in the ﬁnal

dataset utilized to train the models. Additionally, un-

like SMOTE, undersampling has the advantage of re-

lying solely on real data, without the need for synthet-

ically generated samples.

3.3 Federated Learning

FL enables distributed model training on multiple de-

vices or servers, each of which retains its local data,

eliminating the need for data sharing between loca-

tions. Instead of transferring raw data to a central

server, FL allows models to be trained locally, and

only the resulting model updates or gradients are sent

to a central server for aggregation. In this work, the

authors focus on a network of hospitals collaborating

to create a uniﬁed model for predicting mortality in

ICU patients. So, to coordinate the three hospitals,

a server was linked to them, resulting in a healthcare

network, as illustrated in Figure 1. The server starts

by initializing a model and splitting it between the

three clients, which will locally train the model and

send it back to the server. The server then aggregates

the models received and once again splits them be-

tween the clients (Nilsson et al., 2018). This happens

for six rounds, which are iterations of the training pro-

cess, until a ﬁnal model is obtained. Moreover, it is

important to clarify that each one of the clients also

had 70% of data distributed for training and 30% for

testing. FL is particularly suited for this scenario, as

it balances the need for data privacy with the collab-

orative nature of the hospitals’ efforts in building a

shared predictive model.

Figure 1: FL Conﬁguration.

For the technical implementation of this scenario,

Flower (Beutel et al., 2022), a multi-platform, open-

source framework designed for secure execution of

FL, was selected. The FL architecture built using

Flower consists of various components for both the

server and the clients (Figure 2). On the server

side, three aggregation algorithms are represented:

FedAvg, FedAdam, and FedAdagrad. These algo-

rithms, which are utilized one at a time, provide dif-

ferent methods for aggregating the model updates

from the clients. Pivotal to the server’s operations

is the FL Loop, which coordinates the iterative pro-

cess of distributing the global model to the clients

and aggregating their locally trained models, accord-

ing to the selected aggregation algorithm. Communi-

cation between the server and clients is facilitated by

the gRPC server, which manages the transmission of

global model parameters to the clients and the recep-

tion of updated parameters from them. On the client

side, each identical client comprises three main com-

ponents: the gRPC client, the Flower client, and the

local data. The gRPC client handles the communi-

cation with the server’s gRPC server, ensuring the

smooth exchange of model parameters. The Flower

client, implemented in Python, is responsible for in-

tegrating the local ML environment with the FL pro-

cess. It receives the global model parameters from

the server, applies them to the local model, trains the

model using the local data, and sends the updated

model parameters back to the server.

HEALTHINF 2025 - 18th International Conference on Health Informatics

Figure 2: Flower Architecture.

4 EXPERIMENTAL RESULTS

In addition to implementing FL models, the authors

opted to develop traditional ML models using the

same pre-processed dataset to ensure a fair compar-

ison between the two approaches. The models de-

veloped for both methods include Logistic Regres-

sion, Decision Tree Classiﬁer, Random Forest Clas-

siﬁer, Support Vector Classiﬁer (SVC), and Multi-

Layer Perceptron (MLP) Classiﬁer. As previously

mentioned, for the FL approach, three aggregation

algorithms were utilized: FedAvg, FedAdam, and

FedAdagrad. It is important to note that due to the

lengthy training time required for the SVC model, the

scenario was divided into two approaches: in the ﬁrst,

only 10% of the dataset was used to allow for SVC

training; in the second, the complete dataset was used,

excluding the SVC model from training. This two-

pronged strategy enabled the authors to obtain results

for the SVC model while still using the full dataset for

the other classiﬁers.

Table 1 presents the results obtained for ML and

FL approaches, with only 10% of the dataset. It fa-

cilitates a rightful comparison between several classi-

ﬁers and aggregation algorithms, while also compar-

ing ML and FL approaches. These results are capable

of providing different insights into the performance

of ML and FL algorithms in predicting mortality for

ICU patients.

Upon comparing all the trained models, it be-

comes evident that the Random Forest and Decision

Tree models, both trained using traditional ML, ex-

hibit the highest metric values. The Decision Tree

had the highest Precision and F1-Score, with Random

Forest having similar values. Additionally, the FL

MLP paired with FedAdam demonstrated the highest

speciﬁcity, and the FL SVC combined with FedAda-

grad achieved the best recall.

Nonetheless, other conclusions can be drawn. It

can be observed that, in general, the FedAvg mod-

els perform comparably to traditional ML models, as

supported by literature (Tu et al., 2022). However, the

MLP experienced a signiﬁcant decline in metric val-

ues relative to other models. Furthermore, the MLP

also performed worse when using other aggregation

algorithms. It is also worth noting that three out of

ﬁve classiﬁers exhibited similar performance across

the three aggregation algorithms. However, two ex-

ceptions stand out: FedAdagrad produced notably

poor results when training both the SVC and MLP

compared to other algorithms, a trend supported by

other studies on other scopes (Vieira et al., 2024), in-

dicating this may be a recurring pattern with this algo-

rithm. Similarly, the MLP also underperformed with

FedAdam compared to FedAvg, as previously noted.

Another key goal of this experiment was to ana-

lyze the training times of the various algorithms. Tra-

ditional ML models were found to train faster than

FL models, which is expected since ML models do

not require aggregation of multiple models and do

not involve multiple training rounds, unlike FL. De-

spite this, the time required for the best-performing

models was not excessively long. The SVC is the

slowest classiﬁer, making it an unsuitable choice if

time, energy, and resources are a priority. In con-

trast, Logistic Regression and Decision Tree were the

fastest, completing training in less than 0.02 min-

utes. Random Forest also trained quickly in the ML

setting, though it was slightly slower in FL, taking

around 0.30 minutes. Nevertheless, considering its

strong performance metrics, it remains a viable op-

tion. It’s important to note that only 10% of the

dataset was used in this analysis, meaning the models

would require more time to train on the complete pre-

processed dataset, as will be discussed next. Finally,

the results show that the three aggregation algorithms

Privacy-Preserving Mortality Prediction in ICUs Using Federated Learning

Table 1: ML and FL results to General Diseases Mortality Prediction – 10% of the dataset.

Algorithm Accuracy Precision Recall F1-score Speciﬁcity Training Time

(Minutes)

Machine Learning

Logistic Regres-

sion

0.839 0.832 0.790 0.810 0.863 0.00

Decision Tree 0.856 0.892 0.794 0.842 0.918 0.00

Random Forest 0.857 0.883 0.802 0.841 0.909 0.03

SVC 0.843 0.852 0.798 0.824 0.881 0.89

MLP 0.843 0.865 0.784 0.822 0.895 0.55

Federated Learning - FedAvg

Logistic Regres-

sion

0.833 0.822 0.806 0.814 0.855 0.02

Decision Tree 0.860 0.883 0.796 0.838 0.913 0.02

Random Forest 0.859 0.869 0.812 0.840 0.898 0.30

SVC 0.841 0.836 0.811 0.823 0.868 66.89

MLP 0.747 0.844 0.544 0.662 0.917 3.32

Federated Learning - FedAdam

Logistic Regres-

sion

0.827 0.820 0.798 0.809 0.805 0.02

Decision Tree 0.860 0.884 0.797 0.839 0.911 0.02

Random Forest 0.858 0.868 0.812 0.839 0.897 0.29

SVC 0.841 0.832 0.814 0.823 0.863 65.10

MLP 0.660 0.816 0.324 0.468 0.939 3.30

Federated Learning - FedAdagrad

Logistic Regres-

sion

0.818 0.819 0.795 0.807 0.845 0.02

Decision Tree 0.860 0.883 0.798 0.838 0.912 0.02

Random Forest 0.857 0.866 0.811 0.838 0.896 0.30

SVC 0.457 0.455 0.996 0.625 0.451 68.83

MLP 0.552 0.890 0.014 0.028 0.936 3.25

required a similar amount of time to ﬁnish training.

Therefore, when it comes to time, the choice between

them becomes almost negligible.

The results presented in Table 2 facilitate a sim-

ilar comparison to those in Table 1, but utilizing the

complete dataset. By comparing the trained models,

it becomes evident that the Random Forest trained

with traditional ML outperforms the others, achiev-

ing the highest Accuracy, Precision, and F1-Score.

The best Recall was found in both the FedAvg and

FedAdagrad Decision Trees, while FedAvg’s Random

Forest showed the highest Speciﬁcity. Once again,

the MLP underperformed in the FL setting, particu-

larly when trained with FedAdam and FedAdagrad, as

compared to traditional ML. However, all other clas-

siﬁers demonstrated comparable performance to tra-

ditional ML models and among themselves. Despite

this, FedAvg showed a slight advantage, although the

difference was minimal across three out of four mod-

els. It is also clear that FL requires signiﬁcantly more

time to train than traditional ML, due to its iterative

and aggregative processes, a tendency also shown in

the previous case. In terms of training time, Logistic

Regression and Decision Tree were the fastest, while

MLP was the slowest, a trend already observed in the

analysis of 10% of the dataset. Random Forest, on

the other hand, struck a good balance between perfor-

mance and training time.

As this topic has been explored in previous ML

studies, a comparison was made between the best ML

model trained in this work and traditional ML models

from other studies. Table 3 presents this comparison,

featuring some of the most notable works focused

on training ML models for ICU mortality prediction.

The ML Random Forest from this work provided the

best performance metrics in three out of the ﬁve an-

alyzed metrics. In other words, it showed the best

Accuracy, Precision and F1-Score, despite Iwase, et

al.’s Random Forest presenting the best Recall value.

It also presented the second-best Speciﬁcity, trailing

behind Alghatani et al.’s Random Forest.

In a similar way, a comparison was conducted be-

tween the best FL model trained in this study (the

Random Forest aggregated with FedAvg) and FL

models from other research, as shown in Table 4. This

table includes all the relevant works that employed

at least one of the metrics used in this work. How-

ever, since none of them applied all the same met-

HEALTHINF 2025 - 18th International Conference on Health Informatics

Table 2: ML and FL results to General Diseases Mortality Prediction – complete dataset.

Algorithm Accuracy Precision Recall F1-score Speciﬁcity Training Time

(Minutes)

Machine Learning

Logistic Regres-

sion

0.843 0.823 0.794 0.808 0.875 0.03

Decision Tree 0.892 0.933 0.820 0.873 0.951 0.01

Random Forest 0.893 0.936 0.821 0.875 0.953 0.34

MLP 0.843 0.865 0.784 0.822 0.895 4.575

Federated Learning - FedAvg

Logistic Regres-

sion

0.840 0.824 0.796 0.810 0.868 0.08

Decision Tree 0.892 0.933 0.822 0.874 0.948 0,17

Random Forest 0.886 0.935 0.817 0.870 0.965 3,04

MLP 0.756 0.720 0.749 0.734 0.818 36,78

Federated Learning - FedAdam

Logistic Regres-

sion

0.830 0.821 0.799 0.810 0.845 0.08

Decision Tree 0.890 0.930 0.821 0.871 0.951 0,02

Random Forest 0.884 0.928 0.808 0.864 0.938 3,02

MLP 0.546 0.900 0.454 0.603 0.832 35,88

Federated Learning - FedAdagrad

Logistic Regres-

sion

0.829 0.815 0.795 0.808 0.834 0.08

Decision Tree 0.889 0.928 0.822 0.856 0.950 0.19

Random Forest 0.885 0.927 0.820 0.875 0.942 3,12

MLP 0.546 0.645 0.102 0.176 0.435 36,56

Table 3: General Diseases Mortality Prediction – State-of-the-art ML results comparison.

Classiﬁer Accuracy Precision Recall F1-Score Speciﬁcity Training Time -

Minutes

This Work’s Ran-

dom Forest

0.893 0.936 0.821 0.875 0.953 0.34

Iwase, et al.’s Ran-

domForest (Iwase

et al., 2022)

Unknown Unknown 0.865 Unknown 0.875 Unknown

Nistal-Nu

no’s

Extreme Gradient

Boosting (XGB)

(Nistal-Nu

no,

2022)

0.855 0.528 0.831 0.645 0.860 Unknown

Pang et al.’s XG-

Boost(Pang et al.,

2022)

0.834 0.842 0.822 0.831 0.846 Unknown

Chia, et al.’s best

XGB (Chia et al.,

2021)

0.819 0.420 0.615 0.499 0.689 Unknownn

Alghatani et al.’s

Random Forest

(Alghatani et al.,

2021)

0.885 0.840 0.095 0.171 0.997 Unknown

rics, it ended up being a challenge in making a direct

and complete comparison. Furthermore, even though

some studies used certain metrics to evaluate their

models, a comparison with their ﬁndings was not pos-

sible, as those particular metrics were not used in this

work. First, Randl et al. (Randl et al., 2023) pro-

vided the most comprehensive set of metrics, allow-

ing for a fairer comparison with their work compared

to other authors. Although they presented multiple re-

sults due to the variety of models used, only their best

FL model was selected for this comparison. It be-

comes evident that their results are generally inferior

to most of the models trained in this study, both in

ML and FL. Unfortunately, Georgoutsos (Georgout-

Privacy-Preserving Mortality Prediction in ICUs Using Federated Learning

Table 4: General Diseases Mortality Prediction – State-of-the-art FL results comparison.

Classiﬁer Accuracy Precision Recall F1-Score Speciﬁcity Training Time -

Minutes

This Work’s Ran-

dom Forest - Fe-

dAvg

0.886 0.935 0.817 0.870 0.965 3.04

Randl, et al. best

FL model (Randl

et al., 2023)

Unknown 0.520 0.460 0.480 Unknown Unknown

Georgoutsos best

FL model (Geor-

goutsos, 2023)

Unknown Unknown Unknown 0.512 Unknown Unknown

Mondrejevski et

al. best FL model

(Mondrejevski

et al., 2022)

Unknown Unknown Unknown 0.830 Unknown Unknown

sos, 2023) and Mondrejevski et al. (Mondrejevski

et al., 2022) only reported the F1-Score, out of all

the metrics used in this work. Georgoutsos showed

a low F1-Score, which underperformed compared to

most of the models discussed here (with the excep-

tion of the MLP model using the FedAdagrad algo-

rithm). Lastly, Mondrejevski et al. [90] reported

the highest F1-Score among the three state-of-the-art

works. While this result surpasses some of the models

presented, it does not outperform the top-performing

models from Table 2, speciﬁcally the Random Forest

and Decision Tree trained with any of the aggregation

algorithms.

After carefully analysing the results for both ap-

proaches (10% of the dataset, and the total dataset), it

was possible to conﬁrm some results previously pre-

sented. The MLP tends to be less accurate in a FL set

and the SVC performance with FedAdagrad was once

more much worse than in any other approach (both in

FL and traditional ML). Moreover, as expected, the

SVC is the slowest classiﬁer, while Logistic Regres-

sion is the fastest. No signiﬁcant difference between

the time needed. In terms of performance, most of

the aggregation algorithms’ models showed compa-

rable performances between each other and with the

ML models. However, traditional ML models still had

the best metrics, even though the difference was not

signiﬁcant. In terms of FL only, FedAvg was slightly

better than the other two aggregation algorithms, even

though the performances were practically equivalent.

5 CONCLUSIONS AND FUTURE

WORK

In conclusion, this study aimed to predict ICU pa-

tient mortality using a privacy-preserving approach

through FL, comparing different aggregation algo-

rithms and traditional ML methods. The MIMIC-

IV dataset was used, and a network of hospitals was

simulated, where multiple clients collaborated to train

models using Horizontal Federated Learning.

The experimental results indicated that FL mod-

els demonstrated comparable performance in several

cases, particularly when utilizing the FedAvg aggre-

gation algorithm, consistent with ﬁndings in the ex-

isting literature. The study also emphasized the trade-

offs between model performance and training time,

noting that FL models generally require more time

due to the aggregation and iterative processes in-

volved. Nevertheless, since the differences in time

and performance are not substantial, FL remains a

viable option for scenarios where privacy is a sig-

niﬁcant concern, which applies to most cases within

the scope of this research. Additionally, it was ob-

served that FedAvg was the most reliable aggregation

method for producing stable results across various

classiﬁers. Conversely, the MLP algorithm exhibited

underperformance when using FL, particularly with

the FedAdam and FedAdagrad algorithms.

Although this study provides important insights

into the use of FL for predicting ICU mortality, fu-

ture research could focus on optimizing FL models

and incorporating data from multiple datasets to en-

hance model performance further. The model shar-

ing between server and clients could also beneﬁt from

cryptography, as it would lead into an even safer and

more privacy-friendly approach.

ACKNOWLEDGEMENTS

This research was funded by the Celtic-Next Project

iCare4NextG (C2020/1-8) and Project iCare4NextG

(POCI-01-0247-FEDER-072265) co-funded by Por-

tugal 2020. Furthermore, this work also received

funding from the project UIDB/00760/2020.

HEALTHINF 2025 - 18th International Conference on Health Informatics

REFERENCES

Alghatani, K., Ammar, N., Rezgui, A., and Shaban-Nejad,

A. (2021). Predicting intensive care unit length of stay

and mortality using patient vital signs: Machine learn-

ing model development and validation. JMIR Medical

Informatics, 9(5):e21347.

Bavdekar, S. B. (2013). Pediatric clinical trials. Perspec-

tives in clinical research, 4(1):89–99.

Beutel, D. J., Topal, T., Mathur, A., Qiu, X., Fernandez-

Marques, J., Gao, Y., Sani, L., Li, K. H., Parcollet, T.,

de Gusm

ao, P. P. B., et al. (2022). Flower: A friendly

federated learning framework.

Chia, A. H. T., Khoo, M. S., Lim, A. Z., Ong, K. E., Sun, Y.,

Nguyen, B. P., Chua, M. C. H., and Pang, J. (2021).

Explainable machine learning prediction of icu mor-

tality. Informatics in Medicine Unlocked, 25:100674.

Georgoutsos, A. (2023). Analysis of deep federated learn-

ing on early prediction of icu mortality risk.

Holmstr

om, L., Zhang, F. Z., Ouyang, D., Dey, D., Slomka,

P. J., and Chugh, S. S. (2023). Artiﬁcial intelligence in

ventricular arrhythmias and sudden death. Arrhythmia

& Electrophysiology Review, 12.

Iwase, S., Nakada, T.-a., Shimada, T., Oami, T., Shi-

mazui, T., Takahashi, N., Yamabe, J., Yamao, Y., and

Kawakami, E. (2022). Prediction algorithm for icu

mortality and length of stay using machine learning.

Scientiﬁc Reports, 12(1).

Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Sham-

mout, A., Horng, S., Pollard, T. J., Hao, S., Moody,

B., Gow, B., et al. (2023). Mimic-iv, a freely acces-

sible electronic health record dataset. Scientiﬁc data,

10(1):1.

Khalid, N., Qayyum, A., Bilal, M., Al-Fuqaha, A., and

Qadir, J. (2023). Privacy-preserving artiﬁcial intel-

ligence in healthcare: Techniques and applications.

Computers in Biology and Medicine, page 106848.

Kim, Y., Torroba Hennigen, L., and Wang, P. (2023). New

technique enables faster training of machine-learning

models. MIT News. Accessed: 2024-09-26.

Lazzarini, R., Tianﬁeld, H., and Charissis, V. (2023). Fed-

erated learning for iot intrusion detection. AI.

Mammen, P. M. (2021). Federated learning: Opportunities

and challenges. arXiv preprint arXiv:2101.05428.

Mondrejevski, L., Miliou, I., Montanino, A., Pitts, D.,

Hollm

en, J., and Papapetrou, P. (2022). Flicu: a

federated learning workﬂow for intensive care unit

mortality prediction. In 2022 IEEE 35th Interna-

tional Symposium on Computer-Based Medical Sys-

tems (CBMS), pages 32–37. IEEE.

Nilsson, A., Smith, S., Ulm, G., Gustavsson, E., and

Jirstrand, M. (2018). A performance evaluation of fed-

erated learning algorithms. In Proceedings of the sec-

ond workshop on distributed infrastructures for deep

learning, pages 1–8.

Nistal-Nu

no, B. (2022). Developing machine learning mod-

els for prediction of mortality in the medical inten-

sive care unit. Computer Methods and Programs in

Biomedicine, 216:106663.

Pang, K., Li, L., Ouyang, W., Liu, X., and Tang, Y. (2022).

Establishment of icu mortality risk prediction mod-

els with machine learning algorithm using mimic-iv

database. Diagnostics, 12(5):1068.

Pollard, T. J., Johnson, A. E. W., Raffa, J. D., Celi, L. A.,

Mark, R. G., and Badawi, O. (2018). The eicu col-

laborative research database, a freely available multi-

center database for critical care research. Scientiﬁc

Data, 5(1).

Randl, K., Armengol, N. L., Mondrejevski, L., and Miliou,

I. (2023). Early prediction of the risk of icu mortality

with deep federated learning. In 2023 IEEE 36th In-

ternational Symposium on Computer-Based Medical

Systems (CBMS), pages 706–711. IEEE.

Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H. R.,

Albarqouni, S., Bakas, S., Galtier, M. N., Landman,

B. A., Maier-Hein, K., et al. (2020). The future of

digital health with federated learning. NPJ digital

medicine, 3(1):1–7.

Rosen, M. A. (2021). Energy sustainability with a focus on

environmental perspectives. Earth Systems and Envi-

ronment, 5(2):217–230.

Sharma, S. and Guleria, K. (2024). A comprehensive review

on federated learning based models for healthcare ap-

plications. Artif. Intell. Med., 146(C).

Stanﬁll, M. H. and Marc, D. T. (2019). Health information

management: implications of artiﬁcial intelligence on

healthcare data and information management. Year-

book of medical informatics, 28(01):056–064.

Tu, K., Zheng, S., Wang, X., and Hu, X. (2022). Adap-

tive federated learning via mean ﬁeld approach. In

2022 IEEE International Conferences on Internet of

Things (iThings) and IEEE Green Computing & Com-

munications (GreenCom) and IEEE Cyber, Physical

& Social Computing (CPSCom) and IEEE Smart Data

(SmartData) and IEEE Congress on Cybermatics (Cy-

bermatics), pages 168–175. IEEE.

Vieira, P., Maia, E., and Prac¸a, I. (2024). Acute pancre-

atitis mortality prediction with federated learning. In

Progress in Artiﬁcial Intelligence - 23rd EPIA Con-

ference on Artiﬁcial Intelligence, EPIA 2024, Viana

do Castelo, Portugal, September 3-6, 2024, Proceed-

ings, volume II of Lecture Notes in Computer Science,

pages 3–15. Springer.

Yang, A., Ma, Z., Zhang, C., Han, Y., Hu, Z., Zhang, W.,

Huang, X., and Wu, Y. (2023). Review on applica-

tion progress of federated learning model and security

hazard protection. Digital Communications and Net-

works, 9(1):146–158.

Yang, Q., Liu, Y., Chen, T., and Tong, Y. (2019). Federated

machine learning: Concept and applications. ACM

Transactions on Intelligent Systems and Technology

(TIST), 10(2):1–19.

Zhuang, Y., Li, G., and Feng, J. (2016). A survey on entity

alignment of knowledge base. Journal of Computer

Research and Development, 53(1):165–192.

C¸ elik, E. and G

ull

u, M. K. (2023). Comparison of federated

learning strategies on ecg classiﬁcation. In 2023 Inno-

vations in Intelligent Systems and Applications Con-

ference (ASYU), pages 1–4.

Privacy-Preserving Mortality Prediction in ICUs Using Federated Learning