A Federated Approach to Enhance Calibration

of Distributed ML-Based Intrusion Detection Systems

Jacopo Talpini

, Nicol

o Civiero, Fabio Sartori

and Marco Savi

Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, Italy

ﬁ

Keywords:

Intrusion Detection, Federated Learning, Machine Learning, Model Calibration.

Abstract:

Network intrusion detection systems (IDSs) are a major component for network security, aimed at protecting

network-accessible endpoints, such as IoT devices, from malicious activities that compromise conﬁdentiality,

integrity, or availability within the network infrastructure. Machine Learning models are becoming a popular

choice for developing an IDS, as they can handle large volumes of network trafﬁc and identify increasingly

sophisticated patterns. However, traditional ML methods often require a centralized large dataset thus raising

privacy and scalability concerns. Federated Learning (FL) offers a promising solution by enabling a collabora-

tive training of an IDS, without sharing raw data among clients. However, existing research on FL-based IDSs

primarily focuses on improving accuracy and detection rates, while little or no attention is given to a proper

estimation of the model’s uncertainty in making predictions. This is however fundamental to increase the

model’s reliability, especially in safety-critical applications, and can be addressed by an appropriate model’s

calibration. This paper introduces a federated calibration approach that ensures the efﬁcient distributed train-

ing of a calibrator while safeguarding privacy, as no calibration data has to be shared by clients with external

entities. Our experimental results conﬁrm that the proposed approach not only preserves model’s performance,

but also signiﬁcantly enhances conﬁdence estimation, making it ideal to be adopted by IDSs.

1 INTRODUCTION

Network intrusions represent a signiﬁcant threat to

modern communication systems, with their frequency

and complexity steadily increasing (European Union

Agency for Cybersecurity, 2023). These incidents

compromise data, disrupt services, and erode trust in

digital infrastructures. In this landscape, the Internet

of Things (IoT) represents a growing paradigm that

facilitates the connection of diverse devices and com-

putational capabilities through the Internet.

However, the ongoing expansion of IoT systems,

which often involve a substantial number of devices,

heightens even more the risk of cyber-attacks. As a

result, the development of effective detection strate-

gies and resilient countermeasures has become criti-

cal: proactively identifying vulnerabilities and imple-

menting adaptive defense mechanisms are essential to

safeguarding IoT networks (Hassija et al., 2019).

Intrusion Detection Systems (IDSs) play a primary

https://orcid.org/0000-0003-1556-6296

https://orcid.org/0000-0002-5038-9785

https://orcid.org/0000-0002-8193-0597

role to accomplish this task (Khraisat et al., 2019).

Traditional intrusion detection methods have mainly

relied on knowledge-based systems (Hassija et al.,

2019). However, as networks become more com-

plex, these methods are increasingly prone to errors

(Shone et al., 2018; Tsimenidids et al., 2022). To ad-

dress these challenges, data-driven approaches lever-

aging machine learning (ML) have gained signiﬁcant

attention in recent years (Saranya et al., 2020), with a

strong focus on detecting attacks in IoT environments

(Al-Garadi et al., 2020), which is difﬁcult given the

high data heterogeneity.

However, a limitation of current ML-based ap-

proaches is their reliance on centralized training,

where data and computational resources are pro-

cessed and managed by a single central node, such

as a server. Such a centralized framework is asso-

ciated with several challenges, including high com-

putational requirements, extended training times, and

heightened concerns over the security and privacy of

users’ data (Mothukuri et al., 2020).

To address these issues, federated learning (FL)

was originally proposed in (McMahan et al., 2017)

and has recently emerged as an effective model train-

840

Talpini, J., Civiero, N., Sartori, F. and Savi, M.

A Federated Approach to Enhance Calibration of Distributed ML-Based Intrusion Detection Systems.

DOI: 10.5220/0013376600003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 840-848

ISBN: 978-989-758-737-5; ISSN: 2184-433X

ing paradigm in the context of network intrusion de-

tection to address the issues recalled above (Campos

et al., 2021; Talpini et al., 2023). It embodies the

principles of focused collection and data minimiza-

tion, and can especially mitigate many of the sys-

temic privacy risks and costs resulting from tradi-

tional, centralized machine learning, including high

communication efﬁciency and low-latency data pro-

cessing (Kairouz et al., 2021).

In addition, it is fundamental to note that net-

work intrusion detection is a safety-critical applica-

tion, where the consequences of wrong predictions

(i.e., attack yes/no, or which kind of attack) can be

severe and costly. In fact, a network administrator

would require a model that not only is accurate, but

also trustworthy, i.e., able to indicate when its pre-

dictions are likely to be incorrect. In literature, some

works have already pointed out the strong need for

models that are trustworthy (or reliable) in the con-

sidered domain (Talpini et al., 2024).

A fundamental aspect for enhancing the trustwor-

thiness of a ML model relies on its calibration prop-

erty. A model is said to be calibrated if the probabil-

ity associated with the predicted class label matches

the empirical frequency of its ground truth correct-

ness (Guo et al., 2017). As a concrete example, let

us consider an Internet Service Provider that wants to

offer IDS service to its customers and suppose that

the system relies on a ML model to identify whether

network trafﬁc belongs to an attack or not. It would

be beneﬁcial to ensure that the model is calibrated,

so that, out of a certain number of predictions with a

given conﬁdence level, say 90%, we expect roughly

the 90% of samples to be correctly classiﬁed. Ensur-

ing such property makes it possible to understand to

what extent the ML model can be trusted in its classi-

ﬁcation activity.

However, calibrating a model is not always an

easy task. A model is typically calibrated after its

training by using a calibration dataset (B

oken, 2021).

While this is not an issue in centralized settings,

where a lot of users’ data is made available for model

training, and part could also be used for calibration,

it becomes unfeasible in privacy-preserving scenarios

– such as those targeted by FL – where data cannot

be shared and must be kept local. On the other hand,

performing calibration locally on the globally-trained

model by means of FL may be only partially effective.

To tackle this problem we designed a novel feder-

ated calibration module based on a well-known cali-

bration method, i.e., Platt scaling (B

oken, 2021). Our

strategy embraces the federated learning principles,

and it can be applied to any given pre-trained model.

Given its peculiarities, it is especially suitable in fed-

erated settings where also the calibration operation,

and not only the learning, exploits data from the local

clients. Especially, this module can operate in feder-

ated scenarios without requiring modiﬁcations to the

standard FedAvg (McMahan et al., 2017) model ag-

gregation procedure and without the need of changing

or retraining the base model.

We evaluated the effectiveness of our approach

using the ToN-IoT dataset (Sarhan et al., 2021), a

widely adopted benchmark for studying attacks on

IoT infrastructures (Alsaedi et al., 2020), focusing on

how data heterogeneity affects the calibration proper-

ties of the proposed approach. For comparison, we

included a baseline classiﬁer trained in a federated

learning framework, which demonstrated poor cali-

bration properties, as well as a centralized calibration

approach where the calibration module is trained us-

ing data retrieved from clients (i.e., IoT devices) at

a centralized location. Additionally, we tested a per-

sonalized calibration approach where each client in-

dependently trains its own calibrator (i.e., local cali-

bration). Our results show that calibrating the model

is effective to enhance its reliability and that our fed-

erated calibration approach provides (i) better calibra-

tion performance than local calibration, and (ii) only

slightly worse performance than centralized calibra-

tion, which however breaks privacy-preserving con-

straints. In addition, this is obtained without signiﬁ-

cantly altering the model’s classiﬁcation performance

in terms of F1-Score and accuracy.

To summarize, our main contributions are:

• The deﬁnition of a federated calibration module

that works with any pre-trained classiﬁer, but that

is best suited to a federated learning environment.

• An evaluation of the proposed calibration ap-

proach in IoT intrusion detection scenarios with

different clients’ data heterogeneity, and its com-

parison to other relevant baselines.

The structure of the paper is organized as follows.

Section 2 introduces the related work, while Section

3 introduces the proposed federated calibration ap-

proach. Section 4 describes the experimental setup

and the dataset utilized. In Section 5 we provide and

discuss the numerical results, and conclude the pa-

per by highlighting the main takeaways and lessons

learned in Section 6.

2 RELATED WORK

Intrusion Detection Systems are traditionally re-

garded as a secondary layer of defense, designed to

monitor network trafﬁc and identify malicious activ-

A Federated Approach to Enhance Calibration of Distributed ML-Based Intrusion Detection Systems

841

ities that primary security measures (e.g., ﬁrewalls)

are not able to identify (Moustafa et al., 2018). IDSs

are typically categorized as either signature-based or

anomaly-based (Tauscher et al., 2021). Signature-

based IDSs, also known as misuse detection systems,

use pattern recognition techniques to compare cur-

rent network trafﬁc against known attack signatures.

In contrast, anomaly-based IDSs build a model of

normal network behavior and identify any deviations

from this baseline as potential intrusions. However,

they often suffer from a high false positive rate (Al-

Garadi et al., 2020).

In this paper, we focus on signature-based intru-

sion detection, with a speciﬁc focus on IoT envi-

ronments, and in the following subsection we report

on the relevant work related to data-driven systems,

based on machine learning, in this context. Later, we

also recall relevant recent strategies that have been

proposed to improve the calibration of a model in a

federated setting.

2.1 ML-Based Intrusion Detection

Systems

In recent years, data-driven approaches for develop-

ing IDSs have received a lot of attention from the

research community (Shone et al., 2018; Liu and

Lang, 2019; Tauscher et al., 2021) considering differ-

ent methods such as random forests, support vector

machines, neural networks or clustering techniques.

In particular, machine learning and deep learning are

emerging as powerful data-driven approaches capable

of learning and identifying malicious patterns in net-

work trafﬁc, making them highly effective for detect-

ing security threats in networks (Shone et al., 2018),

and in particular in IoT environments (Chaabouni

et al., 2019; Sarhan et al., 2022).

However, the majority of intrusion detection ap-

proaches proposed for the IoT domain rely on cen-

tralized architectures, where IoT devices transmit

their local datasets to cloud datacenters or centralized

servers. This setup leverages the substantial com-

puting power of these centralized systems for model

training (Campos et al., 2021). As a result, the fed-

erated learning paradigm emerged as a promising al-

ternative to traditional centralized approaches in this

domain, and it is possible to ﬁnd a few examples in

the literature exploring its feasibility (Rahman et al.,

2020; Campos et al., 2021; Aouedi et al., 2022; Rey

et al., 2022; Talpini et al., 2023).

For instance, in (Rey et al., 2022) the authors pro-

pose a FL-based framework to detect malware, based

on both supervised and unsupervised models. More

precisely, the authors perform a binary classiﬁcation

with a multi-layer perceptron and an autoencoder on

balanced datasets, which present the same class pro-

portions for every client. Another contribution is rep-

resented by the paper (Campos et al., 2021), in which

the authors investigated a FL approach on a realistic

IoT dataset and showed the issues related to the so-

called statistical heterogeneity, arising from the fact

that typically different devices experience different

kinds of attacks.

Another relevant contribution in the realm of reli-

able models for intrusion detection is (Talpini et al.,

2024), where the authors show how uncertainty quan-

tiﬁcation can enhance the trustworthiness of an IDS,

with a focus on the usual centralized setting. Our pa-

per is aligned with the vision of (Talpini et al., 2024),

as we propose a federated approach to obtain and de-

ploy calibrated models, which is a primary require-

ment for an uncertainty-aware model.

2.2 Model Calibration in Federated

Settings

Calibration has been extensively studied in the lit-

erature within the traditional centralized setting (see

(Guo et al., 2017) for a comprehensive overview), as

machine learning models, including deep neural net-

works, are often poorly calibrated. Approaches to ad-

dress calibration issues can either involve modiﬁca-

tions to the model itself or adjustments to the standard

loss function (e.g., temperature scaling, focal loss), or

they can be applied post-training, assuming the avail-

ability of a fresh dataset independent of the training

set. In line with the latter approach, several meth-

ods exist to improve calibration, including isotonic

regression and Platt scaling (Guo et al., 2017).

However, the calibration of machine learning

models in federated settings has received limited at-

tention so far. Among the few relevant efforts, recent

research (Peng et al., 2024) demonstrated that existing

federated learning schemes often fail to ensure proper

model calibration following weight aggregation, rais-

ing concerns about the deployment of FL models in

safety-critical domains. To address this, they pro-

posed a post-hoc calibrator based on a multi-layer per-

ceptron with a substantial number of free parameters,

which introduces a risk of overﬁtting and leads to a

non-negligible communication overhead between the

clients and the central server.

Similarly, (Qi et al., 2023) explored the prob-

lem of model calibration in FL by proposing a strat-

egy that leverages prototype-based summaries of each

client’s data cluster to facilitate effective calibration.

However, their approach is designed for a cross-silo

FL setting, which permits peer-to-peer communica-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

842

tion between clients. In contrast, this work focuses on

the cross-device FL scenario, where clients commu-

nicate exclusively with a central server.

Another recent study (Chu et al., 2024) proposed

modifying the local training loss by incorporating a

calibration loss to encourage better model calibra-

tion. While effective for neural network models, this

method is unsuitable when pre-trained classiﬁers are

involved, as envisioned in this paper, since it requires

access to the training process.

All the previous works are related to model cali-

bration without focusing on any speciﬁc application

domain. To the best of our knowledge, in the domain

of FL-based network intrusion detection, the issue of

proper model calibration remains unexplored, and this

paper tries to ﬁll this gap.

3 OUR APPROACH TO

FEDERATED CALIBRATION

As mentioned earlier, proper calibration is a prereq-

uisite for reliable learning performance. While model

trustworthiness is typically studied in centralized set-

tings, here we propose a federated calibration mod-

ule (or federated calibrator) that can take as input

any pre-trained model. It operates in federated sce-

narios without requiring modiﬁcations to the standard

FedAvg (McMahan et al., 2017) aggregation architec-

ture. More speciﬁcally, the module adopts a novel

approach: not only is the model trained in a feder-

ated way, but it is also calibrated in the same manner,

leveraging local calibration information from the lo-

cal clients.

The desired goal of the calibrator module is as

follows: given any classiﬁer that outputs predictive

scores across C classes [ f

, .., f

C−1

] (i.e., benign or

speciﬁc attacks) for a given input x (i.e., network

trafﬁc sample), establish a meaningful mapping trans-

forming these scores into well-calibrated probabili-

ties. To achieve this goal, we chose to rely on Platt

scaling – also known as sigmoid or logistic calibra-

tion – due to its simplicity and parametric nature that

naturally ﬁts the federated learning workﬂow.

More speciﬁcally, suppose a binary classiﬁcation

problem and that our base classiﬁer (e.g., a neural net-

work or a decision tree) predicts some score for the

0-th class, denoted as f

(x), for a given input x. Platt

scaling maps this score to a probability by means of

a sigmoid function with two learnable parameters, de-

noted here as a and b, as follows:

p(y = 0| f

(x)) = sigmoid(a · f

(x) + b) (1)

Those parameters are estimated through maxi-

Algorithm 1: Federated Calibrator.

Require: Pre-trained classiﬁer, ˜y = f (x), and the calibra-

tion module ˆy = calibrator( ˜y, w)

Require: For each client k, a fresh calibration dataset D

{(x

, y

)}

i=1

and the corresponding predictions provided

by the base classiﬁer { ˜y

}

i=1

Require: M federated training rounds, learning rate η

Server randomly initializes global model weights w

for m = 1, .., M do

for k = 1, .., K do

{ ˆy

= calibrator( f (x

), w

)}

i=1

▷ calibrator

predictions

(w) =

∑

i=1

cross-entropy[(y

, ˆy

)] ▷ local loss

estimation

← w

− η∇L

(w) ▷ weights update

end for

Server aggregates clients’ updates:

m+1

global

← FedAvg({w

, N

}

k=1

) ▷ Eq. (2)

end for

return w

global

▷ calibrator parameters global estimate

mum likelihood on a validation set D = {(x

, y

)}

i=1

of input-output pairs (x

and y

, respectively) that, in

this context, we call calibration dataset. In the case

of a classiﬁcation task (i.e., the task considered in this

paper), the loss function reduces to the usual cross-

entropy loss between the predicted probabilities for a

given input, and the corresponding class label.

The formulation of Equation (1) can be easily ex-

tended to a multi-class classiﬁcation task by applying

the same equation to each class independently, and

then normalizing the output to ensure a valid probabil-

ity interpretation. The great advantage of Platt scaling

is that it has only a few parameters (as it scales lin-

early with the number of classes) making it suitable

even for calibration datasets of small size. Addition-

ally, it is versatile and, as already pointed out, it can

be applied to any pre-trained model.

Our simple yet effective proposal is to train this

calibrator following the standard federated learning

pipeline, e.g. by adopting FedAvg (McMahan et al.,

2017) in a centralized server. Speciﬁcally, the server

(i.e., FL model aggregator) sends to the clients (or

to a random subset of them) a pre-deﬁned calibra-

tor model (in terms of parameters a and b of the sig-

moid function) for local training. Each client trains

the given model with its own data, by minimizing

a given loss function (i.e., cross-entropy) and hence

computing a local update of the weights, here denoted

as w ≡ [a, b] following Equation (1).

At the end of each round, the server collects all

local updates related to a and b and combines them

to update the central model parameters. Following

this iterative process, an arbitrary number of clients

can concur to model calibration without transferring

A Federated Approach to Enhance Calibration of Distributed ML-Based Intrusion Detection Systems

843

the collected data (i.e., calibration dataset) to a cen-

tralized location, since only locally-trained sigmoid

function parameters need to be sent. As said, the com-

mon aggregation function adopted in this paper for

federated calibration is FedAvg, which computes the

global updated weights w

global

as a weighted average

over all clients’ weights w

global

∑

k=1

∑

k=1

. (2)

Overall, the proposed calibration workﬂow is co-

ordinated by the central server. It takes as input a

pre-trained base model and provides, as output, a cali-

brated version of it by executing the federated calibra-

tion module and the related workﬂow, whose details

are reported in Algorithm 1.

Note that the base model may have been trained

following any possible training paradigm. However,

given its peculiarities, our calibration procedure is es-

pecially recommended in the case of a model trained

by means of federated learning, as it follows the

same workﬂow. Speciﬁcally, the model could ﬁrst be

trained using FL, and then calibrated giving it as input

to our federated calibration module.

4 DATASET AND

EXPERIMENTAL SETUP

4.1 Dataset Description

For exploring the feasibility of our approach, the

choice of a realistic dataset plays a crucial role. In

(Campos et al., 2021) it is possible to ﬁnd an exten-

sive review of different existing datasets related to in-

trusion detection in the IoT domain. As done in that

paper and given its properties, here we exploit the NF-

ToN-IoT dataset (Sarhan et al., 2021) version 2, which

is based on the ToN IoT set (Alsaedi et al., 2020). Ev-

ery row of the dataset represents a network ﬂow char-

acterized by 43 features, and each ﬂow is labeled as

belonging to one over a total of 10 classes, including

benign trafﬁc and nine different attacks. In the dataset

there are approximately 17 million rows with 63.99%

representing attack samples and 36.01% representing

benign samples.

To distribute data among clients we exploited

a common partition method used in the literature,

which is based on a symmetrical Dirichlet distribu-

tion, governed by a concentration parameter α. Such

a parameter can be used to indicate the statistical het-

erogeneity, where lower values of α (say < 1) lead

to more heterogeneous settings (i.e., more different

sample distributions across classes for the different

clients). In particular, we considered 10 clients with

varying α between 0.2 and 0.9. It is worth mention-

ing that in a network intrusion detection scenario it

is common to have statistical heterogeneity (Campos

et al., 2021) and therefore we concentrated our analy-

sis on a heterogeneous scenario, i.e., with α < 1.

Moreover, to leverage the simplicity of the cali-

bration module, which allows training the calibration

with a small amount of data, we applied a random

sub-sampling of the classes except for the minority

one, so that all the classes are equally represented at

calibration time.

4.2 Adopted Classiﬁer and Calibrator

Implementation

As a base model, we considered an extreme gradient-

boosted decision tree classiﬁer implemented in the

library XGBoost (Chen and Guestrin, 2016). This

model often outperforms more complex models (e.g.,

deep learning models) on tabular data such as the

one considered here (Shwartz-Ziv and Armon, 2022;

Markovic et al., 2022), being very appealing in the

case of network trafﬁc classiﬁcation.

Each client locally trains its own XGBoost model,

then the central server aggregates the local models to

deﬁne a single global model following the federated

learning schema. Boosted decision trees are aggre-

gated as follows: at each FL round, each client sends

its boosted tree to a central server; the server then

treats these trees as bootstrapped versions of a classi-

ﬁer and combines them into an ensemble to create the

ﬁnal model. So, if we have K clients and M rounds,

the ﬁnal classiﬁer will consist of K · M trees. The

model aggregation procedure is performed exploiting

the Flower library (Beutel et al., 2020), and the fed-

erated training of the three-based model required 50

training rounds and 10 local epochs per round.

Regarding the calibration module, Platt scaling

has been implemented through Pytorch (Paszke et al.,

2019) as a fully-connected neural network without

hidden layers, with a diagonal weight matrix and with

a sigmoid (i.e., the Platt scaling sigmoid of Equation

(1)) as an activation function. This allowed an easy

implementation of the calibration module directly in

the Flower workﬂow. The calibrator’s FL training

rounds took 20 rounds and 10 local epochs per round.

4.3 Baseline Strategies

To evaluate the effectiveness of our proposed ap-

proach, we considered the following baselines:

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

844

Federated

Calibrator

Base

model

Server

Local

Calibrator

Client 1

Local

Calibrator

Client 𝑘

…

Centralized

Calibrator

Base

model

Server

Client 1

Client 𝑘

…

Server

Local

Calibrator

Client 1

Local

Calibrator

Client 𝑘

…

Base

model

Network

traffic

data

Network

traffic

data

Base

model

(a) (b) (c)

Figure 1: Architectural view of our federated calibration approach (a), centralized calibration (b) and local calibration (c).

• Base model. This is the classiﬁer obtained in a

federated way, as speciﬁed in Section 4.2, without

any calibration module.

• Centralized calibration. It represents the most

favorable scenario in terms of calibration perfor-

mance, where the central server has access to a

calibration dataset so that calibration can be per-

formed in the usual centralized setting. However,

this is not a privacy-preserving strategy as a por-

tion of clients’ data needs to be shared with the

central server.

• Local calibration. In this strategy, each client in-

dependently trains its own calibrator using its own

local data. In this way, the calibration is personal-

ized with respect to each client.

• Proposed approach. This is the federated cali-

bration approach proposed in Section 3.

A comparison of the different calibration ap-

proaches is shown in Fig. 1. All the model param-

eters, such as the number of local training epochs

and the number of federation training rounds, are kept

ﬁxed to the values speciﬁed in Section 4.2 across the

different baselines and across each value of α, to en-

sure a fair comparison of the models’ performance.

5 PERFORMANCE EVALUATION

The main goal of this paper is to improve the cali-

bration of a given classiﬁer. To assess how much a

model is calibrated, a commonly adopted metric is

the so-called Expected Calibration Error (ECE) (Guo

et al., 2017). To compute it, the predicted proba-

bility is divided into a set of bins and the discrep-

ancy between the empirical probability (i.e., the frac-

tion of correctly-classiﬁed samples belonging to the

given conﬁdence range) and the predicted probability

Figure 2: Expected Calibration Error for the considered

models as a function of the clients’ data statistical hetero-

geneity. The curves represent the mean ECE along with the

95% conﬁdence intervals, calculated over 20 experiments

with different random seeds.

is evaluated. More precisely, it is deﬁned as:

ECE =

∑

n=1

|acc(B

) − conf(B

)| (3)

where N

is the number of bins (typically 10), N

the total number of samples, ACC, and conf represent

the accuracy and the conﬁdence for the samples be-

longing to the n-th bin, denoted as B

. ECE values

lie in the range [0, 1], with lower values indicating a

better calibrated model.

In addition, we also evaluated the resulting mod-

els with standard predictive performance metrics such

as accuracy and F1-scores, including both the Macro

F1-Score (F1-Macro) and the Weighted F1-Score

(F1-Weighted), which accounts for the class imbal-

ance by weighting each class’s contribution propor-

tionally to its sample size.

Figure 2 shows the ECE as a function of the het-

erogeneity parameter α for the considered baselines.

It can be seen that the base model is highly un-

calibrated for all the values of α, thus raising concerns

A Federated Approach to Enhance Calibration of Distributed ML-Based Intrusion Detection Systems

845

Figure 3: Accuracy and F1-Scores for the considered mod-

els as a function of the clients’ data statistical heterogeneity.

The curves represent the mean values of each metric along

with the 95% conﬁdence interval, computed over 20 differ-

ent experiments with varying random seeds.

about the reliability of standard FL-based models. On

the other hand, we can see that the proposed approach

shows a comparable behavior to centralized calibra-

tion, but without the need for a centralized calibra-

tion dataset, thus guaranteeing privacy and efﬁciency

since data remains local.

Moreover, the proposed approach leads to better

calibration with respect to a personalized, local cali-

bration. The difference, as expected, is higher in more

heterogeneous settings, where local data are not rep-

resentative of the overall population as typically hap-

pens in a network intrusion detection scenario. In

fact, for instance, certain regions may be more ex-

posed to certain cyber attacks than others, leading to

imbalances in the data collected from each client. The

Figure 4: Per-class ECE for the evaluated models with ﬁxed

data heterogeneity (α = 0.3). The classes are ordered in de-

creasing order based on their sample size. The ECE repre-

sents a mean value, computed over 20 different experiments

with varying random seeds.

ability of the model to perform well under these con-

ditions is important to ensure its effectiveness in prac-

tical, large-scale intrusion detection systems.

It is worth noting that the calibration module does

not negatively impact the classiﬁcation performance,

as shown in Figure 3, for accuracy, F1-Macro and F1-

Weighted. Here, we can see that our approach guaran-

tees similar classiﬁcation performance to the central-

ized calibration case and to the base model. Again,

local calibration shows a sub-optimal behavior, es-

pecially in more heterogeneous settings. In addition,

similar results are obtained for the Weighted F1-score,

with even closer performance between our approach,

centralized calibration and base model.

Finally, to further investigate the relationship be-

tween ECE and class imbalance, we report in Figure

4 the per-class ECE, with classes sorted in decreasing

order of sample size. The results highlight that mi-

nority classes (e.g., ransomwere, mitm, etc.) exhibit

signiﬁcantly higher ECE values, indicating a stronger

calibration challenge in these cases for all the base-

lines, showing how Platt scaling struggles in perform-

ing its task.

6 CONCLUSION

In this paper, we proposed a novel federated calibra-

tion module relying on a calibrator that can be trained

following the standard federated learning pipeline.

The proposed module demonstrates good calibration

performance at small α values, i.e., with heteroge-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

846

neous client’s data, and maintains comparable perfor-

mance to server-based centralized calibration as α in-

creases (i.e., α > 0.4), while offering advantages in

terms of privacy, as no calibration data must be shared

with the server. At the same time, our federated cali-

bration outperforms a local calibration strategy, where

each client calibrates separately the base model (e.g.

trained by means of FL): separate calibration steps at

different nodes might leverage partially representative

data and, hence, result in non-trustworthy models.

In addition, the classiﬁcation performance in

terms of accuracy and F1-score of our proposed ap-

proach is not affected with respect to the base model,

and it is comparable to that of a model obtained by

performing centralized calibration.

These features make our approach particularly

well-suited for the deployment of reliable ML-based

Intrusion Detection Systems, where data are typically

unbalanced and where privacy and efﬁcient resource

usage are essential. However, it still faces challenges

to work well with under-represented classes, high-

lighting an area for potential improvements. In ad-

dition, for future works, we plan to investigate other

approaches to calibration in a federated learning set-

ting, like isotonic regression and/or methods based on

the conformal prediction framework.

ACKNOWLEDGMENT

The research leading to these results has been par-

tially funded by the Italian Ministry of University and

Research (MUR) under the PRIN 2022 PNRR frame-

work (EU Contribution – NextGenerationEU – M.

4,C. 2, I. 1.1), SHIELDED project, ID P2022ZWS82.

REFERENCES

Al-Garadi, M. A., Mohamed, A., Al-Ali, A. K., Du, X., Ali,

I., and Guizani, M. (2020). A Survey of Machine and

Deep Learning Methods for Internet of Things (IoT)

Security. IEEE Communications Surveys Tutorials,

22(3):1646–1685.

Alsaedi, A., Moustafa, N., Tari, Z., Mahmood, A., and

Anwar, A. (2020). TON-IoT Telemetry Dataset: A

New Generation Dataset of IoT and IIoT for Data-

Driven Intrusion Detection Systems. IEEE Access,

8:165130–165150.

Aouedi, O., Piamrat, K., Muller, G., and Singh, K. (2022).

Intrusion detection for Softwarized Networks with

Semi-supervised Federated Learning. In ICC 2022.

Beutel, D. J., Topal, T., Mathur, A., Qiu, X., Fernandez-

Marques, J., Gao, Y., Sani, L., Li, K. H., Parcol-

let, T., de Gusm

ao, P. P. B., et al. (2020). Flower:

A Friendly Federated Learning Research Framework.

arXiv preprint arXiv:2007.14390.

oken, B. (2021). On the Appropriateness of Platt Scal-

ing in Classiﬁer Calibration. Information Systems,

95:101641.

Campos, E. M., Saura, P. F., Gonz

alez-Vidal, A.,

Hern

andez-Ramos, J. L., and et al. (2021). Evaluat-

ing Federated Learning for Intrusion Detection in In-

ternet of Things: Review and Challenges. Computer

Networks, page 108661.

Chaabouni, N., Mosbah, M., Zemmari, A., Sauvignac,

C., and Faruki, P. (2019). Network Intrusion De-

tection for IoT Security Based on Learning Tech-

niques. IEEE Communications Surveys Tutorials,

21(3):2671–2701.

Chen, T. and Guestrin, C. (2016). Xgboost: A Scalable Tree

Boosting System. In KDD 2016.

Chu, Y.-W., Han, D.-J., Hosseinalipour, S., and Brin-

ton, C. (2024). Unlocking the Potential of Model

Calibration in Federated Learning. arXiv preprint

arXiv:2409.04901.

European Union Agency for Cybersecurity (2023). ENISA

Threat Landscape 2023. https://www.enisa.europa.eu/

publications/enisa-threat-landscape-2023. [Accessed:

09-December-2024].

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017).

On Calibration of Modern Neural Networks. In ICML

2017.

Hassija, V., Chamola, V., Saxena, V., Jain, D., and et al.

(2019). A Survey on IoT Security: Application Areas,

Security Threats, and Solution Architectures. IEEE

Access, 7:82721–82743.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., et al.

(2021). Advances and Open Problems in Feder-

ated Learning. Foundations and Trends® in Machine

Learning, 14(1–2):1–210.

Khraisat, A., Gondal, I., Vamplew, P., and Kamruzzaman,

J. (2019). Survey of Intrusion Detection Systems:

Techniques, Datasets and Challenges. Cybersecurity,

2(1):1–22.

Liu, H. and Lang, B. (2019). Machine Learning and Deep

Learning Methods for Intrusion Detection Systems: A

Survey. Applied Sciences, 9:4396.

Markovic, T., Leon, M., Buffoni, D., and Punnekkat, S.

(2022). Random Forest based on Federated Learning

for Intrusion Detection. In AIAI 2022. Springer.

McMahan, B., Moore, E., Ramage, D., Hampson, S.,

and y Arcas, B. A. (2017). Communication-efﬁcient

Learning of Deep Networks from Decentralized Data.

In Artiﬁcial Intelligence and Statistics, pages 1273–

1282.

Mothukuri, V., Parizi, R., Pouriyeh, S., Huang, Y., and et al.

(2020). A survey on Security and Privacy of Federated

Learning. Future Generation Computer Systems.

Moustafa, N., Hu, J., and Slay, J. (2018). A Holistic Review

of Network Anomaly Detection Systems: A Compre-

hensive Survey. Journal of Network and Computer

Applications, 128.

A Federated Approach to Enhance Calibration of Distributed ML-Based Intrusion Detection Systems

847

Paszke, A. et al. (2019). PyTorch: An Imperative

Style, High-Performance Deep Learning Library. In

NeurIPS 2019.

Peng, H., Yu, H., Tang, X., and Li, X. (2024). FedCal:

Achieving Local and Global Calibration in Federated

Learning via Aggregated Parameterized Scaler. In

ICML 2024.

Qi, Z., Meng, L., Chen, Z., Hu, H., Lin, H., and Meng, X.

(2023). Cross-silo Prototypical Calibration for Feder-

ated Learning with Non-iid Data. In Multimedia 2023.

Rahman, S. A., Tout, H., Talhi, C., and Mourad, A. (2020).

Internet of Things Intrusion Detection: Centralized,

On-Device, or Federated Learning? IEEE Network,

34(6):310–317.

Rey, V., S

anchez, P. M. S., Celdr

an, A. H., and Bovet, G.

(2022). Federated Learning for Malware Detection in

IoT devices. Computer Networks, 204:108693.

Saranya, T., Sridevi, S., Deisy, C., Chung, T. D., and Khan,

M. (2020). Performance Analysis of Machine Learn-

ing Algorithms in Intrusion Detection System: A Re-

view. Procedia Computer Science, 171:1251–1260.

Sarhan, M., Layeghy, S., Moustafa, N., Gallagher, M., and

Portmann, M. (2022). Feature Extraction for Machine

Learning-based Intrusion Detection in IoT Networks.

Digital Communications and Networks.

Sarhan, M., Layeghy, S., and Portmann, M. (2021).

Evaluating Standard Feature Sets Towards Increased

Generalisability and Explainability of ML-based

Network Intrusion Detection. arXiv preprint

arXiv:2104.07183.

Shone, N., Ngoc, T. N., Phai, V. D., and Shi, Q. (2018). A

Deep Learning Approach to Network Intrusion Detec-

tion. IEEE Transactions on Emerging Topics in Com-

putational Intelligence, 2(1):41–50.

Shwartz-Ziv, R. and Armon, A. (2022). Tabular Data: Deep

Learning is not All You Need. Information Fusion,

81:84–90.

Talpini, J., Sartori, F., and Savi, M. (2023). A Clustering

Strategy for Enhanced FL-Based Intrusion Detection

in IoT Networks. In ICAART 2023.

Talpini, J., Sartori, F., and Savi, M. (2024). Enhancing

Trustworthiness in ML-Based Network Intrusion De-

tection with Uncertainty Quantiﬁcation. Journal of

Reliable Intelligent Environments, pages 1–20.

Tauscher, Z., Jiang, Y., Zhang, K., Wang, J., and Song, H.

(2021). Learning to Detect: A Data-driven Approach

for Network Intrusion Detection. In IPCCC 2021.

Tsimenidids, S., Lagkas, T., and Rantos, K. (2022). Deep

Learning in IoT Intrusion Detection. Journal of Net-

work and Systems Management, 30.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

848