Bernoulli Distribution-Based Maximum Likelihood Estimation for
Dynamic Coefficient Optimization in Model-Contrastive Federated
Learning
Sichong Liao
Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, S10 2TN, U.K.
Keywords: Bernoulli Distribution, Maximum Likelihood Estimation, Loss Term, Federated Learning
Abstract: In the realm of federated learning, the non-identically and independently distributed (non-IID) nature of data
presents a formidable challenge, often leading to suboptimal model performance. This study introduces a
novel Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization
method in Model-Contrastive Federated Learning, aiming to address these inherent difficulties. The center of
the proposed approach is the dynamic adjustment of loss terms concurring to quantifying deviation between
the global model and local model. There may be a lot of variation in the data. In this case, the proposed manner
could upgrade the robustness and adaptability of the model itself. Leveraging a Model-Contrastive Federated
Learning (MOON) framework, this paper proposed a Dynamic Coefficient Optimized MOON (DCO-MOON)
framework. For the supervised loss term and model-contrastive loss term, the proposed approach incorporates
a dynamic coefficient adjustment mechanism. The efficacy of this approach is illustrated through the
simulations on different datasets, including the Modified National Institute of Standards and Technology
(MNIST), Fashion-MNIST, and Canadian Institute for Advanced Research (CIFAR-10). Experimental results
show improvements in test accuracy and communication efficiency. It also illustrates that DCO-MOON can
superiorly adjust to real-world scenarios, which are confronting data-driven challenges with non-IID and
unbalanced datasets.
1 INTRODUCTION
In recent years, data privacy and security have
become fundamental concerns within the field of
machine learning (Voigt & Bussche, 2017, Kingston,
2017). Conventional centralized training strategies
regularly require the aggregation of large datasets,
posing huge risks in terms of data privacy and
security flaws. Moreover, these strategies can be
inefficient due to the demanding job of transferring
vast amounts of data to a central server. Federated
learning's strategy of training models over different
devices by keeping data local is designed to address
these issues (Li et al, 2021).
Federated Learning speaks to a distributed
machine-learning system. It can benefit a lot from
prioritizing privacy (Li et al, 2020, Yang et al, 2019).
In this case, clients, also known as parties, work
collaboratively in the training of a centralized model.
This collaboration is then encouraged through the
sharing of model-related data. Such parameters or
updates will be exchanged instead of transmitting
their private datasets. Each client uses its local data to
train a local model. Then the central server aggregates
the model parameters of local models to train a global
model. These aggregated model parameters are later
communicated along these lines to the clients. Due to
its privacy and efficiency performance in distributed
settings, federated learning is regarded as a great
advancement in distributed machine learning (Tyagi
et al, 2023). It allows two or more parties to
collaboratively train models without sharing raw data.
This system is vital in today's data-driven world
where data privacy and security are fundamental.
Be that as it may, one of the primary challenges in
federated learning is the non-identically distributed
(non-IID) nature of data over different clients
(Kairouz et al, 2021, Zhu et al, 2021). Due to different
user behaviors and preferences, the data generated by
distinctive parties often varies widely. In this case,
this leads to non-IID distributions. The so-called non-
504
Liao, S.
Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization in Model-Contrastive Federated Learning.
DOI: 10.5220/0012827600004547
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Data Science and Engineering (ICDSE 2024), pages 504-511
ISBN: 978-989-758-690-3
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
IID issue regularly leads to model performance
decline due to the varying data distributions among
participating clients (Li et al, 2020, Li et al, 2019).
The statistical heterogeneity is also known as a direct
result of the non-IID issue. And that is what got
federated models into trouble with uneven
performance and convergence issues.
As a pivotal method in federated learning,
FedAvg aims to address the challenges of
communication efficiency and data privacy
(McMahan et al, 2017). It works by averaging local
stochastic gradient descent (SGD) updates for the
primal problem. Over numerous experiments, it has
been found effective for non-convex problems (Su et
al, 2023). FedAvg combines models by averaging
local model parameters from all clients. In any case,
it is noted that FedAvg can struggle with convergence
in settings with heterogeneous data. Dealing with data
heterogeneity frequently leads to suboptimal
convergence or even divergence of the global model.
Tending to this issue, recent research has focused on
Bayesian non-parametric methods for aggregation of
two or more models. That includes neuron matching
and merging local models, as seen in approaches like
PFNM and Claici et al. (Yurochkin et al, 2019, Claici
et al, 2020). Besides, Shukla et al. presented the
Infogain FedMA algorithm. This algorithm utilizes a
strategy based on information-gain sampling for the
selection of model parameters and joins probabilistic
federated neural matching (Shukla & Srivastava,
2021). Though these approaches appear workable and
innovative, they may not have mainly been used with
more complex neural networks. Applying them to
more complex networks to broaden their use is a
developing research area.
Li, He, and Song's Model-Contrastive Federated
Learning (MOON) strategy may be an outstanding
progression in dealing with non-IID data issues (Li et
al, 2021). By combining model-contrastive loss,
similar to NT-Xent loss for contrastive representation
learning, it optimizes the learning process over
distributed networks. And this improves federated
learning's effectiveness. Such tasks like image
classification can benefit a lot from the MOON
technique.
In this work, this paper will deal with non-IID
issues by utilizing a Bernoulli Distribution-Based
Maximum Likelihood Estimation for Dynamic
Coefficient Optimization (DCO). It is a more
comprehensive framework based on MOON. This
new framework merges statistical techniques with
federated learning. Based on local and global model
differences, it could reasonably adjust the supervised
loss term and model-contrastive loss term, and thus
improve accuracy and communication efficiency.
This article also illustrates that DCO-MOON can
superiorly adjust to real-world scenarios, which are
confronting data-driven challenges with non-IID and
unbalanced datasets.
2 METHODOLOGY
To integrate the dynamic coefficient optimization
within the model-contrastive federated learning
framework, there is no gainsaying the fact that
integration is premised on a principal perception: in
non-IID data settings, local models in a federated
learning system show varying degrees of deviation
from the global model. Based on the condition, the
deviation between the global model and the local
model can be evaluated by a quantification method.
For example, given deviations generated by model
training and aggregation between the global model
and the local model, it is not wise to aggregate the
features learned by the bad local models into the
global model in an unbalanced way. The model
updates instructed by gradient descent should not
uncontrollably aggravate the polarization between the
global model and local models. That is to say, these
deviations could be detected through a statistical
method. And then take proactive adjustments to
optimize gradient descent to a certain degree.
The common cycle of federated learning is shown
in Figure 1. The proposed system follows the cycle of
its initialization, build-up, and repeat stages.
Figure 1: The cycle of common federated learning
(Photo/Picture credit: Original).
Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization in Model-Contrastive Federated
Learning
505
2.1 Quantifying Deviation with
Bernoulli Distribution-Based MLE
To address non-IID issues, this paper presents a novel
mechanism for quantifying and reacting to these
deviations. Since there is always drift in the phase of
local training, this approach first characterizes a
metric for critical deviation. For each local model
i
at round
t , this work calculates the Euclidean
distance
t
i
d between its parameter vector
t
i
w and the
parameter vector
t
g
lobal
w
of global model. In the
following, mitigating the impact of scale differences
is needed. The Euclidean distance is then normalized
according to the norm of the global model's
parameters. So far, the Euclidean distance
t
i
d is
calculated to quantify the deviation:
()
2
(, )
1
Δ
K
tt
iik
k
dw
=
=
(1)
Where
1
Δ
ttt
ii
www
=− can denote the distinction
in parameters between the current model state and
previous one. And
K
is the total number of
parameters. Here, normalization is to ensure
comparability:
-1
Normalized
t
t
i
i
t
d
d
w
=
(2)
A binary outcome function
t
i
b is then defined to
indicate significant deviation. The threshold
θ
serves as a hyper-parameter. This allows it to be
customized by a certain real-world situation:
1, if
0, otherwise
t
t
i
i
d θ
b
>
=
(3)
The normalized deviation is utilized to inform a
binary outcome function
t
i
b , adhering to a Bernoulli
distribution. This function essentially categorizes
each local model's update as significantly deviating or
not, based on a pre-set threshold
θ
.
Maximum Likelihood Estimation (MLE) is a
statistical method used to estimate the parameters of
a statistical model. It selects the parameter values that
maximize the likelihood function, representing the
most probable values given the observed data. In the
context of federated learning, MLE can be utilized to
estimate the probability of deviation in local models
from a global model. Given that each local model in
the federated learning framework is independently
trained, the samples in the dataset are assumed to be
independently and identically distributed. This article
utilizes the Bernoulli distribution to derive the MLE
of the parameter
t
p
. The observed sample set is
denoted as
12
{ , , ... , }
tt t
N
Bbb b
= , where each
t
i
b
represents a binary outcome indicating significant
deviation of the i-th local model at round
t . The
likelihood function, which is the probability of
observing the sample set
B
given the parameter
p
,
is expressed as:
()()
()
1
1
1
t
t
i
i
N
b
b
i
LpB f Bp p p
=
==
(4)
The log-likelihood function also known as the
logarithm of the likelihood function. It is defined as:
()
()
()
()
1
log log 1 log 1
N
tt
ii
i
LpB b p b p
=
=+
(5)
To get the MLE of
t
p
, the value of
p
that
maximizes the log-likelihood function is determined.
It first takes the derivative of the log-likelihood
function with respect to
p
. Then ensuring that this
derivative equals zero is needed:
()
()
()
1
1
log 0
1
t
t
N
i
i
i
b
b
LpB
ppp
=


=−=
∂−


(6)
A MLE of the Bernoulli parameter is then yielded
after solving Formula 6 for
p
:
1
1
N
t
ti
i
p
b
N
=
=
(7)
In this way, the MLE of the probability
t
p
of
deviation is calculated over all local models. In the
local models at each round
t , the empirical
probability of observing a significant deviation is
defined. Under the settings of non-IID and
unbalanced dataset, it is a reasonably statistical metric
and therefore could have a clear understanding of the
dynamics and behavior of local models. In the
following work, this paper will introduce how
dynamic coefficient optimization is implemented and
then proposed a DCO-MOON framework. That is to
say, the learning process could be continuously fine-
tuned in reaction to the deviation.
ICDSE 2024 - International Conference on Data Science and Engineering
506
2.2 Proposed Dynamic Coefficient
Optimized MOON
To achieve seamless integration between Bernoulli
Distribution MLE and MOON framework, this paper
proposed a Dynamic Coefficient Optimized MOON
(DCO-MOON) framework. The integration is
fastidiously designed to dynamically adjust the
coefficient of two loss terms, respectively. According
to the deviation probability
t
p
of the local models,
proposed framework can superiorly adjust to real-
world scenarios, which are confronting data-driven
challenges with non-IID and unbalanced datasets.
The DCO-MOON starts with the central server
initializing the global model
0
w
, which is the same
as the MOON system. However, the DCO-MOON
framework distinguishes itself through the
introduction of a dynamic adjustment process for the
learning coefficients. Initially, at
0t =
, the base
learning coefficient is set to a predefined value
μ
. As
the training progresses, the framework deviates from
the traditional MOON model by employing the
Bernoulli MLE to calculate the deviation probability
t
p
as follows:
1
1
1
N
t
ti
i
p
b
N
=
=
(8)
This probability then informs the adjustment of the
learning coefficients in next round, defined as:
tt
p
=
(9)
In the local training phase, each client model
computes the difference vector
1
Δ
ttt
ii
www
=− and
the normalized Euclidean distance
t
i
d as Formula 2. A
binary outcome
t
i
b , indicating a significant model
deviation, is determined based on the threshold
θ
.
A critical innovation in the DCO-MOON
framework is the redefinition of the total loss function
l
to incorporate dynamically adjusted coefficients,
termed as "Dynamic Coefficients". These coefficients,
(2 )
t
μ
μ
and
()
t
μμ
+
, are applied to the
supervised and model-contrastive components of the
loss function, respectively. The loss function of
MOON and modified loss function of DCO-MOON
is given by:
()
()( )
1
;, ; ;;
tttt
MOON sup i con i i
llwxyμ lwwwx
=+
(10)
() ()
() ( )
1
2;,();;;
tttt
DCO MOON t sup i t con i i
l μμ lwxy μμlwwwx
=− ++
(11)
This alteration allows for a flexible adjustment of the
learning process, adapting to the varying degrees of
deviation in the local models.
After a round of local training, the local models
update their parameters using this adaptive loss
function and return the updated model
t
i
w along with
the deviation outcome
t
i
b to the server. The server
then aggregates these models to form the updated
global model
1t
w
+
, employing a weighted scheme
reflective of their contributions.
The overall DCO-MOON framework is shown in
Figure 2. During each round, the central server
dispatches the global model to the clients and
subsequently gathers the updated local models from
them. The global model is then refined through a
process of weighted average computation. In the
phase of local model training, each client applies
SGD to adjust the global model using their unique
dataset. The objective for this adjustment is shown in
Formula 12.
()
() ()
()
()
()
1
sup
,~
m2 ;, ;;n;i
t
i
i
tttt
ti tconii
xy D
w
lwxy lwwwx

μ−μ + μ+μ

E
(12)
The Dynamic Coefficient Optimized MOON
framework brings forth an adaptive, responsive
approach to model training and aggregation,
specifically tailored to overcome the complexities
associated with non-IID and unbalanced datasets. By
dynamically modulating learning coefficients based
on real-time model behavior, the DCO-MOON
framework promises improved convergence and
model performance in diverse federated learning
environments.
Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization in Model-Contrastive Federated
Learning
507
Figure 2: The DCO-MOON framework (Photo/Picture credit: Original).
ICDSE 2024 - International Conference on Data Science and Engineering
508
3 RESULTS & EVALUATION
In this section, proposed approach evaluates the
performance of FL algorithms, including FedAvg,
FedProx, MOON, SOLO and proposed DCO-MOON
(Li et al, 2020, McMahan et al, 2017, Li et al, 2021).
First, this article introduces experimental setup. The
accuracy and communication efficiency of proposed
DCO-MOON framework is then shown in
comparison with other up-to-date federated learning
algorithms. For fair comparison, DCO-MOON and
all baselines are in non-IID settings.
To validate the effectiveness of DCO-MOON,
this research conducted a series of simulations using
a customized federated learning platform. These
simulations were designed to address non-IID and
unbalanced datasets. The simulations were carried out
on PyCharm. It utilized three datasets, which include
MNIST, Fashion-MNIST and CIFAR-10, to ensure a
comprehensive evaluation. CIFAR-10 is generated by
using Dirichlet distribution to create the non-IID data
partition among clients. For the CIFAR-10 dataset,
this paper proposed an approach. It utilizes a CNN
network as the base encoder. Besides, it comprises
two convolutional layers: the first convolutional layer
has 32 filters with a kernel size of 5x5, followed by a
ReLU activation and a 2x2 max pooling layer. The
second convolutional layer consists of 64 filters, also
with a 5x5 kernel, followed by a ReLU activation and
another 2x2 max pooling layer. Following the
convolutional layers, the network includes a fully
connected layer with an input dimensionality of 1600,
flattened from the output of the convolutional layers,
and an output size of 512. A final fully connected
layer maps to the number of classes, which is 10 for
the CIFAR-10 dataset. For the CIFAR-10 dataset, the
projection head in the Convolutional Federated
Learning Model is configured as a single fully
connected layer, originally serving as the final layer
of the model. This configuration is distinct from a
traditional 2-layer MLP, with the output dimension of
the projection head being aligned with the number of
classes as defined in the model's architecture. For fair
comparison, all baselines, including FedAvg and
MOON, adopt this network architecture and utilize
the same structure for the projection head.
The present study rigorously investigates the
influence of the hyperparameter µ on the performance
of DCO-MOON. Experimental adjustments of µ
within the set {0.1, 1, 5, 10} were conducted, and the
optimal results are documented in Table 1. This table
illustrates the test accuracy of DCO-MOON with
varying µ values across datasets such as MNIST,
Fashion-MNIST, and CIFAR-10. Note that the
optimal value of µ for DCO-MOON was consistently
identified as 5 for all three datasets. It is pertinent to
mention that similar hyperparameters exist in MOON
and FedProx, with MOON having a µ value of 1 and
FedProx a µ value of 0.001. Unless specified
otherwise, the experiments proceeding within this
article will adhere to these default settings.
Table 1: Test accuracy of DCO-MOON with
μ
from
{
}
0.1 1 5 10
,, ,
on different datasets.
μ
MNIST Fashion-MNIST
CIFAR-10
0.1 96.9% 81.7% 65.7%
1 97.4% 84.2% 68.3%
5 98.3% 85.4% 69.5%
10 98.0% 84.8% 68.6%
Table 2 presents the top-1 test accuracies of
various federated learning algorithms. Under non-IID
settings, the SOLO algorithm showed obviously
lower accuracy compared to other FL algorithms.
DCO-MOON consistently outperformed other FL
algorithms in all tasks. When assessing the average
accuracy across all datasets, DCO-MOON surpassed
MOON by 1.4%. While MOON's performance was
inferior to DCO-MOON, it was superior to other FL
algorithms. For FedProx, its test accuracy closely
aligned with FedAvg, with slightly higher accuracy
on the Fashion-MNIST and CIFAR-10 datasets.
Given the small µ value, the proximal term in
FedProx (i.e.,
F
edProx FedAvg prox
Ll l=+
μ
) had a
minimal impact on the training process.
Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization in Model-Contrastive Federated
Learning
509
Table 2: Test accuracy of different FL algorithms on different datasets.
FL algorithms MNIST Fashion-MNIST
CIFAR-10
Fe
d
Av
g
97.9% 79.8% 65.2%
FedProx 97.7% 81.9% 65.9%
MOON 98.1% 83.1% 67.8%
SOLO 89.8% 76.4% 45.7%
DCO-MOON 98.3% 85.4% 69.5%
Figure 3 describes the Top-1 accuracy per training
round. Compared to MOON, DCO-MOON's model-
contrastive loss had a more positive effect on the
convergence speed of the optimal algorithm. Initially,
the accuracy improvement rate of DCO-MOON was
almost same as MOON. However, it achieved better
accuracy later due to the more positive impact of the
model-contrastive loss. Consequently, the test
accuracy of MOON and FedAvg gradually diverged
from that of DCO-MOON as the number of
communication rounds increased. For FedProx, the
optimal µ value is typically small. Thus, the
increasement of FedProx closely resembles FedAvg,
especially on Fashion-MNIST. However, with a
setting
1=
μ
, FedProx operates at a considerably
slower pace due to the added proximal term. This
illustrates that a big µ value in FedProx leads to slow
convergence and poor accuracy. So it indicates that
restricting the L2-norm distance between local and
global models is not an effective solution. While
MOON's model-contrastive loss can effectively
enhance accuracy without decelerating convergence,
DCO-MOON amplifies this positive effect. The
dynamic constraint of model-contrastive loss and
supervised loss actively adjusts with deviation,
particularly noticeable when there is a significant
change in deviation between the global and local
models. Furthermore, DCO-MOON required
significantly fewer communication rounds to achieve
similar accuracy levels compared to FedAvg. On
CIFAR-10, DCO-MOON necessitated approximately
half the communication rounds of FedAvg. This
highlights its better performance on communication
efficiency.
0 102030405060708090
40
50
60
70
80
90
100
Top-1 accuracy (%)
Communication rounds
FedAvg
FedProx
MOON
SOLO
DCO-MO
O
0 102030405060708090
10
20
30
40
50
60
70
80
90
Top-1 accuracy (%)
Communication rounds
FedAvg
FedProx
MOON
SOLO
DCO-MO
O
0 102030405060708090
10
20
30
40
50
60
70
Top-1 accuracy (%)
Communication rounds
FedAvg
FedProx
MOON
SOLO
DCO-MO
O
(a) MNIST (b) Fashion-MNIST (c) CIFAR-10
Figure 3. The top-1 accuracy on different datasets with the number of communication rounds T = 100 (Photo/Picture credit:
Original).
4 CONCLUSION
This paper proposed DCO-MOON, which can
achieve a better performance on merging supervised
loss and model-contrastive loss. It designs a dynamic
adjustment mechanism according to quantifying
deviation between global model and local model.
Experimental results also demonstrate that DCO-
MOON achieve a better performance on accuracy and
communication efficiency. DCO-MOON can better
adapt to real-world scenario, which is facing data-
driven challenges with non-IID and unbalanced
datasets. In future works, evaluating the deviation
between global and local models in a more detailed
and quantitative way and designing corresponding
loss term adjustment mechanisms, are also research
directions worth exploring.
ICDSE 2024 - International Conference on Data Science and Engineering
510
REFERENCES
P. Voigt, and A. Von dem Bussche, A Practical Guide, 1st
Ed., Cham: Springer International Publishing
10(3152676), 10–5555, (2017).
J. Kingston, Artificial Intelligence and Law, 25(4), 429–
443 (2017).
Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y. Li, X. Liu, and
B. He, IEEE Trans. Knowl. Data Eng., (2021).
T. Li, A.K. Sahu, A. Talwalkar, and V. Smith, IEEE Signal
Process. Mag. 37(3), 50–60, (2020).
Q. Yang, Y. Liu, T. Chen, and Y. Tong, ACM Transactions
on Intelligent Systems and Technology (TIST) 10(2),
1–19, (2019).
S. Tyagi, I.S. Rajput, and R. Pandey, “Federated learning:
Applications, Security hazards and Defense measures”.
in 2023 International Conference on Device
Intelligence, Computing and Communication
Technologies, (DICCT) (2023), pp. 477–482.
P. Kairouz, H.B. McMahan, B. Avent, A. Bellet, M. Bennis,
A.N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R.
Cummings, and others, Foundations and Trends® in
Machine Learning, 14(12), 1–210, (2021).
H. Zhu, J. Xu, S. Liu, and Y. Jin, Neurocomputing, 465,
371–390, (2021).
T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and
V. Smith, Proceedings of Machine Learning and
System,s 2, 429–450, (2020).
X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, arXiv
Preprint arXiv:1907.02189, (2019).
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B.A.
y Arcas, “Federated learning: Applications, Security
hazards and Defense measures”. in Proceedings of the
20th International Conference on Artificial Intelligence
and Statistics, edited by A. Singh and J. Zhu (PMLR,
2017), pp. 1273–1282.
S. Su, B. Li, and X. Xue, Neural Networks, 164, 203–215,
(2023).
M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N.
Hoang, and Y. Khazaeni, “Bayesian Nonparametric
Federated Learning of Neural Networks”. in
Proceedings of the 36th International Conference on
Machine Learning, edited by K. Chaudhuri and R.
Salakhutdinov, (PMLR, 2019), pp. 7252–7261.
S. Claici, M. Yurochkin, S. Ghosh, and J. Solomon, “Model
Fusion with Kullback-Leibler Divergence”. in
Proceedings of the 37th International Conference on
Machine Learning, edited by H.D. III and A. Singh,
(PMLR, 2020), pp. 2038–2047.
S. Shukla, and N. Srivastava, “Federated matched
averaging with information-gain based parameter
sampling”. in Proceedings of the First International
Conference on AI-ML Systems, (Association for
Computing Machinery, New York, NY, USA, 2021),
pp. 1-7.
Q. Li, B. He, and D. Song, “Model-contrastive federated
learning”. in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, (2021),
pp. 10713–10722.
Bernoulli Distribution-Based Maximum Likelihood Estimation for Dynamic Coefficient Optimization in Model-Contrastive Federated
Learning
511