这是indexloc提供的服务,不要输入任何密码

Decentralized Federated Learning of Probabilistic Generative Classifiers

Aritz Pérez, Carlos Echegoyen and Guzmán Santafé Aritz Pérez is at the Basque Center for Applied Mathematics, 48009 Bilbao, Spain. Email: aperez@bcamath.orgCarlos Echegoyen and Guzmán Santafé are with the Spatial Statistics Group and INAMAT2, Public University of Navarre, 31006 Pamplona, Spain. Email:{carlos.echegoyen, guzman.santafe}@unavarra.es
Abstract

Federated learning is a paradigm of increasing relevance in real world applications, aimed at building a global model across a network of heterogeneous users without requiring the sharing of private data. We focus on model learning over decentralized architectures, where users collaborate directly to update the global model without relying on a central server. In this context, the current paper proposes a novel approach to collaboratively learn probabilistic generative classifiers with a parametric form. The framework is composed by a communication network over a set of local nodes, each of one having its own local data, and a local updating rule. The proposal involves sharing local statistics with neighboring nodes, where each node aggregates the neighbors’ information and iteratively learns its own local classifier, which progressively converges to a global model. Extensive experiments demonstrate that the algorithm consistently converges to a globally competitive model across a wide range of network topologies, network sizes, local dataset sizes, and extreme non-i.i.d. data distributions.

Index Terms:
Decentralized federated learning, supervised classification, empirical risk minimization, probabilistic models, non-i.i.d. data.

I Introduction

In recent years, federated learning (FL) [1, 2] has gained increasing attention from both the research community [3, 4] and private companies [5, 6], as it enables the development of machine learning models across multiple users without requiring data centralization. This design inherently offers a fundamental layer of privacy while reducing the costs associated with massive data storage. FL traditionally achieves this by using a user-server architecture, where users train local models and share updates with a central server that aggregates them to build a global model [7, 8]. In contrast, decentralized FL [9, 10, 4] eliminates the need for a central server by enabling users to communicate directly and collaboratively train machine learning models. This decentralized paradigm not only removes key risks associated with centralization, such as single points of failure and server communication bottlenecks, but also enhances efficiency and robustness through direct peer-to-peer communication. Moreover, it allows the system to operate over existing network topologies, reducing the need for centralized infrastructure.

The network architecture has a crucial impact on the model convergence [4]. In centralized FL, the central server coordinates the aggregation of local models, ensuring a direct and consistent flow of information between users and the server. However, in decentralized architectures, nodes rely on peer-to-peer communication, which entails additional challenges. The network topology plays a critical role in determining how quickly and effectively local models converge to a global solution. For example, highly connected networks facilitate faster convergence by enabling rapid information exchange among neighbors, whereas sparsely connected networks may lead to delayed updates and slower convergence.

Beyond the network topology, FL faces challenges across multiple dimensions, including handling heterogeneous data distributions, optimizing communication, managing security risks, resource-constrained environments, or ensuring fair learning, among others [11]. A fundamental methodological challenge is the development of general schemes of learning that can effectively accommodate diverse model families, as data characteristics and task requirements can vary significantly depending on the problem at hand. In this context, we focus on decentralized federated learning of probabilistic generative classifiers, which model the joint probability distribution of the data in an efficient, flexible, and interpretable manner, making their application both feasible and valuable across a wide variety of classification tasks.

In the current work, we introduce a novel collaborative method designed for learning probabilistic generative classifiers over decentralized networks, enabling effective model training across distributed data owners. We argue that learning probabilistic generative classifiers in a serverless architecture is an important research line that has received limited attention due to the predominant use of deep neural network models in FL. This kind of model may not always be appropriate due to several factors, especially when the nature of the data deviates from typical large-scale local datasets and highly complex instances. Probabilistic models, in turn, emerge as an effective alternative when, for instance, we work with limited local datasets, deal with missing data, require model interpretability, need to model dependencies between variables, aim to incorporate prior knowledge, or when nodes have limited computing power, such as in edge devices or the Internet of Things [12]. A representative case is the medical domain, where data scarcity and privacy constraints often make simpler and more interpretable models preferable [13, 14, 15]. Although FL is undergoing a very rapid development of new methods, only a few works have explored alternatives to neural networks [16, 9, 17], with most of them focusing on Bayesian models within a user-server architecture e.g. implementing federated variational inference [18] or learning posterior distributions on the server side [19].

While other decentralized federated learning approaches have been proposed, they differ substantially in their assumptions and model scope. For instance, in [4], a decentralized scheme based on undirected graphs is proposed to implement a federated version of gradient descent for empirical risk minimization. Although their method can theoretically be applied to any classifier learned via gradient-based optimization, their empirical validation is limited to neural networks trained on three datasets using relatively small communication networks (20 nodes) and large local training sets (thousands of samples). Gradient-based methods benefit from directly optimizing classification-specific objectives and can yield better classification performance in some cases. However, they lack many of the key benefits of probabilistic generative classifiers discussed earlier, such as interpretability, robustness to missing data, and the ability to incorporate prior knowledge.

On the other hand, gradient descent is not well-suited for learning probabilistic generative classifiers due to the structural constraints these models impose. In particular, as shown in [20], risk-based calibration is more appropriate than gradient-based optimization for training probabilistic generative classifiers, which often require parameter constraints to be satisfied, e.g. semi-positive definite covariance matrices in quadratic discriminant analysis (QDA), or probability distributions that sum to one in naive Bayes (NB). Gradient descent may violate these constraints, requiring costly corrective projections that impair convergence and degrade classification performance.

In contrast, our work introduces a novel decentralized FL algorithm named Collaborative Risk Calibration (CRC) algorithm. This proposal is based on the risk-based calibration (RC) framework [20] and is designed to optimize the empirical risk when learning probabilistic generative classifiers—such as quadratic discriminant analysis or Bayesian network classifiers—across a network of users without requiring a central server. Unlike traditional Federated Averaging (FedAvg) schemes [1, 4], which aggregate model parameters, CRC shares and aggregates local sufficient statistics to enable decentralized training.

We demonstrate that the CRC scheme effectively implements probabilistic models to address classification problems without the need for a central server. The analysis focuses on the model learning process, showing that local models converge to a common model that reduces the overall error of the network competitively compared to its centralized counterpart. This is validated through extensive experiments across a range of scenarios, including: i) diverse classification problems involving varying numbers of continuous and discrete variables, as well as different number of classes, ii) a range of graph structures providing communication networks of different topology and size, including dynamic connectivity, iii) non-identically distributed user datasets generated under various challenging conditions, iv) different sizes and degrees of fragmentation of local datasets, and v) different number of local iterations per round of learning.

The remainder of this paper is organized as follows. Section II introduces probabilistic generative classifiers, the Risk-Based Calibration (RC) algorithm for learning this type of classifier, and the decentralized federated learning framework. Section III presents Collaborative Risk-based Calibration (CRC), the novel federated decentralized version of RC. Section IV empirically evaluates CRC mainly in terms of the gap error between CRC and RC. Finally, Section V summarizes the main contributions of this work. Additionally, the Appendix provides details on the naive Bayes (NB) classifiers and basic theoretical properties of CRC.

II Preliminaries

In this section, we introduce supervised learning of probabilistic generative classifiers and the risk-based calibration algorithm, which is adapted in this paper to learn probabilistic generative classifiers over decentralized architectures. We also introduces the framework used to conduct decentralized federated learning.

II-A Learning probabilistic generative classifiers

Supervised classification focuses on training a classifier from data to minimize the expected loss (or risk). A classifier hh\in\mathcal{H}italic_h ∈ caligraphic_H is a function that maps instances to labels, h:𝒳𝒴:𝒳𝒴h:\mathcal{X}\rightarrow\mathcal{Y}italic_h : caligraphic_X → caligraphic_Y, where 𝒳q𝒳superscript𝑞\mathcal{X}\subset\mathds{R}^{q}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT is the input space and 𝒴={1,,r}𝒴1𝑟\mathcal{Y}=\{1,\cdots,r\}caligraphic_Y = { 1 , ⋯ , italic_r } the set of class labels. The loss, l𝑙litalic_l, of a classifier hhitalic_h at instance (𝒙,y)𝒙𝑦(\bm{x},y)( bold_italic_x , italic_y ), with 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X and y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, is defined as l:×(𝒳×𝒴)+:𝑙𝒳𝒴superscriptl:\mathcal{H}\times(\mathcal{X}\times\mathcal{Y})\rightarrow\mathds{R}^{+}italic_l : caligraphic_H × ( caligraphic_X × caligraphic_Y ) → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Thus, supervised learning aims to find a classifier, hh\in\mathcal{H}italic_h ∈ caligraphic_H, that minimizes the risk (expected loss):

argminhEp[l(h,(𝒙,y))],subscriptsubscript𝐸𝑝delimited-[]𝑙𝒙𝑦\arg\min_{h}E_{p}[l(h,(\bm{x},y))],roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_l ( italic_h , ( bold_italic_x , italic_y ) ) ] , (1)

where pΔ(𝒳×𝒴)𝑝Δ𝒳𝒴p\in\Delta(\mathcal{X}\times\mathcal{Y})italic_p ∈ roman_Δ ( caligraphic_X × caligraphic_Y ) is the underlying joint distribution of the data.

In practice, the underlying joint distribution, p𝑝pitalic_p, is unknown, and we only have access to a supervised training set, (X,Y)={(𝒙i,yi)}i=1m𝑋𝑌superscriptsubscriptsuperscript𝒙𝑖superscript𝑦𝑖𝑖1𝑚(X,Y)=\{(\bm{x}^{i},y^{i})\}_{i=1}^{m}( italic_X , italic_Y ) = { ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, with independent and identically distributed (i.i.d.) instances according to p𝑝pitalic_p. Generally, the 0-1 loss is considered, l01(h,(𝒙,y))=𝟙(yh(𝒙))subscript𝑙01𝒙𝑦1𝑦𝒙l_{01}(h,(\bm{x},y))=\mathds{1}(y\neq h(\bm{x}))italic_l start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ( italic_h , ( bold_italic_x , italic_y ) ) = blackboard_1 ( italic_y ≠ italic_h ( bold_italic_x ) ), and then, the objective is to minimize the empirical risk under the 0-1 loss or empirical error:

argminh1m𝒙,yX,Yl01(h,(𝒙,y)).subscript1𝑚subscriptformulae-sequence𝒙𝑦𝑋𝑌subscript𝑙01𝒙𝑦\arg\min_{h}\frac{1}{m}\sum_{\bm{x},y\in X,Y}l_{01}(h,(\bm{x},y)).roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x , italic_y ∈ italic_X , italic_Y end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ( italic_h , ( bold_italic_x , italic_y ) ) . (2)

The learning problem is further simplified to be tractable. For instance, probabilistic generative classifiers solve the problem by approximating the underlying joint distribution using some specific parametric family of distributions p𝜽Δ(𝒳×𝒴)subscript𝑝𝜽Δ𝒳𝒴p_{\bm{\theta}}\in\Delta(\mathcal{X}\times\mathcal{Y})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_X × caligraphic_Y ), where 𝜽Θ𝜽Θ\bm{\theta}\in\Thetabold_italic_θ ∈ roman_Θ represent the parameters of the considered family of distributions. Usually, maximum likelihood or maximum a posteriori parameters are taken for 𝜽𝜽\bm{\theta}bold_italic_θ, and the classifier h𝜽subscript𝜽h_{\bm{\theta}}italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT assigns to each 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X the class label that maximizes the class conditional probability, h𝜽(𝒙)=argmaxyp𝜽(𝒙,y)subscript𝜽𝒙subscript𝑦subscript𝑝𝜽𝒙𝑦h_{\bm{\theta}}(\bm{x})=\arg\max_{y}p_{\bm{\theta}}(\bm{x},y)italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = roman_arg roman_max start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x , italic_y ). When it is clear from the context, we denote the classifier h𝜽subscript𝜽h_{\bm{\theta}}italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT simply as hhitalic_h. The intuition is that by having an accurate modeling of p𝑝pitalic_p given by p𝜽subscript𝑝𝜽p_{\bm{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, probabilistic generative classifiers can reach the minimum error, a.k.a. Bayes error. Typical examples of probabilistic generative classifiers include those based on joint probability distributions from the exponential family, such as the quadratic discriminant analysis and the NB classifiers from the family of classifiers based on Bayesian networks [21, 22].

By modeling the joint probability distribution, probabilistic generative classifiers offer several practical advantages over alternative approaches. This modeling enables intuitive interpretation of the parameters (e.g., means and covariances), facilitates the incorporation of prior knowledge via Bayesian frameworks (especially when using conjugate priors), and provides a comprehensive characterization of the data-generating process [23, 24]. Their relatively low number of parameters (e.g., NB with linear parameter growth) acts as a natural regularizer, helping to reduce overfitting and ensuring stable parameter updates. These models often generalize well even with small training sets, sometimes achieving optimal performance with sample sizes logarithmic in the number of parameters [25]. Moreover, probabilistic generative models support synthetic data generation [26], handle missing data robustly via the EM algorithm [27, 28, 29], and are well-suited to Bayesian decision theory, providing optimal predictions under cost-sensitive loss functions when class-conditional distributions are accurately modeled [30, 31].

Probabilistic generative classifiers typically offer closed-form algorithms that estimate parameters in a single pass over the training data. These algorithms aim to maximize data-fitting objectives, such as the likelihood function. For certain parametric families of distributions, such as the exponential family, closed-form solutions based on data statistics exist that allow the computation of maximum likelihood or maximum a posteriori estimates given the data. We formalize the closed-form algorithms a:𝒳m×𝒴mo:𝑎superscript𝒳𝑚superscript𝒴𝑚superscript𝑜a:\mathcal{X}^{m}\times\mathcal{Y}^{m}\rightarrow\mathds{R}^{o}italic_a : caligraphic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT as the composition of a statistics mapping,

s:𝒳m×𝒴mk,:𝑠superscript𝒳𝑚superscript𝒴𝑚superscript𝑘s:\mathcal{X}^{m}\times\mathcal{Y}^{m}\rightarrow\mathds{R}^{k},italic_s : caligraphic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ,

and a parameter mapping

θ:ko,:𝜃superscript𝑘superscript𝑜\theta:\mathds{R}^{k}\rightarrow\mathds{R}^{o},italic_θ : blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ,

We assume that the statistics mapping is additive

s(X,Y)=x,yX,Ys(x,y),𝑠𝑋𝑌subscriptformulae-sequence𝑥𝑦𝑋𝑌𝑠𝑥𝑦s(X,Y)=\sum_{x,y\in X,Y}s(x,y),italic_s ( italic_X , italic_Y ) = ∑ start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_X , italic_Y end_POSTSUBSCRIPT italic_s ( italic_x , italic_y ) ,

where s(x,y):𝒳×𝒴k:𝑠𝑥𝑦𝒳𝒴superscript𝑘s(x,y):\mathcal{X}\times\mathcal{Y}\rightarrow\mathds{R}^{k}italic_s ( italic_x , italic_y ) : caligraphic_X × caligraphic_Y → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. By letting a=θs𝑎𝜃𝑠a=\theta\circ sitalic_a = italic_θ ∘ italic_s, the parameters obtained by a closed form algorithm are independent from data X,Y𝑋𝑌X,Yitalic_X , italic_Y given the statistics s(X,Y)𝑠𝑋𝑌s(X,Y)italic_s ( italic_X , italic_Y ). The computational complexity of a closed-form algorithm a𝑎aitalic_a typically is 𝒪(m×k+k×o)𝒪𝑚𝑘𝑘𝑜\mathcal{O}(m\times k+k\times o)caligraphic_O ( italic_m × italic_k + italic_k × italic_o ). For instance, the maximum likelihood parameters of NB can be obtained using a closed-form algorithm (see Appendix A).

Unfortunately, as probabilistic generative classifiers optimize likelihood-based objectives, they may obtain higher classification error than other kinds of classifiers that directly minimize a classification-specific loss [32]. For example, it is well established that the influence of the likelihood function on classification performance can diminish as the dimensionality of the feature space increases [33]. The next section describes a recently proposed procedure that allows to learn the parameters of probabilistic generative classifiers by minimizing empirical risk, which improves classification performance while preserving the advantages of probabilistic generative classifiers.

II-B Risk-based calibration

The risk-based calibration (RC) algorithm [20] is an iterative algorithm designed to reduce the empirical risk of a probabilistic generative classifier by leveraging an existing closed-form algorithm. RC progressively updates the statistics used to compute the parameter estimators guided by the soft 0-1 loss to find the parameters 𝜽superscript𝜽\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that minimize the corresponding empirical risk. The soft 0-1 loss is defined as:

ls01(h,(𝒙,y))=1p𝜽(y|𝒙)=yyp𝜽(y|𝒙),subscript𝑙𝑠01𝒙𝑦1subscript𝑝𝜽conditional𝑦𝒙subscriptsuperscript𝑦𝑦subscript𝑝𝜽conditionalsuperscript𝑦𝒙l_{s01}(h,(\bm{x},y))=1-p_{\bm{\theta}}(y|\bm{x})=\sum_{y^{\prime}\neq y}p_{% \bm{\theta}}(y^{\prime}|\bm{x}),italic_l start_POSTSUBSCRIPT italic_s 01 end_POSTSUBSCRIPT ( italic_h , ( bold_italic_x , italic_y ) ) = 1 - italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) ,

where p𝜽(y|𝒙)subscript𝑝𝜽conditional𝑦𝒙p_{\bm{\theta}}(y|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) is obtained from the joint distribution by the Bayes rule. Intuitively, the soft 0-1 loss corresponds to the expected 0-1 loss of a randomized classifier that selects y𝑦yitalic_y according to the class conditional distribution p𝜽(y|𝒙)subscript𝑝𝜽conditional𝑦𝒙p_{\bm{\theta}}(y|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ).

Algorithm 1 Risk-based Calibration (RC)
1:  Input: Training set (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) with m𝑚mitalic_m labeled instances; learning algorithm a=θs𝑎𝜃𝑠a=\theta\circ sitalic_a = italic_θ ∘ italic_s, defined by the statistics mapping s𝑠sitalic_s and the parameter mapping θ𝜃\thetaitalic_θ; learning rate lr𝑙𝑟lritalic_l italic_r.
2:  Output: Classifier hhitalic_h.
3:  𝒔0s(X,Y)superscript𝒔0𝑠𝑋𝑌\bm{s}^{0}\leftarrow s(X,Y)bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_s ( italic_X , italic_Y )
4:  𝜽0θ(𝒔0)superscript𝜽0𝜃superscript𝒔0\bm{\theta}^{0}\leftarrow\theta(\bm{s}^{0})bold_italic_θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_θ ( bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
5:  t1𝑡1t\leftarrow 1italic_t ← 1
6:  for t=1,2,,tmax𝑡12superscript𝑡𝑚𝑎𝑥t=1,2,\cdots,t^{max}italic_t = 1 , 2 , ⋯ , italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT do
7:     𝒔t𝒔t1+lr(s(X,Y)s(X,𝜽t1))superscript𝒔𝑡superscript𝒔𝑡1𝑙𝑟𝑠𝑋𝑌𝑠𝑋superscript𝜽𝑡1\bm{s}^{t}\leftarrow\bm{s}^{t-1}+lr\cdot\big{(}s(X,Y)-s(X,\bm{\theta}^{t-1})% \big{)}bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← bold_italic_s start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_l italic_r ⋅ ( italic_s ( italic_X , italic_Y ) - italic_s ( italic_X , bold_italic_θ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) )
8:     𝜽tθ(𝒔t)superscript𝜽𝑡𝜃superscript𝒔𝑡\bm{\theta}^{t}\leftarrow\theta(\bm{s}^{t})bold_italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_θ ( bold_italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
9:  end for
10:  return  hhitalic_h with parameters 𝜽tmaxsuperscript𝜽superscript𝑡𝑚𝑎𝑥\bm{\theta}^{t^{max}}bold_italic_θ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

Algorithm 1 shows the pseudo-code of RC. After initializing the statistics and the probabilistic generative classifier (steps 3 and 4), the algorithm updates the statistics and associated classifiers iteratively using local data and the classifier obtained in the previous iteration. The iterative calibration of the statistics continues until convergence, given in terms of the soft error (step 12). RC returns the classifier with the lowest training soft 0-1 loss (step 13).

The intuition of the updating rule of RC is to increase p𝜽(y|𝒙)subscript𝑝𝜽conditional𝑦𝒙p_{\bm{\theta}}(y|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) and reduce p𝜽(y|𝒙)subscript𝑝𝜽conditionalsuperscript𝑦𝒙p_{\bm{\theta}}(y^{\prime}|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) for each instance (𝒙,y)𝒙𝑦(\bm{x},y)( bold_italic_x , italic_y ) by calibrating the set of statistics, 𝒔𝒔\bm{s}bold_italic_s, used to calculate the model parameters 𝜽=θ(𝒔)𝜽𝜃𝒔\bm{\theta}=\theta(\bm{s})bold_italic_θ = italic_θ ( bold_italic_s ), according to the soft 0-1 loss. Assuming that s(𝒙,y)𝑠𝒙𝑦s(\bm{x},y)italic_s ( bold_italic_x , italic_y ) denotes the statistics mapping from instance (𝒙,y)𝒙𝑦(\bm{x},y)( bold_italic_x , italic_y ), the calibration of 𝒔𝒔\bm{s}bold_italic_s is conducted by adding s(𝒙,y)𝑠𝒙𝑦s(\bm{x},y)italic_s ( bold_italic_x , italic_y ) with weight 1p𝜽(y|𝒙)1subscript𝑝𝜽conditional𝑦𝒙1-p_{\bm{\theta}}(y|\bm{x})1 - italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) and subtracting s(𝒙,y)𝑠𝒙superscript𝑦s(\bm{x},y^{\prime})italic_s ( bold_italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with weight p𝜽(y|𝒙)subscript𝑝𝜽conditionalsuperscript𝑦𝒙p_{\bm{\theta}}(y^{\prime}|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) for all yysuperscript𝑦𝑦y^{\prime}\neq yitalic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_y. That leads to the next update rule given the parameters θ(𝒔)𝜃𝒔\theta(\bm{s})italic_θ ( bold_italic_s ) and the labeled instance (𝒙,y)𝒙𝑦(\bm{x},y)( bold_italic_x , italic_y ):

𝒔=𝒔+s(x,y)y𝒴p𝜽(y|𝒙)s(x,y).𝒔𝒔𝑠𝑥𝑦subscriptsuperscript𝑦𝒴subscript𝑝𝜽conditionalsuperscript𝑦𝒙𝑠𝑥superscript𝑦\bm{s}=\bm{s}+s(x,y)-\sum_{y^{\prime}\in\mathcal{Y}}p_{\bm{\theta}}(y^{\prime}% |\bm{x})\cdot s(x,y^{\prime}).bold_italic_s = bold_italic_s + italic_s ( italic_x , italic_y ) - ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) ⋅ italic_s ( italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (3)

Given the additive nature of the statistics, 𝒔𝒔\bm{s}bold_italic_s, the RC update rule for the whole dataset (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) is given by step 8 in Algorithm 1, where 𝜽t1superscript𝜽𝑡1\bm{\theta}^{t-1}bold_italic_θ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT denote the parameters at iteration t1𝑡1t-1italic_t - 1, i.e. θ(𝒔t1)𝜃superscript𝒔𝑡1\theta(\bm{s}^{t-1})italic_θ ( bold_italic_s start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ). In this algorithm, with a slight abuse of notation, s(X,Y)𝑠𝑋𝑌s(X,Y)italic_s ( italic_X , italic_Y ) denotes the statistics mapping over the training set (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), while s(X,𝜽)𝑠𝑋𝜽s(X,\bm{\theta})italic_s ( italic_X , bold_italic_θ ) represents the probabilistic statistics mapping according to the class conditional distribution p𝜽(y|𝒙)subscript𝑝𝜽conditional𝑦𝒙p_{\bm{\theta}}(y|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) defined as

s(X,𝜽)=𝒙Xy𝒴p𝜽(y|𝒙)s(𝒙,y).𝑠𝑋𝜽subscript𝒙𝑋subscriptsuperscript𝑦𝒴subscript𝑝𝜽conditionalsuperscript𝑦𝒙𝑠𝒙superscript𝑦s(X,\bm{\theta})=\sum_{\bm{x}\in X}\sum_{y^{\prime}\in\mathcal{Y}}p_{\bm{% \theta}}(y^{\prime}|\bm{x})\cdot s(\bm{x},y^{\prime}).italic_s ( italic_X , bold_italic_θ ) = ∑ start_POSTSUBSCRIPT bold_italic_x ∈ italic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_x ) ⋅ italic_s ( bold_italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

II-C Federated decentralized learning

Decentralized FL is a collaborative, distributed machine learning approach designed to address classification problems across a network of users, without sharing private data or relying on a central server. In centralized FL, the nodes own the data and the server performs the aggregation of the user parameters to build the global model. In the absence of a server, the nodes both own the data and perform aggregation tasks, following a graph that defines the connections between them in the network.

The network is defined by an undirected graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) where 𝒱={1,,n}𝒱1𝑛\mathcal{V}=\{1,\ldots,n\}caligraphic_V = { 1 , … , italic_n } is the set of nodes and 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V is the set of edges that represent the existing connections between nodes. The set of neighbors of a node v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V is 𝒩v={u:(v,u)}subscript𝒩𝑣conditional-set𝑢𝑣𝑢\mathcal{N}_{v}=\{u:(v,u)\in\mathcal{E}\}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { italic_u : ( italic_v , italic_u ) ∈ caligraphic_E }. Each node of the network has each own local training set (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) with the corresponding number of instances mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. We call global training data to the union of the local training datasets (X,Y)=(v𝒱Xv,u𝒱Yv)𝑋𝑌subscript𝑣𝒱subscript𝑋𝑣subscript𝑢𝒱subscript𝑌𝑣(X,Y)=(\bigcup_{v\in\mathcal{V}}X_{v},\bigcup_{u\in\mathcal{V}}Y_{v})( italic_X , italic_Y ) = ( ⋃ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , ⋃ start_POSTSUBSCRIPT italic_u ∈ caligraphic_V end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )

In general, the objective in federated learning [7] is to find a classifier, hhitalic_h, that minimizes the weighted sum of the empirical risks over the local datasets:

argminhv𝒱wvmvx,yXv,Yvl01(h,(xv,yv))),\arg\min_{h}\hskip 2.84544pt\sum_{v\in\mathcal{V}}\frac{w_{v}}{m_{v}}\sum_{x,y% \in X_{v},Y_{v}}l_{01}(h,(x_{v},y_{v}))),roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT divide start_ARG italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ( italic_h , ( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) ) , (4)

where with a slight abuse in the notation l01(h,(Xv,Yv)))l_{01}(h,(X_{v},Y_{v})))italic_l start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ( italic_h , ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) ) is the empirical error of hhitalic_h over the local training data of node v𝑣vitalic_v, and wvsubscript𝑤𝑣w_{v}italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the associated weight. The weight wvsubscript𝑤𝑣w_{v}italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is commonly defined as wv=mv/msubscript𝑤𝑣subscript𝑚𝑣𝑚w_{v}=m_{v}/mitalic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT / italic_m, giving equal importance to each data sample in the network, or as wv=1/nsubscript𝑤𝑣1𝑛w_{v}=1/nitalic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 1 / italic_n, treating every node as equally important regardless of the number of samples [34]. In this work, we focus on minimizing the soft 0-1 error over the global training set, corresponding to the federated error with weights wv=mv/msubscript𝑤𝑣subscript𝑚𝑣𝑚w_{v}=m_{v}/mitalic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT / italic_m for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V (Equation 4).

III Collaborative risk-based calibration

In this section, we present the novel federated decentralized version of RC called collaborative RC (CRC).

Algorithm 2 Colaborative RC (CRC)
1:  Input: Local training sets (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) with mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT labeled instances for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V; Communication network given by the neighbors 𝒩vsubscript𝒩𝑣\mathcal{N}_{v}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V; number of communication rounds, tmaxsuperscript𝑡𝑚𝑎𝑥t^{max}italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT; number of local iterations iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r; equivalent sample size m0superscript𝑚0m^{0}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT; learning algorithm a=θs𝑎𝜃𝑠a=\theta\circ sitalic_a = italic_θ ∘ italic_s, defined by the statistics mapping s𝑠sitalic_s and the parameter mapping θ𝜃\thetaitalic_θ.
2:  Output: Local classifiers hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and local statistics 𝒔vsubscript𝒔𝑣\bm{s}_{v}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V .
3:  for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V do
4:     𝒔v0initialize(m0)superscriptsubscript𝒔𝑣0𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒superscript𝑚0\bm{s}_{v}^{0}\leftarrow initialize(m^{0})bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_i italic_n italic_i italic_t italic_i italic_a italic_l italic_i italic_z italic_e ( italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
5:  end for
6:  for t=1,2,,tmax𝑡12superscript𝑡𝑚𝑎𝑥t=1,2,\cdots,t^{max}italic_t = 1 , 2 , ⋯ , italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT do
7:     for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V do
8:        𝒔¯vt1|𝒩v|u𝒩v𝒔ut1superscriptsubscript¯𝒔𝑣𝑡1subscript𝒩𝑣subscript𝑢subscript𝒩𝑣superscriptsubscript𝒔𝑢𝑡1\bar{\bm{s}}_{v}^{t}\leftarrow\frac{1}{|\mathcal{N}_{v}|}\sum_{u\in\mathcal{N}% _{v}}\bm{s}_{u}^{t-1}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← divide start_ARG 1 end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT
9:        hvt,𝒔vtLocalRC(𝒔¯vt,Xv,Yv,iter,a)superscriptsubscript𝑣𝑡superscriptsubscript𝒔𝑣𝑡𝐿𝑜𝑐𝑎𝑙𝑅𝐶superscriptsubscript¯𝒔𝑣𝑡subscript𝑋𝑣subscript𝑌𝑣𝑖𝑡𝑒𝑟𝑎h_{v}^{t},\bm{s}_{v}^{t}\leftarrow LocalRC(\bar{\bm{s}}_{v}^{t},X_{v},Y_{v},% iter,a)italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_L italic_o italic_c italic_a italic_l italic_R italic_C ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_i italic_t italic_e italic_r , italic_a )
10:     end for
11:  end for
12:  return  hvtmax,𝒔vtmaxsuperscriptsubscript𝑣superscript𝑡𝑚𝑎𝑥superscriptsubscript𝒔𝑣superscript𝑡𝑚𝑎𝑥h_{v}^{t^{max}},\bm{s}_{v}^{t^{max}}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V

Algorithm 2 shows the pseudo-code of CRC. The algorithm consists of a sequence of rounds in which the statistics of the neighborhood are aggregated by averaging them and then the statistics and its associated classifier are updated with local training data using LRC given in Algorithm 3. The local statistics 𝒔v0superscriptsubscript𝒔𝑣0\bm{s}_{v}^{0}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT are initialized so as they have an equivalent sample size of m0superscript𝑚0m^{0}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and the conditional distribution with parameters θ(𝒔v0)𝜃superscriptsubscript𝒔𝑣0\theta(\bm{s}_{v}^{0})italic_θ ( bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) is uniform in 𝒴𝒴\mathcal{Y}caligraphic_Y for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V (line 4, Algorithm 2). Thus, initially, local probabilistic generative models are the same model and they have no preference towards any class label. Appendix A shows the details on the uniform initialization for NB. The aggregation rule of CRC is simply given by the average of the neighborhood statistics. All the local statistics share the same equivalent sample size m0superscript𝑚0m^{0}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and the updating rule of LRC not modify the equivalent sample size of the updated statistics. Thus the averaging aggregation maintain the equivalent sample size of the aggregated statistics.

Algorithm 3 Local RC (LRC)
1:  Input: Aggregated statistics 𝒔¯vsubscript¯𝒔𝑣\bar{\bm{s}}_{v}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT; local training sets (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ); number of iterations iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r; learning algorithm a=θs𝑎𝜃𝑠a=\theta\circ sitalic_a = italic_θ ∘ italic_s, defined by the statistics mapping s𝑠sitalic_s and the parameter mapping θ𝜃\thetaitalic_θ.
2:  Output: Local classifiers hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT; Updated local statistics 𝒔vsubscript𝒔𝑣\bm{s}_{v}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.
3:  𝒔v𝒔¯vsubscript𝒔𝑣subscript¯𝒔𝑣\bm{s}_{v}\leftarrow\bar{\bm{s}}_{v}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT
4:  for i=1,2,,iter𝑖12𝑖𝑡𝑒𝑟i=1,2,\cdots,iteritalic_i = 1 , 2 , ⋯ , italic_i italic_t italic_e italic_r do
5:         𝒔v𝒔v+s(Xv,Yv)s(Xv,θ(𝒔v))subscript𝒔𝑣subscript𝒔𝑣𝑠subscript𝑋𝑣subscript𝑌𝑣𝑠subscript𝑋𝑣𝜃subscript𝒔𝑣\bm{s}_{v}\leftarrow\bm{s}_{v}+s(X_{v},Y_{v})-s(X_{v},\theta(\bm{s}_{v}))bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) - italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_θ ( bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) )
6:  end for
7:  return  hvsubscript𝑣h_{v}italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with parameters θ(𝒔v)𝜃subscript𝒔𝑣\theta(\bm{s}_{v})italic_θ ( bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), 𝒔vsubscript𝒔𝑣\bm{s}_{v}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

Algorithm 3 shows the pseudo-code of LRC. LRC is a simplified version of RC that updates input statistics using the RC updating rule a fixed number of times given by the parameter iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r. The updating rule of LRC inherits the property of the updating rule of RC that maintain the sample size of the input statistics. The equivalent sample size of the input aggregated statistics, m0superscript𝑚0m^{0}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, represents their inertia to the local updates of LRC. It plays the role of a local learning rate that depends on the sample size of the local data mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and is given by mv/m0subscript𝑚𝑣superscript𝑚0m_{v}/m^{0}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT / italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Thus, the updates of a user with more local data can have a stronger impact in the generated local statistics.

In line with the decentralized nature of CRC, Figure 1 shows an example of the round t𝑡titalic_t of CRC for the user 4444 with neighborhood 𝒩4={1,2,3}subscript𝒩4123\mathcal{N}_{4}=\{1,2,3\}caligraphic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { 1 , 2 , 3 }. At each round, the behavior of a user is independent from the rest of the network, given its neighbors. First, node 4444 receives the statistics of 𝒩4subscript𝒩4\mathcal{N}_{4}caligraphic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT produced at previous round, {𝒔1t1,𝒔2t1,𝒔3t1}superscriptsubscript𝒔1𝑡1superscriptsubscript𝒔2𝑡1superscriptsubscript𝒔3𝑡1\{\bm{s}_{1}^{t-1},\bm{s}_{2}^{t-1},\bm{s}_{3}^{t-1}\}{ bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT }, then node 4444 combines the neighbor statistics to get 𝒔¯4tsuperscriptsubscript¯𝒔4𝑡\bar{\bm{s}}_{4}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using the aggregation rule (line 4, Algorithm 3) to get 𝒔¯4tsuperscriptsubscript¯𝒔4𝑡\bar{\bm{s}}_{4}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Next, 4444 updates the aggregated statistics using local data using the update rule (line 8 in Algorithm 2) to get 𝒔4tsuperscriptsubscript𝒔4𝑡\bm{s}_{4}^{t}bold_italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and it communicates the updated statistics to its neighbors 𝒩4subscript𝒩4\mathcal{N}_{4}caligraphic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT.

Refer to caption
Figure 1: This figure shows an example of round t𝑡titalic_t of CRC with a single local update of the statistics for the node 4444 using the local data (X4,Y4)subscript𝑋4subscript𝑌4(X_{4},Y_{4})( italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) and the communications with its neighbors, 𝒩4={1,2,3}subscript𝒩4123\mathcal{N}_{4}=\{1,2,3\}caligraphic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { 1 , 2 , 3 }. The update is given by the aggregation rule and the update rule.

III-A On the equivalence of CRC and RC

In despite of its extreme simplicity, CRC is able to replicate RC by considering a full graph as the communication network without further assumptions on the available local training data.

Theorem 1.

Let (𝒱,)𝒱(\mathcal{V},\mathcal{E})( caligraphic_V , caligraphic_E ) be a communication network with 𝒩v=𝒱subscript𝒩𝑣𝒱\mathcal{N}_{v}=\mathcal{V}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = caligraphic_V for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V. Let {(Xv,Yv):v𝒱}conditional-setsubscript𝑋𝑣subscript𝑌𝑣𝑣𝒱\{(X_{v},Y_{v}):v\in\mathcal{V}\}{ ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) : italic_v ∈ caligraphic_V } be a partition of the global training data (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), where m=v𝒱mv𝑚subscript𝑣𝒱subscript𝑚𝑣m=\sum_{v\in\mathcal{V}}m_{v}italic_m = ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Consider RC with learning rate lr>0𝑙𝑟0lr>0italic_l italic_r > 0, using the global training data (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) with uniform initialization and a single iteration, i.e., iter=1𝑖𝑡𝑒𝑟1iter=1italic_i italic_t italic_e italic_r = 1. Now consider CRC with equivalent sample size m0=mlrnsuperscript𝑚0𝑚𝑙𝑟𝑛m^{0}=\frac{m}{lr\cdot n}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = divide start_ARG italic_m end_ARG start_ARG italic_l italic_r ⋅ italic_n end_ARG, using the local data (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for each v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V with uniform initialization and one iteration (iter=1𝑖𝑡𝑒𝑟1iter=1italic_i italic_t italic_e italic_r = 1). Then, all local parameters in the aggregation step of CRC coincide with the parameters of RC at the previous iteration:

θ(𝒔¯vt)=θ(𝒔gt1),for t=1,,tmax, and v𝒱.formulae-sequence𝜃subscriptsuperscript¯𝒔𝑡𝑣𝜃subscriptsuperscript𝒔𝑡1𝑔formulae-sequencefor 𝑡1superscript𝑡 and 𝑣𝒱\theta(\bar{\bm{s}}^{t}_{v})=\theta(\bm{s}^{t-1}_{g}),\quad\text{for }t=1,% \dots,t^{\max},\text{ and }v\in\mathcal{V}.italic_θ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_θ ( bold_italic_s start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , for italic_t = 1 , … , italic_t start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT , and italic_v ∈ caligraphic_V .

This result shows that in the extreme case of having a complete communication network, CRC can replicate the behavior of RC. In most practical scenarios, a complete communication network lacks of interest and sparse communication networks are available. However, in general, as the communication network is more dense (closer to the complete subgraph) the sequence of local classifiers obtained by CRC is closer to that obtained by RC, and at the same time, local classifiers are more similar between them. The sparsest communication network with a single connected component is a tree, which has only n1𝑛1n-1italic_n - 1 edges. In the experiments, we empirically show that the gap between CRC and RC is smaller than 0.010.010.010.01 for most of the datasets. We also show that the differences become smaller just by adding an small percentage of the possible edges.

Theorem 1 does not require assumptions about the local data generation process and thus the equivalence between CRC using full communication network and RC holds non-i.i.d. local training sets. In the experiments, we show that this holds even for sparsest communication networks and that by adding an small percentage of the possible edges CRC can obtain gaps smaller than 0.010.010.010.01 for most of the datasets.

Using Theorem 1, we propose the next heuristic to select m0superscript𝑚0m^{0}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for a communication network given a learning rate of RC lr𝑙𝑟lritalic_l italic_r:

m0=mlrn,superscript𝑚0𝑚𝑙𝑟𝑛m^{0}=\frac{m}{lr\cdot n},italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = divide start_ARG italic_m end_ARG start_ARG italic_l italic_r ⋅ italic_n end_ARG , (5)

where m𝑚mitalic_m is the size of the global training data in the network. For instance, for lr=0.05𝑙𝑟0.05lr=0.05italic_l italic_r = 0.05 we have m0=20mnsuperscript𝑚020𝑚𝑛m^{0}=\frac{20*m}{n}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = divide start_ARG 20 ∗ italic_m end_ARG start_ARG italic_n end_ARG, and for the particular case of constant local training set size m0=20mvsuperscript𝑚020subscript𝑚𝑣m^{0}=20\cdot m_{v}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 20 ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

IV Experimental results

This section presents experimental results that show the effectiveness of CRC to learn probabilistic generative classifiers in federated decentralized settings. More specifically, we evaluate the robustness, scalability, and convergence behavior of CRC under a broad variety of configurations for the communication network and local training sets relevant to real-world federated decentralized scenarios.

All experiments were conducted using the NB classifier with both discrete and continuous variables, trained via a closed-form maximum likelihood algorithm. The objective is to evaluate the collaborative risk-based calibration (CRC) learning algorithm under this same training procedure. Appendix A provides the details on the NB classifier and the maximum likelihood closed-form learning algorithm. CRC learns a model for each user in a decentralized network of size n𝑛nitalic_n, where each user has a local training set of size mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. CRC will be compared against the risk-based calibration algorithm (RC) using the same closed-form algorithm. RC learns NB by using the global training set given by the union of the local training data from all nodes in the network. NB learned using RC represents the gold-standard since is a centralized version of CRC with the entire training data in the network, and it provides and upper bound for the performance for the local NB models learned using CRC. We also include, as a reference, the results obtained by NB with the maximum likelihood parameters using the global training set (ML).

Training and test errors are the average 0-1 loss of NB in the global training data and a disjoint test data, respectively. In the case of CRC, we have n𝑛nitalic_n different NB classifiers, one for each user, and for CRC the training and test errors corresponds to the average training and test errors of the local NB classifiers across the different users. We would like to highlight that training error for CRC is the average of NB classifiers calibrated using only local training sets but tested on the global training set. We call the training and test gaps between CRC and RC the differences between their training and test errors, respectively. We are primarily interested in the training and test gaps between CRC and RC, since the errors of CR are lower bounds to those of CRC. All the experiments have been repeated 5 times using different splits for the global training and test sets, and randomly generated communication networks. All the scores that we report in the tables are averaged across the 5 repetitions. In the tables summarizing the results, we highlight gaps in (0.05,0.01]0.050.01(0.05,0.01]( 0.05 , 0.01 ] in bold gray and those lower than 0.010.010.010.01 in bold black. We consider that CRC has converged to RC when the gaps are smaller than 0.010.010.010.01.

The default parameters of the experiments are: number of nodes n=50𝑛50n=50italic_n = 50, size of the local training datasets mv=50subscript𝑚𝑣50m_{v}=50italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 50 (i.i.d.), number of maximum communication rounds tmax=64superscript𝑡𝑚𝑎𝑥64t^{max}=64italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT = 64, number of LRC iterations iter=1𝑖𝑡𝑒𝑟1iter=1italic_i italic_t italic_e italic_r = 1, learning rate of RC with global data lr=0.05𝑙𝑟0.05lr=0.05italic_l italic_r = 0.05, and equivalent sample size m0=mv/lrsuperscript𝑚0subscript𝑚𝑣𝑙𝑟m^{0}=m_{v}/lritalic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT / italic_l italic_r. We show the results of CRC and RC for round tmax=64superscript𝑡𝑚𝑎𝑥64t^{max}=64italic_t start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT = 64. The default communication topology is a tree, generated at random. The trees correspond to the sparsest connected graphs, with only (n1)𝑛1(n-1)( italic_n - 1 ) edges compared to the possible n(n1)/2𝑛𝑛12n(n-1)/2italic_n ( italic_n - 1 ) / 2. We have selected the sparsest connected graphs to expose CRC to the most challenging communication network topology. As n𝑛nitalic_n increases the sparseness of the tree increases, being the sparseness the ratio of the possible edges included, given by 2/(n1)2𝑛12/(n-1)2 / ( italic_n - 1 ). Figure 2 shows the effect of tree size n𝑛nitalic_n on the average proportion of nodes within a given communication distance, which highlights the increase in the difficulty of the cooperative learning as the size of the tree increases. To further increase the difficulty of learning accurate local NB classifiers using CRC, we have selected a relatively high default size of the communication network (n=50𝑛50n=50italic_n = 50) and a small default size of the local training sets (mv=50subscript𝑚𝑣50m_{v}=50italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 50) [4].

Refer to caption
Figure 2: Average ratio of nodes at a distance less than or equal to a given value relative to n𝑛nitalic_n, in random trees. This ratio indicates the fraction of nodes that can influence a given node and relates to the difficulty of convergence of CRC.

Experiments will be conducted on 16 datasets. To allow experiments with large communication networks, we have selected datasets with a minimum of 16000 samples. Continuous variables with ten or fewer distinct values have been treated as discrete variables. Table I summarizes the main features of the datasets. Table II shows the test gaps between RC using different training set sizes m{20,40,80,160,320,640,1280,2650,5120}𝑚204080160320640128026505120m\in\{20,40,80,160,320,640,1280,2650,5120\}italic_m ∈ { 20 , 40 , 80 , 160 , 320 , 640 , 1280 , 2650 , 5120 } and the RC trained using m=10240𝑚10240m=10240italic_m = 10240 as the gold-standard. These gaps indicate the amount of training data required to obtain errors close to the RC using the largest training set, which is the most deterministic factor in federated decentralized learning. For instance, given the default local training size mv=50subscript𝑚𝑣50m_{v}=50italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 50. The last column in Table II shows the test error of the gold standard for reference. Table I shows the datasets ordered using the gaps of Table II to ease the analysis of the experimental results. Note that the datasets at the bottom of Table II correspond to more complex classification problems. For example, the Letter dataset involves predicting 26 distinct class labels. Consequently, small local datasets are not sufficient to train accurate classifiers. Even with a training set of size 1280, the gap relative to the reference remains substantial (0.13).

Table I: Dataset information. Columns: name the names of the datasets, m𝑚mitalic_m the number of instances, d𝑑ditalic_d the number of features, disc.𝑑𝑖𝑠𝑐disc.italic_d italic_i italic_s italic_c . the number of discrete features, cont.𝑐𝑜𝑛𝑡cont.italic_c italic_o italic_n italic_t . the number of continuous features, and |𝒴|𝒴|\mathcal{Y}|| caligraphic_Y | the number of class labels.
name m𝑚mitalic_m d𝑑ditalic_d disc.𝑑𝑖𝑠𝑐disc.italic_d italic_i italic_s italic_c . cont.𝑐𝑜𝑛𝑡cont.italic_c italic_o italic_n italic_t . |𝒴|𝒴|\mathcal{Y}|| caligraphic_Y |
skin 245056 3 0 3 2
pulsar stars 17897 8 0 8 2
catsvsdogs 23262 512 0 512 2
bank marketing 45211 16 7 9 2
default credit card 29999 23 5 18 2
soft heart disease 319795 17 12 5 2
adult 48841 14 5 9 2
run or walk 88588 6 0 6 2
smartgrid stability 59999 13 0 13 2
smoking 55691 24 6 18 2
yearbook 37921 512 0 512 2
secondary mushroom 61067 19 12 7 2
fashion mnist 70000 512 0 512 10
cifar10 60000 512 0 512 10
mnist 70000 510 0 510 10
letter recognition 20000 16 0 16 26
Table II: Test gaps between RC for different global training set sizes, m𝑚mitalic_m, and RC with m=10240𝑚10240m=10240italic_m = 10240. The last column shows the test error of RC for m=10240𝑚10240m=10240italic_m = 10240. Gaps <0.01absent0.01<0.01< 0.01 are bold black; gaps <0.05absent0.05<0.05< 0.05 are bold gray.
Dataset 20 40 80 160 320 640 1280 2560 5120 10240
skin 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07
pulsar 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.03
catsvsdogs 0.03 0.02 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.01
bank 0.05 0.03 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.11
default 0.07 0.07 0.05 0.03 0.02 0.00 0.00 0.00 0.00 0.18
soft 0.09 0.06 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.10
adult 0.09 0.08 0.05 0.04 0.02 0.01 0.00 0.00 0.00 0.16
run 0.11 0.05 0.05 0.03 0.02 0.00 0.00 0.00 0.00 0.18
smartgrid 0.14 0.07 0.04 0.02 0.01 0.01 0.00 0.00 0.00 0.02
smoking 0.11 0.09 0.07 0.05 0.03 0.02 0.01 0.01 0.00 0.26
yearbook 0.14 0.10 0.07 0.05 0.05 0.04 0.03 0.02 0.01 0.08
secondary 0.18 0.16 0.11 0.07 0.04 0.02 0.01 0.00 0.00 0.22
fashion 0.36 0.26 0.18 0.13 0.08 0.06 0.04 0.03 0.01 0.14
cifar10 0.42 0.33 0.22 0.15 0.11 0.08 0.05 0.03 0.02 0.15
mnist 0.49 0.38 0.27 0.18 0.10 0.06 0.04 0.02 0.01 0.04
letter 0.77 0.77 0.73 0.44 0.33 0.22 0.13 0.06 0.02 0.19

Organization: To provide a comprehensive and systematic evaluation of CRC, the experiments have been divided into subsections covering the following fundamental aspects in federated decentralized learning. We first analyze the convergence of local models by examining how their errors evolve during training under our default experimental conditions. Next, we examine the robustness of CRC to different communication network topologies (𝒩vsubscript𝒩𝑣\mathcal{N}_{v}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) by using different undirected graphs. This is motivated by the critical role that the network plays in information propagation and convergence speed. Then, we address robustness to non-i.i.d. local datasets, reflecting the practical challenge of heterogeneous data distributions across clients. We also explore the effect of the size of the local dataset (mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) and the effect of the size of the communication network (n𝑛nitalic_n), as both influence efficiency and scalability. In addition, we analyze the effect of data fragmentation with constant global training data (m=nmv𝑚𝑛subscript𝑚𝑣m=n\cdot m_{v}italic_m = italic_n ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), to better understand how splitting the same amount of data among more clients affects the learning process. The impact of the number of local iterations per round (iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r) is also investigated, given its importance in the trade-off between local computation and communication frequency. Finally, we evaluate the robustness to dynamic changes in the communication network topology (𝒩vtsuperscriptsubscript𝒩𝑣𝑡\mathcal{N}_{v}^{t}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), which simulates real-world scenarios where connections may vary over time.

IV-A Convergence of local models

In federated learning, convergence of decentralized models to the global centralized optimum is a fundamental property. Understanding how quickly local models trained with CRC approach the gold standard RC model is key to validating its efficiency.

The experiments have been carried out with the default parameter values. Table III shows the training and test gaps between ML and between CRC and RC. In addition, we include the standard deviation for training and test errors of local NB classifiers across the different users (std.). The standard deviation measures the similarity of local models trained with CRC, while the gap measures the convergence to RC. The table also includes the training and test errors for RC as a reference.

CRC outperforms ML (trained with 2500250025002500 samples) in 15 out of 16 datasets in terms of both training and test errors. RC shows similar training and test errors, and CRC has the same train and test gaps, which highlight the robustness of CRC to overfitting. CRC has training and test gaps smaller than 0.010.010.010.01 (0.050.050.050.05) in 10 (13) and 9 (14) out of 16 datasets, respectively. In addition, the standard deviations of training and test errors of local models trained with CRC are smaller than 0.010.010.010.01 (0.050.050.050.05) in 13131313 (16161616) out of 16161616 datasets. These results clearly show that the local models obtained by CRC have converged to a similar model that is close to the global model.

Table III: Convergence of local models: training and test gaps at round t=64𝑡64t=64italic_t = 64 between ML and between CRC and RC, respectively. The column std. shows the standard deviations for the errors obtained by the local NB classifiers learned using CRC.
Training Test
ML CRC RC ML CRC RC
Dataset gap gap std. error gap gap std. error
skin 0.01 0.00 0.00 0.07 0.01 0.00 0.00 0.07
pulsar 0.02 0.00 0.00 0.02 0.03 0.00 0.00 0.03
catsvsdogs 0.04 0.01 0.00 0.00 0.02 0.00 0.00 0.01
bank 0.03 0.00 0.00 0.10 0.03 0.00 0.00 0.11
default 0.19 0.01 0.01 0.18 0.19 0.01 0.01 0.18
soft 0.03 0.00 0.00 0.10 0.03 0.00 0.00 0.10
adult 0.06 0.01 0.00 0.15 0.05 0.01 0.00 0.16
run 0.04 0.00 0.01 0.19 0.04 0.00 0.01 0.19
smartgrid 0.01 0.00 0.00 0.02 0.01 0.00 0.00 0.02
smoking 0.07 0.01 0.01 0.24 0.05 0.01 0.01 0.27
yearbook 0.15 0.03 0.01 0.04 0.08 0.01 0.01 0.10
secondary 0.15 0.02 0.01 0.21 0.14 0.02 0.01 0.22
fashion 0.38 0.21 0.02 0.00 0.24 0.12 0.01 0.17
cifar10 0.42 0.15 0.01 0.00 0.26 0.08 0.00 0.19
mnist 0.40 0.34 0.04 0.00 0.35 0.31 0.03 0.07
letter 0.50 0.75 0.00 0.06 0.38 0.58 0.00 0.26

Due to lack of overfitting, from here on, we will focus the analysis of CRC in terms of test gap with respect to the gold-standard represented by the RC using the global training data.

IV-B Sparse communication networks

In this section, we evaluate the robustness of CRC to different communication network topologies, which strongly affect the learning performance in decentralized federated learning. The communication graph structure determines how efficiently information propagates across the network. Sparse or poorly connected networks may slow convergence or even prevent consensus, while well-connected graphs facilitate faster and more stable convergence. In fact, Theorem 1 proves that CRC obtain local models equivalent to the RC global model using full communication network. Thus, this section focuses on analyzing empirically the behavior of CRC for sparse communication networks.

In this experiment, we vary the communication topology while keeping the rest of default parameters. Specifically, we depart from the sparsest possible communication network given by a tree with only 4%percent44\%4 % of the possible edges, and increase connectivity by adding 10 (5%percent55\%5 %), 20 (6%percent66\%6 %), 40 (7%)7\%)7 % ), and 80 (11%percent1111\%11 %) random edges among the possible 1225122512251225 edges for n=50𝑛50n=50italic_n = 50. These graphs represent randomly generated sparse connected communication networks with a single connected component. We also include the chain structure representing a particular tree with the worst communication properties.

Table IV summarizes the test gaps between CRC for different network topologies and RC using global training set. CRC using tree structure obtain gaps smaller than 0.010.010.010.01 (0.050.050.050.05) in 9999 (15151515) out of 16161616 datasets and using chains can obtain gaps close to the ones obtained with trees. Besides, CRC using trees with 10101010 randomly added edges in 13131313 (15151515) out of 16161616. These results highlight the robustness of CRC using sparse communication networks and the benefits by adding a very small quantity of edges.

Table IV: Influence of communication network sparsity: test gaps between CRC using different sparse communication network topologies and RC at round t=64𝑡64t=64italic_t = 64.
Topology of 𝒩lsubscript𝒩𝑙\mathcal{N}_{l}caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Dataset tree chain +10 +20 +40 +80
skin 0.00 0.00 0.00 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00 0.00
bank 0.00 0.00 0.00 0.00 0.00 0.00
default 0.01 0.01 0.01 0.00 0.00 0.00
soft 0.00 0.00 0.00 0.00 0.00 0.00
adult 0.01 0.01 0.01 0.00 0.00 0.00
run 0.01 0.01 0.00 0.00 0.00 0.00
smartgrid 0.00 0.00 0.00 0.00 0.00 0.00
smoking 0.02 0.02 0.01 0.01 0.01 0.00
yearbook 0.01 0.02 0.01 0.00 0.00 0.00
secondary 0.03 0.03 0.02 0.01 0.01 0.01
fashion 0.02 0.03 0.01 0.01 0.00 0.00
cifar10 0.02 0.03 0.01 0.01 0.00 0.01
mnist 0.05 0.03 0.01 0.01 0.00 0.00
letter 0.46 0.37 0.46 0.49 0.51 0.52

IV-C Non-i.i.d. data

Real-world federated data is rarely i.i.d., which can severely impact decentralized learning. Thus, testing CRC under extreme non-i.i.d. scenarios is critical to assess its robustness.

We generate three types of particularly extreme non-i.i.d. local datasets: (i) drift in p(𝒙)𝑝𝒙p(\bm{x})italic_p ( bold_italic_x ), by splitting sorted data according to the first PCA component; (ii) drift in p(y)𝑝𝑦p(y)italic_p ( italic_y ), by assigning each user samples from a single class; (iii) drift in p(𝒙,y)𝑝𝒙𝑦p(\bm{x},y)italic_p ( bold_italic_x , italic_y ), combining input and label drifts. The local datasets have been generated in such a way that global training set remains i.i.d. and, thus, RC learns the NB classifier using i.i.d. training data. According to Theorem 1 CRC is able to obtain the same models as RC with the global training set with the full communication network. Thus, in these experiments we also vary the communication topology departing from trees and adding an small amount of edges, while keeping the rest of default parameters.

Table V: Influence of non-i.i.d data with extreme drifts in p(𝒙)𝑝𝒙p(\bm{x})italic_p ( bold_italic_x ): test gaps between CRC using different sparse communication network topologies and RC at round t=64𝑡64t=64italic_t = 64.
Topology of 𝒩lsubscript𝒩𝑙\mathcal{N}_{l}caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Dataset tree +10 +20 +40 +80
skin 0.00 0.00 0.00 0.00 0.00
pulsar 0.03 0.02 0.01 0.01 0.00
catsvsdogs 0.01 0.00 0.00 0.00 0.00
bank 0.01 0.00 0.00 0.00 0.00
default 0.02 0.02 0.01 0.01 0.00
soft 0.00 0.00 0.00 0.00 0.00
adult 0.03 0.02 0.01 0.01 0.00
run 0.03 0.01 0.01 0.00 0.00
smartgrid 0.00 0.00 0.00 0.00 0.00
smoking 0.02 0.02 0.01 0.01 0.01
yearbook 0.02 0.01 0.01 0.01 0.01
secondary 0.07 0.05 0.04 0.03 0.02
fashion 0.03 0.02 0.01 0.01 0.00
cifar10 0.02 0.01 0.01 0.01 0.00
mnist 0.06 0.01 0.01 0.01 0.00
letter 0.47 0.56 0.61 0.63 0.64
Table VI: Influence of non-i.i.d data with extreme drifts in p(y)𝑝𝑦p(y)italic_p ( italic_y ): test gaps between CRC using different sparse communication network topologies and RC at round t=64𝑡64t=64italic_t = 64.
Topology of 𝒩lsubscript𝒩𝑙\mathcal{N}_{l}caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Dataset tree +10 +20 +40 +80
skin 0.01 0.01 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00
bank 0.01 0.01 0.01 0.01 0.01
default 0.04 0.04 0.04 0.04 0.04
soft 0.00 0.00 0.00 0.00 0.00
adult 0.08 0.08 0.08 0.08 0.08
run 0.03 0.03 0.02 0.01 0.00
smartgrid 0.01 0.01 0.00 0.00 0.00
smoking 0.10 0.10 0.10 0.10 0.10
yearbook 0.02 0.01 0.01 0.01 0.00
secondary 0.33 0.33 0.33 0.33 0.33
fashion 0.42 0.34 0.30 0.24 0.05
cifar10 0.42 0.35 0.31 0.18 0.01
mnist 0.48 0.39 0.33 0.11 0.01
letter 0.70 0.70 0.70 0.70 0.70
Table VII: Influence of non-i.i.d data with extreme drifts in p(𝒙,y)𝑝𝒙𝑦p(\bm{x},y)italic_p ( bold_italic_x , italic_y ): test gaps between CRC using different sparse communication network topologies and RC at round t=64𝑡64t=64italic_t = 64.
Topology of 𝒩lsubscript𝒩𝑙\mathcal{N}_{l}caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Dataset tree +10 +20 +40 +80
skin 0.03 0.02 0.01 0.01 0.00
pulsar 0.03 0.02 0.01 0.00 0.00
catsvsdogs 0.01 0.00 0.00 0.00 0.00
bank 0.01 0.01 0.01 0.01 0.01
default 0.04 0.04 0.04 0.04 0.04
soft 0.00 0.00 0.00 0.00 0.00
adult 0.08 0.08 0.08 0.08 0.08
run 0.06 0.05 0.03 0.02 0.01
smartgrid 0.02 0.01 0.01 0.00 0.00
smoking 0.10 0.10 0.10 0.10 0.10
yearbook 0.04 0.02 0.02 0.01 0.01
secondary 0.33 0.33 0.33 0.33 0.33
fashion 0.44 0.36 0.30 0.25 0.02
cifar10 0.42 0.36 0.27 0.20 0.04
mnist 0.53 0.45 0.38 0.14 0.01
letter 0.70 0.70 0.70 0.70 0.70

In despite of the extreme drifts, CRC is resilient and using sparse communication networks with just 11%percent1111\%11 % (+80) of possible edges is able to get gaps smaller than 0.010.010.010.01 (0.050.050.050.05) in 14 (15), 9 (11) and 8 (12) out of 16 datasets with drifts in p(𝒙)𝑝𝒙p(\bm{x})italic_p ( bold_italic_x ), p(y)𝑝𝑦p(y)italic_p ( italic_y ) and p(𝒙,y)𝑝𝒙𝑦p(\bm{x},y)italic_p ( bold_italic_x , italic_y ), respectively. Notably, the effect of the extreme non-i.i.d. generated local training set can be drastically accommodated by adding a small fraction of the total edges.

IV-D Size of local data, mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

The availability of local data at each node directly impacts model quality. Larger datasets should intuitively lead to better estimates and faster convergence of CRC. Table II shows the effect of the training set size in the test error of RC.

In these experiments, we vary the size of local datasets mv{20,40,80,160,320}subscript𝑚𝑣204080160320m_{v}\in\{20,40,80,160,320\}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ { 20 , 40 , 80 , 160 , 320 } , while keeping the rest of default parameters. Table VIII shows the test gap between CRC for the different values of mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and their associated gold standards given by RC with m=nmv𝑚𝑛subscript𝑚𝑣m=n\cdot m_{v}italic_m = italic_n ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and n=50𝑛50n=50italic_n = 50. The gold standard improves with mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, since the amount of global training data m𝑚mitalic_m increases linearly with the size of local training data mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and thus the test gaps between CRC and RC could increase with mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (see Table II).

As the size of local training data mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT increases, the gap between CRC and RC with m=nmv𝑚𝑛subscript𝑚𝑣m=n\cdot m_{v}italic_m = italic_n ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT decreases. Even with only mv=20subscript𝑚𝑣20m_{v}=20italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 20 CRC is able to obtain gaps smaller than 0.010.010.010.01 (0.05)0.05)0.05 ) in 8888 (12) out of 16 with respect to RC using m=1000𝑚1000m=1000italic_m = 1000 samples, while for mv=320subscript𝑚𝑣320m_{v}=320italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 320, CRC obtain gaps smaller in 12 (15) datasets concerning RC using 16000160001600016000 samples.

Table VIII: Influence of local data size: test error gaps at round t=64𝑡64t=64italic_t = 64 between CRC with different local training sizes mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and RC trained with m=nmv𝑚𝑛subscript𝑚𝑣m=n\cdot m_{v}italic_m = italic_n ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT samples.
mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=20 mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=40 mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=80 mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=160 mvsubscript𝑚𝑣m_{v}italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=320
Dataset m𝑚mitalic_m=1000 m𝑚mitalic_m=2000 m𝑚mitalic_m=4000 m𝑚mitalic_m=8000 m𝑚mitalic_m=16000
skin 0.00 0.00 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00
bank 0.00 0.00 0.00 0.00 0.00
default 0.01 0.01 0.01 0.00 0.00
soft 0.01 0.00 0.00 0.00 0.00
adult 0.02 0.01 0.01 0.00 0.00
run 0.01 0.01 0.00 0.00 0.00
smartgrid 0.01 0.00 0.00 0.00 0.00
smoking 0.02 0.02 0.01 0.01 0.01
yearbook 0.01 0.01 0.01 0.01 0.01
secondary 0.05 0.02 0.02 0.01 0.01
fashion 0.16 0.09 0.02 0.02 0.01
cifar10 0.28 0.01 0.02 0.02 0.02
mnist 0.66 0.03 0.10 0.02 0.10
letter 0.61 0.68 0.09 0.07 0.04

IV-E Size of communication network, n𝑛nitalic_n

Scalability is a key requirement for federated learning. It is therefore important to verify that CRC maintains performance as the number of users increases.

In these experiments we vary the size of the communication network n{20,40,80,160,320}𝑛204080160320n\in\{20,40,80,160,320\}italic_n ∈ { 20 , 40 , 80 , 160 , 320 } nodes, while keeping the rest of default parameters. The experiments drastically increase the difficulty of the learning scenarios in terms of the gaps between CRC and RC as n𝑛nitalic_n increases due to the following reasons. The sparseness of a tree increases with n𝑛nitalic_n. For instance, for n=20𝑛20n=20italic_n = 20 the edges of a tree are n1=19𝑛119n-1=19italic_n - 1 = 19 and they represent the 10%percent1010\%10 % of the total edges, while for n=320𝑛320n=320italic_n = 320 they are less than 1%percent11\%1 %. As n𝑛nitalic_n increases, the average fraction of nodes within a given distance from any node decreases, making convergence more difficult (see Figure 2). The global training set used by RC linearly increases with n𝑛nitalic_n while the size of the local training sets remain constant mv=50subscript𝑚𝑣50m_{v}=50italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 50, which for n=20𝑛20n=20italic_n = 20 has m=1000𝑚1000m=1000italic_m = 1000 samples and for n=320𝑛320n=320italic_n = 320 has m=16000𝑚16000m=16000italic_m = 16000.

Table IX shows the test gaps between CRC and RC for different tree sizes, n𝑛nitalic_n. The test gap slightly increases as the size of the communication network n𝑛nitalic_n increases: with the smallest communication network, n=20𝑛20n=20italic_n = 20, the test gaps are smaller than 0.010.010.010.01 (0.050.050.050.05) in 13 (15) out of 16, and with the largest n=320𝑛320n=320italic_n = 320 in 7 (12) out of 16 datasets. These results highlight the robustness of CRC to large communication networks with hundreds of nodes.

Table IX: Influence of communication network size: test error gaps between CRC and RC (best round per seed) for different network sizes n𝑛nitalic_n.
n𝑛nitalic_n=20 n𝑛nitalic_n=40 n𝑛nitalic_n=80 n𝑛nitalic_n=160 n𝑛nitalic_n=320
Dataset m𝑚mitalic_m=1000 m𝑚mitalic_m=2000 m𝑚mitalic_m=4000 m𝑚mitalic_m=8000 m𝑚mitalic_m=16000
skin 0.00 0.00 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00
bank 0.00 0.00 0.00 0.00 0.00
default 0.00 0.01 0.01 0.01 0.01
soft 0.00 0.00 0.00 0.00 0.00
adult 0.01 0.01 0.01 0.01 0.01
run 0.00 0.01 0.00 0.00 0.00
smartgrid 0.00 0.00 0.00 0.00 0.00
smoking 0.01 0.02 0.02 0.02 0.02
yearbook 0.00 0.01 0.02 0.02 0.03
secondary 0.02 0.02 0.03 0.03 0.03
fashion 0.01 0.01 0.03 0.07 0.06
cifar10 0.01 0.01 0.02 0.04 0.10
mnist 0.00 0.02 0.06 0.08 0.14
letter 0.39 0.50 0.28 0.28 0.24

IV-F Fragmentation of global training data, nmv𝑛subscript𝑚𝑣n\cdot m_{v}italic_n ⋅ italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT constant

In this section, we analyze the impact of splitting a fixed global training set across communication networks of varying sizes. This experiment is particularly valuable for understanding how data fragmentation influences learning in decentralized networks.

Given a fixed training data of size m𝑚mitalic_m, and a federated decentralized scenario with n𝑛nitalic_n users, the data fragmentation increases with n𝑛nitalic_n since the corresponding mv=m/nsubscript𝑚𝑣𝑚𝑛m_{v}=m/nitalic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_m / italic_n. Note that the fragmentation does not affect RC because the global training data remains the same with independence of the value of n𝑛nitalic_n, and thus the gold-standard is the same. In these experiments, we vary the number of users n=20,40,80,160,320𝑛204080160320n=20,40,80,160,320italic_n = 20 , 40 , 80 , 160 , 320 while the training set decreases linearly so as to the global training set remains constant. All other parameters are set to their default values.As n increases, the experiments become more challenging due to greater tree sparsity and a reduced average fraction of nodes within a given distance from any node (see Section 4.2). The difficulty is further amplified by the linear decrease in the size of local training sets, while the global training set remains fixed at m = 12800.

Table X shows the test gaps between CRC for different values of n𝑛nitalic_n, using local training samples of size m/n𝑚𝑛m/nitalic_m / italic_n, and RC with a global training set of m=12800𝑚12800m=12800italic_m = 12800 samples. In the most favorable conditions (n=20𝑛20n=20italic_n = 20, mv=640subscript𝑚𝑣640m_{v}=640italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 640), the CRC achieves test gaps smaller than 0.010.010.010.01 (0.050.050.050.05) in 15 (16) out of 16, while in the most extreme case of data fragmentation (n=320𝑛320n=320italic_n = 320, mv=40subscript𝑚𝑣40m_{v}=40italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 40), the algorithm is still able to achieve test gaps below 0.010.010.010.01 (0.050.050.050.05) in 7 (12) out of 16 datasets.

Table X: Influence of global data fragmentation: test error gaps of CRC with respect to RC, trained with m𝑚mitalic_m=12800, for different levels of data fragmentation.
n𝑛nitalic_n=20 n𝑛nitalic_n=40 n𝑛nitalic_n=80 n𝑛nitalic_n=160 n𝑛nitalic_n=320
Dataset mlsubscript𝑚𝑙m_{l}italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT=640 mlsubscript𝑚𝑙m_{l}italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT=320 mlsubscript𝑚𝑙m_{l}italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT=160 mlsubscript𝑚𝑙m_{l}italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT=80 mlsubscript𝑚𝑙m_{l}italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT=40
skin 0.00 0.00 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00
bank 0.00 0.00 0.00 0.00 0.00
default 0.00 0.00 0.00 0.01 0.01
soft 0.00 0.00 0.00 0.00 0.00
adult 0.00 0.00 0.00 0.01 0.01
run 0.00 0.00 0.00 0.00 0.00
smartgrid 0.00 0.00 0.00 0.00 0.00
smoking 0.00 0.01 0.01 0.02 0.03
yearbook 0.00 0.01 0.01 0.02 0.03
secondary 0.00 0.01 0.01 0.02 0.04
fashion 0.00 0.01 0.02 0.03 0.11
cifar10 0.01 0.02 0.03 0.04 0.09
mnist 0.01 0.01 0.02 0.07 0.17
letter 0.02 0.04 0.07 0.11 0.74

IV-G Number of local iterations in each round, iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r

Performing multiple local updates before communicating could accelerate convergence by reducing the frequency of costly communication rounds. However, this strategy may also introduce instability in the learning process, as local models can diverge from each other during independent updates. Thus, in this section, we analyze the impact of varying the number of local iterations iter{1,2,3}𝑖𝑡𝑒𝑟123iter\in\{1,2,3\}italic_i italic_t italic_e italic_r ∈ { 1 , 2 , 3 } of CRC, where the rest are the default parameters.

Table XI shows the test gaps between CRC for different number of iterations, iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r, and the RC. In particular, it shows for each value of iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r the round t𝑡titalic_t in which the gap is lower than 0.010.010.010.01 and in case that CRC is not able to reach a gap lower than 0.010.010.010.01 the gap at round 64. The results show that increasing the number of local iterations (iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r) can speed up the convergence, reducing the number of communication rounds to achieve a gap <0.01absent0.01<0.01< 0.01.

Table XI: Influence of number of local iterations: test error gaps, and round t𝑡titalic_t where gap <<< 0.01 (if any), for different number of local iterations.
iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r=1 iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r=2 iter𝑖𝑡𝑒𝑟iteritalic_i italic_t italic_e italic_r=3
Dataset gap t𝑡titalic_t gap t𝑡titalic_t gap t𝑡titalic_t
skin 0.01 32 0.01 16 0.01 11
pulsar 0.01 13 0.01 8 0.01 7
catsvsdogs 0.01 5 0.01 4 0.01 3
bank 0.01 11 0.01 6 0.01 5
default 0.01 29 0.01 64 0.02 64
soft 0.00 2 0.00 1 0.00 1
adult 0.01 56 0.01 43 0.01 45
run 0.01 48 0.01 25 0.01 18
smartgrid 0.01 42 0.01 24 0.01 33
smoking 0.02 64 0.01 64 0.02 64
yearbook 0.01 64 0.01 64 0.01 64
secondary 0.03 64 0.02 64 0.01 64
fashion 0.02 64 0.09 64 0.02 64
cifar10 0.02 64 0.02 64 0.02 64
mnist 0.10 64 0.26 64 0.34 64
letter 0.59 64 0.59 64 0.58 64

IV-H Dynamic changes in the communication network

Finally, we study the impact of dynamic changes in the communication network, which are common in many real-world scenarios due to user mobility or network variability. In the experiments, we analyze the effect of a communication network that changes with different periods δ𝛿\deltaitalic_δ, which determines how frequently the network edges are changed, where the rest are the default parameters.

Table XII shows the test gaps between CRC and RC. In these results, for δ{1,2,4,8}𝛿1248\delta\in\{1,2,4,8\}italic_δ ∈ { 1 , 2 , 4 , 8 }, the set of connections between nodes is updated every δ𝛿\deltaitalic_δ communication rounds, while for δ=𝛿\delta=\inftyitalic_δ = ∞, the network remains static. In 9 out of 16 datasets, CRC achieves gaps smaller than 0.010.010.010.01 with both dynamic and static communication networks. In 6 out of 16 dynamic communication networks can reduce the gap compared to the static case, while in only 1 dataset the gap increases under dynamic conditions. These results suggest that CRC can benefit from dynamic communication networks, because dynamic networks can reduce effective communication distances among nodes, reinforcing CRC’s robustness.

Table XII: Influence of dynamic changes in the communication network: test error gaps between CRC and RC at round t=64𝑡64t=64italic_t = 64 as a function of the local update period δ𝛿\deltaitalic_δ.
Dataset δ𝛿\deltaitalic_δ=\infty δ𝛿\deltaitalic_δ=8 δ𝛿\deltaitalic_δ=4 δ𝛿\deltaitalic_δ=2 δ𝛿\deltaitalic_δ=1
skin 0.00 0.00 0.00 0.00 0.00
pulsar 0.00 0.00 0.00 0.00 0.00
catsvsdogs 0.00 0.00 0.00 0.00 0.00
bank 0.00 0.00 0.00 0.00 0.00
default 0.01 0.01 0.00 0.01 0.01
soft 0.00 0.00 0.00 0.00 0.00
adult 0.01 0.01 0.00 0.00 0.00
run 0.00 0.00 0.00 0.00 0.00
smartgrid 0.00 0.00 0.00 0.00 0.00
smoking 0.02 0.01 0.01 0.01 0.01
yearbook 0.01 0.00 0.00 0.01 0.01
secondary 0.03 0.02 0.02 0.01 0.01
fashion 0.02 0.04 0.01 0.44 0.45
cifar10 0.02 0.00 0.00 0.14 0.43
mnist 0.10 0.57 0.34 0.34 0.67
letter 0.59 0.57 0.57 0.57 0.57

V Conclusions

The current research proposes a decentralized federated learning implementation of risk-based calibration (RC) algorithm, called collaborative risk-based calibration (CRC). The goal of CRC is to minimize the empirical error of probabilistic generative classifiers over arbitrary communication network of data owners. CRC learns local models by sharing statistics between neighbors that are iteratively updated using local training datasets in a decentralized way. CRC can be applied in any network setting defined over an undirected graph that can change over time, and to any probabilistic generative classifier with a closed-form learning algorithm of the model parameters from statistics.

Despite the simplicity of CRC, we prove that it is able to replicate the RC algorithm with all the training data in the network using a fully connected communication network. Through extensive experimentation in 16 datasets, we analyze the behavior of CRC in trees, the sparsest communication networks with a single connected components, and measure the error difference (gap) with respect to the gold standard represented by RC using all the available training data in the network. The experiments are designed to imposed challenging conditions as a baseline, including limited data per node and sparsely connected networks; when evaluating non-i.i.d. scenarios, we set extreme conditions to clearly assess the robustness of CRC. The main findings can be summarized as follows:

  • Convergence of local models: CRC achieves convergence to the global RC model with remarkable accuracy and consistency, outperforming ML in nearly all datasets while maintaining extremely low variance across local models.

  • Sparse communication networks: CRC demonstrates strong resilience to sparse topologies, achieving high accuracy even with minimal connectivity, and further improving with only a small increase in edges.

  • Non-i.i.d. data: CRC remains robust under severe non-i.i.d. conditions, recovering performance close to centralized RC in most datasets when just a small fraction of additional edges is added to the network.

  • Size of local data: the gap between CRC and RC decreases as the local training data increases.

  • Size of communication network: CRC maintains competitive performance across increasing network sizes, showcasing its scalability and robustness even under highly sparse and challenging network conditions.

  • Fragmentation of global training data: CRC effectively handles extreme data fragmentation, preserving low error gaps with RC even when the global dataset is split across hundreds of nodes.

  • Number of local iterations per round: Increasing local iterations can accelerate CRC’s convergence, enabling faster approach to the RC reference with fewer communication rounds.

  • Dynamic changes in the communication network: CRC benefits from dynamic communication networks, often improving accuracy by reducing communication distances and enhancing resilience to topology variability.

All the classifiers, learning algorithms, datasets and experiments included in this paper can be found in the open-source library developed in Python https://gitlab.bcamath.org/aperez/decentralized_risk-based_calibration.

Appendix A The naive Bayes classifier

A popular example of probabilistic generative classifier based on Bayesian networks is naive Bayes (NB). In this work, we consider NB based on conditional Gaussian Bayesian networks which allows to deal with both continuous and discrete features distribution. In particular, NB based on the following factorization of the joint probability distribution:

p𝜽(𝒙,y)=p𝜽(y)i=1dp𝜽(xi|y),subscript𝑝𝜽𝒙𝑦subscript𝑝𝜽𝑦superscriptsubscriptproduct𝑖1𝑑subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(\bm{x},y)=p_{\bm{\theta}}(y)\prod_{i=1}^{d}p_{\bm{\theta}}(x_{% i}|y),italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x , italic_y ) = italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) , (6)

where p𝜽(xi|y)subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(x_{i}|y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) follows a Gaussian distribution or a categorical distribution for each y𝑦yitalic_y when xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a continuous or discrete feature, respectively. NB assumes that the features are conditionally independent given the class, which effectively controls the number of parameters of the probabilistic model and thus the overfitting. Despite of the strong independence assumption of NB, it has shown good performance in many practical scenarios [35, 36, 37]. Moreover, in [38] the authors show that for continuous variables NB and logistic regression are the same class of models when we impose additional constraints on the variance of the Gaussian distributions p𝜽(xi|y)subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(x_{i}|y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) (homocedasticity assumption).

Maximum likelihood learning algorithm of NB is closed form. The statistics mapping can be decomposed into components corresponding to factor from 6 including p𝜽(y)subscript𝑝𝜽𝑦p_{\bm{\theta}}(y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y ) and each the class conditional distribution of each feature p𝜽(xi|y)subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(x_{i}|y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) for i=1,,d𝑖1𝑑i=1,...,ditalic_i = 1 , … , italic_d, s(𝒙,y)=(s0(y),s1(x1,y),,sd(xd,y))𝑠𝒙𝑦subscript𝑠0𝑦subscript𝑠1subscript𝑥1𝑦subscript𝑠𝑑subscript𝑥𝑑𝑦s(\bm{x},y)=(s_{0}(y),s_{1}(x_{1},y),...,s_{d}(x_{d},y))italic_s ( bold_italic_x , italic_y ) = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y ) , … , italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_y ) ). The statistics for the class s0(y)=𝒆ysubscript𝑠0𝑦subscript𝒆𝑦s_{0}(y)=\bm{e}_{y}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_y ) = bold_italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is the y𝑦yitalic_y-th vector in the standard basis of rsuperscript𝑟\mathds{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT depends on the continuous or discrete nature of the feature xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

When the i𝑖iitalic_i-th input feature xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is discrete, we denote its support as 𝒳i={1,,ri}subscript𝒳𝑖1subscript𝑟𝑖\mathcal{X}_{i}=\{1,...,r_{i}\}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { 1 , … , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and p𝜽(xi|y)subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(x_{i}|y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) is a categorical distribution for y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y. The component si(xi,y)=𝒆y𝒆xisubscript𝑠𝑖subscript𝑥𝑖𝑦tensor-productsubscript𝒆𝑦subscript𝒆subscript𝑥𝑖s_{i}(x_{i},y)=\bm{e}_{y}\otimes\bm{e}_{x_{i}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) = bold_italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⊗ bold_italic_e start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT where tensor-product\otimes is the Kronecker product, and it represents the counting statistics associated to each value of (xi,y)(𝒳i,𝒴)subscript𝑥𝑖𝑦subscript𝒳𝑖𝒴(x_{i},y)\in(\mathcal{X}_{i},\mathcal{Y})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) ∈ ( caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_Y ). When the i𝑖iitalic_i-th feature is continuous, its support is \mathds{R}blackboard_R and p𝜽(xi|y)subscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦p_{\bm{\theta}}(x_{i}|y)italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) is an univariate Gaussian distribution with mean μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and variance σy2superscriptsubscript𝜎𝑦2\sigma_{y}^{2}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y. The component si(xi,y)=𝒆y(1,xi,xi2)subscript𝑠𝑖subscript𝑥𝑖𝑦tensor-productsubscript𝒆𝑦1subscript𝑥𝑖superscriptsubscript𝑥𝑖2s_{i}(x_{i},y)=\bm{e}_{y}\otimes(1,x_{i},x_{i}^{2})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) = bold_italic_e start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ⊗ ( 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) allows the computation of the zeroth, first, and second order moments required to get the maximum likelihood parameters of the Gaussian distribution for each value of y𝑦yitalic_y.

Then the mapping θ(𝒔)𝜃𝒔\theta(\bm{s})italic_θ ( bold_italic_s ) is given by p𝜽(y)=s0,y/y𝒴s0,ysubscript𝑝𝜽𝑦subscript𝑠0𝑦subscriptsuperscript𝑦𝒴subscript𝑠0superscript𝑦p_{\bm{\theta}}(y)=s_{0,y}/\sum_{y^{\prime}\in\mathcal{Y}}s_{0,y^{\prime}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y ) = italic_s start_POSTSUBSCRIPT 0 , italic_y end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for the probabilities of the categorical distribution of the class; by p𝜽(xi|y)=si,y,xi/xi𝒳isi,y,xisubscript𝑝𝜽conditionalsubscript𝑥𝑖𝑦subscript𝑠𝑖𝑦subscript𝑥𝑖subscriptsuperscriptsubscript𝑥𝑖subscript𝒳𝑖subscript𝑠𝑖𝑦superscriptsubscript𝑥𝑖p_{\bm{\theta}}(x_{i}|y)=s_{i,y,x_{i}}/\sum_{x_{i}^{\prime}\in\mathcal{X}_{i}}% s_{i,y,x_{i}^{\prime}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) = italic_s start_POSTSUBSCRIPT italic_i , italic_y , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_y , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for the probabilities of the categorical distribution of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the class y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y; and by μi,y=si,y,2/si,y,1subscript𝜇𝑖𝑦subscript𝑠𝑖𝑦2subscript𝑠𝑖𝑦1\mu_{i,y}=s_{i,y,2}/s_{i,y,1}italic_μ start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i , italic_y , 2 end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT italic_i , italic_y , 1 end_POSTSUBSCRIPT and σi,y2=si,y,3/si,y,1μi,yμi,yTsuperscriptsubscript𝜎𝑖𝑦2subscript𝑠𝑖𝑦3subscript𝑠𝑖𝑦1subscript𝜇𝑖𝑦superscriptsubscript𝜇𝑖𝑦𝑇\sigma_{i,y}^{2}=s_{i,y,3}/s_{i,y,1}-\mu_{i,y}\cdot\mu_{i,y}^{T}italic_σ start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_i , italic_y , 3 end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT italic_i , italic_y , 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT ⋅ italic_μ start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, for the mean and variance of the Gaussian density function of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the class label y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y.

A-A Uniform initialization

On the experiments, we use the same initialization of the local statistics in all the nodes: the counting statistics for the class are my=m0/rsubscript𝑚𝑦superscript𝑚0𝑟m_{y}=m^{0}/ritalic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / italic_r for y=1,,r𝑦1𝑟y=1,...,ritalic_y = 1 , … , italic_r, the counting statistics for discrete variables are m0/(rri)superscript𝑚0𝑟subscript𝑟𝑖m^{0}/(r\cdot r_{i})italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / ( italic_r ⋅ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for xi=1,,risubscript𝑥𝑖1subscript𝑟𝑖x_{i}=1,...,r_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , … , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and y=1,,r𝑦1𝑟y=1,...,ritalic_y = 1 , … , italic_r, the statistics for the zeroth, first and second moments of continuous variables are m0/rsuperscript𝑚0𝑟m^{0}/ritalic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / italic_r, m0/rsuperscript𝑚0𝑟m^{0}/ritalic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / italic_r and 00, respectively. This leads to a classifier with uniform probabilities, p𝜽(y|𝒙)subscript𝑝𝜽conditional𝑦𝒙p_{\bm{\theta}}(y|\bm{x})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_y | bold_italic_x ) for all 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X, with 𝜽=θ(𝒔)𝜽𝜃𝒔\bm{\theta}=\theta(\bm{s})bold_italic_θ = italic_θ ( bold_italic_s ).

Appendix B CRC with full communication network

This section, proves the equivalence between CRC with full communication network and RC without requiring further assumptions on the local data generation process.

Theorem 1. Let (𝒱,)𝒱(\mathcal{V},\mathcal{E})( caligraphic_V , caligraphic_E ) be a communication network with 𝒩v=𝒱subscript𝒩𝑣𝒱\mathcal{N}_{v}=\mathcal{V}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = caligraphic_V for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V. Let {(Xv,Yv):v𝒱}conditional-setsubscript𝑋𝑣subscript𝑌𝑣𝑣𝒱\{(X_{v},Y_{v}):v\in\mathcal{V}\}{ ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) : italic_v ∈ caligraphic_V } be a partition of the global training data (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), where m=v𝒱mv𝑚subscript𝑣𝒱subscript𝑚𝑣m=\sum_{v\in\mathcal{V}}m_{v}italic_m = ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Consider RC with learning rate lr>0𝑙𝑟0lr>0italic_l italic_r > 0, using the global training data (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) with uniform initialization and a single iteration, i.e., iter=1𝑖𝑡𝑒𝑟1iter=1italic_i italic_t italic_e italic_r = 1. Now consider CRC with equivalent sample size m0=mlrnsuperscript𝑚0𝑚𝑙𝑟𝑛m^{0}=\frac{m}{lr\cdot n}italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = divide start_ARG italic_m end_ARG start_ARG italic_l italic_r ⋅ italic_n end_ARG, using the local data (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) for each v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V with uniform initialization and one iteration (iter=1𝑖𝑡𝑒𝑟1iter=1italic_i italic_t italic_e italic_r = 1). Then, all local parameters in the aggregation step of CRC coincide with the parameters of RC at the previous iteration:

θ(𝒔¯vt)=θ(𝒔gt1),for t=1,,tmax, and v𝒱.formulae-sequence𝜃subscriptsuperscript¯𝒔𝑡𝑣𝜃subscriptsuperscript𝒔𝑡1𝑔formulae-sequencefor 𝑡1superscript𝑡 and 𝑣𝒱\theta(\bar{\bm{s}}^{t}_{v})=\theta(\bm{s}^{t-1}_{g}),\quad\text{for }t=1,% \dots,t^{\max},\text{ and }v\in\mathcal{V}.italic_θ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = italic_θ ( bold_italic_s start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , for italic_t = 1 , … , italic_t start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT , and italic_v ∈ caligraphic_V .
Proof.

We proceed by induction on the local aggregated statistic 𝒔¯vtsuperscriptsubscript¯𝒔𝑣𝑡\bar{\bm{s}}_{v}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V.

We first observe the following:

  1. 1.

    All aggregated statistics are identical across nodes: 𝒔¯vt=𝒔¯tsubscriptsuperscript¯𝒔𝑡𝑣superscript¯𝒔𝑡\bar{\bm{s}}^{t}_{v}=\bar{\bm{s}}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V.

  2. 2.

    The aggregated statistics 𝒔¯tsuperscript¯𝒔𝑡\bar{\bm{s}}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are updated locally using (Xv,Yv)subscript𝑋𝑣subscript𝑌𝑣(X_{v},Y_{v})( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) and a classifier with parameters θ(𝒔¯t)𝜃superscript¯𝒔𝑡\theta(\bar{\bm{s}}^{t})italic_θ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

  3. 3.

    The parameters of a classifier obtained from a statistic 𝒔𝒔\bm{s}bold_italic_s are invariant under positive rescaling of 𝒔𝒔\bm{s}bold_italic_s.

Base case (t=1𝑡1t=1italic_t = 1): At initialization, all local statistics are equal to 𝒔0superscript𝒔0\bm{s}^{0}bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, and the global statistic 𝒔g0superscriptsubscript𝒔𝑔0\bm{s}_{g}^{0}bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is a rescaled version of the local one:

𝒔0=initialize(m0),𝒔g0=initialize(m)=mm0𝒔0.formulae-sequencesuperscript𝒔0initializesuperscript𝑚0subscriptsuperscript𝒔0𝑔initialize𝑚𝑚superscript𝑚0superscript𝒔0\bm{s}^{0}=\text{initialize}(m^{0}),\quad\bm{s}^{0}_{g}=\text{initialize}(m)=% \frac{m}{m^{0}}\cdot\bm{s}^{0}.bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = initialize ( italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = initialize ( italic_m ) = divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG ⋅ bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT .

Then, at round t=1𝑡1t=1italic_t = 1, the aggregated local statistics are 𝒔¯1=𝒔0superscript¯𝒔1superscript𝒔0\bar{\bm{s}}^{1}=\bm{s}^{0}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = bold_italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V, and thus θ(𝒔g0)=θ(𝒔¯1)𝜃superscriptsubscript𝒔𝑔0𝜃superscript¯𝒔1\theta(\bm{s}_{g}^{0})=\theta(\bar{\bm{s}}^{1})italic_θ ( bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = italic_θ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ).

Inductive step: Assume that at round t𝑡titalic_t, the aggregated local statistics are 𝒔¯tsuperscript¯𝒔𝑡\bar{\bm{s}}^{t}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V, and the global statistic at t1𝑡1t-1italic_t - 1 satisfies 𝒔gt1=mm0𝒔¯tsuperscriptsubscript𝒔𝑔𝑡1𝑚superscript𝑚0superscript¯𝒔𝑡\bm{s}_{g}^{t-1}=\frac{m}{m^{0}}\cdot\bar{\bm{s}}^{t}bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG ⋅ over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Then both RC and CRC compute the same classifier parameters, θ(𝒔¯t)=θ(𝒔gt1)=𝜽𝜃superscript¯𝒔𝑡𝜃superscriptsubscript𝒔𝑔𝑡1𝜽\theta(\bar{\bm{s}}^{t})=\theta(\bm{s}_{g}^{t-1})=\bm{\theta}italic_θ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_θ ( bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) = bold_italic_θ.

Next, compute the RC and CRC updates with parameter 𝜽𝜽\bm{\theta}bold_italic_θ for each node v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V:

𝒔vtsuperscriptsubscript𝒔𝑣𝑡\displaystyle\bm{s}_{v}^{t}bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =𝒔¯t+s(Xv,Yv)s(Xv,𝜽),absentsuperscript¯𝒔𝑡𝑠subscript𝑋𝑣subscript𝑌𝑣𝑠subscript𝑋𝑣𝜽\displaystyle=\bar{\bm{s}}^{t}+s(X_{v},Y_{v})-s(X_{v},\bm{\theta}),= over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) - italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_θ ) , (7)
𝒔gtsuperscriptsubscript𝒔𝑔𝑡\displaystyle\bm{s}_{g}^{t}bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =𝒔gt1+lr(s(X,Y)s(X,𝜽)).absentsuperscriptsubscript𝒔𝑔𝑡1𝑙𝑟𝑠𝑋𝑌𝑠𝑋𝜽\displaystyle=\bm{s}_{g}^{t-1}+lr\cdot\left(s(X,Y)-s(X,\bm{\theta})\right).= bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_l italic_r ⋅ ( italic_s ( italic_X , italic_Y ) - italic_s ( italic_X , bold_italic_θ ) ) . (8)

At the beginning of round t+1𝑡1t+1italic_t + 1, compute the CRC aggregation step for each node v𝑣vitalic_v:

𝒔¯vt+1superscriptsubscript¯𝒔𝑣𝑡1\displaystyle\bar{\bm{s}}_{v}^{t+1}over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT =1nv𝒱𝒔vtabsent1𝑛subscript𝑣𝒱superscriptsubscript𝒔𝑣𝑡\displaystyle=\frac{1}{n}\sum_{v\in\mathcal{V}}\bm{s}_{v}^{t}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (9)
=𝒔¯t+1nv𝒱(s(Xv,Yv)s(Xv,𝜽))absentsuperscript¯𝒔𝑡1𝑛subscript𝑣𝒱𝑠subscript𝑋𝑣subscript𝑌𝑣𝑠subscript𝑋𝑣𝜽\displaystyle=\bar{\bm{s}}^{t}+\frac{1}{n}\sum_{v\in\mathcal{V}}\left(s(X_{v},% Y_{v})-s(X_{v},\bm{\theta})\right)= over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V end_POSTSUBSCRIPT ( italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) - italic_s ( italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_θ ) ) (10)
=𝒔¯t+1n(s(X,Y)s(X,𝜽)),absentsuperscript¯𝒔𝑡1𝑛𝑠𝑋𝑌𝑠𝑋𝜽\displaystyle=\bar{\bm{s}}^{t}+\frac{1}{n}\left(s(X,Y)-s(X,\bm{\theta})\right),= over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( italic_s ( italic_X , italic_Y ) - italic_s ( italic_X , bold_italic_θ ) ) , (11)

where the last equality holds because the statistic mapping is additive. This final expression does not depend on node v𝑣vitalic_v, so the aggregated statistics at round t+1𝑡1t+1italic_t + 1 are the same for all v𝒱𝑣𝒱v\in\mathcal{V}italic_v ∈ caligraphic_V, i.e., 𝒔¯t+1superscript¯𝒔𝑡1\bar{\bm{s}}^{t+1}over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.

Multiplying both sides by mm0𝑚superscript𝑚0\frac{m}{m^{0}}divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG, we obtain:

mm0𝒔¯t+1=𝒔gt1+mm0n(s(X,Y)s(X,𝜽))=𝒔gt,𝑚superscript𝑚0superscript¯𝒔𝑡1superscriptsubscript𝒔𝑔𝑡1𝑚superscript𝑚0𝑛𝑠𝑋𝑌𝑠𝑋𝜽superscriptsubscript𝒔𝑔𝑡\frac{m}{m^{0}}\cdot\bar{\bm{s}}^{t+1}=\bm{s}_{g}^{t-1}+\frac{m}{m^{0}\cdot n}% \left(s(X,Y)-s(X,\bm{\theta})\right)=\bm{s}_{g}^{t},divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG ⋅ over¯ start_ARG bold_italic_s end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⋅ italic_n end_ARG ( italic_s ( italic_X , italic_Y ) - italic_s ( italic_X , bold_italic_θ ) ) = bold_italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,

for lr=mm0n𝑙𝑟𝑚superscript𝑚0𝑛lr=\frac{m}{m^{0}\cdot n}italic_l italic_r = divide start_ARG italic_m end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⋅ italic_n end_ARG. ∎

This result proves that by selecting the equivalent sample size m0=mlrnsuperscript𝑚0𝑚𝑙𝑟𝑛m^{0}=\frac{m}{lr}\cdot nitalic_m start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = divide start_ARG italic_m end_ARG start_ARG italic_l italic_r end_ARG ⋅ italic_n, CRC with full communication network is equivalent to RC with a learning rate of lr𝑙𝑟lritalic_l italic_r just by assuming that RC uses the global training data available at the network of users.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, 2017.
  • [2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on Intelligent Systems and Technology, vol. 10, no. 2, pp. 1–19, 2019.
  • [3] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” in NIPS Workshop on Private Multi-Party Machine Learning, 2016.
  • [4] T. Sun, D. Li, and B. Wang, “Decentralized federated averaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 4, no. 5, pp. 4289 – 4301, 2023.
  • [5] H. B. McMahan, D. M. Bacon, J. Konecny, and X. Yu, “Communication efficient federated learning,” may 2020.
  • [6] F. Granqvist, C. Song, A. Cahill, R. van Dalen, M. Pelikan, Y. S. Chan, X. Feng, N. Krishnaswami, V. Jina, and M. Chitnis, “pfl-research: simulation framework for accelerating research in private federated learning,” in Advances in Neural Information Processing Systems, vol. 37, 2024.
  • [7] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  • [8] P. Kairouz et al., Advances and Open Problems in Federated Learning.   Now Foundations and Trends, 2021, vol. 14, no. 1–2.
  • [9] A. Lalitha, X. Wang, O. Kilinc, Y. Lu, T. Javidi, and F. Koushanfar, “Decentralized Bayesian learning over graphs,” arXiv preprint, 2019.
  • [10] S. Kalra, J. Wen, J. C. Cresswell, M. Volkovs, and H. R. Tizhoosh, “Decentralized federated learning through proxy model sharing,” Nature Communications, vol. 14, no. 2899, 2023.
  • [11] S. Banabilah, M. Aloqaily, E. Alsayed, N. Malik, and Y. Jararweh, “Federated learning review: Fundamentals, enabling technologies, and future applications,” Information Processing & Management, 2022.
  • [12] J.-C. Gamazo-Real, R. Torres Fernández, and A. Murillo Armas, “Comparison of edge computing methods in internet of things architectures for efficient estimation of indoor environmental parameters with machine learning,” Engineering Applications of Artificial Intelligence, vol. 126, no. 107149, 2023.
  • [13] R. S. Antunes, C. André da Costa, A. Küderle, I. A. Yari, and B. Eskofier, “Federated learning for healthcare: Systematic review and architecture proposal,” ACM Transactions on Intelligent Systems and Technology, vol. 13, no. 4, 2022.
  • [14] A. Jochems, T. M. Deist, I. El Naqa, M. Kessler, C. Mayo, J. Reeves, S. Jolly, M. Matuszak, R. Ten Haken, J. van Soest, C. Oberije, C. Faivre-Finn, G. Price, D. de Ruysscher, P. Lambin, and A. Dekker, “Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries,” International Journal of Radiation Oncology*Biology*Physics, vol. 99, no. 2, pp. 344–352, 2017.
  • [15] V. C. Pezoulas, K. D. Kourou, F. Kalatzis, T. P. Exarchos, E. Zampeli, S. Gandolfo, A. Goules, C. Baldini, F. Skopouli, S. De Vita, A. G. Tzioufas, and D. I. Fotiadis, “Overcoming the barriers that obscure the interlinking and analysis of clinical data through harmonization and incremental learning,” IEEE Open Journal of Engineering in Medicine and Biology, vol. 1, pp. 83–90, 2020.
  • [16] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi-task learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, 2017.
  • [17] A. Navia-Vázquez, R. Díaz-Morales, and M. Fernández-Díaz, “Budget distributed support vector machine for non-id federated learning scenarios,” ACM Trans. Intell. Syst. Technol., vol. 13, no. 6, 2022.
  • [18] M. G. Boroujeni, A. Krause, and G. F. Trecate, “Personalized federated learning of probabilistic models: A PAC-Bayesian approach,” arXiv preprint, 2025.
  • [19] M. Ashman, T. D. Bui, C. V. Nguyen, S. Markou, A. Weller, S. Swaroop, and R. E. Turner, “Partitioned variational inference: A framework for probabilistic federated learning,” arXiv preprint, 2022.
  • [20] A. Pérez, C. Echegoyen, and G. Santafé, “Risk-based calibration for generative classifiers,” arXiv preprint, 2025.
  • [21] C. Bielza and P. Larrañaga, “Discrete Bayesian network classifiers: A survey,” ACM Computing Surveys, vol. 47, no. 1, 2014.
  • [22] A. Pérez, P. Larrañaga, and I. Inza, “Supervised classification with conditional Gaussian networks: Increasing the structure complexity from naive Bayes,” International Journal of Approximate Reasoning, vol. 43, no. 1, pp. 1–25, 2006.
  • [23] S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, and B. A. y Arcas, “Generative models for effective ML on private, decentralized datasets,” in International Conference on Learning Representations, 2020.
  • [24] H. GM, M. K. Gourisaria, M. Pandey, and S. S. Rautaray, “A comprehensive survey and analysis of generative models in machine learning,” Computer Science Review, vol. 38, p. 100285, 2020.
  • [25] A. Ng and M. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,” in Advances in Neural Information Processing Systems, vol. 14, 2001.
  • [26] Y. Sun, A. Cuesta-Infante, and K. Veeramachaneni, “Learning vine copula models for synthetic data generation,” AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019.
  • [27] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B, vol. 39, no. 1, pp. 1–22, 1977.
  • [28] T. Shadbahr, M. Roberts, J. Stanczuk, J. Gilbey, P. Teare, S. Dittmer, M. Thorpe, R. Torné, E. Sala, P. Lió, M. Patel, J. Preller, I. Selby, A. Breger, J. Weir-McCall, E. Gkrania-Klotsas, A. Korhonen, E. Jefferson, G. Langs, G. Yang, H. Prosch, J. Babar, L. Escudero Sánchez, M. Wassin, M. Holzer, N. Walton, J. Rudd, T. Mirtti, A. Rannikko, J. Aston, J. Tang, and C.-B. Schönlieb, “The impact of imputation quality on machine learning classifiers for datasets with missing values,” Communications Medicine, vol. 3, no. 1, 2023.
  • [29] S. Kim, H. Kim, E. Yun, H. Lee, J. Lee, and J. Lee, “Probabilistic imputation for time-series classification with missing data,” in Proceedings of the 40th International Conference on Machine Learning, 2023.
  • [30] K. P. Murphy, Machine Learning: A Probabilistic Perspective.   The MIT Press, 2012.
  • [31] C. Elkan, “The foundations of cost-sensitive learning,” in International Joint Conference on Artificial Intelligence, vol. 17, 2001.
  • [32] T. Jebara, Machine Learning: Discriminative and Generative.   Springer New York, NY, 2012.
  • [33] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine Learning, vol. 29, no. 2–3, pp. 131–163, 1997.
  • [34] O. Marfoq, G. Neglia, L. Kameni, and R. Vidal, “Federated learning for data streams,” in Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023.
  • [35] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine learning, vol. 29, pp. 103–130, 1997.
  • [36] I. Rish et al., “An empirical study of the naive Bayes classifier,” in IJCAI workshop on empirical methods in artificial intelligence, vol. 3, 2001.
  • [37] H. Zhang, “The optimality of naive Bayes,” in Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, 2004.
  • [38] R. Greiner and W. Zhou, “Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers.” 2002.