A calibration test for evaluating set-based epistemic uncertainty representations

Jürgens, Mira; Mortier, Thomas; Hüllermeier, Eyke; Bengs, Viktor; Waegeman, Willem

doi:10.1007/s10994-025-06844-8

A calibration test for evaluating set-based epistemic uncertainty representations

Open access
Published: 05 August 2025

Volume 114, article number 202, (2025)
Cite this article

You have full access to this open access article

Download PDF

Machine Learning Aims and scope Submit manuscript

A calibration test for evaluating set-based epistemic uncertainty representations

Download PDF

1336 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The accurate representation of epistemic uncertainty is a challenging yet essential task in machine learning. A widely used representation corresponds to convex sets of probabilistic predictors, also known as credal sets. One popular way of constructing these credal sets is via ensembling or specialized supervised learning methods, where the epistemic uncertainty can be quantified through measures such as the set size or the disagreement among members. In principle, these sets should contain the true data-generating distribution. As a necessary condition for this validity, we adopt the strongest notion of calibration as a proxy. Concretely, we propose a novel statistical test to determine whether there is a convex combination of the set’s predictions that is calibrated in distribution. In contrast to previous methods, our framework allows the convex combination to be instance-dependent, recognizing that different ensemble members may be better calibrated in different regions of the input space. Moreover, we learn this combination via proper scoring rules, which inherently optimize for calibration. Building on differentiable, kernel-based estimators of calibration errors, we introduce a nonparametric testing procedure and demonstrate the benefits of capturing instance-level variability on synthetic and real-world experiments.

On the Calibration of Epistemic Uncertainty: Principles, Paradoxes and Conflictual Loss

Estimating Expected Calibration Errors

Diversified Ensemble of Independent Sub-networks for Robust Self-supervised Representation Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In supervised machine learning, it has become more and more important not only to have accurate predictors, but also to provide a reliable quantification of predictive uncertainty, i.e., the learner’s uncertainty in the outcome $y \in \mathcal {Y}$ given a query instance $x \in \mathcal {X}$ for which a prediction is sought. Predictive uncertainty is often divided into aleatoric and epistemic uncertainty (Senge et al., 2014; Hüllermeier & Waegeman, 2021; Kendall & Gal, 2017; Gruber et al., 2023), where the former corresponds to uncertainty that cannot be reduced with further information (e.g. more training data), as it originates from inherent randomness in the relationship between features X and labels Y. Therefore, we assume that the ground truth is a conditional probability distribution $\mathbb {P}_{Y \, | \,X}$ on $\mathcal {Y}$, i.e. given an input sample $x \in \mathcal {X}$, each outcome y has a certain probability to occur, given by $\mathbb {P}_{Y \, | \,X=x}$. Even with perfect knowledge about the underlying data-generating process, the outcome cannot be predicted with certainty. However, in a typical machine learning scenario, the learner does not know $\mathbb {P}_{Y|X}$. Having a space of possible hypotheses, an estimator of the underlying probability distribution f within this space typically consists of a mapping $f: \mathcal {X} \rightarrow \mathbb {P}(\mathcal {Y})$, where $\mathcal {X}$ denotes the feature space and $\mathbb {P}(\mathcal {Y})$ the space of all probability distributions over the target space $\mathcal {Y}$. In essence, epistemic uncertainty refers to the uncertainty about the true $\mathbb {P}_{Y \, | \,X}$, or the “gap” between $\mathbb {P}_{Y \, | \,X}$ and f. One approach to represent this gap is via second-order probability distributions, assigning a probability for each of the first-order predicted probability distributions, i.e. the candidates for f. This is commonly done in Bayesian methods, such as Gaussian processes and Bayesian neural networks (Gelman et al., 2004), but also in evidential deep learning methods (Ulmer et al., 2023). However, the former usually involves computationally costly methods to approximate the intractable posterior distribution, while the latter has been criticised for producing unfaithful or unreliable representations of epistemic uncertainty (Bengs et al., 2022, 2023; Meinert et al., 2023; Juergens et al., 2024). An alternative way of representing epistemic uncertainty—which will be the topic of this paper—is through sets of probability distributions. Such sets are often referred to as credal sets (Walley, 1991) in the imprecise probability literature, and are commonly assumed to be closed and convex (Cozman, 2000). In essence, they are designed to represent ignorance, i.e. a lack of knowledge about the underlying ground truth, by not committing to one, but a set of potential probability distributions. They can be obtained in a direct manner, as via credal classifiers (Hüllermeier et al., 2022; Javanmardi et al., 2024; Wang et al., 2025; Caprio et al., 2024; Nguyen et al., 2022), which have recently gained popularity, or in an indirect manner, via various types of ensemble methods, such as bootstrapped ensembles, deep ensembles (Lakshminarayanan et al., 2017) and randomization techniques based on Monte Carlo dropout (Gal & Ghahramani, 2016). In credal sets, epistemic uncertainty is quantified via the size of the credal set (Sale et al., 2023) or the diversity of the corresponding ensemble (Lakshminarayanan et al., 2017).

Due to a lack of an objective ground-truth, epistemic uncertainty representations are often evaluated in an indirect manner, using downstream tasks such as out-of-distribution detection (Ovadia et al., 2019), robustness to adversarial attacks (Kopetzki et al., 2021), and active learning (Nguyen et al., 2022). However, recent studies have raised concerns about the usefulness of such tasks w.r.t. epistemic uncertainty evaluation (Abe et al., 2022; Meinert et al., 2023; Bengs et al., 2022). A crucial open question is whether existing representations of epistemic uncertainty can be interpreted in a statistically profound way. For the case of credal sets, one might wonder whether such representations are statistically valid, i.e., whether a credal set contains with high probability the true underlying conditional target distribution. One way to evaluate this validity criterion is via statistical tests, as exemplary visualized in Fig. 1. Chau et al. (2024) address this issue by proposing a two-sample test framework for credal sets. However, they assume having a sample of sufficient size from the same underlying conditional distribution. In the “classical” machine learning scenario that we look at, we do not have access to the ground truth conditional distribution nor to more than one realization $(x_i,y_i)$ from it, making this direct evaluation infeasible.

One necessary condition for validity is calibrationNiculescu-Mizil and Caruana (2005), which measures the consistency between predicted probabilities and actual frequencies. In this paper we will utilize in this paper the notion of distribution calibration (Vaicenavicius et al., 2019) as a surrogate method for validity. Failing calibration necessarily means failing validity, hence in this case one can be confident that the true data generating distribution is not contained inside the credal set. Hence, by introducing a calibration test for set-based epistemic uncertainty representations, we aim for a more direct evaluation than measuring the performance on downstream tasks.

Our work builds further upon the work of Mortier et al. (2023), who first proposed the usage of calibration as a proxy for epistemic uncertainty evaluation of credal sets. The test introduced in that study analysed whether, for a given convex set of probabilistic models, there exists a calibrated convex combination in the set. However, this existing test has important limitations, which motivate our current work. Specifically, the original test focuses on determining whether a single convex combination of ensemble predictions can achieve calibration, but it does so in an instance-agnostic manner. This limitation means that the approach does not account for the variability of model calibration in dependence on the instance space. Moreover, the previous test tries to simulate the distribution of the calibration error estimators under the null hypothesis via sampling predictions in the credal set at random, which possibly leads to an overestimation of how calibrated the found convex combination is. In this paper, we address these shortcomings by introducing an instance-dependent approach to calibration testing, with two important modifications: First, due to the instance-dependency of the underlying convex combination of probabilistic predictions, we consider situations where predictions of the set are “differently well” calibrated for different regions in the instance space. Second, by using an optimization-based instead of a sampling-based algorithm, we achieve a more reliable estimate of the calibration statistic under the null hypothesis. Together the two modifications result in better control of the statistical Type I error, while simultaneously increasing the power of the test.

The paper is organized as follows. In Sect. 2 we review the literature on calibration, and we formally discuss the calibration measures that are used further on. In Sect. 3 we introduce both the concept of instance-dependent validity as well as the (weaker) notion of calibration for credal sets. We then propose our novel, nonparametric calibration test that is able to assess whether an epistemic uncertainty representation via credal sets is calibrated. Furthermore, we introduce an optimization algorithm for finding the most calibrated convex combination. It includes training a neural network as a meta-learner, using proper scoring rules as loss functions. In Sect. 4 we empirically evaluate the statistical Type I and Type II error of our test in different scenarios, thereby showing its improvement over the test proposed by Mortier et al. (2023) We further demonstrate its usefulness on validating common ensemble-based epistemic uncertainty representations of models trained on real-world datasets.

2 Calibration of first-order predictors

We start with reviewing common calibration metrics that have been introduced for evaluating distribution calibration in multi-class classification. We also discuss differentiable estimators of these metrics, and statistical tests that assess calibration for a single probabilistic model. The extension to sets of probabilistic models will be discussed in Sect. 3.

2.1 General setting

We assume a multi-class classification setting with feature space $\mathcal {X} \subseteq \mathbb {R}^d$ and label space $\mathcal {Y}=\{1, \dots , K\}$, consisting of K classes. Let X and Y be random variables that are distributed according to a joint distribution $\mathbb {P}_{X,Y}$ on $\mathcal {X} \times \mathcal {Y}$. A probabilistic model can be represented as $f: \mathcal {X} \rightarrow \mathbb {P}(\mathcal {Y})$, which is a map from the feature space to the space of probability distributions over the output space $\mathcal {Y}$. For multi-class classification problems with K classes, $\mathbb {P}(\mathcal {Y}) = \Delta _K$ (the $(K-1)$-dimensional probability simplex. Roughly speaking, a classifier is calibrated if its outputs coincide with probability distributions that match the empirical frequencies observed from realized outcomes. In multi-class classification, one distinguishes between different types of calibration. Confidence calibration (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017; Kumar et al., 2018) only analyses the confidence score, i.e., the probability of the predicted class being calibrated. It is therefore the weakest notion of calibration. Distribution calibration (Bröcker, 2009; Vaicenavicius et al., 2019), which will be the focus in this paper, analyses whether all class probabilities are calibrated. Instead of only requiring calibrated marginal probabilities as in the definition of classwise calibration (Zadrozny & Elkan, 2002), it is defined via conditioning on the full probability vector; hence, it is the strongest notion of calibration.

Definition 1

A probabilistic multi-class classifier with output vector $f(X) \in \Delta _K$ is calibrated in distribution if for all $k \in \mathcal {Y}=\{1, \dots , K\}$ it holds that

$$\mathbb {P}(Y = k | f(X) = \varvec{s}) = s_k,$$

with $s_k$ being the k-th component of the probability vector $\varvec{s} \in \Delta _K$.

To evaluate calibration, one typically makes use of calibration errors. In general, these errors measure the (average) discrepancy between the conditional probability distribution $\mathbb {P}_{Y|f(X)}$ and the predictions f(X). Different calibration errors have been introduced to measure distribution calibration. One can in general differentiate between calibration errors defined as an expectation over a divergence $d: \Delta _K \times \Delta _K \rightarrow [0, \infty )$ between f(X) and $\mathbb {P}_{Y|f(X)}$ (Popordanoska et al., 2022, 2024):

$$\begin{aligned} \textrm{CE}_d(f) = \mathbb {E}\Big [d\big (f(X), \mathbb {P}_{Y|f(X)}\big )\Big ], \end{aligned}$$

(1)

and calibration errors defined via integral probability metrics (Müller, 1997), also called kernel calibration errors (Kumar et al., 2018; Widmann et al., 2019; Marx et al., 2024):

$$\begin{aligned} \textrm{CE}_{\mathcal {H}}(f) = \sup _{\phi \in \mathcal {H}}\Big |\mathbb {E} \Big [\phi (f(X), Y)\Big ] - \mathbb {E}\Big [\phi (f(X), Z)\Big ]\Big |, \end{aligned}$$

(2)

where $\mathcal {H}$ is typically chosen to be the unit ball of a reproducing kernel Hilbert space (RKHS) $\mathcal {H}$, and Z is a conditioning variable that is assumed to be calibrated.

For a given dataset $\mathcal {D}=\{(x_i, y_i)\}_{i=1}^N$, we denote $\widehat{\textrm{CE}}(f):= \widehat{\textrm{CE}}(f, \mathcal {D})$ as an estimator of $\textrm{CE}$ based on the dataset $\mathcal {D}$.

A very common way to estimate the calibration error is to use binning (Zadrozny & Elkan, 2001; Naeini et al., 2015; Guo et al., 2017), i.e., partitioning the domain of the predictor variable f(X) into a finite number of discrete intervals or bins, and estimating the calibration error by aggregating data within each bin. However, the piecewise nature of binning functions makes binned estimators non-differentiable and unsuitable for optimization tasks requiring gradient information, a property that we will need in Sect. 3. Binning also introduces a bias in estimating the conditional expectation $\mathbb {E}[Y | f(X)]$, because it replaces continuous variations with average values within bins. Furthermore, binned estimators usually measure a weaker form of calibration than distribution calibration as defined in Definition 1, namely classwise calibration, which only demands calibrated marginal probabilities. For these reasons, binning is not discussed in the following overview of existing estimators. Due to the nature of our problem, we make use of calibration estimators that are consistent, (asymptotically) unbiased and differentiable.

2.2 Overview of calibration errors and estimators

One of the first and still widely used metrics (Gupta et al., 2020; Murphy, 1973; Rahimi et al., 2020) that intrinsically also assesses calibration is the (expected) Brier score (Brier, 1950) . It is defined as

$$\begin{aligned} \text {BS}(f) = \mathbb {E}\Big [\Big \Vert f(X)-\varvec{e}_Y \Big \Vert _2^2 \Big ], \end{aligned}$$

(3)

with $\varvec{e}_Y$ being the one-hot-encoded vector of Y. Its estimator is given by the mean squared error between the predicted probability and actual outcome, averaged over all classes:

$$\begin{aligned} \widehat{\textrm{BS}}(f) = \sum _{i=1}^N \sum _{j=1}^K (f_{ij} - \mathbb {I}_{(y_i =j)})^2, \end{aligned}$$

(4)

where $f_{ij}$ denotes the j-th entry of the vector $f(x_i)$. Being a proper scoring rule (Gneiting & Raftery, 2007), it does not only measure calibration, but can be decomposed into a calibration and a refinement loss term (Murphy, 1973; Kull & Flach, 2015), where the calibration loss consists of the expected $L_2$ error between the conditional distribution $P_{X|f(X)}$ and f(X), hence Eq. (1) with d being the Euclidean distance. In their framework of proper calibration errors, Gruber and Buettner (2022) show that the Brier score also serves as an upper bound for other common calibration errors.

Popordanoska et al. (2022) proposed an estimator for the $L^p$ calibration error, which for the case $p=2$ directly corresponds to the calibration term in the decomposition of the Brier score:

$$\begin{aligned} \textrm{CE}_p(f) = \Big (\mathbb {E}\Big [\Big \Vert f(X) - \mathbb {P}_{Y|f(X)}\Big \Vert _p^p\Big ]\Big )^{\frac{1}{p}}. \end{aligned}$$

(5)

They formulate an estimator of $\textrm{C}\textrm{E}_{p}(f)$ using Beta and Dirichlet kernel density estimates for the binary and multi-class classification case, respectively. Precisely, they prove that

$$\begin{aligned} \widehat{\textrm{CE}}_p(f) =\left( \frac{1}{n}\sum _{j=1}^{n}\left[ \Big \Vert \widehat{{\mathbb {E}}[y|f(x_j)}] -f(x_{j})\Big \Vert _{p}^{p}\right] \right) ^{\frac{1}{p}} \end{aligned}$$

(6)

is a point-wise consistent and asymptotically unbiased estimator, where the estimator of the conditional expectation is defined via kernel density estimation.

Using the Kullback–Leibler divergence $D_{KL}$ as the divergence measure, one can define another proper calibration error (Popordanoska et al., 2024):

$$\begin{aligned} \text {CE}_{KL}(f):= \mathbb {E}\Big [D_{KL}(\mathbb {E}[Y|f(X)], f(X))\Big ], \end{aligned}$$

(7)

which exactly forms the calibration error term of the popular log loss. An estimator is given by

$$\begin{aligned} \widehat{\textrm{CE}}_{KL}(f) =\frac{1}{n}\sum _{j=1}^{n}\left\langle \widehat{{\mathbb {E}}[y|f(x_j)}], \log \frac{\mathbb {E}[y|f(x_j)]}{f(x_j)}\right\rangle , \end{aligned}$$

(8)

where the estimator $\widehat{{\mathbb {E}}[y|f(x_j)}]$ is again defined via kernel density estimation.

Widmann et al. (2019) introduced the kernel calibration error and multiple differentiable estimators for it. In its kernel-based formulation, it is defined using a matrix-valued kernel $k: \Delta _K \times \Delta _K \rightarrow \mathbb {R}^{K \times K}$ as follows:

$$\begin{aligned} \textrm{CE}_k(f)=\left( \mathbb {E}\left[ (\varvec{e}_{Y}-f(X))^{\textsf{T}}k(f(X),f(X^{\prime }))(\varvec{e}_{Y^{\prime }} -f(X^{\prime }))\right] \right) ^{1/2}, \end{aligned}$$

(9)

where $(X', Y')$ is an independent copy of (X, Y) and $\varvec{e}_Y$, $\varvec{e}_{Y'}$ are the one-hot-encoded vectors of the random variables Y and $Y'$. We will make use of the computationally more feasible unbiased estimator they propose, defined as

$$\begin{aligned} \widehat{\textrm{CE}}_k(f) = \frac{1}{\lfloor n/2\rfloor } \sum _{i=1}^{\lfloor n/2 \rfloor } h_{2i-1,2i}, \end{aligned}$$

with k being a universal matrix-valued kernel and $h_{i,j}$ corresponds to the term in the expectation (9) evaluated on two instance-label pairs.

Marx et al. (2024) introduced a framework of calibration estimators for different types of calibration, which considers it as a distribution matching problem between (true) conditional distribution $\mathbb {P}_{Y|X}$ and the predicted distribution $\hat{\mathbb {P}}_{Y|X}$ induced by f(X). Similar to Widmann et al. (2019), they proposed integral probability metrics to measure the distance between real conditional and predicted distribution. For $(X, Y) \sim \mathbb {P}$ and $(X, \hat{Y}) \sim \hat{\mathbb {P}}$ they define the calibration error as the Maximum-Mean-Discrepancy (MMD) between $\mathbb {P}$ and $\hat{\mathbb {P}}$:

$$\begin{aligned} \text {CE}_{MMD}(f) = \sup _{\phi \in \mathcal {H}} \Big |\mathbb {E}\big [\phi (Y, f(X))\big ]- \mathbb {E}\big [\phi (\hat{Y}, f(X))\big ] \Big |, \end{aligned}$$

(10)

where $\mathcal {H}$ is again the unit ball of an RKHS. For the classification case, they introduce a trainable calibration estimate for distribution calibration which measures the squared error as

$$\begin{aligned} \widehat{\text {CE}}_{MMD}^2 = \frac{1}{n(n-1)} \sum _{i=1}^n \sum _{j=1, j \ne i}^n h_{ij}, \end{aligned}$$

(11)

where the terms $h_{ij}$ are defined dependent on the type of calibration. They define $z_i$ being a conditional random variable distributed according to $f(x_i)$, for $i \in \{1, \dots , N \}$ and

$$\begin{aligned} h_{ij} = k((y_i, z_i), (y_j, z_j)) + \sum _{y \in \mathcal {Y}}\sum _{y' \in \mathcal {Y}} q_i(y)q_j(y')k((y, z_i), (y', z_j)) - 2 \sum _{y\in \mathcal {Y}}q_i(y)k((y, z_i), (y_j, z_j)), \end{aligned}$$

with $k: \Delta _K \times \Delta _K \rightarrow \mathbb {R}$ being a universal kernel function, and $q_i(y)$ the predicted probability for y given $f(x_i)$.

2.3 Statistical tests for calibration

As the estimators $\widehat{\textrm{CE}}(f,\mathcal D)$ reviewed in the previous section are computed from finite samples, their realised values inherit sampling variability, and hence comparing them in terms of their realized values does not take this randomness into account. Appendix A visualises the distribution of the estimators under the null hypothesis that they are calibrated; the dispersion shrinks with the sample size N, but even for large N randomness is non-negligible. Consequently a formal hypothesis test is needed when evaluating the calibration of a classifier $f\!:\mathcal X\!\rightarrow \!\Delta _{K}$ on a finite data set $\mathcal D=\{(x_i,y_i)\}_{i=1}^{N}$:

$$\begin{aligned} H_{0}: f \text { is calibrated} \quad \quad H_{1}: \lnot H_0. \end{aligned}$$

There are a number of established tests for calibration in classification problems. The classical Hosmer–Lemeshow test (Hosmer et al., 1997), initially developed as a goodness-of-fit test for the logistic regression model, considers a chi-squared distributed test statistic based on observed and expected frequencies of events. However, as it requires binning and can only be used to test for confidence calibration, we do not include it in this analysis. Widmann et al. (2019) derive a test for the kernel calibration error ($\text {CE}_k$), which takes the asymptotic normality of their proposed estimator into account. While it yields analytic p-values and is computationally light once the test statistic is calculated, its validity holds only for the specific kernel calibration estimator and its asymptotic normality.

Vaicenavicius et al. (2019) developed a general framework to evaluate calibration based on consistency sampling (Bröcker & Smith, 2007), a method that applies bootstrapping to estimate the distribution of the calibration estimator’s values under the null hypothesis that f is calibrated. This is done by resampling new labels in each bootstrap iteration, based on the distribution that f predicts for the respective bootstrap sample. Given the (empirical) distribution function of the calibration estimator under the null hypothesis, one can then check how likely the given calibration estimate is under the assumption that f is calibrated. Being a nonparametric test, it can therefore be used to test for distribution calibration in combination with any of the calibration estimators mentioned above. For these reasons we adopt the nonparametric bootstrap test of Vaicenavicius et al. (2019) throughout the paper: it is the only approach that (a) is valid for all considered calibration errors, (b) remains distribution-free, and (c) integrates seamlessly with our proposed algorithm.

3 Calibration of sets of probabilistic predictors

We will now come to the main part of this paper, namely the evaluation of calibration of not only one, but a set of classifier models. To this end, after introducing the necessary methodology, we introduce our proposed test in a step-wise manner.

3.1 Credal sets

Credal sets, generally defined as sets of probability distributions, form a way to represent disbelief or uncertainty about the true underlying probability distribution. Instead of committing to a single point prediction, using credal sets to represent uncertainty allows to stay imprecise, thereby avoiding prediction error if the uncertainty is too high. The sets can be obtained in various ways (Wang et al., 2025; Caprio et al., 2024), one direct way being via sets of probabilistic classifiers. In the following, let $\mathcal {F}:= \{f^{(1)}, \dots , f^{(M)}\}$, with $f^{(i)} : \mathcal {X} \rightarrow \Delta _K$ the i-th probabilistic model in a set that contains M models in total. For each feature vector $x \in \mathcal {X}$, this yields a (credal) set of probability distributions $\mathcal {F} |_x = \{f^{(1)}(x), \dots , f^{(M)}(x)\} \subseteq \Delta _K$. A natural way to validate a representation of epistemic uncertainty through sets of predictors is to look at their possible convex combinations: If there is at least one prediction in the convex hull of $\mathcal {F}|_x$ that is calibrated, then one can argue that the set of predictors contains the ground truth aleatoric part of the uncertainty, hence fulfils a necessary requirement for representing epistemic uncertainty. Calibration here serves as a relevant necessary condition for validity; a calibrated combination indicates that the predictors are not systematically biased and that they can approximate the true data-generating process in a consistent manner. For each instance $x\in \mathcal {X}$, we now define the (credal) set $\mathcal {S}(\mathcal {F},x)$ as the set of all possible convex combinations of predictors in $\mathcal {F}|_x$.

Definition 2

For a feature vector $x \in \mathcal {X}$, the credal set $\mathcal {S}(\mathcal {F}, x)$ is the set of all convex combinations of $\mathcal {F} |_x$:

$$\begin{aligned} \mathcal {S}(\mathcal {F}, x) = \left\{ f_{\varvec{\lambda }}(x) \in \mathcal {H} \left| \, f_{\varvec{\lambda }}(x) = \sum _{i=1}^M \lambda _i(x)f^{(i)}(x) \, \right| \,(\lambda _1, \dots , \lambda _M) \in \Delta _{M, \mathcal {X}} \right\} , \end{aligned}$$

where $\Delta _{M, \mathcal {X}} = \Big \{\varvec{\lambda } = (\lambda _1, \dots , \lambda _M) \Big |\, \lambda _i: \mathcal {X} \mapsto [0, 1] \, \text {and}\, \sum _{i=1}^M \lambda _i(x) = 1 \, \forall x \in \mathcal {X} \Big \}$ denotes the set of all functions with co-domain being the $(M-1)$-simplex.

Here $\Delta _{M,\mathcal X}$ is a function space over $\mathcal X$, while in previous work (Mortier et al., 2023), $\Delta _{M}$ was the $(M-1)$-simplex of all (constant) convex combinations, that is, the same convex combination for all $x \in \mathcal {X}$ was analysed in order to test for calibration. Instance-dependent convex combinations offer the flexibility needed to capture varying degrees of aleatoric and epistemic uncertainty across the input domain. Different regions of the instance space may require different degrees of model blending to appropriately represent the level of uncertainty, particularly when some models are better calibrated than others. Def. 2 constructs credal sets via realizations of sets of predictors, which yield sets of probability distributions. From the perspective of imprecise probabilities, one might also define a credal predictor, which yields as output a set of probability distributions. The proposed framework also holds for this scenario, given the predictor outputs a convex set of extreme points.

3.2 Validity and calibration for sets of predictions

In the following, we formally define our notion of validity for set representations of epistemic uncertainty. It implies that for each $x \in \mathcal {X}$, the true underlying probability distribution $\mathbb {P}_{Y|X=x}$ is contained in the (credal) set of all possible convex combinations $\mathcal {S}(\mathcal {F}, x)$.

Definition 3

(Validity of credal sets) Let $\mathcal {F}= \{f^{(1)}, \dots , f^{(M)}\}$ be a set of predictors, with $f^{(i)}: \mathcal {X} \rightarrow \mathbb {P}(\mathcal {Y})$, and for each $x\in \mathcal {X}$, define the induced credal set $\mathcal {F}|_x$ as in Definition 2. Then $\mathcal {F}$ is valid if $\mathbb {P}_{Y|X=x} \in \mathcal {S}(\mathcal {F}, x)$ for all $x \in \mathcal {X}$.

In practice, testing validity requires having either access to the ground truth conditional distribution $\mathbb {P}_{Y|X}$ or a sufficient number of samples thereof, as assumed in Chau et al. (2024). Hence, in our setting, we make use of the weaker notion of calibration which results from conditioning on the predictions f(X) instead of X. This way, we analyse whether there is a $\varvec{\lambda } \in \Delta _{M, \mathcal {X}}$ such that $\mathbb {P}_{Y|f_{\lambda }(x)} \in \mathcal {S}(\mathcal {F}, x)$. Similar as in Mortier et al. (2023), we now define a set of classifiers, or equivalently a credal classifier, as calibrated if there exists (at least) one calibrated convex combination of predictors. Definition 4 forms a generalization of Definition 4 of Mortier et al. (2023), where the coefficients of the convex combination do not form functions on the instance space, but are constants, i.e., $\varvec{\lambda } \in \Delta _M$.

Definition 4

Let $\mathcal {F}=\{f^{(1)}, \dots , f^{(M)}\}$ be a set of predictors, $f^{(i)}: \mathcal {X} \rightarrow \Delta _K$. We say that $\mathcal {F}$ is calibrated in distribution if there exists $\varvec{\lambda } \in \Delta _{M, \mathcal {X}}$ such that the combined predictor $f_{\varvec{\lambda }} : \mathcal {X} \rightarrow \mathcal {S}(\mathcal {F}, \mathcal {X})$ defined as

$$f_{\varvec{\lambda }}(x) = \sum _{i=1}^M \lambda _i (x) f^{(i)}(x)$$

is calibrated in distribution (Definition 1).

Note that calibration provides a necessary, but not sufficient condition for validity: A valid predictor is always calibrated, yet the reverse does not have to hold. In fact, there are (possibly) many calibrated predictors, as shown in Vaicenavicius et al. (2019). In Appendix B, we show that this is also the case for convex combinations of probabilistic models. However, if a predictor is not calibrated, one knows that it is also not valid.

3.3 A novel calibration test for sets of predictors

Our null and alternative hypothesis for testing the calibration of a set of predictors $\mathcal {F}$ can now be formulated as

$$\begin{aligned} H_0: \quad \exists \varvec{\lambda } \in \Delta _{M, \mathcal {X}} \, \text {s.t.} f_{\varvec{\lambda }} \, \text {is calibrated}, \quad H_1: \lnot H_0. \end{aligned}$$

(12)

In order to analyse whether there is a calibrated convex combination, a natural approach is to analyse the one with the lowest calibration error. Hence, we define the minimum calibration error as follows:

$$\begin{aligned} \min _{\varvec{\lambda } \in \Delta _{M, \mathcal {X}}} g(f_{\varvec{\lambda }}), \end{aligned}$$

(13)

where g is a calibration error.

For the experiments, we choose g to be equal to the errors proposed in Sect. 2, i.e., $g \in \{\text {CE}_p, \text {CE}_{KL}, \text {CE}_{MMD},\text {CE}_k\}$. These have under suitable conditions the desirable property that f is calibrated if and only if $\text {CE} = 0$, making them suitable for optimization. Furthermore, the respective estimators described in Sect. 2 are both consistent and at least asymptotically unbiased. Having an optimisation dataset $\mathcal {D}_{opt} = \{(x_i, y_i)\}_{i=1}^{\tilde{N}}$, one can represent the evaluations of the weight function $\varvec{\lambda }: \mathcal {X}\rightarrow \Delta _{M, \mathcal {X}}$ by an $\tilde{N} \times M$ matrix:

$$\begin{aligned} \Lambda = \begin{pmatrix} \lambda _1(x_1) & \quad \dots & \quad \lambda _M(x_1) \\ \vdots & \quad \ddots & \quad \vdots \\ \lambda _1(x_{\tilde{N}}) & \quad \dots & \quad \lambda _M(x_{\tilde{N}}) \\ \end{pmatrix}. \end{aligned}$$

(14)

Hence, for $\hat{g}$ being the (data-dependent) estimator of g, finding the minimum empirical calibration error can be formulated as follows:

$$\begin{aligned} \min _{\Lambda _{i, j}, \, i \in {1, \dots , M}, j \in {1, \dots , \tilde{N}}}\hat{g}(f_{\Lambda }, \mathcal {D}_{val}) \end{aligned}$$

(15)

with $f_{\Lambda }(x_i)=\sum _{m=1}^M \lambda _m(x_i) f^{(m)}(x_i) \in [0,1]^K$, having the constraint $\sum _{j=1}^M \Lambda _{ij} =1$ for all $j \in \{1, \dots , \tilde{N}\}$.

As already described in Sect. 3.2, there are in general many calibrated functions f, implying the absence of a unique global minimum for the optimization problem (16). Therefore, in the optimisation, we make use of a combined loss function consisting of a proper scoring rule and the respective calibration estimator, weighted by a constant. Proper scoring rules intrinsically optimise for both a calibration error and an accuracy term and provide a stable optimisation. The specific optimisation method is described in Sect. 3.5.

3.4 Algorithmic procedure

We now explain in detail our new, adapted version of the statistical test proposed in Mortier et al. (2023), which in turn is based on the bootstrap test introduced by Vaicenavicius et al. (2019). Algorithm 1 shows the pseudo code of the test, which consists of the following steps:

First, using an appropriate optimization algorithm, the per-instance convex combination $\Lambda ^*$ with minimal calibration error is found (line 3). Using $\Lambda ^*$, the predictions of the combined classifier $f_{\Lambda ^*}$ are computed (line 4). Third, the calibration test of Vaicenavicius et al. (2019) is employed to assess whether the predictions of $f_{\Lambda ^*}$ are calibrated (line 5–19). This statistical test is based on bootstrapping and consistency resampling, that is, resampling the labels from the distribution induced by $f_{\Lambda ^*}$. This way, one is able to estimate the distribution of the calibration error estimator $\hat{g}$ under the null hypothesis that $f_{\Lambda ^*}$ is calibrated (see also Fig. 5 for an example).

For a given significance level $\alpha$, the test rejects the null hypothesis if the value of the calibration error estimator is higher than the $(1-\alpha )$-quantile of the bootstrapped empirical distribution. In the other case, it cannot reject it. Algorithm 1 differs from the proposed algorithm in Mortier et al. (2023) in two important aspects:

1.
It allows for the case where the weights for the most calibrated convex combination depend on the instance space, i.e., for different regions in it, it accounts for the fact that some predictors might be better calibrated than others and vice versa.
2.
It uses an optimization instead of a sampling-based approach: For constructing the distribution under the null hypothesis, the previous test used uniform sampling of the weights. Directly optimizing over the calibration error and performing the test on the found convex combination has two main advantages: (a), the uniform sampling of weights used in Mortier et al. (2023) does not lead to uniform sampling in the credal set, which we further elaborate on in Appendix D. By putting emphasis on combinations in the inner of the polytope, it could lead to a biased estimate of the distribution under the null hypothesis. Advanced sampling techniques that could result in a more uniform sampling in the credal sets, like rejection-sampling or triangulizations, are computationally costly. (b), the minimization step automatically “guides” the algorithm towards the region with low calibration errors, avoiding the need to explore the whole credal set. In Appendix C we show that using a universal approximator to learn the underlying convex combination leads to the test being asymptotically valid, i.e., controlling the Type 1 error. The optimisation step is explained in more detail in the following section.

3.5 A robust way of finding the most calibrated convex combination

We will now analyse further how optimization problem (15) can be solved in an efficient and robust manner. Specifically, we try to avoid the problem of overfitting on the empirical calibration error on the respective dataset: Since the latter is finite, there might be a mismatch between the estimated and the population-level calibration error. Classical gradient-based solvers, which purely optimise the calibration error on the same dataset that is used for the test, might therefore run into the problem of underestimating the true calibration error, thereby making the test more conservative. This can also be seen in the empirical results of Mortier et al. (2023).

Therefore, we suggest an alternative approach find the minimum in (15), which incorporates two important aspects:

We use a neural network for learning the weight function $\varvec{\lambda }: \mathcal {X} \rightarrow \Delta _{M, \mathcal {X}}$, exemplary visualized in Fig. 2. This approach is similar to stacking (Wolpert, 1992), where a second meta-learner is trained to find the optimal combination of predictions. It is trained on a separate optimization dataset, and then used to predict the optimal convex combination for our test.
As a loss function, we use the risk of a proper scoring rule $\ell$, combined with a term controlling the calibration error, such that the true risk minimizer is given by
$$\begin{aligned} \varvec{\lambda }^* \in \mathop {\mathrm {arg\,min}}\limits _{\varvec{\lambda }\in \Delta _{M, \mathcal {X}}} \mathbb {E}\big [\ell (f_{\varvec{\lambda }}(x), y)\big ] + \gamma \cdot g(f_{\varvec{\lambda }}). \end{aligned}$$
(16)
We use $\hat{g} \in \{\widehat{\textrm{CE}}_{KL}, \widehat{\text {CE}}_2, \widehat{\text {CE}}_k, \widehat{\text {CE}}_{MMD}\}$ as estimates for the respective calibration error of interest, while $\gamma \ge 0$ denotes a weight factor. This approach takes recent insights (Popordanoska et al., 2022; Marx et al., 2024) that adding a calibrating penalty to a proper scoring rule can help ensuring calibration, into account, while avoiding trivial or degenerate solutions. In particular, many convex combinations can be calibrated (Appendix B), hence, the proper scoring rule serves as a regularizer, guiding the optimizer toward solutions that not only reduce calibration error but also maintain high predictive accuracy. Furthermore, we learn $\varvec{\lambda }^*$ on a validation set but evaluate calibration on a separate test set. This split reduces the risk of overfitting to our chosen calibration metric, ensuring the resulting test does not become overly conservative.

4 Experimental results

We evaluate our test both on synthetic and real-world data. A more detailed description of the experimental setup can also be found in Appendix E and the code for reproducing the experiments is available on GitHub.^{Footnote 1} For the experiments on synthetic data, a MLP architecture with 3 hidden layers and 16 neurons is used. Experimentally, we found that this simple architecture was sufficient to solve the optimization problem in Eq. (16) and learn the underlying weight function. For the experiments on real data, we use a more complex network architecture, which is described in Sect. 4.3. As a loss function, we use the combined loss as in (16), with the Brier score as a proper scoring rule, and $\gamma =0.01$.

4.1 Binary classification

For illustration purposes, we first examine the case of binary classification by simulating $M=2$ probabilistic predictors, each outputting a probability for the positive class. For each input x, the predictor’s probabilities $f^{(1)}(x)$ and $f^{(2)}(x)$ are sampled from a Gaussian process (GP) constrained to [0, 1]. We then define a true calibrated predictor $f^*$ that lies inside (under $H_0$) or outside (under $H_1$) the convex hull of the two predictors. Figure 9 shows the exemplary setting for this experiment. Under $H_0$, we consider two cases of non-instance and instance-dependence of $\varvec{\lambda }=(\lambda _1, \lambda _2)^T$:

1.
$\varvec{H_{0,1}}$: The calibrated predictor $f^*$ is a constant convex combination of $f^{(1)}$ and $f^{(2)}$, where $\lambda _1^*(x) \equiv c \sim Unif([0,1])$, $\lambda _2^*(x) = 1 - \lambda _1^*(x)$.
2.
$\varvec{H_{0,2}}$: $\lambda ^*(x)$ is a randomly generated polynomial function and $f^*(x)= \lambda ^*(x) f^{(1)}(x) +(1-\lambda ^*(x))f^{(2)}(x)$.

Under $H_1$, $f^*(x)$ lies strictly outside or near the boundary of the credal set. We generate 3 scenarios of increasing distance from it:

1.
$\varvec{H_{1,1}}$: $f^*(x)$ is set at a small $\epsilon$-distance to one boundary.
2.
$\varvec{H_{1,2}}$: is sampled from a GP that remains close but outside the credal set.
3.
$\varvec{H_{1,3}}$: similar to $\varvec{H_{1,2}}$, but allowing a larger maximum distance.

Using the described scenarios, Fig. 3 shows the Type 1 and Type 2 error of the proposed test, averaged over 200 runs of the experiment. For comparison, we include the results given by the algorithm of Mortier et al. (2023), as well as the algorithm of Vaicenavicius et al. (2019) applied to the mean of the ensemble, $f_{\lambda _{AVG}}(x) = \frac{1}{M}\sum _{i=1}^M f^{(i)}(x)$. The average Type 1 error of our testlies close to the chosen significance level. In some cases, it lies slightly above it, which is is due to the generalization error on the unseen data: Since we learn the optimally calibrated convex combination on a separate dataset, the learned convex combination on the test set might be not the one with lowest calibration error, due to the inherent randomness within the data. Nonetheless, with increasing sample size, we expect the Type 1 error to be controlled by the significance level (due to the test’s asymptotic validity). The results show that the test of Mortier et al. is overly conservative, with a rejection rate far lower than the significance level. The test of Vaicenavicius applied to the mean prediction has a Type 1 error far higher than the significance level; which is expected, as the mean prediction might not lie close enough to the underlying conditional distribution for the test to not reject. Both our test and the test of Vaicenavicius applied to the mean show a high power for this setting, with almost zero Type 2 error also in the case where the true distribution lies in very close distance to the polytope. However, the conservative nature of the test of Mortier et al. leads to a higher Type 2 error, thereby revealing the provided benefit of our test in terms of power.

4.2 Multi-class classification

We consider $K > 2$ classes, and follow a Dirichlet-based scheme to generate an ensemble of M probabilistic predictors. Specifically, we generate a prior $\varvec{p} \sim Dir(\varvec{1})$ and predictions $f^{(1)}(x), \dots , f^{(M)}(x) \sim Dir(\frac{\varvec{p}|x \cdot K}{u(x)})$, where $u: \mathcal {X} \rightarrow \mathbb {R}_{>0}$ is a function defining the epistemic uncertainty (i.e. the larger u, the higher is the spread between the predictions within the set) over the instance space. For the experiments, we use a constant $u(x) \equiv 0.5$. Under $H_0$, which we found to be a good tradeoff to evaluate Type 1 and 2 error of the test. We distinguish between

1.
$\varvec{H_{0,1}}$: $\varvec{\lambda ^*}(x) \equiv \varvec{c} \in [0,1]$ where $\varvec{c} \sim Dir(1, \dots ,1) \in \Delta _M$ ($\varvec{\lambda ^*}$ is constant across the instance space), and
2.
$\varvec{H_{0,2}}$: $\varvec{\lambda ^*}(x) = (\lambda _1^*(x), \dots , \lambda _M^*(x)) \in \Delta _M$ with $\lambda ^*_m(x) = \sum _{i=0}^D \beta _i x^i$ for $m =1, \dots ,M$ and $\sum _{m=1}^M \lambda _m^*(x) =1$ (the components of $\varvec{\lambda ^*}$ form scaled polynomials of a certain degree).

For the alternative hypothesis, we select a random corner $f_c \notin \mathcal {F}$ of the simplex and then set the true underlying conditional distribution $f^*$ as a point on the connecting line between the corner and the boundary point $f_b \in \mathcal {F}$ of the credal set: $f^*(x) = \delta \cdot f_c(x) + (1- \delta ) \cdot f_b(x)$. We define three cases with increasing distance to the credal set by varying the mixing coefficient $\delta$: $\varvec{H_{1,1}}$: $\delta = 0.01$, $\varvec{H_{1,2}}$: $\delta = 0.1$ and $\varvec{H_{1,3}}$: $\delta = 0.2$.

The resulting Type 1 and Type 2 errors of this experiment are shown in Fig. 4, again for our test and the two baselines. We see that the proposed test yields more reliable decisions, not heavily exceeding the given significance level while following it closely. The test of Mortier et al. is again overly conservative for all estimators, except for the kernel calibration error estimator $\widehat{\text {CE}}_k$, which in general leads to unreliable results and low power. While the test of Vaicenavicius et al. applied to the mean prediction yields the highest power, it cannot be seen as a valid test, as it does not control the Type 1 error rate. Note here that we use the same datasets for all three tests, but we perform the optimisation for our proposed test on a separate dataset of the same size. The Type 2 error analysis shows that, except for the kernel calibration error, our proposed test has higher power than the previously proposed one, and it aligns better with the chosen significance level.

4.3 Experiments on real-world data

Table 1 Calibration test results on CIFAR-10 and CIFAR-100 for the significance level $\alpha =0.05$ and $D=100$ bootstrap iterations

Full size table

Since the true data-generating distribution is unknown in real-world datasets, we cannot directly quantify Type 1 or Type 2 errors. Instead, we demonstrate the practical usefulness of our test by applying it to the two standard image classification tasks CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). More specifically, we apply our test on the predictions of three ensemble methods, combined with two different architectures that are trained on the two datasets, respectively. We train a deep ensemble (DE, (Lakshminarayanan et al., 2017)), a dropout network with dropout rate 0.5 (DN(0.5), (Gal & Ghahramani, 2016)) and a dropout network with dropout rate 0.2 (DN(0.2)). For the deep ensemble, we train 10 different models using different weight initializations, while in the dropout networks, dropout is applied after specific layers. A higher dropout rate injects more parameter noise during training, increasing predictor diversity and thus enlarging the credal set, which is why we analyse the two different cases. After training, we obtain a set of different predictions from these models, which we then use to run Algorithm 1. As model architectures, we use two architectures of different model complexity: the ResNet-18 (He et al., 2016) ($11.6 \times 10^6$ parameters) and the VGG-19 (Simonyan & Zisserman, 2014) ($138 \times 10^6$ parameters). For the optimisation part, we leverage the same architecture, i.e. a pre-trained ResNet-18 or VGG-19, to embed images and attach a fully-connected layer with 32 neurons to predict the weight function $\varvec{\lambda }$. Section E describes the full experimental setup. The results are shown in Table 1. For each model and calibration estimator, we run the respective bootstrapping test (line 5–19 of Algorithm 1), with the optimal weight function $\varvec{\lambda }^*$ learned by the neural network. Similarly as in the synthetic experiments, we also perform the bootstrapping part of the test with the mean predictor (Algorithm 1 from line 5 with $f_{\varvec{\lambda }_{AVG.}}=\frac{1}{M}\sum _{j=1}^M f^{(j)}$). The results differ significantly, depending on the used calibration error. In general, using the calibration error $\text {CE}_{KL}$ led to the fewest rejections (i.e. the highest p values), potentially because the models were trained with a log-loss objective as proper scoring rule, which aligns more closely with KL-based calibration measures. In general, for the CIFAR-100 dataset with more classes, the test rejects more often, and the dropout networks are less calibrated. In most cases, the learned convex combination $\varvec{\lambda }^*$ leads to a slight increase in accuracy for the combined prediction $f_{\varvec{\lambda }^*}$.

5 Discussion

We developed a novel statistical test for the calibration of epistemic uncertainty representations based on sets of probability distributions. It is defined on an instance-based level, thereby making the test more flexible. As calibration forms a necessary condition for validity, we claim that by testing for it, one can safely detect scenarios where the credal set is not valid. For this case, the next step to include would be actionability, i.e., ways to increase the size of the credal set to include at least one calibrated convex combination. Here, we model calibration as a property that applies across the entire instance space. Alternatively, calibration could be viewed as an instance-specific concept, allowing analysis in different regions of the instance space. However, there has been limited research on this form of calibration, sometimes referred to as conditional or local calibration (Luo et al., 2022), and the challenge of partitioning the instance space in a way that enables the computation of expectations remains unresolved. Interesting future work includes the application of the test to other credal set representations such as interval-neural networks (Wang et al., 2025), credal sets based on relative log-likelihood (Löhr et al., 2025) and bootstrapped ensembles, as well as the analysis of the regression case.

Data availability

Data is provided within the manuscript or supplementary information files.

Notes

https://github.com/mkjuergens/EnsembleCalibration.

References

Abe, T., Buchanan, E. K., Pleiss, G., Zemel, R., & Cunningham, J. P. (2022). Deep ensembles work, but are they necessary? In Proceedings of NeurIPS, 35th advances in neural information processing systems.
Bengs, V., Hüllermeier, E., & Waegeman, W. (2022). Pitfalls of epistemic uncertainty quantification through loss minimisation. In Proceedings of NeurIPS, 35th advances in neural information processing systems.
Bengs, V., Hüllermeier, E., & Waegeman, W. (2023). On second-order scoring rules for epistemic uncertainty quantification. In Proceedings of ICML, 40th international conference on machine learning.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1.
Article Google Scholar
Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135, 1512–1519.
Article Google Scholar
Bröcker, J., & Smith, L. A. (2007). Increasing the reliability of reliability diagrams. Weather and Forecasting, 22, 651–661.
Article Google Scholar
Caprio, M., Dutta, S., Jang, K. J., Lin, V., Ivanov, R., Sokolsky, O., & Lee, I. (2024). Credal Bayesian deep learning. Transactions on Machine Learning Research.
Chau, S. L., Schrab, A., Gretton, A., Sejdinovic, D., & Muandet, K. (2024). Credal two-sample tests of epistemic ignorance. arXiv preprint arXiv:2410.12921.
Cozman, F. G. (2000). Credal networks. Artificial Intelligence, 120(2), 199–233.
Article MathSciNet Google Scholar
Dümbgen, L. (2017). Empirische Prozesse. Institut für Mathematische Statistik und Versicherungsmathematik der Universität Bern, Bern, Switzerland. https://books.google.de/books?id=tmFMygEACAAJ.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of ICML, 32nd international conference on machine learning.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Boca Raton: Chapman and Hall.
Google Scholar
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
Article MathSciNet Google Scholar
Gruber, S., & Buettner, F. (2022). Better uncertainty calibration via proper scores for classification and beyond. In Proceedings of NeurIPS, 35th advances in neural information processing systems.
Gruber, C., Schenk, P. O., Schierholz, M., Kreuter, F., & Kauermann, G. (2023). Sources of uncertainty in machine learning—a statisticians’ view. arXiv preprint arXiv:2305.16703.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of ICML, 34th international conference on machine learning.
Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., & Hartley, R. (2020). Calibration of neural networks using splines. In Proceedings of ICML, international conference on learning representations.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hosmer, D. W., Hosmer, T., Le Cessie, S., & Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965–980.
Article Google Scholar
Hüllermeier, E., Destercke, S., & Shaker, M. H. (2022). Quantification of credal uncertainty in machine learning: A critical analysis and empirical comparison. In Proceedings of UAI, 34th conference on uncertainty in artificial intelligence.
Hüllermeier, E., & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110, 457–506.
Article MathSciNet Google Scholar
Javanmardi, A., Stutz, D., & Hüllermeier, E. (2024). Conformalized credal set predictors. In Proceedings of NeurIPS, 37th advances in neural information processing systems.
Juergens, M., Meinert, N., Bengs, V., Hüllermeier, E., & Waegeman, W. (2024). Is epistemic uncertainty faithfully represented by evidential deep learning methods? In Proceedings of ICML, 41st international conference on machine learning.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of NeurIPS, 30th advances in neural information processing systems.
Kopetzki, A.-K., Charpentier, B., Zügner, D., Giri, S., & Günnemann, S. (2021). Evaluating robustness of predictive uncertainty estimation: Are Dirichlet-based models reliable? In Proceedings of ICML, 38th international conference on machine learning.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.
Kull, M., & Flach, P. (2015). Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Machine learning and knowledge discovery in databases.
Kumar, A., Sarawagi, S., & Jain, U. (2018). Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of ICML, 35th international conference on machine learning.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of NeurIPS, 30th advances in neural information processing systems.
Löhr, T., Hofman, P., Mohr, F., & Hüllermeier, E. (2025). Credal prediction based on relative likelihood. arXiv preprint arXiv:2505.22332.
Luo, R., Bhatnagar, A., Bai, Y., Zhao, S., Wang, H., Xiong, C., Savarese, S., Ermon, S., Schmerling, E., & Pavone, M. (2022). Local calibration: Metrics and recalibration. In Proceedings of UAI, uncertainty in artificial intelligence.
Marx, C., Zalouk, S., & Ermon, S. (2024). Calibration by distribution matching: Trainable kernel calibration metrics. In Proceedings of NeurIPS, 36th advances in neural information processing systems.
Meinert, N., Gawlikowski, J., & Lavin, A. (2023). The unreasonable effectiveness of deep evidential regression. In Proceedings of AAAI, 37th proceedings of the AAAI conference on artificial intelligence.
Mortier, T., Bengs, V., Hüllermeier, E., Luca, S., & Waegeman, W. (2023). On the calibration of probabilistic classifier sets. In Proceedings of the 26th international conference on artificial intelligence and statistics.
Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29, 429–443.
Article MathSciNet Google Scholar
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology (1962–1982), 12(4), 595–600.
Article Google Scholar
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of AAAI, proceedings of the twenty-ninth AAAI conference on artificial intelligence.
Nguyen, V.-L., Shaker, M. H., & Hüllermeier, E. (2022). How to measure uncertainty in uncertainty sampling for active learning. Machine Learning, 111, 89–122.
Article MathSciNet Google Scholar
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In Proceedings of ICML, 22nd international conference on machine learning.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Proceedings of NeurIPS, 31st advances in neural information processing systems.
Popordanoska, T., Gregor Gruber, S., Tiulpin, A., Buettner, F., & Blaschko, M. B. (2024). Consistent and asymptotically unbiased estimation of proper calibration errors. In Proceedings of the 27th international conference on artificial intelligence and statistics.
Popordanoska, T., Sayer, R., & Blaschko, M. (2022). A consistent and differentiable lp canonical calibration error estimator.
Rahimi, A., Shaban, A., Cheng, C.-A., Hartley, R., & Boots, B. (2020). Intra order-preserving functions for calibration of multi-class neural networks. In Proceedings of NeurIPS, 33rd advances in neural information processing systems.
Sale, Y., Caprio, M., & Hüllermeier, E. (2023). Is the volume of a credal set a good measure for epistemic uncertainty? In Proceedings of UAI, 39th conference on uncertainty in artificial intelligence.
Senge, R., Bösner, S., Dembczyński, K., Haasenritter, J., Hirsch, O., Donner-Banzhoff, N., & Hüllermeier, E. (2014). Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty. Information Sciences, 255, 16–29.
Article MathSciNet Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.
Ulmer, D., Hardmeier, C., & Frellsen, J. (2023). Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation. Transactions on Machine Learning Research.
Vaart, A. W. (2000). Asymptotic statistics. Cambridge: Cambridge University Press.
Google Scholar
Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., & Schön, T. (2019). Evaluating model calibration in classification. In Proceedings of the twenty-second international conference on artificial intelligence and statistics.
Walley, P. (1991). Statistical reasoning with imprecise probabilities. London: Chapman and Hall.
Book Google Scholar
Wang, K., Shariatmadar, K., Manchingal, S. K., Cuzzolin, F., Moens, D., & Hallez, H. (2025). Creinns: credal-set interval neural networks for uncertainty estimation in classification tasks. In Neural networks.
Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In Proceedings of NeurIPS, 32nd advances in neural information processing systems
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Article Google Scholar
Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of ICML, 18th international conference on machine learning.
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of ACM SIGKDD, 8th international conference on knowledge discovery and data mining.

Download references

Acknowledgements

M.J. and W.W. received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Author information

Authors and Affiliations

Department of Data Analysis and Mathematical Modeling, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
Mira Jürgens & Willem Waegeman
Department of Environment, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
Thomas Mortier
Institute of Informatics, LMU Munich, Akademiestr. 7, 80799, Munich, Germany
Eyke Hüllermeier & Viktor Bengs

Authors

Mira Jürgens
View author publications
Search author on:PubMed Google Scholar
Thomas Mortier
View author publications
Search author on:PubMed Google Scholar
Eyke Hüllermeier
View author publications
Search author on:PubMed Google Scholar
Viktor Bengs
View author publications
Search author on:PubMed Google Scholar
Willem Waegeman
View author publications
Search author on:PubMed Google Scholar

Contributions

MJ conducted the experiments and wrote the main manuscript. MJ, TM and WW conceived the experiments and analyzed the results. All authors helped with conceptualization and reviewing the manuscript.

Corresponding author

Correspondence to Mira Jürgens.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Editors: Concha Bielza, Ana Carolina Lorena, Tiago Almeida.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Estimator analysis

In this section we empirically analyse the distribution of the chosen calibration estimators and their behaviour with respect to the distance from the underlying ground truth probability distribution. Figure 5 shows the distribution of the estimators under the null hypothesis. In the figure, we see that the mean of the estimators $\widehat{\text {CE}}_{2}$ and $\widehat{\text {CE}}_{KL}$ is slightly above zero—since they are only asymptotically unbiased—while $\widehat{\text {CE}}_{MMD}$ and $\widehat{\text {CE}}_{k}$ also empirically show their unbiasedness. The kernel calibration estimator $\widehat{CE}_k$ is symmetrically distributed around zero—its asymptotic normality was also shown by Widmann et al. (2019). As the estimators $\widehat{CE}_k$ and $\widehat{CE}_{MMD}$ can obtain negative values, we use their squared versions for the optimization in Eq. (16).

Figure 6 shows the values of different calibration estimators in dependence of the position in the simplex, where the true underlying calibrated convex combination $f_{\varvec{\lambda }^*}$ is set with $\varvec{\lambda }^*=(0.1, 0.1, 0.8)$, and the predictions are sampled from a Dirichlet distribution. We see that by optimizing over $\varvec{\lambda }$, the true calibrated convex combination is learned differently “well enough”. However, $\varvec{\lambda }^*$ still lies within a certain region of low estimator values. For $\widehat{CE}_k$, we see in general more noisy behaviour in the region around the ground truth distribution (red dot).

We also do a similar analysis in $\lambda$-space: In Fig. 7, we show the value of the calibration estimators not in distance to the ground truth probability distribution, but the ground truth weight vector $\varvec{\lambda }\in \Delta _3$, given the predictions of two predictors $\{f^{(1)}(x), f^{(2)}(x), f^{(3)}(x)\}$ and a ground truth probability distribution $f_{\varvec{\lambda }^*}(x) = 0.1 \cdot f^{(1)}(x) + 0.1 \cdot f^{(2)}(x) + 0.8 \cdot f^{(3)}(x)$, that is $\varvec{\lambda }^*=(0.1,0,1,0.8)^T$. We see that here the heatmaps vary more significantly for the different calibration estimators.

Appendix B Non-uniqueness of calibrated models

Using simple examples, we will show the non-uniqueness of calibrated convex combinations, that is, that there are often many different (possibly instance-dependent) convex combinations leading to a calibrated predictor. Using a proposition proven by Vaicenavicius et al. (2019), which states that conditioning Y on any measurable function h yields a calibrated predictor, we show this with a concrete example.

Example 1

(Many calibrated classifiers) For the following, assume the case of having 3 classes, i.e. $\mathcal {Y}= \{1,2,3\}$. We have (Vaicenavicius et al. 2019, Proposition 1) for any measurable function $h: \mathcal {X} \mapsto \mathcal {Z}$ with $\mathcal {Z}$ being some measurable space, that the function

$$\begin{aligned} f(X):= \mathbb {P}[Y \in \cdot | h(X)] \end{aligned}$$

is a calibrated classifier. As an example, taking a constant $h(x) = c \in \mathbb {R} \, \forall x \in \mathcal {X}$, the classifier that simply predicts the marginal class probabilities

$$\begin{aligned} f \equiv \begin{pmatrix} \mathbb {P}(Y=1) \\ \mathbb {P}(Y=2) \\ \mathbb {P}(Y=3) \\ \end{pmatrix} \end{aligned}$$

is calibrated for $\mathcal {Y}= \{1, 2, 3 \}$. Further, if we take $h: \mathcal {X}\mapsto \mathcal {X}$ with $h(x)=x$, then

$$\begin{aligned} f' \equiv \begin{pmatrix} \mathbb {P}(Y=1|X) \\ \mathbb {P}(Y=2|X) \\ \mathbb {P}(Y=3|X) \\ \end{pmatrix} \end{aligned}$$

is also a calibrated classifier. To make this example more concrete, let $\mathcal {X}=\{1,2,3\}$ with $\mathbb {P}(X=x)=\frac{1}{3} \forall \, x \in \mathcal {X}$. Further let again $\mathcal {Y}=\{1,2,3\}$ and

$$\begin{aligned} \mathbb {P}(Y=i|X=j) = {\left\{ \begin{array}{ll} 1, \quad i=j \\ 0, \quad \text {else} \end{array}\right. } \end{aligned}$$

for $i, j \in \{1,2,3\}$. Then $\mathbb {P}(Y=i)= \sum _{j=1}^3 \mathbb {P}(Y=i|X=j)\mathbb {P}(X=j)= \frac{1}{3}$ for $i \in \{1,2,3\}$, and

$$\begin{aligned} f(x) \equiv \begin{pmatrix} 1/3 \\ 1/3 \\ 1/3 \end{pmatrix}, \quad f'(x) = \textbf{e}_{x} \end{aligned}$$

where $\textbf{e}_{x}$ denotes the x-th unit vector, are both calibrated predictors.

Going from the case of one single classifier, similarly, one can also show that there usually exists various calibrated convex combinations for a set of classifiers. The following example illustrates this for the simple case of binary classification.

Example 2

Let us look at a very simple problem of binary classification with $\mathcal {Y} = \{1, 2 \}$ and $\mathcal {X}= \{-1,1\}$. Assume

$$\begin{aligned} \mathbb {P}(Y=i|X=x_j)= {\left\{ \begin{array}{ll} 1, \quad (i, x_j) \in \{(1, -1), (2,1)\} \\ 0, \quad \text {else}, \end{array}\right. } \end{aligned}$$

and $\mathbb {P}(X=x) = \frac{1}{2}, \quad x \in \mathcal {X}$. Then the two classifiers

$$\begin{aligned} f_1: \,&\mathcal {X} \rightarrow \Delta ^2 \\&x \mapsto \begin{pmatrix} \frac{1}{2} \\ \frac{1}{2} \end{pmatrix}, \quad x \in \{-1,1\} \end{aligned}$$

and

$$\begin{aligned} f_2: \,&\mathcal {X} \rightarrow \Delta ^2 \\&x \mapsto {\left\{ \begin{array}{ll} \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \quad x = -1 \\ \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \quad x = 1 \end{array}\right. } \end{aligned}$$

are both calibrated (which can easily be seen by Proposition 1 of Vaicenavicius et al. (2019) and conditioning on $h_1(x)=c$ and $h_2(x)=x$). Further, they can be written as convex combinations of the two classifiers $m_1(x) = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \forall x$ and $m_2(x) = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \forall x$ using the two different convex combinations

$$\begin{aligned} \lambda _1:\,&\mathcal {X} \rightarrow \Delta ^2 \\&x \mapsto {\left\{ \begin{array}{ll} \begin{pmatrix} 0 \\ 1 \end{pmatrix}, x = -1 \\ \begin{pmatrix} 1 \\ 0 \end{pmatrix}, x = 1 \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \lambda _2: \,&\mathcal {X} \rightarrow \Delta ^2 \\&x \mapsto \begin{pmatrix} \frac{1}{2} \\ \frac{1}{2} \end{pmatrix}, x \in \{-1,1\}. \end{aligned}$$

Appendix C Validity of the proposed algorithm

In the following, we prove the asymptotic validity of the derived test.

Theorem 1

(Asymptotic validity of Algorithm 1) Let $\mathcal {D}_{opt} =\{(x_i, y_i)\}_{i=1}^{n}$ and $\mathcal {D}_{val} = \{(x_i', y_i')\}_{i=1}^{m}$ be i.i.d. data generated from the underlying distribution $\mathbb {P}_{X,Y}$. Denote by

$$\begin{aligned} \hat{\varvec{\lambda }}_{n}\;\in \; \arg \min _{\Lambda } \frac{1}{n} \; \sum _{i=1}^n \ell (f_{\Lambda }(x_i),y_i) +\gamma \,\widehat{g}_{n}(f_{\Lambda },\mathcal {D}_{opt}) \end{aligned}$$

the weight function that is the risk minimizer of the empirical version of Eq. (16) evaluated at $\mathcal {D}_{opt}$, where $\ell$ is the (strictly) proper Brier score or log loss and $\widehat{g}_n$ is a consistent, differentiable and (asymptotically) unbiased estimator of one of the calibration errors in Sect. 2. Define $t_{m,n} := \hat{g}_m(f_{\hat{\varvec{\lambda }}_n}, \mathcal {D}_{val})$ the test statistic evaluated on the combined predictor, and let $c_{m,D,\alpha }=\hat{F}^{-1}_{m,D}(1-\alpha )$ be the $(1-\alpha )$-quantile of the empirical cumulative distribution function $\hat{F}_{m,D}$ of the bootstrap replicates $t_{0,1},\dots ,t_{0,D}$ generated by the consistency–resampling scheme of Vaicenavicius et al. (2019) (line $5-19$ of Algorithm 1). Furthermore, assume that the set of predictors $\mathcal {F}=\{f^{(1)}, \dots , f^{(M)}\}$ is such that the mapping $\varvec{\lambda } \mapsto f_{\varvec{\lambda }}$ is continuous over $\mathcal {X}$ and for the respective calibration function g all regularity conditions are fulfilled such that f is calibrated if and only if $\text {g} = 0$. Then Algorithm 1 rejects $H_0$ whenever $t_{m,n}\ge c_{m,D,\alpha }$. Then, under $H_0$,

$$\begin{aligned} \lim _{\begin{array}{c} m,n\rightarrow \infty \\ D\rightarrow \infty \end{array}} \mathbb {P}( t_{m,n}\ge c_{m,D,\alpha }|H_0 ) \;=\;\alpha . \end{aligned}$$

Proof

In the following, we assume that the null hypothesis is true, i.e., there exists (at least) one underlying function $\varvec{\lambda }$ such that the combined predictor $f_{\varvec{\lambda }}$ is calibrated.

1.
Consistency: Because $\ell$ is strictly proper, the population level risk in Eq. (16) is uniquely minimised by some $\varvec{\lambda }^\star \!\in \!\Delta _{M,\mathcal {X}}$ with respective combined predictor $f_{\varvec{\lambda }^\star }$ . As the expectation of $\ell$ can be decomposed into a calibration and refinement term, and $g(f) = 0$ for any calibrated predictor f, we have that $g(f_{\varvec{\lambda }^\star })=0$.

By the arg-min consistency theorem (Thm. 5.7 of van der Vaart (2000)) it holds that the empirical risk minimizer converges to the population level one, i.e. $\Vert \hat{\varvec{\lambda }}_{n}-\varvec{\lambda }^\star \Vert _{L_1} \rightarrow 0$ as $n\!\rightarrow \!\infty$, and hence we have $\Vert f_{\hat{\varvec{\lambda }}_{n}}-f_{\varvec{\lambda }^*}\Vert _{L_1}\rightarrow 0.$
2.
Asymptotic distribution: Under the null hypothesis, for each of the calibration errors we have $g(f_{\varvec{\lambda }^*})=0$. Further, taking the asymptotic behavior of U-statistics (van der Vaart, 2000) into account, we have $\sqrt{m}\,\hat{g}_m(f_{\varvec{\lambda }^*}, \mathcal {D}_{val}) \rightarrow Z$, $m\rightarrow \infty$, for some random variable with limiting distribution $F_Z$. By Rubin’s theorem (see Theorem 7.9 in Dümbgen (2017)) it holds that $\sqrt{m}\, t_{m,n} \rightarrow Z$, $m,n\rightarrow \infty$.
3.
Bootstrap validity: For sake of simplicity, set $D=m$. Under $H_0$, the consistency-bootstrap (line 5 to 19 of Algorithm 1, Vaicenavicius et al. (2019)) guarantees that conditionally on the validation data set it holds that $\sqrt{m} \, c_{m,D,\alpha }\rightarrow F_Z^{-1}(1-\alpha ).$ This is because $\sup _t |\hat{F}_{m,D}^{-1}(t)-F_{t_{m,n}}^{-1}(t)|\rightarrow 0$ (compare to Section 23.2.1 in van der Vaart (2000)).
4.
Control of Type 1 error: The convergence of the bootstrap quantile against the quantile of the limiting distribution implies $\mathbb {I}\{t_{m,n}\ge c_{m,D,\alpha }\}\rightarrow \mathbb {I}\{Z\ge F_Z^{-1}(1-\alpha )\}.$ Taking expectations yields: $\mathbb {P}(\text {reject}| H_0) = \mathbb {P}(t_{m,n} \ge c_{m.B. \alpha }) \rightarrow 1-F_Z(F_Z^{-1}(1-\alpha ))=\alpha .$

$\square$

Appendix D Uniform sampling in the credal set

In the work of Mortier et al. (2023), a sampling within the polytope of convex combinations of predictions is done by sampling weights $\varvec{\lambda } \in \Delta _M\sim Dir(1, \dots , 1)$. This however does in general not lead to a uniform sampling within the polytope, as is exemplary shown in Fig. 8. As soon as we have more than $M=3$ predictors, the sampled predictions start “accumulating” in the centre of the simplex, due to the fact that there are more possible combinations in the centre than for the regions in the boundary. Possible ways to achieve uniform sampling is by using triangulization methods, rejection sampling or Markov Chain Monte Carlo sampling methods, however, these methods come with increased computational effort.

Appendix E Experimental setup

In this section we explain the experimental setup for the experiments conducted in Sect. 4.

1.1 E.1 Synthetic data

For optimising the weights $\varvec{\lambda }$ in the synthetic experiments, we use an MLP with 3 hidden layers each consisting of 16 neurons that is trained on an optimization dataset of size $N_{opt}=400$. Both tests (the previous one proposed by Mortier et al. and ours) are performed on a validation set set of size $N_{val}=400$ (Fig. 9).

For the case of binary classification, we draw $\{x_i\}_{i=1}^N \sim U([0,5])$, and generate predictions from two probabilistic predictors $\{f^{(1)}(x), f^{(2)}(x)\}$ by sampling from a Gaussian process with an rbf kernel, whose outputs are constrained to output probabilities [0, 1], using min-max sclaing. Under $H_0$, we generate the probability for class 1, $f^*(x) = \lambda ^*\,f^{(1)}(x) + (1-\lambda ^*)\,f^{(2)}(x)$ with $\lambda ^*$ generated as

$\varvec{H_{0,1}}$: $\lambda ^*$ is a constant and sampled as $\lambda ^* \sim U(0,1)$,
$\varvec{H_{0,2}}$: $\lambda ^*: \mathcal {X}\rightarrow [0,1]$ is a polynomial of degree D. In the experiments, we set $D=2$.

Under $H_1$, $f^*(x)$ lies outside the credal set with

$\varvec{H_{1,1}}$: $f^*(x)$ lies at an $\epsilon$-distance to one of the boundaries of the credal set. We sample $\epsilon \sim U([0,0.02])$.
$\varvec{H_{1,2}}-\varvec{H_{1,3}}$ reflect increasing distance to the boundary by sampling a new GP that intentionally remains outside the credal set.

For the multi-class classification case, we again draw $\{x_i\}_{i=1}^N \sim U([0,5])$ and generate the set of probabilistic predictors $f^{(1)}, \dots , f^{(M)}\}$ as follows: For each $x_i$, draw a prior $\varvec{p}\in \mathbb {R}^K \sim Dir(1, \dots , 1)$, then sample $f^{(m)}(x_i) \sim Dir(\frac{\varvec{p} \cdot K}{0.5})$. Under $H_0$, the weight function $\varvec{\lambda }^*$ is generated as

$\varvec{H_{0,1}}$: $\varvec{\lambda }^*\sim Dir(1, \dots , 1)$
$\varvec{H_{0,2}}$: $\varvec{\lambda }^*=(\lambda _1^*, \dots , \lambda _M^*)$ with $\lambda _m^* : \mathcal {X} \rightarrow [0,1]$; the components of $\varvec{\lambda }^*$ are randomly generated and scaled polynomials of degree $D=2$.

Under $H_1$, $f^c(x)$ is randomly chosen as one of the corners of the simplex, outside of the convex set. $f^*(x)$ is then given by the point prediction on the line segment between $f^c(x)$ and the boundary point $f^b(x)$ in the credal set, $f^*(x)(x) = \delta f^c(x) + (1-\delta )\,f^b(x)$, with

$\varvec{H_{0,1}}$: $\delta =0.01$,
$\varvec{H_{0,2}}$: $\delta =0.1$,
$\varvec{H_{0,3}}$: $\delta =0.2$.

In both settings, the labels $\{(y_i)\}_{i=1}^N$ are then sampled from the resulting categorical distribution with parameter $f^*(x_i)$.

1.2 E.2 Real data

Here we provide details on the data preprocessing, model architectures, hyper-parameters, and calibration-test implementation used in the real-world experiments.

Data splitting: We split the data into training set, (used to train the ensembles), validation and test set. The validation set together with the predictions of the models are then used to optimise for $\varvec{\lambda }^*$, and the test is performed on the test data.

CIFAR-10: 50000 training, 5000 validation and 5000 test samples,
CIFAR-100: 60000 training, 5000 validation and 5000 test samples.

Models: We use the following training parameters to train the deep ensemble and dropout models, in order to obtain the credal set of predictions:

adam optimizer with $\text {lr}=0.001$,
loss function: cross-entropy (log loss)
batch size: 128
number of epochs: 50
early stopping: monitor validation loss, stop if does not decrease for $\text {patience}=10$ epochs.

Optimization of weights: We concatenate a small MLP consisting of 3 hidden layers and 32 neurons to the same VGG-19/ResNet-18 architecture that has been used for learning the prediction sets. For optimization, we use the combined loss as in (16) with

log loss and $\widehat{\text {CE}}_{KL}$ with $\gamma =0.01$ for testing $\text {CE}_{KL}$
Brier score and $\widehat{\text {CE}}_2$ with $\gamma =0.01$ for testing $\text {CE}_2$.

As optimisation parameters, we use a learning rate $\text {lr}=0.0001$, a $\text {batch size} = 256$, the adam optimizer and train the neural network for 200 epochs.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jürgens, M., Mortier, T., Hüllermeier, E. et al. A calibration test for evaluating set-based epistemic uncertainty representations. Mach Learn 114, 202 (2025). https://doi.org/10.1007/s10994-025-06844-8

Download citation

Received: 09 June 2025
Revised: 14 July 2025
Accepted: 15 July 2025
Published: 05 August 2025
Version of record: 05 August 2025
DOI: https://doi.org/10.1007/s10994-025-06844-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part of a collection:

A calibration test for evaluating set-based epistemic uncertainty representations

Abstract

Similar content being viewed by others

On the Calibration of Epistemic Uncertainty: Principles, Paradoxes and Conflictual Loss

Estimating Expected Calibration Errors

Diversified Ensemble of Independent Sub-networks for Robust Self-supervised Representation Learning

Explore related subjects

1 Introduction

2 Calibration of first-order predictors

2.1 General setting

Definition 1

2.2 Overview of calibration errors and estimators

2.3 Statistical tests for calibration

3 Calibration of sets of probabilistic predictors

3.1 Credal sets

Definition 2

3.2 Validity and calibration for sets of predictions

Definition 3

Definition 4

3.3 A novel calibration test for sets of predictors

3.4 Algorithmic procedure

3.5 A robust way of finding the most calibrated convex combination

4 Experimental results

4.1 Binary classification

4.2 Multi-class classification

4.3 Experiments on real-world data

5 Discussion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Estimator analysis

Appendix B Non-uniqueness of calibrated models

Example 1

Example 2

Appendix C Validity of the proposed algorithm

Theorem 1

Proof

Appendix D Uniform sampling in the credal set

Appendix E Experimental setup

1.1 E.1 Synthetic data

1.2 E.2 Real data

Rights and permissions

About this article

Cite this article

Share this article

Keywords