1 Introduction

Fig. 1
figure 1

We generate virtual outliers directly by perturbing the ID samples, e.g., by placing a cat’s face on a dog’s face or masking most of a deer’s area, leaving only the tail. In this context, these virtual outliers have ID features. Thus, we consider a more complex label assignment strategy, see Sect. 4.3. This is more in line with human perception (a young child who knows a cat would think a tiger is more like a cat than a fish)

Deep neural networks (DNNs), benefiting from the abundance of large-scale labeled samples, have achieved remarkable success across diverse fields. These models, however, are conventionally trained and evaluated under the assumption of a closed-set environment, where the label space remains consistent throughout the training and testing stages (Huang et al., 2017). This assumption often falls short in real-world applications, where samples from unseen classes can appear unexpectedly. This discrepancy has led to a surge of interest in out-of-distribution (OOD) detection (Bendale & Boult, 2016), a task that requires models to accurately classify in-distribution (ID) samples while effectively identifying OOD samples.

The seminal work proposes a score function for detecting OOD samples (Hendrycks & Gimpel, 2017), where the maximum softmax probability is leveraged as an indicator for OOD detection. In this context, samples with low maximum softmax probability are considered as OOD samples. However, this approach is limited by the tendency of DNNs to produce overly confident predictions for OOD samples (Nguyen et al., 2015; Hein et al., 2019). This stems from the intrinsic similarity in patterns between OOD and ID samples, making ID and OOD samples indistinguishable even with various outstanding scoring functions (Lee et al., 2018; Liu et al., 2020).

To mitigate the overconfidence issue, advanced works propose to incorporate auxiliary OOD samples during training (Hendrycks et al., 2019; Wang et al., 2023), where these introduced outliers are assigned with equal likelihood of belonging to any ID class. Consequently, DNNs trained on these outliers are expected to produce low confidence in OOD samples. In this context, the performance of this approach is inherently related to the selection of outliers. However, identifying outliers that share patterns with ID samples poses a significant challenge.

To address this challenge, in this paper, we introduce a novel approach, termed virtual outlier smoothing (VOSo). Instead of searching for additional OOD samples, VOSo constructs auxiliary outliers directly from ID samples, thus endowing auxiliary outliers with patterns similar to those of ID samples and negating the need for an exhaustive search for natural OOD samples. Similar approach occurs in VOS (Du et al., 2022) and NPOS (Tao et al., 2023), where they obtain virtual outliers by sampling around boundary ID points in the feature space. Although compelling, the feature space is heavily dependent on the ID samples. For instance, for a binary classification task on cats and tables, since cats and tables are very different, the network can easily distinguish between the two, which leads to the fact that the boundaries distinguishing cats and tables do not really reflect the boundaries of cats and non-cats. Therefore, the features obtained by sampling around the feature boundaries of cats are not good OOD samples for cats. In contrast, VOSo mitigates this issue by perturbing ID samples in the image space. Specifically, VOSo constructs virtual OOD samples by perturbing the semantic regions of ID samples and infusing patterns from other ID samples. For instance, a virtual outlier could be an image combining a cat’s face with a dog’s nose, where the cat’s face serves as the primary semantic feature for model prediction. To efficiently obtain semantically relevant regions in the images, we use Class Activation Maps (CAMs) (Zhou et al., 2016), a technique for visualizing neural networks that can obtain the contribution of different image regions to the model prediction. During training, we randomly select different sizes of prediction-related regions on the input images for perturbing.

Existing outlier exposure methods (Hendrycks et al., 2019; Tao et al., 2023) typically view auxiliary outliers equally and assign an equal likelihood of belonging to categories. Specifically, they set labels for auxiliary outliers that are uniformly distributed over the ID class. Unlike them, VOSo considers assigning different labels to the constructed virtual outliers. VOSo constructs virtual outliers using ID samples to endow outliers with patterns similar to those of ID samples. Consequently, an intuitive approach is to assign these virtual outliers with labels that are related to the ID patterns. In this way, the boundary between ID and OOD will transition more smoothly, as shown in Fig. 2. Specifically, we introduce a dynamic label assignment method for these virtual outliers, where the labels of these virtual OOD samples are adjusted based on the extent of semantic region perturbation. This approach aligns with the understanding that virtual outliers may still contain patterns similar to ID samples, thus necessitating a nuanced labeling strategy, which is consistent with human intuition, as shown in Fig. 1.

Fig. 2
figure 2

NPOS and VOS view all virtual outliers equally, while VOSo assigns different soft labels to different virtual outliers depending on the degree of perturbation. We use a color gradient from blue to red to show that different labels are assigned to virtual outliers depending on the degree of perturbation. The smaller the degree of perturbation, the closer the labels of the virtual outliers are to the original ID labels (Color figure online)

Our main contributions are listed as follows:

  • We propose virtual outlier smoothing (VOSo), a method that creates auxiliary outliers resembling ID samples by perturbing the semantic regions of ID samples.

  • The constructed virtual outliers provide a novel direction for using outliers to smooth model prediction, where the labels assigned to virtual outliers are related to the extent of semantic region perturbation.

  • Extensive experiments show that the proposed VOSo strategy greatly improves the OOD uncertainty estimation, and an ablation study is conducted to understand the efficacy of VOSo.

2 Related Work

OOD detection. The goal of OOD detection is to enable the model to distinguish between ID samples and OOD samples while maintaining the classification accuracy of ID samples. Many works try to mitigate the overconfidence of neural networks by designing different scoring functions, such as maximum softmax probability (Hendrycks & Gimpel, 2017), energy score (Liu et al., 2020), ReAct (Sun et al., 2021) and GradNorm score (Huang et al., 2021). Despite their simplicity and convenience, these methods are more like after-the-fact fixes. These types of methods may result in more detection time. In addition to that, the proposed scoring functions may have different effects in different scenarios, which sometimes need to be picked manually in practical applications. Another promising line of work attempts to improve OOD detection performance by training-time regularization (Hendrycks et al., 2019; Tao et al., 2023; Wei et al., 2022; Wang et al., 2023; Du et al., 2022; Pinto et al., 2022). One group is to directly constrain the training process of ID data. For example, LogitNorm (Wei et al., 2022) enforces a constant vector norm on the logits in training. RegMixup (Pinto et al., 2022) utilizes Mixup (Zhang et al., 2018) as an additional regularizer to the standard cross-entropy loss.Another group of methods is to use outliers to constrain the training process. The model is regularized to produce lower confidence on outliers. These outliers can be additional manually collected surrogate OOD data (Hendrycks et al., 2019; Wang et al., 2023) or virtual outliers generated by the model (Tao et al., 2023; Du et al., 2022; Kong & Ramanan, 2021). VOS (Du et al., 2022) and NPOS (Tao et al., 2023) synthesize virtual outliers from the low-likelihood region in the feature space of ID data. OpenGAN (Kong & Ramanan, 2021) trains a GAN in the classifier’s feature space to generate virtual outliers. However, sampling virtual outliers in feature space relies on the quality of the feature space, while generative models often exhibit training instability. In contrast, VOSo circumvents these drawbacks by directly perturbing ID samples in pixel space to obtain virtual outliers (Zhang et al., 2022). In addition to this, another group of reconstruction-based methods (Yang et al., 2022; Perera et al., 2020; Zhou, 2022; Gao et al., 2023) also use generative models. They are based on the assumption that generative models trained with ID samples may not be able to reconstruct OOD samples with high quality, but this assumption may not always hold (Nalisnick et al., 2019). For this reason, MOODCat (Yang et al., 2022) and DiffGuard (Gao et al., 2023) further use conditional synthesis to mitigate this problem. However, the use of generative models to reconstruct samples in the testing phase introduces more detection time.

Confidence calibration. Many previous works have shown that neural networks tend to be overconfident in their predictions (Hein et al., 2019; Nguyen et al., 2015). To this end, some works address this problem by post-hoc methods such as Temperature Scaling (Platt et al., 1999). In addition, some method focus on the regularization of the model, such as weight decay (Guo et al., 2017), label smoothing (Szegedy et al., 2016).

Label smoothing. Label smoothing generates soft labels by applying a weighted average between the uniform distribution and the hard label. It has been shown to improve model calibration (Pereyra et al., 2017; Müller et al., 2019; Lukasik et al., 2020). It is shown that label smoothing encourages the penultimate layer features of the training examples from the same class to group in tight clusters (Müller et al., 2019). (Lukasik et al., 2020) shows that label smoothing can mitigate label noise. (Yuan et al., 2020) shows that part of knowledge distillation’s success derives from its ability to normalize soft labels the same way label smoothing does. (Yuan et al., 2023) studies the effectiveness of biased soft labels in knowledge distillation and weakly-supervised learning. In this paper, we assign soft labels to virtual outliers.

3 Preliminaries

Let \(\mathcal {X}\) and \(\mathcal {Y}=\{1,\ldots ,K\}\) represent the input image space and ID label space, respectively. Here, K represents the number of classes of ID samples. We consider the ID distribution \(D_{X_{\textrm{I}}Y_{\textrm{I}}}\) as a joint distribution defined over \(\mathcal {X} \times \mathcal {Y}\), where \(X_{\textrm{I}}\) and \(Y_{\textrm{I}}\) are random variables whose outputs are from spaces \(\mathcal {X}\) and \(\mathcal {Y}\). During testing time, there are some OOD joint distributions \(D_{X_\textrm{O}Y_\textrm{O}}\) defined over \(\mathcal {X}^C \times \mathcal {Y}^C\), where \(X_\textrm{O}\) and \(Y_\textrm{O}\) are random variables whose outputs are from semantic-shifted space \(\mathcal {X}^C\) and label-shifted space \(\mathcal {Y}^C\). Here, \(\mathcal {X}^C\) represents OOD data space in which the data may come from classes that are unknown during training, and \(\mathcal {Y}^C\) represents OOD label space with \(\mathcal {Y}^C \cap \mathcal {Y} = \emptyset \). Then following (Fang et al., 2022), OOD detection can be defined as follows:

Problem 1

(OOD Detection) Given the labeled ID samples

$$\begin{aligned} S_{I} = \{(\textbf{x}^1,\textbf{y}^1),...,(\textbf{x}^n,\textbf{y}^n) \} \sim D_{X_{\textrm{I}}Y_{\textrm{I}}}^n,~i.i.d., \end{aligned}$$

OOD detection aims to learn an OOD detector G using \(S_{I}\) such that for any test sample \(\textbf{x}\):

  • if \(\textbf{x}\) is drawn from \(D_{X_\textrm{I}}\), then G can classify \(\textbf{x}\) into correct ID classes;

  • if \(\textbf{x}\) is drawn from \(D_{X_\textrm{O}}\), then G can detect \(\textbf{x}\) as OOD sample.

Note that in Problem 1, we use the one-hot vector to represent the label \(\textbf{y}\).

Model and loss function. In this work, we utilize \(\textbf{f}_{\varvec{\theta }}(\cdot )\) to represent the deep neural network-based model with parameters \(\varvec{\theta }\in \varvec{\Theta }\), where \(\varvec{\Theta }\) denotes the parameter space. We also set \(\ell (\cdot ,\cdot ): \mathbb {R}^K \times \mathbb {R}^K \rightarrow \mathbb {R}_+\) to be the loss function.

Score-based OOD detection strategy. Many representative OOD detection methods (Hendrycks & Gimpel, 2017; Liang et al., 2018; Liu et al., 2020) follow a score-based strategy, i.e., given a model \(\textbf{f}_{\varvec{\theta }}\) trained using \(S_I\), a scoring function \(S(\cdot )\) and a threshold \(\tau \), then \(\textbf{x}\) is detected as ID data iff \(S(\textbf{x};\textbf{f}_{\varvec{\theta }})\ge \tau \), i.e.,

$$\begin{aligned} \begin{aligned} G_{\tau }(\textbf{x}) = \text {ID},~&\text {if}~S(\textbf{x};\textbf{f}_{\varvec{\theta }}) \ge \tau ;\\ G_{\tau }(\textbf{x}) = \text {OOD},~&\text {if}~S(\textbf{x};\textbf{f}_{\varvec{\theta }}) < \tau . \end{aligned} \end{aligned}$$
(1)

In this paper, we use maximum softmax probability (MSP) (Hendrycks & Gimpel, 2017) as the scoring function to design our OOD detector, i.e.,

$$\begin{aligned} S_\textrm{MSP}(\textbf{x};\textbf{f}_{\varvec{\theta }}) = \max _{k\in \mathcal {Y}} \ \textrm{softmax}_k (\textbf{f}_{\varvec{\theta }}(\textbf{x})), \end{aligned}$$
(2)

where \(\textrm{softmax}_k (\textbf{f}_{\varvec{\theta }}(\textbf{x}))\) is the k-th coordinate function of \(\textrm{softmax} (\textbf{f}_{\varvec{\theta }}(\textbf{x}))\).

Classical training strategy. In the majority of score-based strategies, researchers primarily emphasize the design of effective scoring functions, aiming to extract the detection potential of \(\textbf{f}_{\varvec{\theta }}(\cdot )\). In general, the model \(\textbf{f}_{\varvec{\theta }}\) is trained based on a unified learning strategy—empirical risk minimization (ERM) principle, i.e.,

$$\begin{aligned} \min _{\varvec{\theta }\in \varvec{\Theta }} \frac{1}{n} \sum _{i=1}^n \ell (\textbf{f}_{\varvec{\theta }}(\textbf{x}^i), \textbf{y}^i). \end{aligned}$$
(3)

In this work, our primary focus is to design a more effective training strategy that enhances the separation of ID and OOD samples for model \(\textbf{f}_{\varvec{\theta }}\).

4 Methodology

In this section, we mainly introduce our proposed method, virtual outlier smoothing (VOSo).

4.1 Outlier Exposure

Deep neural networks (DNNs) trained using cross-entropy loss tend to exhibit overconfidence when encountering OOD samples. To this end, outlier exposure (OE) (Wang et al., 2023) proposes to regularize the training process by introducing an auxiliary OOD distribution \(D_{X_OY_O}^a\), which makes it possible to lower the prediction confidence for OOD samples. Specifically, to mitigate the overconfidence issue, OE mainly focuses on the following learning objective:

$$\begin{aligned} \begin{aligned} \mathcal {L}(\textbf{f}_{\varvec{\theta }})&= \mathbb {E}_{(\textbf{x},\textbf{y})\sim D_{X_IY_I}} \ell _\textrm{ce}(\textbf{f}_{\varvec{\theta }}(\textbf{x}), \textbf{y}) \\ {}&\quad + \ \lambda \mathbb {E}_{(\textbf{x},\textbf{y})\sim D_{X_OY_O}^a} \ell _\textrm{oe}(\textbf{f}_{\varvec{\theta }}(\textbf{x})), \end{aligned} \end{aligned}$$

where \(\ell _\textrm{ce}(\cdot ,\cdot )\) is the cross-entropy loss, \(\lambda \) stands for the hyperparameter, and \(\ell _\textrm{oe}(\cdot )\) represents Kullback–Leibler divergence to the uniform distribution, i.e., \(-\sum _k {\text {softmax}}_k \textbf{f}_{\varvec{\theta }}(\textbf{x}) / K\). This means that the model is encouraged to produce low confidence levels for these auxiliary OOD samples. Thanks to the integration of the auxiliary outliers during training, OE usually has reliable detection performance for auxiliary OOD distribution.

4.2 Virtual Outlier Construction

OE is a promising approach to promote OOD detection. Despite its success, a significant limitation is that true OOD samples may deviate from auxiliary OOD samples, leading to overconfident judgments in scenarios where OOD samples bear resemblance to ID samples. To address this problem, one approach is to construct virtual outliers similar to ID samples directly from ID samples. The principle of constructing virtual outliers is simple but effective. Previous work (Du et al., 2022; Tao et al., 2023) obtains virtual outliers by sampling around ID margin features in feature space. However, as stated before, this approach relies on the quality of the feature space. To avoid this drawback, in this work, we propose a simple approach: injecting an anomalous pattern (a randomly selected part of another image or a 0) into the ID sample, which does not introduce much additional computational overhead. However, simply injecting may not be effective, e.g., injecting the background region of one image into the background region of another image does not build a valid OOD sample. In order to construct an effective OOD sample, we need to inject an anomaly model into the semantic region of the ID sample, which can be used to break the ID attribute.

Therefore, the construction of outliers has two-fold: i) locating semantic regions and ii) injecting patterns. Thanks to the development of visualization of DNNs, we can locate the semantic region in the image space through Class Activation Maps (CAMs), which is a weakly-supervised localization method that can identify discriminative regions, as shown in Fig. 3. Specifically, locating semantic regions can be formulated as:

$$\begin{aligned} M(\textbf{x};t)[i,j] = {\left\{ \begin{array}{ll}0, &{} \text {if } C_\textbf{x}[i,j] \ge t \\ 1, &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$
(4)

where \(M(\textbf{x};t)\) is the mask of an input image \(\textbf{x}\), \(C_\textbf{x}\) is the CAM of \(\textbf{x}\), and \(t \in [0,1]\) denotes the threshold to transform the CAM into a binary mask. Here, we re-scale the CAM into the range [0, 1]. In each iteration, we sample the threshold t from a Beta distribution, i.e., \(t \sim \textit{Beta}(\alpha , \beta )\). As t controls the size of perturbed regions, this approach would produce different sizes of perturbed regions.

Fig. 3
figure 3

CAM on CIFAR and the generated masked or mixed outliers

The located regions (image patch) can be injected into various patterns. For instance, one ideal approach is to inject semantic regions from images with different labels, i.e., merging the cat’s nose with a dog’s head. An on-the-fly approach is to sample patches from different images randomly. The simplest approach is to inject zeroes or noise sampled from a Gaussian distribution, sharing the same spirit with random erasing (Zhong et al., 2020). This can be formulated as:

$$\begin{aligned} \tilde{\textbf{x}}(\textbf{z}) = M(\textbf{x},t) \odot \textbf{x} + (1 - M(\textbf{x},t)) \odot \textbf{z}, \end{aligned}$$
(5)

where \(\tilde{\textbf{x}}(\textbf{z})\) is the constructed virtual outlier, \(M(\textbf{x},t)\) stands for the mask indicating the semantic region, and \(\textbf{z}\) represents the injected patterns.

4.3 Virtual Outlier Smoothing

In the existing OE approaches, all OOD samples have the same labels, which could be inappropriate. For example, both tigers and fishes can be used as OOD classes for cats, but in the feature space, tigers should be closer to cats than fishes. It is not reasonable to assign the same OOD label to tiger and fish. In VOSo, virtual outliers are obtained by perturbing ID samples. Similarly, we should assign appropriate OOD labels to virtual outliers, according to the injection model. To this end, we propose a dynamic label assignment method.

Specifically, in VOSo, the label of the virtual outlier varies with both the injected pattern \(\textbf{z}\) and the size of the mask region. Specifically, given the injected pattern \(\textbf{z}\), the label \({\varvec{\tilde{{\textbf {y}}}}}\) of the outlier is a function of the mask \(M(\textbf{x}, t)\) and the label \(\textbf{y}\),

$$\begin{aligned} {{\varvec{\tilde{{\textbf {y}}}}}} (\textbf{x}, \textbf{z}) = \phi _{\textbf{z}}(M(\textbf{x}, t), \textbf{y}), \end{aligned}$$
(6)

where \(\phi _{\textbf{z}}\) is the function required to be designed according to the type of injected pattern \(\textbf{z}\).

Given an image \(\textbf{x}\), we introduce three types of patterns \(\textbf{z}\) to enforce the model prediction to vary with the patterns appearing in the image. The simplest case is to set \(\textbf{z}_1=\textbf{x}\), where \({{\varvec{\tilde{{\textbf {y}}}}}}=\textbf{y}\). DNNs should predict the one-hot label for ID patterns. Meanwhile, we set \(\textbf{z}_2=\textbf{0}\), where the label is soft and varies with the size of the perturbed region \(\epsilon (t)\) controlled by the threshold t in Eq. (4). This can be formulated as follows:

$$\begin{aligned} {\varvec{\tilde{{\textbf {y}}}}(\textbf{x},\textbf{z}_2)} = (1-\epsilon (t)) \cdot \textbf{y} + \epsilon (t) /K \cdot \textbf{u}, \end{aligned}$$
(7)

where \(\textbf{u} = [1/K,...,1/K]\in \mathbb {R}^{K}\) and the function \(\epsilon (t)\) is designed as:

$$\begin{aligned} \epsilon (t) = 1 - \exp ((t - 1) / T). \end{aligned}$$
(8)

Here, we introduce the temperature coefficient T. The most complex pattern is set as follows:

$$\begin{aligned} \textbf{z}_3 = \lambda \textbf{x} + (1 - \lambda ) \textbf{x}', \end{aligned}$$
(9)

where \(\lambda \) is drawn from \(\textrm{Beta}(\alpha ,\beta )\) and \(\textbf{x}'\) is a randomly selected ID sample. Accordingly, the label is designed as follows:

$$\begin{aligned} {\varvec{\tilde{{\textbf {y}}}}}(\textbf{x},\textbf{z}_{3}) = (1-\epsilon (t)) \cdot \textbf{y} + \epsilon (t) (\lambda \textbf{y} + (1 - \lambda ) \textbf{y}'), \end{aligned}$$
(10)

where \(\textbf{y}'\) is the label of the input \(\textbf{x}'\). As shown in Fig. 4, the most complex patterns are good outliers, because they are mainly in the low-likelihood region of the ID sample space.

According to the construction approach, introducing \(\textbf{z}_1\)-type patterns makes DNNs predict the one-hot label for ID patterns, since complete ID patterns are given. Similarly, introducing \(\textbf{z}_2\)-type patterns lowers the prediction confidence on ID samples due to the missing ID patterns. Introducing \(\textbf{z}_3\)-type patterns leads to OOD samples, which lowers prediction confidence on OOD samples having ID patterns. Therefore, the objective function can be written as:

$$\begin{aligned} \begin{aligned} \mathcal {L}(\textbf{f}_{\varvec{\theta }}) =&\frac{1}{n} \sum _{i=1}^n \ \Big [\ell _\textrm{ce}(\textbf{f}_{\varvec{\theta }}(\textbf{x}^i), \textbf{y}^i) \!+\! \gamma _1 \ell _\textrm{ce}(\textbf{f}_{\varvec{\theta }} ({{\varvec{\tilde{{\textbf {x}}}}}}^i(\textbf{z}_2)), {{\varvec{\tilde{{\textbf {y}}}}}}(\textbf{x}^i, \textbf{z}_2)) \\&+ \gamma _2\ell _\textrm{ce}(\textbf{f}_{\varvec{\theta }} ({\varvec{{\tilde{{\textbf {x}}}}}}^i(\textbf{z}_3)), {{\varvec{\tilde{{\textbf {y}}}}}}(\textbf{x}^i, \textbf{z}_3))\Big ], \end{aligned} \end{aligned}$$
(11)

where \(\gamma _1\) and \(\gamma _2\) are hyperparameters. Note that in Eq. (11), we omit the utilization of \(\textbf{z}_1\) for simplicity.

4.4 Tackling Distribution Shift

Fig. 4
figure 4

t-SNE visualization of virtual outliers on CIFAR10. VOSo successfully creates OOD samples (black dots) in the low-likelihood region of the ID sample space (colored dots) (Color figure online)

Fig. 5
figure 5

The auxiliary BN layers are employed to address the issue of distribution inconsistency between the mixed OOD samples and the original training samples

Fig. 6
figure 6

An overview of VOSo. We use a pre-trained network to obtain the heat map of the input image and then sample a threshold. The parts of the image with heat values greater than this threshold are masked, and it is soft label is computed according to the threshold. The original image (with hard label) and the masked image (with soft label) are used simultaneously to train the deep neural network

Training DNNs with outliers would degenerate the prediction accuracy, leading to poor detection performance. This aligns with the rationale of domain adaptation (Gong et al., 2016). Namely, training models over data sampled from different distributions cause performance degeneration. Thus, leveraging both ID and OOD samples to train DNNs would degenerate prediction accuracy on ID samples, leading to poor OOD detection performance. This is consistent with our experimental observations, where naive incorporation of virtual outliers to the training process resulted in a limited improvement.

To address this problem, we propose to employ a dual batch normalization layer (DuBN) (Xie et al., 2020) to model two distributions simultaneously. The intuition of this approach is illustrated in Fig. 5. Specifically, in each batch normalization layer, the features of ID samples (i.e., \({{\varvec{\tilde{{\textbf {x}}}}}}(\textbf{z}_1)\) and \({{\varvec{\tilde{{\textbf {x}}}}}}(\textbf{z}_2)\)) are fed into normal BN, while the features of virtual outliers (i.e., \(\varvec{\tilde{{\textbf {x}}}}(\textbf{z}_3)\)) are fed into an auxiliary BN. In the testing stage, DNNs merely use the normal BN for prediction.

In each iteration, we sample masking thresholds and mixing level from Beta distributions: \(t_1 \sim \textrm{Bata}(\alpha _1, \beta _1), t_2 \sim \textrm{Bata}(\alpha _2, \beta _2), \lambda \sim \textrm{Bata}(\alpha _3, \beta _3)\), and compute the risk in Eq. (11), using gradient backpropagation to update our model \(\textbf{f}_{\varvec{\theta }}\). An overview of our method is presented in Fig. 6.

5 Experiments

In this section, experiments are conducted to validate the effectiveness of the proposed method.

5.1 Experiment Setup

Datasets. Following the common benchmarks used in previous work (Zhang et al., 2023b), we use CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) as our major ID datasets. We use five common benchmarks as the OOD test datasets: Textures (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), iSUN (Xu et al., 2015), Places365 (Zhou et al., 2018) and LSUN (Yu et al., 2015). Besides, we further conduct experiments on large-scale datasets. Following NPOS, we use ImageNet-1k (Deng et al., 2009) dataset as the in-distribution data. For OOD datasets, we use iNaturalist (Horn et al., 2018), SUN (Xiao et al., 2010), Places (Zhou et al., 2018) and Texture (Cimpoi et al., 2014).

Evaluation metrics. For evaluation, we follow the commonly-used metrics in OOD detection: (1) the false positive rate of OOD samples when the true positive rate of ID samples is at \(95\%\) (FPR95), and (2) the area under the receiver operating characteristic curve (AUROC). We also report ID classification accuracy (ID-ACC) to reflect the preservation level of the performance for the original classification task on ID samples.

OOD detection baselines. We use both post-hoc inference methods and training methods as baselines. For post-hoc methods, we take MSP score (Hendrycks & Gimpel, 2017), ODIN (Liang et al., 2018), ReAct (Sun et al., 2021), Energy score (Liu et al., 2020) and ASH (Djurisic et al., 2023) as baselines. And for training methods, we use MOODCAT (Yang et al., 2022), OpenGAN (Kong & Ramanan, 2021), RegMixup (Pinto et al., 2022), VOS (Du et al., 2022), LogitNorm (Wei et al., 2022) and NPOS (Tao et al., 2023) as baselines. We also compares the performance of VOSo under different OOD detection scoring functions. Besides, we further compare our method with OE methods (Hendrycks et al., 2019; Zhang et al., 2023a; Wang et al., 2023).

Table 1 OOD detection performance comparison between using softmax cross-entropy loss and VOSo loss
Table 2 OOD detection performance on CIFAR10 benchmarks
Table 3 OOD detection performance on CIFAR100 benchmarks

Training details. For our main results, we use ResNet18 (He et al., 2016) to train models on both the CIFAR10 and CIFAR100 datasets. Initially, we follow the standard training procedure to establish a baseline classification model. Subsequently, we train the final model using the proposed loss function (as detailed in Eq. (11)). The training spans 300 epochs, utilizing SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. We start with an initial learning rate of 0.1 with cosine decay  (Loshchilov & Hutter, 2017). For CIFAR10, hyperparameters \(\alpha _1\) and \(\beta _1\) are adjusted to 50 and 20, respectively, while \(\alpha _2\) and \(\beta _2\) are set to 0.5 each. The loss coefficients \(\gamma _1\) and \(\gamma _2\) are determined as 0.2 and 1. For the temperature coefficients \(T_1\) and \(T_2\), a value of 10 is chosen. In the case of CIFAR100, hyperparameters \(\alpha _1\) and \(\beta _1\) are adjusted to 30 and 18, respectively, with \(\alpha _2\) and \(\beta _2\) set to 0.5 each. Here, \(\gamma _1\) and \(\gamma _2\) are set at 0.6 and 1, respectively, with the temperature coefficients \(T_1\) and \(T_2\) set at 100. For the mixture sampler parameters \(\alpha _3\) and \(\beta _3\), in line with RegMixup (Pinto et al., 2022), we fix them at 10 each. To facilitate the selection of the temperature coefficient, we adjust CAMs to 0–255 and set the threshold based on this. These experiments are conducted using PyTorch on various NVIDIA GeForce RTX GPUs, including the 2080, 3090, and 4090 models.

5.2 Experimental Results

How does VOSo influence OOD detection performance? In Table 1, we present a comparison of the OOD detection performance between models trained using cross-entropy loss and those trained with VOSo. Throughout these tests, we consistently employed MSP as the OOD scoring function. Our observations reveal that VOSo significantly enhances OOD detection capabilities. Specifically, in the CIFAR10 benchmark, VOSo lowers average FPR95 from \(51.13\%\) to \(11.09\%\), marking a substantial improvement of \(40.04\%\). In the CIFAR100 benchmark, VOSo achieves an \(11.93\%\) improvement in average FPR95.

Comparison with other baselines. We conducted comparative experiments on CIFAR10 and CIFAR100 datasets. As illustrated in Tables 2 and 3, VOSo achieves the best average performance on CIFAR10. On the CIFAR100 dataset, VOSo also demonstrates strong performance. This superior performance underscores the effectiveness of VOSo. Training with the virtual outliers successfully mitigates overconfident predictions for OOD samples and enhances OOD detection.

VOSo makes the softmax scores more distinguishable between in- and out-of-distributions. To gain deeper insights into the effectiveness of VOSo, we visualize and compare the distributions of softmax confidence scores for ID and OOD samples. As illustrated in Fig. 7, models trained with cross-entropy loss tend to generate high softmax scores for both ID and OOD samples, leading to suboptimal OOD detection performance. In contrast, VOSo distinctly differentiates the softmax scores between ID and OOD samples, demonstrating its efficacy.

Comparison with OE methods. We also compare VOSo with OE methods on the CIFAR10 dataset, including OE (Hendrycks et al., 2019), MixOE (Zhang et al., 2023a), and DOE (Wang et al., 2023). It is important to note that this comparison may not be entirely fair, as VOSo does not utilize an additional auxiliary OOD dataset. Nevertheless, as demonstrated in Table 5, VOSo still achieves results comparable to these methods.

The effect of the soft labeling strategies. Previous methods assign hard OOD labels to all virtual outliers, essentially treating them as a uniform distribution across all ID classes. This ignores the connection between outliers and ID samples, leading to sub-optimal results. To verify this, we assign hard OOD labels, uniformly distributed over ID classes, to all generated virtual outliers in our experiments. The results, as presented in Table 7, indicate a significant improvement in performance when soft OOD labels are assigned to virtual outliers instead. We also test other perturbations like Gaussian noise and elastic transform, and explore their labeling strategies. Hard OOD labels for these outliers generated with the above perturbations don’t perform well because the outliers share patterns with ID samples. Ignoring this disrupts training. We further investigate labeling strategies for vanilla OE (Hendrycks et al., 2019). The original paper assigns uniform labels to all outliers, which is inappropriate as some outliers resemble ID samples more than others. We use a pre-trained model like CLIP to assign soft labels. Since this is slow, we fine-tune the network for 10 epochs instead of training from scratch. Table 7 confirms the effectiveness of soft labels.

Fig. 7
figure 7

The MSP score distribution of training with cross-entropy loss and VOSo. VOSo makes the softmax scores more distinguishable between in- and out-of-distributions. Values are percentages

Experimental results in near-OOD setting. Besides the widely adopted far OOD setting, we further verify the proposed VOSo’s effectiveness in the near OOD scenario. Following OpenOOD (Zhang et al., 2023b), we use CIFAR10 as ID data while employing CIFAR100 and TinyImageNet as OOD data. As shown in Table 8, VOSo still achieves the best performance, demonstrating VOSo’s superiority in challenging scenarios.

Table 4 VOSo with different scoring functions
Table 5 Comparison with OE methods

Experimental results on large-scale datasets. To demonstrate VOSo’s scalability and generalizability, we further evaluate VOSo using large-scale datasets, i.e., ImageNet-1k. In this context, we follow the experimental settings used in NPOS, where we conduct experiments on the ImageNet-1k dataset with CLIP (Radford et al., 2021). Instead of fine-tuning CLIP’s image encoder like NPOS does, VOSo leverages prompt learning, similar to CoOp (Zhou et al., 2022) as illustrated in Fig. 8, which requires less data and fewer computational resources. As shown in Table 9, VOSo achieves the best performance, demonstrating VOSo’s robustness and applicability across various scales. However, since prompt learning has only a small number of learnable parameters, the improvements on ImageNet-1k are not as substantial as those on CIFARs. Greater performance gains can probably be obtained by training from scratch, but this is not a wise choice on large-scale datasets.

Table 6 The effect of virtual outliers and DuBN
Table 7 VOSo with different labeling strategies
Table 8 VOSo in near-OOD settings

5.3 Ablation Study

In this section, we perform ablation experiments. Unless otherwise stated, experiments are conducted on the CIFAR10 dataset.

VOSo with different scoring functions. In Table 4, we present a comparison of OOD detection performance between neural networks trained with VOSo loss and those trained with cross-entropy loss, using various scoring functions. The experimental results indicate that the OOD detection performance of neural networks trained with cross-entropy loss varies depending on the scoring function used. In contrast, the performance of networks trained with VOSo loss remains consistently high across different scoring functions. Notably, with VOSo, employing MSP alone is sufficient to achieve robust OOD detection, thereby reducing testing time and eliminating the need to select among different scoring functions.

Table 9 OOD detection performance on ImageNet-1k as ID. Except for CoOp, other baseline results are from NPOS (Tao et al., 2023)

The effect of virtual outliers and DuBN. In Table 6, we illustrate how the integration of virtual outliers affects the OOD detection performance, focusing our analysis on the CIFAR10 dataset. Our results reveal that employing either masked samples or mixed samples independently as virtual outliers can enhance the model’s OOD detection capabilities. However, simply combining these two types of samples results in only a marginal improvement in performance. Notably, the addition of the DuBN module leads to a significant enhancement. In the final row, labeled VOSo+, we describe our method of processing masked samples through the virtual BN. This method, as evidenced by the results, tends to underperform.

Comparison with vanilla Mixup. Using CAMs to identify semantic regions adds computational burden. However, perturbing only parts of these regions creates virtual outliers that resemble ID samples, making OOD samples more challenging. These tailored outliers can efficiently regularize training. For comparison, we add ablation experiments using only mixed outliers. We compare VOSo with Mixup and RegMixup. Mixup proposes using mixed samples to train the network to improve robustness, and RegMixup further proposes that using both original and mixed samples can further improve robustness. While in VOSo, Mixup is used as one of the perturbation functions to generate virtual outliers. Consistent with our soft labeling strategy, RegMixup also mixes the labels of the samples. For a more adequate comparison, we assign hard OOD labels to mixed samples. As shown in Table 10, vanilla Mixup and RegMixup do not improve OOD detection significantly. And it fails completely when assigning hard OOD labels to mixed samples, highlighting the importance of considering the relationship between outliers and ID samples. Further, given the distribution difference between mixed and ID samples, we employ DuBN to model two distributions simultaneously and the performance is improved accordingly. Finally, we use CAM to guide the mixing of only part of the regions, making the generated outliers more challenging and further improving performance.

The effect of the sampling function. In Table 11, we present the impact of the sampling function’s parameters on the OOD detection performance, with the analysis conducted using the CIFAR10 dataset. Generally, a sampling distribution characterized by higher variance results in a greater diversity of samples. This increased diversity often necessitates a larger network capacity and extended training duration to effectively accommodate and learn from the training data.

The effect of temperature T . In Fig. 9, we further explore how the parameter T influences OOD detection performance, focusing our analysis on CIFAR datasets. On CIFAR10, we achieve the optimal OOD detection performance when T is set to 10. For the CIFAR100 dataset, a larger value of T is employed. This adjustment is made because networks dealing with a smaller number of classes tend to exhibit overconfidence, and in such scenarios, a greater penalty is beneficial.

Fig. 8
figure 8

Training a model from scratch on ImageNet-1k is too costly. Instead, we leverage prompt learning. We use the proposed VOSo loss to model a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed

Table 10 Comparison with vanilla Mixup
Table 11 The effect of the sampling function
Table 12 Test accuracy on ID samples
Fig. 9
figure 9

The effect of the temperature T

VOSo improves the model’s classification performance. As shown in Table 12, VOSo also improves the model’s classification performance.

Comparison of training time. We study the training time efficiency of the proposed VOSo to explore whether VOSo will introduce extra overhead. Specifically, we report the training times of our method and compare it with other outstanding works. The results are shown in Table 13. In terms of training time, VOS and NPOS are not superior because they require dense sampling of virtual outliers around marginal feature points. Hence, VOSo exhibits an acceptable training-time efficiency.

Comparison with generative models. Virtual outliers can also be obtained with generative models, and usually these virtual outliers are more semantically realistic. As shown in Fig. 11, the virtual outliers generated by OpenGAN seem to be more realistic than those created by VOSo. However, realistic virtual outliers are not necessarily more effective than unrealistic ones, what is more important is that the outliers need to share features with ID samples in order to constrain the decision boundaries of the network. VOSo directly perturbs ID samples to obtain virtual outliers, which share a large number of features with ID samples, and make the knowledge boundaries of the network more tightly constrained.

The effect of filling content. In our main results, the masked areas in the masked samples are all filled with zero pixels, and in order to explore the effect of filling pixels, we conduct comparison experiments of filling with random pixels. As shown in Table 16, filling with random pixels has similar effects as filling with zero pixels.

Table 13 Comparision of training time

5.4 Additional Experiments on Model Calibration and Data-Shift Robustness

To explore the potential adverse effects as limitations, we evaluate the proposed VOSo under the distribution-shift scenarios. Specifically, we evaluate the model trained with VOSo loss on different test data with distribution shifts in Table 14. Experimental results show that VOSo also improves in distributional robustness compared to a vanilla classifier trained using only cross-entropy loss. We further measure the calibration error of the VOSo using Expected Calibration Error (ECE (Guo et al., 2017)) and reported the results in Table 15. The experimental results show that VOSo leads to poor calibration of the model. To explore the cause of it, we visualize the relationship between the average accuracy and the confidence level in Fig. 10. We find that models trained with VOSo loss are overcautious, leading to poorer calibration. However, this careful decision-making allows the model to be more aware of its knowledge boundaries, as evidenced by VOSo’s good OOD detection performance. Besides, we further explore the performance of VOSo on the corruption datasets. Specifically, we train the network with VOSo on the CIFAR10 dataset and test on the CIFAR10C dataset (Hendrycks & Dietterich, 2019). CIFAR10C contains 15 types of corruptions with 5 levels of severity. The corruptions are applied on images from the test set of the clean CIFAR10 dataset. We report classification accuracy of the models under largest corruption severity level 5. As shown in Table 17, VOSo achieves better classification performance compared to training the network with cross-entropy loss, demonstrating the robustness of VOSo on corruption datasets.

Table 14 Evaluations on data-shift robustness (numbers are in %)
Fig. 10
figure 10

Comparison of ECE score

Fig. 11
figure 11

Comparison of virtual outliers generated by VOSo and OpenGAN

Table 15 Calibration performance
Table 16 VOSo with different filling content
Table 17 Robustness on corrupted datasets

6 Discussions

6.1 Selection of Hyperparameters

VOSo introduces a relatively large number of hyperparameters. Specifically, the size of the area to be perturbed in each epoch is controlled by a threshold sampled from a \(Beta(\alpha , \beta )\) sampling distribution, leading to two hyperparameters \(\alpha \) and \(\beta \). Meanwhile, the temperature factor T is used to control the degree of smoothness of labels.

We can control the dispersion of the sampling distribution by adjusting \(\alpha \) and \(\beta \). When the sampling distribution is more dispersed, we obtain more varied samples. This diversity can improve the network’s adaptability to new data, but it also makes learning more challenging, leading to difficulties in convergence. In essence, a higher dispersion in the distribution expands the parameter space the network needs to adapt to during training, increasing the complexity of the learning process. Usually, we recommend designing a sampling distribution with a small variance for a small network to avoid the network failing to converge and a sampling function with a slightly larger variance for a large network to get more diverse OOD samples. The choice of temperature factor T is related to the number of classes. When the number of classes is smaller, the neural network is more likely to be overconfident, in which case we should give a larger penalty. This usually leads to better results. In our experiments, we incorporate virtual OOD samples without semantic information to construct a validation set to assist in hyperparameter selection.

6.2 Limitations

In this section, we report the limitations of the proposed VOSo.

  • VOSO uses CAM to guide the generating of virtual outliers and requires training the neural network with both the original images and the virtual outliers. This will introduce additional computational overhead, which is a drawback of VOSo.

  • VOSo assigns appropriate labels to virtual outliers through a hand-designed link function. How to automatically assign appropriate labels to virtual outliers deserves further exploration, yet it has not been explored in the current version. Hence, we will further explore the possibility of designing a learnable link function in our future work.

  • On large-scale datasets such as ImageNet-1k, we apply prompt learning on the CLIP model to achieve VOSo, which greatly reduces the computational cost, but less training data and fewer learnable parameters may result in performance gains less noticeable than training from scratch.

7 Conclusion

In this paper, we propose virtual outlier smoothing (VOSo), a simple and effective method to conduct virtual outliers by perturbing the semantic regions of ID samples. Besides, considering that these virtual outliers still contain some ID patterns, we develop a more appropriate label assignment strategy for these virtual outliers. The virtual outliers are highly correlated with the ID samples, thus providing a strong regularization of the training process. Extensive experiments show that VOSo can significantly improve the OOD detection performance of the model while maintaining the classification accuracy of ID samples.