Abstract
Existing Out-of-Distribution (OoD) detection methods address to detect OoD samples from In-Distribution (InD) data mainly by exploring differences in features, logits and gradients in Deep Neural Networks (DNNs). We in this work propose a new perspective upon loss landscape and mode ensemble to investigate OoD detection. In the optimization of DNNs, there exist many local optima in the parameter space, or namely modes. Interestingly, we observe that these independent modes, which all reach low-loss regions with InD data (training and test data), yet yield significantly different loss landscapes with OoD data. Such an observation provides a novel view to investigate the OoD detection from the loss landscape, and further suggests significantly fluctuating OoD detection performance across these modes. For instance, FPR values of the RankFeat (Song et al. in Advances in Neural Information Processing Systems 35:17885–17898, 2022) method can range from 46.58% to 84.70% among 5 modes, showing uncertain detection performance evaluations across independent modes. Motivated by such diversities on OoD loss landscape across modes, we revisit the deep ensemble method for OoD detection through mode ensemble, leading to improved performance and benefiting the OoD detector with reduced variances. Extensive experiments covering varied OoD detectors and network structures illustrate high variances across modes and validate the superiority of mode ensemble in boosting OoD detection. We hope this work could attract attention in the view of independent modes in the loss landscape of OoD data and more reliable evaluations on OoD detectors.
Similar content being viewed by others
References
Cimpoi, M., Maji, S., Kokkinos, I., et al. (2014) Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3606–3613
Deng, J., Dong, W., Socher, R., et al. (2009) Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations
Draxler, F., Veschgini, K., Salmhofer, M., et al. (2018) Essentially no barriers in neural network energy landscape. In: International conference on machine learning, PMLR, pp 1309–1318
Fang, K., Tao, Q., Wu, Y., et al. (2022a) On multi-head ensemble of smoothed classifiers for certified robustness. arXiv preprint arXiv:2211.10882
Fang, K., Tao, Q., Wu, Y., et al. (2024). Towards robust neural networks via orthogonal diversity. Pattern Recognition, 149, 110281.
Fang, Z., Li, Y., Lu, J., et al. (2022). Is out-of-distribution detection learnable? Advances in Neural Information Processing Systems, 35, 37199–37213.
Fort, S., Hu, H., Lakshminarayanan, B. (2019) Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757
Garipov, T., Izmailov, P., Podoprikhin, D., et al (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31
Goodfellow, I.J., Vinyals, O., Saxe, AM. (2015) Qualitatively characterizing neural network optimization problems. In: International Conference on Learning Representations
Han, T., & Li, Y. F. (2022). Out-of-distribution detection-assisted trustworthy machinery fault diagnosis approach with uncertainty-aware deep ensembles. Reliability Engineering & System Safety, 226, 108648.
He, K., Zhang, X., Ren, S., et al. (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hendrycks, D., Gimpel, K. (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: International Conference on Learning Representations
Horváth, M.Z., Mueller, MN., Fischer, M., et al. (2021) Boosting randomized smoothing with variance reduced classifiers. In: International Conference on Learning Representations
Hsu, YC., Shen, Y., Jin, H., et al. (2020) Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10951–10960
Huang, G., Liu, Z., Van Der Maaten, L., et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Huang, R., Geng, A., & Li, Y. (2021). On the importance of gradients for detecting distributional shifts in the wild. Advances in Neural Information Processing Systems, 34, 677–689.
Krizhevsky, A. (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30
Lee, J., AlRegib, G. (2020) Gradients as a measure of uncertainty in neural networks. In: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, pp 2416–2420
Lee, K., Lee, K., Lee, H., et al. (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31
Li, H., Xu, Z., Taylor, G, et al. (2018) Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31
Liang, S., Li, Y., Srikant, R. (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Representations
Liu, W., Wang, X., Owens, J., et al. (2020). Energy-based out-of-distribution detection. Advances in neural information processing systems, 33, 21464–21475.
Maaten Van der, L., Hinton, G. (2008) Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605.
Miller, JP., Taori, R., Raghunathan, A., et al (2021) Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In: International Conference on Machine Learning, PMLR, pp 7721–7735
Netzer, Y., Wang, T., Coates, A, et al. (2011) Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning
Nguyen, A., Yosinski, J., Clune, J. (2015) Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 427–436
Peebles, W., Xie, S. (2023) Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4195–4205
Rame, A., Kirchmeyer, M., Rahier, T., et al. (2022). Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35, 10821–10836.
Shen, Z., Liu, J., He, Y., et al. (2021) Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624
Song, Y., Sebe, N., & Wang, W. (2022). Rankfeat: Rank-1 feature removal for out-of-distribution detection. Advances in Neural Information Processing Systems, 35, 17885–17898.
Sun, Y., Guo, C., & Li, Y. (2021). React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34, 144–157.
Sun, Y., Ming, Y., Zhu, X., et al. (2022) Out-of-distribution detection with deep nearest neighbors. In: International Conference on Machine Learning, PMLR, pp 20827–20840
Tonin, F., Pandey, A., Patrinos, P., et al. (2021) Unsupervised energy-based out-of-distribution detection using stiefel-restricted kernel machine. In: 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8
Van Horn, G., Mac Aodha, O., Song, Y., et al. (2018) The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8769–8778
Vyas, A., Jammalamadaka, N., Zhu, X., et al. (2018) Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 550–564
Wang, H., Li, Z., Feng, L., et al. (2022) Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4921–4930
Wang, R., Li, Y., Liu, S. (2023) Exploring diversified adversarial robustness in neural networks via robust mode connectivity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2345–2351
Wortsman, M., Horton, M.C., Guestrin, C., et al. (2021) Learning neural network subspaces. In: International Conference on Machine Learning, PMLR, pp 11217–11227
Wortsman, M., Ilharco, G., Gadre, SY., et al. (2022) Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: International conference on machine learning, PMLR, pp 23965–23998
Xiao, J., Hays, J., Ehinger, KA., et al. (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, pp 3485–3492
Xu, P., Ehinger, K.A., Zhang, Y., et al. (2015) Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755
Xue, F., He, Z., Xie, C., et al. (2022) Boosting out-of-distribution detection with multiple pre-trained models. arXiv preprint arXiv:2212.12720
Yang, D., Mai Ngoc, K., Shin, I., et al. (2021). Ensemble-based out-of-distribution detection. Electronics, 10(5), 567.
Yang J, Zhou K, Li Y, et al (2021b) Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 1–28
Yang, J., Zhou, K., Liu, Z. (2023) Full-spectrum out-of-distribution detection. International Journal of Computer Vision pp 1–16
Ye, H., Xie, C., Cai, T., et al. (2021). Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 34, 23519–23531.
Yu, F., Seff, A., Zhang, Y., et al. (2015) Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365
Yu, Y., Shin, S., Lee, S., et al. (2023) Block selection method for using feature norm in out-of-distribution detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15701–15711
Yuan, L., Chen, Y., Wang, T., et al. (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567
Zagoruyko, S., Komodakis, N. (2016) Wide residual networks. In: British Machine Vision Conference 2016, British Machine Vision Association
Zhang, C., Bengio, S., Hardt, M., et al. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
Zhao, P., Chen, PY., Das, P., et al (2020) Bridging mode connectivity in loss landscapes and adversarial robustness. In: International Conference on Learning Representations
Zhou, B., Lapedriza, A., Khosla, A., et al. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6), 1452–1464.
Zhou, Z. H. (2012). Ensemble methods: foundations and algorithms. CRC Press.
Zhu, Y., Chen, Y., Xie, C., et al. (2022). Boosting out-of-distribution detection with typical features. Advances in Neural Information Processing Systems, 35, 20758–20769.
Zhu, Y., Chen, Y., Li, X., et al (2024) Rethinking out-of-distribution detection from a human-centric perspective. International Journal of Computer Vision pp 1–18
Acknowledgements
This work is jointly supported by National Natural Science Foundation of China (62376155, 62376153), Shanghai Municipal Science and Technology Research Program (22511105600), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). Jie YANG and Xiaolin Huang are the corresponding authors.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Hong Liu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A Baseline OoD Detectors and Mode Ensemble
We outline how the scoring function \(S(\cdot )\) is designed in the selected baseline OoD detectors and elaborate the corresponding mode ensemble strategies over these detectors. In the following, the outputs of the mode \(f:\mathbb {R}^D\rightarrow \mathbb {R}^C\) are the logits of C-dimension, in correspondence to C classes.
MSP (Hendrycks and Gimpel 2016) takes the maximum probability over the output logits as the scoring function. Given a new sample \(\varvec{x}\in \mathbb {R}^D\), its MSP score w.r.t a single mode \(f(\cdot )\) is
Ensemble on MSP adopts the average logits from N modes \(f_{s_i},i=1,\cdots ,N\) to calcuate the maximum probability as the score for \(\varvec{x}\):
ODIN (Liang et al. 2018) introduces temperature scaling and adversarial examples into MSP and proposes the following score:
where T denotes the temperature and \(\varvec{{\bar{x}}}\) denotes the perturbed adversarial example of \(\varvec{x}\). Following the settings in Liang et al. (2018), Song et al. (2022), we set \(T=1000\) and do not perturb \(\varvec{x}\) for the ImageNet-1K benchmark in experiments.
Ensemble on ODIN shares a similar scoring function with that of MSP-ensemble:
The adversarial attack is executed individually on each mode \(f_{s_i}\) and then the ODIN score is calculated on the average predictive logits on the perturbed inputs.
Energy (Liu et al. 2020) improves MSP via an energy function since energy is better aligned with the input probability density:
where \(f^i(\varvec{x})\) denotes the i-th element in the C-dimension output logits.
Ensemble on Energy firstly averages the N logits and then computes the energy score:
Mahalanobis (Lee et al. 2018) tries to model the network outputs at different layers as the mixture of multivariate Gaussian distributions and uses the Mahalanobis distance as the scoring function:
where \(\varvec{\mu }_c\) denotes the mean feature vector of class-c and \(\Sigma \) denotes the covariance matrix across classes. In experiments, following the settings in Lee et al. (2018), Song et al. (2022), adversarial examples generated from 500 randomly-selected clean samples are involved to train the logistic regression, with a perturbation size 0.05 on CIFAR10 and 0.001 on ImageNet-1K.
Ensemble on Mahalanobis leverages the average output features at the same layers in DNNs over N modes and attacks the N modes simultaneously to calculate the Mahalanobis score. Details can be found in the released code.
KNN (Sun et al. 2022) is a simple but time-consuming and memory-inefficient detector since it performs nearest neighbor search on the \(\ell _2\)-normalized penultimate features between the test sample and all the training samples. The negative of the (k-th) shortest \(\ell _2\) distance between the features \(\varvec{h}^*\) of a new sample \(\varvec{x}^*\) and all the training features \(\varvec{h}^i\) is set as the score:
where \(\varvec{h}\) denotes the penultimate features in the DNN, and \(\varvec{h}^i\) denotes the penultimate features in correspondence to the i-th training sample in the training set of size \(n_\textrm{tr}\). The key of this detector is the \(\ell _2\) normalization on the features.
Ensemble on KNN improves performance by replacing the penultimate features from one single mode with the average penultimate features from N nodes:
RankFeat (Song et al. 2022) removes the rank-1 matrix from each individual sample feature matrix \(\textbf{X}\) in the mini-batch during forward propagation in test, since the rank-1 feature drastically perturbs the predictions on OoD samples:
In Eq. (A10), the singular value decomposition is firstly executed on the feature matrix \(\textbf{X}\) of an individual sample, leading to the left and right orthogonal singular vector matrices \(\textbf{U}\) and \(\textbf{V}\), and the rectangle diagonal singular value matrix \(\textbf{S}\). Then, the rank-1 matrix is calculated based on the largest singular value \(\varvec{s}_1\) and the two corresponding singular vectors \(\varvec{u}_1\) and \(\varvec{v}_1\), and gets subtracted from the original \(\textbf{X}\). Such removals are recommended at the 3rd and 4th blocks in DNNs. Finally, an Energy score (Liu et al., 2020) is calculated on the resulting changed output logits.
Ensemble on RankFeat executes the rank-1 feature removing on each mode individually and then average the N changed logits to compute the Energy score as Eq. (A6). Details can be found in the released code.
GradNorm (Huang et al. 2021) leverages the gradient information for OoD detection by calculating the \(\ell _1\) norm of the gradients w.r.t a KL divergence loss as the score of \(\varvec{x}\):
where \(\varvec{u}=[1/C,\cdots ,1/C]\in \mathbb {R}^C\) and \(\varvec{\omega }\) is set as the weight parameters of the last full-connected layer in DNNs.
Ensemble on GradNorm firstly calculates the KL divergence between \(\varvec{u}\) and the softmax probability of the average logits of N modes. The final score for \(\varvec{x}\) is the average gradient norm over the selected parameters \(\varvec{\omega }_{s_i}\) in each mode:
Appendix B Experiment Setup Details
1.1 B.1 Details on Data Sets
The information of the data sets for OoD detection is outlined below. All these settings follow previous works.
For the CIFAR10 benchmark, the InD data is the training and test sets of CIFAR10, with 50,000 and 10,000 \(32\times 32\times 3\) images of 10 categories, respectively. In this experiment, all images from the OoD data sets are resized to an image size of \(32\times 32\times 3\). The evaluated OoD data sets are introduced below.
-
SVHN (Netzer et al., 2011) data set includes images of street view house numbers. The test set of SVHN is adopted for OoD detection with 26,032 digits of numbers 0-9.
-
LSUN (Yu et al., 2015) data set is about large-scale scene classification and understanding. The test set of LSUN is employed for OoD detection with 10,000 images of 10 categories.
-
iSUN (Xu et al., 2015) data set consists of 8,925 images of gaze traces, all of which are employed for OoD detection.
-
Textures (Cimpoi et al., 2014) data set covers various types of surface texture with 5,640 images of 47 categories. The whole data set is adopted in the evaluation of OoD detection performance.
-
Places365 (Zhou et al., 2017) data set is for scene recognition. A subset of 328,500 images are curated for OoD detection by (Sun et al., 2022).
For the ImageNet-1K benchmark, the InD data is the training and test sets of ImageNet-1K, with 1,281,167 and 50,000 images of 1,000 categories, respectively. In experiments, all images from the OoD data sets are resized to an image size of \(224\times 224\times 3\). The evaluated OoD data sets are introduced below.
-
iNaturalist (Van Horn et al., 2018) data set contains natural fine-grained images of different species of plants and animals. For OoD detection, 10,000 images are sampled from the selected concepts by Sun et al. (2022).
-
SUN (Xiao et al., 2010) data set covers a large variety of environmental scenes, places and the objects within. For OoD detection, 10,000 images are sampled from the selected concepts by Sun et al. (2022).
-
Places (Zhou et al., 2017) data set is about scene recognition. For OoD detection, 10,000 images are sampled by Sun et al. (2022).
-
Textures (Cimpoi et al., 2014) data set in the ImageNet-1K benchmark is the same as that described above in the CIFAR10 benchmark.
1.2 B.2 Details on Model Training
Thorough training details of the adopted modes for OoD detection are elaborated below. Particularly, to obtain multiple independent modes, it is required to re-train multiple models from scratch on the training sets of CIFAR10 (Krizhevsky, 2009) and ImageNet-1K (Deng et al., 2009) w.r.t different random seeds. For checkable reproducibility of the results reported in this paper, all the training and evaluation code and the trained checkpoints can be found in the publicly-released link given in the main text.
For the 10 independent modes of ResNet18 (He et al., 2016) and Wide ResNet28X10 (Zagoruyko & Komodakis, 2016) trained on CIFAR10 (Krizhevsky, 2009), each DNN is optimized via SGD for 150 epochs with a batch size of 256 and weight decay \(10^{-4}\). The initial learning rate is 0.1 and reduced via a cosine scheduler to 0 during training. Each DNN is trained on one single NVIDIA GeForce RTX 3090 GPU.
For the 5 independent modes of ResNet50 (He et al., 2016) and DenseNet121 (Huang et al., 2017) trained from scratch on ImageNet-1K (Deng et al., 2009), we follow the official training scriptFootnote 2 provided by PyTorch. Each DNN is optimized via SGD for 90 epochs with weight decay \(10^{-4}\). The initial learning rate is 0.1 and reduced every 30 epochs by a factor of 10. The batch size for training ResNet50 and DenseNet121 is 1000 and 800, respectively. Each training is executed parallely on 4 NVIDIA v100 GPUs.
For the 3 independent modes of T2T-ViT-14 (Yuan et al., 2021) trained from scratch on ImageNet, we follow the training script provided from the official github repositoryFootnote 3 and adopt the default recommendation settings. Each T2T-ViT-14 model is trained parallely on 8 NVIDIA v100 GPUs for 310 epochs with a batch size of 64, an initial learning rate of \(5\times 10^{-4}\) and weight decay 0.05.
The final models after the training finishes are evaluated for OoD detection in inference.
Computational overhead For training, general ensemble methods inevitably requires extra time and memory expenses in the employed multiple models than that of single-model-based methods (Horváth et al., 2021; Wortsman et al., 2022). Similarly in our proposed mode ensemble, multiple modes require training multiple DNNs. In Table 12, we provide the time-consuming of training a single model, where our N-mode ensemble takes such training N times. In implementation, we can proceed the training in parallel to reduce the heavy time cost of training sequentially, given with sufficient computation resources. In this way, multiple DNNs could be obtained in the time of training one DNN. Once the models are trained, our OoD detectors are available at hands. Then, they can be used to detect any given data, where only the inference is needed.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fang, K., Tao, Q., Huang, X. et al. Revisiting Deep Ensemble for Out-of-Distribution Detection: A Loss Landscape Perspective. Int J Comput Vis 132, 6107–6126 (2024). https://doi.org/10.1007/s11263-024-02156-x
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-024-02156-x