1 Introduction

Semantic segmentation represents a significant advancement in deep learning. Learned features are densely mapped to a pre-defined set of classes by a pixel-level classifier. The remarkable performance of end-to-end models on this closed set has led researchers to consider the next challenge: extending semantic segmentation to the open-world setting where objects of unknown classes also need to be segmented. One of the biggest challenges in segmenting unknown objects is the lack of outlier data.

In this work, we first attack the lack of data for unknown segmentation by utilizing a large foundation model, DINOv2 Oquab et al. (2024), for a robust representation space. The availability of internet-scale data has enabled the training of large visual foundation models, known for their generalization capabilities across various tasks (Zhang et al., 2023; Blumenkamp et al., 2024; Aydemir et al., 2023; Nguyen et al., 2023). Despite these promising generalization capabilities, their potential for unknown object segmentation remains mostly unexplored. Only recently, PixOOD Vojíř et al. (2024) has used DINOv2 without any training to avoid biases in industrial settings, however, their performance falls significantly behind the methods that use outlier supervision on commonly used SMIYC benchmark.

While collecting representative data for all possible classes in an open-world setting is impracticable, existing methods perform significantly better when trained using proxy outlier data (Grcić et al., 2023; Nayal et al., 2023; Rai et al., 2023), for example, obtained with the cut-and-paste method. Retraining with outlier supervision improves unknown segmentation but causes problems for known classes due to the reshaping of the representation space. Furthermore, retraining the entire model becomes infeasible in the case of large foundational models. We propose a novel way of utilizing proxy outlier data to improve the segmentation of unknown classes without compromising the performance of known classes.

Semantic segmentation models are typically trained to predict class probabilities with a softmax classifier. With a cross-entropy loss on the predicted class probabilities, the model learns to discriminate features of a certain class from the others. Such models excel in learning discriminative representations for the known classes but struggle to generalize to unknown classes due to partitioning the feature space between known classes. As an alternative, deep generative models directly learn a density model to predict the likelihood of a data sample. This likelihood is expected to be lower for outliers, such as samples from unknown classes. However, generative models often require more computational resources and can be challenging to train effectively.

Due to their potential to learn well-calibrated scores, deep generative models have been widely explored for out-of-distribution (OoD) tasks. However, in segmentation, their performance is often inferior to that of discriminative counterparts (Lee et al., 2018; Haldimann et al., 2019; Xia et al., 2020; Vojir et al., 2021). To benefit from the best of both worlds, GMMSeg Liang et al. (2022) presents a hybrid approach by augmenting the GMM-based generative model with discriminatively learned features. While discriminative features boost the inlier performance, GMM helps achieve an impressive OoD performance without explicitly training for it.

Nalisnick et al. (2019) test the ability of deep generative models to detect OoD. They show that a generative model trained on one dataset assigns higher likelihoods to samples from another than those from the training dataset itself. Zhang and Wischik (2022) first explain this phenomenon by showing that the expected log-likelihood is mathematically larger for out-of-distribution data and then propose to differentiate between outlier and OoD detection. While the learned density function can be used to detect outliers with respect to a single distribution, OoD detection requires comparing two distributions. As initially proposed by Bishop (1993), OoD detection can be considered model selection between in-distribution and out-of-distribution data. Although an out-of-distribution is not often explicitly modeled, Zhang et al.Zhang and Wischik (2022) show that several existing works in OoD perform a likelihood ratio test with a proxy distribution for OoD, e.g.from auxiliary OoD datasets Hendrycks et al. (2019) or using background statistics Ren et al. (2019).

In this paper, we propose applying the likelihood ratio as a principled way of detecting OoD in semantic segmentation. To calculate the likelihood ratio, we propose to train a lightweight unknown estimation module (UEM) on top of an already trained semantic segmentation model with a fixed number of semantic classes. UEM estimates an OoD distribution using proxy outlier data and a class-agnostic inlier distribution to calculate the likelihood ratio score. We also propose an objective to optimize the likelihood ratio score and train UEM with this objective. We show that our formulation is general enough to apply to both discriminative and generative segmentation models, with an example for each in the experiments. Our proposed method achieves state-of-the-art performance on multiple benchmarks while maintaining the same inlier performance.

We summarize our contributions as follows:

  • leveraging the large-scale visual foundational model DINOv2, we compensate for the lack of data in unknown segmentation,

  • we propose using a likelihood formulation for unknown segmentation to utilize proxy outlier data without sacrificing known performance,

  • we propose a lightweight Unknown Estimation Module incorporated into state-of-the-art generative and discriminative models, trained with a novel objective to optimize likelihood ratio,

  • averaged across several benchmarks, our Unknown Estimation Module achieves new state-of-the-art results in unknown segmentation while maintaining inlier performance.

2 Related Work

OoD without Outlier Data Earlier approaches for OoD detection rely on uncertainty estimation methods to model predictive uncertainty. The uncertainty of a model can be estimated through maximum softmax probabilities Hendrycks and Gimpel (2017), ensembles Lakshminarayanan et al. (2017), MC-dropout Gal and Ghahramani (2016), or by learning to estimate the confidence directly Kendall and Gal (2017). However, posterior probabilities in a closed-set setting may not always be well-calibrated for an open-world setting, potentially leading to overly confident predictions for unfamiliar categories (Guo et al., 2017; Jiang et al., 2018; Minderer et al., 2021).

OoD with Outlier DataHendrycks et al. (2019) introduce outlier exposure to improve OoD detection. Outlier exposure leverages a proxy dataset of outliers to discover signals and learn heuristics for OoD samples. Chan et al.Chan et al. (2021) use a proxy dataset and entropy maximization to fine-tune the model to give high entropy scores to unknown samples. Similarly, RbA Nayal et al. (2023) uses a proxy dataset to fine-tune the model to produce low logit scores on unknown objects. We follow a similar approach in our work and use a proxy dataset to learn a proxy distribution of OoD. However, our proxy dataset is only used to adjust the parameters of a small discriminator model, so it does not affect the performance of the inlier model.

Deep Generative Models for OoD Generative models have been used to identify outliers based on the estimated probability density of the inlier training data distribution. Liang et al. (2022) use a mixture of Gaussians to represent the data distribution within each class and model OoD instances as low-density regions. Other methods use normalizing flow (Blum et al., 2021; Grcic et al., 2024) or an energy-based model Grcić et al. (2022) to estimate inlier data density. However, estimating a data density of inliers only does not behave as expected for OoD detection, as Nalisnick et al. (2019) show in their analysis of several deep generative models. Instead of a single density estimation, we treat OoD detection as model selection between two distributions as proposed in Zhang and Wischik (2022). We directly train the model to optimize the likelihood ratio between an in-distribution and an out-of-distribution for a better separation of outliers. To our knowledge, this is the first work to consider the likelihood ratio for segmenting outliers.

Mask-Based OoD A recent trend in OoD segmentation is to use mask-based models by predicting and classifying masks (Cheng et al., 2021, 2022; Li et al., 2023). In masked-based models such as Mask2Former Cheng et al. (2022), each query specializes in detecting a certain known class (Nayal et al., 2023; Ackermann et al., 2023). Based on this property of mask-based models, RbA Nayal et al. (2023) proposes an outlier scoring function based on the probability of not belonging to any known classes. Utilizing the same property, Maskomaly Ackermann et al. (2023) selects outlier masks by thresholding the per-class mIoU on a validation set. Mask2Anomaly Rai et al. (2023) augments Mask2Former with a global masked-attention mechanism and trains it using a contrastive loss on outlier data. EAM Grcić et al. (2023) performs OoD detection via an ensemble over mask-level scores. Almost all of these methods, except for Maskomaly Ackermann et al. (2023), which is a simple inference-time post-processing technique, show the importance of utilizing OoD data during training. In this paper, we propose a better way of utilizing outlier data with the likelihood ratio, outperforming mask-based models in most metrics with pixel-based classification.

Foundational Models for OoD Foundational models trained on large datasets have shown impressive zero-shot performance on downstream tasks like classification and segmentation (Radford et al., 2021; Oquab et al., 2024; Ranzinger et al., 2024). For image-level OoD classification, Vojíř et al. (2023) leverage generic pre-trained representation from CLIP Radford et al. (2021). Wang et al. (2023) train a negation text-encoder to equip CLIP with the ability to separate OoD samples from in-distribution samples. Recently, PixOoD Vojíř et al. (2024) utilizes DINOv2 Oquab et al. (2024) for modeling the in-distribution data and achieves competitive results for OoD segmentation without using any outlier training. Initial work started exploring the potential of foundational models for OoD by building on their powerful representations. In this work, we take it further and improve outlier performance by retraining with outlier supervision without affecting the representation space of the foundational model.

Fig. 1
figure 1

Overview. Our proposed unknown estimation module (UEM) takes the input from the frozen encoder backbone and learns the outlier and inlier distributions \(\tilde{p}_{\text {out}}\) and \(\tilde{p}_{\text {in}}\). Then, we calculate the log-likelihood ratio by combining the outputs of UEM with the class probabilities of the inlier model \(\hat{p}_{\text {in}}\)

3 Methodology

3.1 Overview

We propose a two-stage approach: In the first stage, a semantic segmentation model is trained solely on the known data with the standard segmentation losses. In the second stage, the semantic segmentation model is fully frozen to maintain its exact inlier performance. We train an adaptive, lightweight unknown estimation module that estimates a proxy OoD distribution \(\tilde{p}_{\text {out}}\) and a generic inlier distribution \(\tilde{p}_{\text {in}}\) after injecting the training dataset with pseudo-unknown pixels in the second stage. With this setup, we propose an OoD scoring function based on the likelihood ratio by combining the output of this module and the inlier part, and propose a loss function to optimize it. In Fig 1, we provide an overview of our proposed approach.

3.2 Notation and Preliminaries

Given an input image \(\textbf{x}\in \mathbb {R}^{3 \times H \times W}\) and its corresponding label map \(\textbf{y}\in \mathcal {Y}^{H \times W}\), a closed-set semantic segmentation model learns a mapping from the input pixels to the class logits \(\textbf{F}_{\theta }(\textbf{x}): \mathbb {R}^{3 \times H \times W} \rightarrow \mathbb {R}^{K \times H \times W}\), where \(\mathcal {Y}= \{1, \dots , K\}\) is the set of known class labels during training. In OoD segmentation, we extend the label space to \(\mathcal {Y}^{'} = \mathcal {Y}\cup \{K + 1\}\), where \(K + 1\) represents semantic categories unseen during training or the OoD class. To identify pixels belonging to the class \(K + 1\), we define a scoring function \(\mathcal {S}_{\text {out}}(\textbf{x}) \in \mathbb {R}^{H \times W}\) that assigns high values to OoD pixels and low values to inlier pixels belonging to \(\mathcal {Y}\).

Likelihood Ratio Previous work in image-level OoD detection Nalisnick et al. (2019) has shown that when \(\mathcal {S}_{\text {out}}(x)\) for an image is defined using the likelihood density of the training data, it assigns high likelihood values to some OoD samples. This limitation of likelihood-based methods has been mitigated by defining \(\mathcal {S}_{\text {out}}(x)\) as the likelihood ratio (LR) between two distributions: \(p_{\text {in}}\) representing the likelihood of the sample belonging to the inlier distribution, and \(p_{\text {out}}\) representing the likelihood of a pixel x is an outlier. Formally:

$$\begin{aligned} \text {LR}(x) ~=~ \frac{p_{\text {out}}(x)}{p_{\text {in}}(x)} \end{aligned}$$
(1)

In this formulation, the likelihood of a sample being an inlier is reinforced by the likelihood of it not being an outlier, and vice versa. While defining \(p_{\text {in}}\) is done using the inlier dataset, defining \(p_{\text {out}}\) is challenging due to the unbounded diversity of \(p_{\text {out}}\) compared to \(p_{\text {in}}\). Therefore, using different assumptions, previous work explores approximating \(p_{\text {out}}\) (Ren et al., 2019; Zhang et al., 2021). In this work, we explore representing \(p_{\text {out}}\) by utilizing pseudo-unknown data consisting of objects that are semantically disjoint from the training data distribution (Rai et al., 2023; Nayal et al., 2023; Grcić et al., 2023). We show that this formulation applies to segmentation models that use a standard discriminative classifier and generative classifiers such as in GMMSeg Liang et al. (2022).

3.3 Learning an Inlier Segmentor

The existing pixel-level inlier segmentation models typically consist of three parts:

  1. i.

    a feature extractor \(\textbf{E}: \mathbb {R}^{3 \times H \times W} \mapsto \mathbb {R}^{C_e \times \dot{H} \times \dot{W}}\), reducing spatial dimension to \(\dot{H} \times \dot{W}\),

  2. ii.

    a decoder \(\textbf{D}_{\theta _d}: \mathbb {R}^{C_e \times \dot{H} \times \dot{W}} \mapsto \mathbb {R}^{C_d \times H \times W}\), increasing it back to the original \(H \times W\), and

  3. iii.

    a classification head \(\textbf{G}_{\theta _g} : \mathbb {R}^{C_d \times H \times W} \mapsto \mathbb {R}^{K \times H \times W}\) mapping features to class logit scores.

\(C_e\) and \(C_d\) denote the encoder and decoder’s hidden dimension size, respectively. Hence, the mapping \(\textbf{F}_{\theta }(\textbf{x}): \mathbb {R}^{3 \times H \times W} \mapsto \mathbb {R}^{K \times H \times W}\) is defined as \(\textbf{F}_{\theta } = \textbf{E}\circ \textbf{D}_{\theta _d} \circ \textbf{G}_{\theta _g}\). In this notation, \(\theta \) is the set of all learnable parameters and contains the union of \(\theta _d\) and \(\theta _g\). In some cases, features from multiple layers of the encoder \(\textbf{E}\) can be passed on to the decoder \(\textbf{D}\) to process features in a multi-scale fashion. We omit this in the notation for simplicity.

For the backbone, we use DINOv2 Oquab et al. (2024), which is a self-supervised ViT Dosovitskiy et al. (2020) that has been shown to produce robust and rich visual representations Ranzinger et al. (2024). To maintain its rich representation, we freeze the backbone throughout all stages of training. For the decoder, we utilize a standard Feature Pyramid Network (FPN) Lin et al. (2017) that takes features from multiple layers of the encoder and fuses them to produce an output feature map. For the classification head, we explore using two types of classifiers: generative and discriminative. Although the discriminative version seems less suitable for likelihood computations, considering that the definition of likelihood is valid within the framework of a generative model, we nevertheless show that it performs exceptionally well by relaxing the notion of likelihood to be also synonymous with the confidence of a discriminative classifier.

Overall, we consider two families of inlier segmentation models: discriminative and generative. The main difference between the two lies in how the classification head \(\textbf{G}_{\theta _g}\) is implemented.

Generative Classifier We adopt the generative classification formula proposed in Liang et al. (2022), which replaces the linear softmax classification head by learning class densities of each pixel p(x|k) with Gaussian Mixture Models (GMMs), where each class is represented with a separate GMM with a uniform prior on the component weights. Formally:

$$\begin{aligned} p(x | k, \theta _g) = \sum _{c=1}^{C}{\pi _{kc}~\mathcal {N}(x ; \mu _{kc}, \Sigma _{kc})} \end{aligned}$$
(2)

where C is the number of components per GMM, \(\pi _{kc}\) is the component mixture weight for component c of class k, \(\mu _{kc}\),\(\Sigma _{kc}\) are the mean and covariance matrix respectively, and \(\mathcal {N}\) is the Gaussian distribution. The GMM parameters are learned with a variant of the Expectation-Maximization (EM) algorithm called Sinkhorn EM, which adds constraints that enforce an even assignment of features to mixture components, thereby improving the training stability. For more details, please refer to Liang et al. (2022).

Discriminative Classifier We train a single linear layer as a discriminative classifier. In this version, the parameters of the model \(\theta \) are supervised by the cross-entropy loss:

$$\begin{aligned} \theta ^{*} = \text {argmin}_{\theta } -\sum _{(x, k) \in \mathcal {D}} \log p(k | x, \theta ) \end{aligned}$$
(3)

where \(\mathcal {D}\) is the set of image-label pairs, and \(p(k | x, \theta )\) is the softmax output of class k after mapping it to class logits first, \(\hat{y} = \textbf{F}_{\theta }(x)\):

$$\begin{aligned} p(k | x, \theta ) = \frac{\exp (\hat{y}_k)}{\sum _{k'} \exp (\hat{y}_{k'})} \end{aligned}$$
(4)

3.4 Unknown Estimation Module (UEM)

At this stage, we assume the existence of an inlier segmentation model trained as described in Section 3.3. The unknown estimation module (UEM) consists of a projection module \(\textbf{P}_{\phi _p} \in \mathbb {R}^{C_p \times H \times W}\), where \(C_p\) is the hidden size output for the projection module, which is a 3-layer Multi-Layer Perceptron (MLP) that takes the output of the frozen backbone and produces a projected feature map as follows:

$$\begin{aligned} \textbf{P}_{\phi _p}(\textbf{x}) = \text {MLP}\big (\textbf{E}(\textbf{x})\big ) \end{aligned}$$
(5)

After that, the projected features are fed to a classification head \(\textbf{G}_{\phi _g}\) with two classes: one class maps to the OoD distribution and another to a generic inlier distribution learned directly from the backbone. Hence, the output of UEM: \(\textbf{U}\in \mathbb {R}^{2 \times H \times W}\) is defined as follows:

$$\begin{aligned} \textbf{U}_{\phi }(\textbf{x}) = \textbf{G}_{\phi _g}\big (\textbf{P}_{\phi _p}(\textbf{x})\big ) \end{aligned}$$
(6)

The classifier head \(\textbf{G}_{\phi _g}\) can be defined as either a generative or discriminative classifier, as detailed in Section 3.3. This means that regardless of whether the inlier segmentation model is generative or discriminative, we can train either a generative or discriminative UEM on top of it, which allows for more flexibility in the design choices. We assume that the outputs of the UEM module represent the likelihoods of a sample under the both the inlier and the unknown distribution. More specifically,we denote \(\tilde{p}_{\text {out}} \in \mathbb {R}^{H \times W}\) as the likelihood of \(\textbf{x}\) being and outlier, and \(\tilde{p}_{\text {in}} \in \mathbb {R}^{H \times W}\) as the likelihood of a sample \(\textbf{x}\) being an inlier. They are computed from the UEM module as follows:

$$\begin{aligned} \begin{aligned} \tilde{p}_{\text {out}}&= \textbf{U}_{\phi _g}^1(\textbf{x}) \\ \tilde{p}_{\text {in}}&= \textbf{U}_{\phi _g}^0(\textbf{x}) \end{aligned} \end{aligned}$$
(7)

3.5 Log-Likelihood Ratio Score

First, we outline the formulation assuming that both the inlier segmentation and the UEM module have a generative classification head.

Generative We propose the log-likelihood ratio as an OoD scoring function \(\mathcal {S}_{\text {out}}\) where the likelihood ratio is defined in (1). For this, we need to define \(p_{\text {out}}\) and \(p_{\text {in}}\). We simply set the outlier distribution as the outlier distribution predicted by UEM \(p_{\text {out}} = \tilde{p}_{\text {out}}\), where \(\tilde{p}_{\text {out}}(\textbf{x}) \sim \text {GMM}\) with a uniform prior \(\pi _{c}^{\text {out}} = \frac{1}{C}\):

$$\begin{aligned} \begin{aligned} \tilde{p}_{\text {out}}(\textbf{x})&= \sum _{c=1}^{C}\pi _{c}^{\text {out}} \mathcal {N}(\textbf{x};\mu _c^{\text {out}}, \Sigma _c^{\text {out}}) \\ \end{aligned} \end{aligned}$$
(8)

where C is the number of components and \(\mathcal {N}\) is the Gaussian distribution. As for the inlier distribution \(p_{\text {in}}\), we define it by combining \(\tilde{p}_{in}\) with the likelihood that a sample is inlier based on the inlier segmentation model. Due to the independence of the two sources of inlier confidence, \(p_{\text {in}}\) can be defined as their product: \(p_{\text {in}} = \tilde{p}_{\text {in}}\cdot \hat{p}_{\text {in}}\). In this case, \(\tilde{p}_{\text {in}}(\textbf{x})\) follows the same form of \(\tilde{p}_{\text {out}}(\textbf{x})\) in (8). As for \(\hat{p}_{\text {in}}(\textbf{x})\), we have:

$$\begin{aligned} \begin{aligned} \hat{p}_{\text {in}}(\textbf{x})&= \max _k{\log p(k|\textbf{x})} \end{aligned} \end{aligned}$$
(9)

Hence, we can write more explicitly:

$$\begin{aligned} \begin{aligned} p_{\text {in}}(\textbf{x})&= \tilde{p}_{\text {in}} \cdot \hat{p}_{\text {in}} \\&= \sum _{c=1}^{C}\pi _{c}^{\text {in}} \mathcal {N}(\textbf{x};\mu _c^{\text {in}}, \Sigma _c^{\text {in}}) \cdot \max _k{\log p(k|\textbf{x})} \end{aligned} \end{aligned}$$
(10)

However, since the inlier segmentation model is a generative classifier, we have \(p(k|\textbf{x}) \sim \text {GMM}\). And following Liang et al. (2022), the log probability is used as class logit scores, hence \(\textbf{F}_k(\textbf{x}) = \log p(k|\textbf{x})\), which makes the full log-likelihood ratio score as follows:

$$\begin{aligned} \begin{aligned} \text {LLR}(\textbf{x})&= \log {\frac{p_{\text {out}}(\textbf{x})}{p_{\text {in}}(\textbf{x})}} \\&=\log p_{\text {out}}(\textbf{x}) - \log {p_{\text {in}}(\textbf{x})} \\&= \log \tilde{p}_{\text {out}}(\textbf{x}) - \log {\tilde{p}_{\text {in}}(\textbf{x})} - \max _k \log p(k|\textbf{x}) \\&= \log \tilde{p}_{\text {out}}(\textbf{x}) - \log {\tilde{p}_{\text {in}}(\textbf{x})} - \max _k \textbf{F}_k(\textbf{x}) \end{aligned} \end{aligned}$$
(11)

LLR unifies the confidence values of the inlier model and our proposed UEM in a single objective.

Discriminative In this case, \(p_{\text {in}}\) and \(p_{\text {out}}\) are defined as \(\tilde{p}_{\text {in}}\) and \(\tilde{p}_{\text {out}}\) respectively, as defined in (7). The difference compared to the generative case is that these terms are not defined as GMMs, but rather as logits computed through a linear classifier. More formally, the main difference between the discriminative and the generative cases is the form of the classifier \(\textbf{G}_{\phi _g}\), which is mathematically defined as in (2) for the generative case, and defined as in (4) for the discriminative classifier. As for \(\hat{p}_{\text {in}}\), we define it to be the maximum of class logits defined in (4). Hence, the LLR score in this case can be written as:

$$\begin{aligned} \begin{aligned} \text {LLR}(\textbf{x})&= \log {\frac{p_{\text {out}}(\textbf{x})}{p_{\text {in}}(\textbf{x})}} \\&= \log p_{\text {out}}(\textbf{x}) - \log {p_{\text {in}}(\textbf{x})} \\&= \log \tilde{p}_{\text {out}}(\textbf{x}) - \log \tilde{p}_{\text {in}}(\textbf{x}) - \max _k{\log p(k|x)} \\&= \log \tilde{p}_{\text {out}}(\textbf{x}) - \log \tilde{p}_{\text {in}}(\textbf{x})\\&\quad - \max _k \Big ({\textbf{F}_k(\textbf{x})} + \log {\sum _{k' \in \mathcal {Y}}{\exp \big (\textbf{F}_{k'}(\textbf{x})\big )}}\Big ) \end{aligned} \end{aligned}$$
(12)

As the normalization term \( \log {\sum _{k \in \mathcal {Y}}{\exp (\textbf{F}(\textbf{x}))}}\) does not affect the maximum, we obtain the final form of the scoring function as follows:

$$\begin{aligned} \text {LLR}(\textbf{x}) = \log \tilde{p}_{\text {out}}(\textbf{x}) - \log \tilde{p}_{\text {in}}(\textbf{x}) - \max _k{\textbf{F}_k(\textbf{x})} \end{aligned}$$
(13)

This shows that the generative and discriminative case formulations converge to the same equation.

3.6 Log-Likelihood Ratio Loss

The proposed unknown estimation module \(\textbf{U}_{\phi }\), for both discriminative and generative cases, is supervised by the LLR loss generically defined as follows:

$$\begin{aligned} \mathcal {L}_{\text {LLR}}(\textbf{x}, \tilde{\textbf{y}}) = \text {BCE}(\text {LLR}(\textbf{x}), \tilde{\textbf{y}}) + \alpha ~\mathcal {L}_{\text {GMM}} \end{aligned}$$
(14)

where \(\tilde{\textbf{y}}\) is a binary label map denoting known and pseudo-outlier pixels, BCE is the Binary Cross Entropy Loss, and \(\mathcal {L}_{\text {GMM}}\) is the loss used to train the GMM component in case the generative classifier used as in Liang et al. (2022), and \(\alpha \) is a coefficient that is used to for weighting the GMM loss, and is also set to \(\alpha = 0\) when training a discriminative classifier to distinguish between the two cases. The \(\mathcal {L}_{\text {GMM}}\) consists of two terms as follows:

$$\begin{aligned} \mathcal {L}_{\text {GMM}} = \mathcal {L}_{\text {CE}} + \beta \mathcal {L}_{\text {contrast}} \end{aligned}$$
(15)

where \(\beta \) is a weighting coefficient. \(\mathcal {L}_{\text {CE}}\) is the cross-entropy loss which is applied on the output logit scores of the generative classifier head \(\textbf{F}(\textbf{x})\). As for \(\mathcal {L}_{\text {contrast}}\), this loss is applied to contrast between every component within every class GMM with all other components, including those with the same class and of the other classes. Please refer to Liang et al. (2022) for details.

4 Experiments

4.1 Experimental Setup

In our experiments, we use an inlier segmentation network composed of a feature extractor, an FPN Lin et al. (2017) pixel decoder, and a generative classification head (GMMSeg Liang et al. (2022)). The feature extractor is frozen, and we train the pixel decoder and segmentation head in the first stage on random patches of size \(518 \times 1036\) taken from the Cityscapes dataset Cordts et al. (2016). In the second stage, we train our unknown estimation module using a modified version of Anomaly Mix Tian et al. (2022), where we randomly cut and past objects from the COCO dataset Lin et al. (2014) on the training data for outlier supervision. During outlier supervision, all the trained parameters of the main segmentation network are frozen to maintain inlier performance. Finally, we maintain the training resolution during inference but with a sliding window approach to cover the whole image.

Table 1 Ablation of Backbone Feature Extractor. We compare the inlier and OoD detection performance using different backbones. We find the DINOv2 backbone to offer the best inlier and outlier performance. The model variation reported in this table is (G-D), with a generative inlier model and a discriminative UEM
Table 2 Likelihood Ratio Gains Ablation: We compare the OoD detection performance of different backbones and with different scoring functions on the Road Anomaly and FS LaF datasets. Gains/losses to the base ID scoring are highlighted in green/red, respectively. The model variation used in these experiments is (G-D), with a generative inlier classifier and a discriminative UEM.

Evaluation Datasets and Metrics We report the performance on SMIYC Chan et al. (2021) Anomaly Track (SMIYC-AT), Obstacle Track (SMIYC-OT), RoadAnomaly Lis et al. (2019), and the validation set of Fishyscapes LostandFound (FS LaF) Blum et al. (2021). SMIYC-AT and RoadAnomaly are real-world images featuring one or several OoD objects of varying sizes and categories. SMIYC-OT and FS LaF assess the model’s capability to identify small-sized obstacles on the road. We evaluate the performance of our method using common pixel-wise anomaly segmentation metrics: Average Precision (AP) and False Positive Rate (FPR) at True Positive Rate of 95%.

4.2 Quantitative Results

Backbone Feature Extractor The proposed Unknown Estimation Module (UEM) builds on a strong backbone model as the feature extractor. The backbone plays a critical role by encoding images into a rich representation space, which helps first model the inliers and then differentiate the outliers with the UEM. We compare the performance of three different backbones for feature extraction, including a self-supervised one, DINOv2 Oquab et al. (2024); a contrastive one, CLIP Radford et al. (2021); and a supervised one, Hierarchical Swin Transformer Liu et al. (2021) for the baseline segmentation network.

Table 1 shows the mIoU performance of the inlier network using different backbones on Cityscapes and the anomaly segmentation performance of UEM trained on top of the inlier network on RoadAnomaly and Fishyscapes. While DINO and Swin show comparable performance on inlier data, DINO significantly outperforms Swin in handling outliers. CLIP shows lower inlier performance than both but surpasses Swin in outlier detection. This difference in outlier performance can be attributed to the pre-training of DINO and CLIP on larger and more diverse datasets, which results in more robust feature representations capable of effectively modeling both in-distribution and out-of-distribution.

Fig. 2
figure 2

Qualitative Results on SMIYC-AT (G-G). The second column (-ID score) shows the outlier score from using the GMM without outlier supervision. The third column (OoD Score) shows the anomaly score from the fine-tuned OoD detection head. The fourth column (LR Score) shows our proposed likelihood formulation. The likelihood formulation combines information from both and predicts more accurate OoD score maps

Improvements from Likelihood Ratio We question whether the likelihood ratio is necessary for unknown segmentation. To investigate this, we push the performance of a generative model that requires no additional outlier training by using more powerful backbones. GMMSeg estimates class probability densities, allowing it to directly compute an anomaly score based on the likelihood of the maximum component without requiring outlier training. We use GMMSeg’s density estimate (ID) as a baseline. We also consider the density estimate of the proxy OoD distribution alone as a scoring function (OoD). Lastly, we use the likelihood ratio scoring (LR), which integrates information from both distributions. Table 2 illustrates the performance improvements of the two scoring functions compared to the density estimates from the GMMSeg. The results consistently demonstrate that the likelihood ratio formulation provides better performance over inlier density estimates or the OoD scoring alone, highlighting the advantages of our approach.

We qualitatively compare the three scoring functions in Fig 2. The in-distribution (ID) score demonstrates lower precision due to its tendency to favor known classes. In contrast, the out-of-distribution (OoD) scoring detects outliers very confidently but at the cost of increasing false positives so as not to miss any outliers. The proposed likelihood ratio (LR) balances the two, leveraging the strengths of each to achieve the best results in terms of both inliers and outliers.

Fig. 3
figure 3

Qualitative Comparison to RbA and PEBAL. We compare our method to RbA Nayal et al. (2023) and PEBAL Tian et al. (2022) with outlier (OoD) supervision. Our method results in smoother masks, better distinguishing outliers from inliers, and reducing false positives

Table 3 Quantitative Results on SMIYC-AT, SMIYC-OT, Road Anomaly, and FS LaF. We compare our approach against existing OoD segmentation methods. Averaged across the four datasets, our approach sets a new state-of-the-art for both AP and FPR. The best result for each dataset is highlighted in bold, and the second best is underlined. Our method UEM (X-Y), has the flexibility to use a discriminative (D) or generative (G) modeling for the inlier classification head (X) and unknown estimation module (Y). We report the results of three possible combinations. Methods in the top group do not use outlier supervision. Methods in the middle group use outlier supervision

Quantitative Comparison to State-of-the-Art Table 3 shows our results compared to state-of-the-art methods on four datasets with the average performance over datasets in the last column. As each dataset has different characteristics, the existing methods behave differently across the datasets. The top-performing methods include recently proposed masked-based models RbA Nayal et al. (2023), EAM Grcić et al. (2022), and Mask2Anomaly Rai et al. (2023). While these methods achieve impressive performance in terms of accuracy, reasoning at the mask level hurts FPR, as considering a mask outlier introduces several false positives at once. Its negative effect on small objects can be seen by high FPR on SMIYC-OT and FS LaF. Our method achieves significantly lower FPR on these two datasets while being among the top-performing methods in terms of AP. Our method also achieves impressive accuracy levels on real-world images of SMIYC-OT and Road Anomaly, increasing AP by 1.85 and 1.18, respectively, without causing high FPR. Averaged across the four datasets in the last column, our method sets a new state-of-the-art in both metrics, outperforming the previous state-of-the-art by \(3.71\%\) in AP and \(0.27\%\) in FPR.

Training data impacts performance significantly. Both RbA and EAM are trained on Mapillary and Cityscapes datasets, whereas we train our inlier model only on Cityscapes. Additionally, EAM uses ADE20K Zhou et al. (2017) for outlier supervision, which contains a broader range of classes than COCO. We only use COCO to ensure a fair comparison to other methods. We also note that outlier supervision used in most other methods negatively impacts the performance of the inlier segmentation network as reported in Tian et al. (2022), Nayal et al. (2023).

Qualitative Comparison to State-of-the-Art In Fig 3, we qualitatively compare the UEM (G-G) with the previous state-of-the-art methods, mask-level RbA Nayal et al. (2023) and pixel-level PEBAL Tian et al. (2022). First, there is a difference in the range of anomaly scores produced by these methods, leading to large variations in color, especially for RbA. RbA shows a highly skewed anomaly score margin with large variations in the anomaly scores for the inlier objects. Our method, UEM, produces more calibrated anomaly maps as its likelihood ratio-based scoring produces a more balanced anomaly scoring distribution with cleaner, more interpretable masks. By incorporating the highly noisy results of PEBAL, we highlight the challenge of achieving smooth predictions with pixel-level methods like PEBAL or ours, compared to mask-level approaches such as RbA. It is important to note that these significant visual differences are not reflected in the quantitative results shown in Table 3, as the evaluation region is limited to the road.

Table 4 Other Methods with DINOv2. We compare the OoD performance of two other scoring functions using DINOv2. Gains/losses to the original method are highlighted in green/red, respectively. While DINOv2 improves the results of other methods, especially PEBAL, our method consistently achieves top results on all four datasets without affecting the inlier performance
Fig. 4
figure 4

Qualitative Comparison of Our Model Variations: The discriminative variation generally assigns higher anomaly scores, leading to better detection of certain objects. The generative variation is more conservative, producing cleaner masks but sometimes missing objects. The generative-inlier and discriminative-outlier variation balances between the two, offering intermediate results

Is DINOv2 All You Need To assess the backbone’s impact, we compare our approach to PEBAL Tian et al. (2022) and RbA Nayal et al. (2023). We use their scoring functions to train our segmentation network with the DINOv2 backbone and adjust the outlier supervision process to their original implementations. As shown in Table 4, DINOv2 significantly improves PEBAL’s performance on Road Anomaly, SMIYC-AT, and SMIYC-OT across both metrics and results in a lower FPR on FS LaF. We can attribute these improvements to the more robust backbone. For RbA, AP on Road Anomaly improves, but other metrics are better using the original model with Mask2Former. This performance drop is likely due to the lack of multi-resolution hierarchical feature maps in the DINOv2 architecture, which are essential for the Mask2Former decoder to process multi-scale features effectively. Additionally, we note that the original Swin-L backbone in RbA had 197M parameters, while our version of DINOv2 has only 86M parameters. Finally, our outlier scoring function performs best overall without modifying any original network parameters, a critical constraint for real-world applications. Both other methods are reported to lose at least 1% mIoU during fine-tuning. We found this effect exacerbated with DINO, which requires careful adjustments to mitigate inlier performance loss.

Discriminative vs.Generative Modeling of Estimator module Our unknown estimation module models two distributions during fine-tuning. Each distribution can be modelled as a data density using generative GMMs or explicitly as a linear layer mapping function. In Table 3, we evaluated different possible combinations for each. We find both discriminative and generative classifiers to outperform the previous state-of-the-art methods, with the fully discriminative classifier for the OoD modelling being slightly better. We omit the discriminative inlier and generative outlier (D-G) combination as we find the GMM takes too long to converge due to the unbounded range of values coming from the MLP.

We visualize the different combinations of our model in Fig 4. Qualitatively, discriminative heads tend to assign higher anomaly scores, which helps to capture some anomalies more effectively, increasing the accuracy. This could potentially lead to a higher false positive rate; however, because the evaluation benchmark restricts the assessment to the region of interest, i.e.the road, discriminative heads achieve better performance on the benchmark.

On the Number of Parameters in UEM The original segmentation network consists of 101M parameters. Our UEM module introduces an additional 788K parameters, representing a less than 1% increase in the overall model size. Despite this minimal parameter overhead, the UEM module significantly enhances OoD detection performance.

5 Conclusion and Future Work

In this work, we propose a novel strategy to utilize proxy outlier data for improved OoD detection without retraining the entire network. This allows us to build on the robust representation space of large foundational models, significantly enhancing the generalization capability of the proposed approach. We propose an unknown estimation module (UEM) that can be integrated into the existing segmentation networks to identify OoD objects effectively. We develop an OoD scoring function based on the likelihood ratio by combining UEM’s outputs with inlier predictions. Our method sets a new state-of-the-art in outlier segmentation across multiple datasets, without causing any drops in the inlier performance.

For future work, we aim to investigate how the choice of proxy out-of-distribution (OoD) dataset influences the generalization performance of our method. In this study, we utilized the COCO dataset as the proxy OoD data for fair comparison with the other approaches. We plan to investigate the effect of mining more realistic outliers from real-world OoD datasets Shoeb et al. (2024) as future work.