Abstract
Addressing the Out-of-Distribution (OoD) segmentation task is a prerequisite for perception systems operating in an open-world environment. Large foundational models are frequently used in downstream tasks, however, their potential for OoD remains mostly unexplored. We seek to leverage a large foundational model to achieve robust representation. Outlier supervision is a widely used strategy for improving OoD detection of the existing segmentation networks. However, current approaches for outlier supervision involve retraining parts of the original network, which is typically disruptive to the model’s learned feature representation. Furthermore, retraining becomes infeasible in the case of large foundational models. Our goal is to retrain for outlier segmentation without compromising the strong representation space of the foundational model. To this end, we propose an adaptive, lightweight unknown estimation module (UEM) for outlier supervision that significantly enhances the OoD segmentation performance without affecting the learned feature representation of the original network. UEM learns a distribution for outliers and a generic distribution for known classes. Using the learned distributions, we propose a likelihood-ratio-based outlier scoring function that fuses the confidence of UEM with that of the pixel-wise segmentation inlier network to detect unknown objects. We also propose an objective to optimize this score directly. Our approach achieves a new state-of-the-art across multiple datasets, outperforming the previous best method by 5.74% average precision points while having a lower false-positive rate. Importantly, strong inlier performance remains unaffected. The code and pre-trained models are available at: https://github.com/NazirNayal8/UEM-likelihood-ratio.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Semantic segmentation represents a significant advancement in deep learning. Learned features are densely mapped to a pre-defined set of classes by a pixel-level classifier. The remarkable performance of end-to-end models on this closed set has led researchers to consider the next challenge: extending semantic segmentation to the open-world setting where objects of unknown classes also need to be segmented. One of the biggest challenges in segmenting unknown objects is the lack of outlier data.
In this work, we first attack the lack of data for unknown segmentation by utilizing a large foundation model, DINOv2 Oquab et al. (2024), for a robust representation space. The availability of internet-scale data has enabled the training of large visual foundation models, known for their generalization capabilities across various tasks (Zhang et al., 2023; Blumenkamp et al., 2024; Aydemir et al., 2023; Nguyen et al., 2023). Despite these promising generalization capabilities, their potential for unknown object segmentation remains mostly unexplored. Only recently, PixOOD Vojíř et al. (2024) has used DINOv2 without any training to avoid biases in industrial settings, however, their performance falls significantly behind the methods that use outlier supervision on commonly used SMIYC benchmark.
While collecting representative data for all possible classes in an open-world setting is impracticable, existing methods perform significantly better when trained using proxy outlier data (Grcić et al., 2023; Nayal et al., 2023; Rai et al., 2023), for example, obtained with the cut-and-paste method. Retraining with outlier supervision improves unknown segmentation but causes problems for known classes due to the reshaping of the representation space. Furthermore, retraining the entire model becomes infeasible in the case of large foundational models. We propose a novel way of utilizing proxy outlier data to improve the segmentation of unknown classes without compromising the performance of known classes.
Semantic segmentation models are typically trained to predict class probabilities with a softmax classifier. With a cross-entropy loss on the predicted class probabilities, the model learns to discriminate features of a certain class from the others. Such models excel in learning discriminative representations for the known classes but struggle to generalize to unknown classes due to partitioning the feature space between known classes. As an alternative, deep generative models directly learn a density model to predict the likelihood of a data sample. This likelihood is expected to be lower for outliers, such as samples from unknown classes. However, generative models often require more computational resources and can be challenging to train effectively.
Due to their potential to learn well-calibrated scores, deep generative models have been widely explored for out-of-distribution (OoD) tasks. However, in segmentation, their performance is often inferior to that of discriminative counterparts (Lee et al., 2018; Haldimann et al., 2019; Xia et al., 2020; Vojir et al., 2021). To benefit from the best of both worlds, GMMSeg Liang et al. (2022) presents a hybrid approach by augmenting the GMM-based generative model with discriminatively learned features. While discriminative features boost the inlier performance, GMM helps achieve an impressive OoD performance without explicitly training for it.
Nalisnick et al. (2019) test the ability of deep generative models to detect OoD. They show that a generative model trained on one dataset assigns higher likelihoods to samples from another than those from the training dataset itself. Zhang and Wischik (2022) first explain this phenomenon by showing that the expected log-likelihood is mathematically larger for out-of-distribution data and then propose to differentiate between outlier and OoD detection. While the learned density function can be used to detect outliers with respect to a single distribution, OoD detection requires comparing two distributions. As initially proposed by Bishop (1993), OoD detection can be considered model selection between in-distribution and out-of-distribution data. Although an out-of-distribution is not often explicitly modeled, Zhang et al.Zhang and Wischik (2022) show that several existing works in OoD perform a likelihood ratio test with a proxy distribution for OoD, e.g.from auxiliary OoD datasets Hendrycks et al. (2019) or using background statistics Ren et al. (2019).
In this paper, we propose applying the likelihood ratio as a principled way of detecting OoD in semantic segmentation. To calculate the likelihood ratio, we propose to train a lightweight unknown estimation module (UEM) on top of an already trained semantic segmentation model with a fixed number of semantic classes. UEM estimates an OoD distribution using proxy outlier data and a class-agnostic inlier distribution to calculate the likelihood ratio score. We also propose an objective to optimize the likelihood ratio score and train UEM with this objective. We show that our formulation is general enough to apply to both discriminative and generative segmentation models, with an example for each in the experiments. Our proposed method achieves state-of-the-art performance on multiple benchmarks while maintaining the same inlier performance.
We summarize our contributions as follows:
-
leveraging the large-scale visual foundational model DINOv2, we compensate for the lack of data in unknown segmentation,
-
we propose using a likelihood formulation for unknown segmentation to utilize proxy outlier data without sacrificing known performance,
-
we propose a lightweight Unknown Estimation Module incorporated into state-of-the-art generative and discriminative models, trained with a novel objective to optimize likelihood ratio,
-
averaged across several benchmarks, our Unknown Estimation Module achieves new state-of-the-art results in unknown segmentation while maintaining inlier performance.
2 Related Work
OoD without Outlier Data Earlier approaches for OoD detection rely on uncertainty estimation methods to model predictive uncertainty. The uncertainty of a model can be estimated through maximum softmax probabilities Hendrycks and Gimpel (2017), ensembles Lakshminarayanan et al. (2017), MC-dropout Gal and Ghahramani (2016), or by learning to estimate the confidence directly Kendall and Gal (2017). However, posterior probabilities in a closed-set setting may not always be well-calibrated for an open-world setting, potentially leading to overly confident predictions for unfamiliar categories (Guo et al., 2017; Jiang et al., 2018; Minderer et al., 2021).
OoD with Outlier DataHendrycks et al. (2019) introduce outlier exposure to improve OoD detection. Outlier exposure leverages a proxy dataset of outliers to discover signals and learn heuristics for OoD samples. Chan et al.Chan et al. (2021) use a proxy dataset and entropy maximization to fine-tune the model to give high entropy scores to unknown samples. Similarly, RbA Nayal et al. (2023) uses a proxy dataset to fine-tune the model to produce low logit scores on unknown objects. We follow a similar approach in our work and use a proxy dataset to learn a proxy distribution of OoD. However, our proxy dataset is only used to adjust the parameters of a small discriminator model, so it does not affect the performance of the inlier model.
Deep Generative Models for OoD Generative models have been used to identify outliers based on the estimated probability density of the inlier training data distribution. Liang et al. (2022) use a mixture of Gaussians to represent the data distribution within each class and model OoD instances as low-density regions. Other methods use normalizing flow (Blum et al., 2021; Grcic et al., 2024) or an energy-based model Grcić et al. (2022) to estimate inlier data density. However, estimating a data density of inliers only does not behave as expected for OoD detection, as Nalisnick et al. (2019) show in their analysis of several deep generative models. Instead of a single density estimation, we treat OoD detection as model selection between two distributions as proposed in Zhang and Wischik (2022). We directly train the model to optimize the likelihood ratio between an in-distribution and an out-of-distribution for a better separation of outliers. To our knowledge, this is the first work to consider the likelihood ratio for segmenting outliers.
Mask-Based OoD A recent trend in OoD segmentation is to use mask-based models by predicting and classifying masks (Cheng et al., 2021, 2022; Li et al., 2023). In masked-based models such as Mask2Former Cheng et al. (2022), each query specializes in detecting a certain known class (Nayal et al., 2023; Ackermann et al., 2023). Based on this property of mask-based models, RbA Nayal et al. (2023) proposes an outlier scoring function based on the probability of not belonging to any known classes. Utilizing the same property, Maskomaly Ackermann et al. (2023) selects outlier masks by thresholding the per-class mIoU on a validation set. Mask2Anomaly Rai et al. (2023) augments Mask2Former with a global masked-attention mechanism and trains it using a contrastive loss on outlier data. EAM Grcić et al. (2023) performs OoD detection via an ensemble over mask-level scores. Almost all of these methods, except for Maskomaly Ackermann et al. (2023), which is a simple inference-time post-processing technique, show the importance of utilizing OoD data during training. In this paper, we propose a better way of utilizing outlier data with the likelihood ratio, outperforming mask-based models in most metrics with pixel-based classification.
Foundational Models for OoD Foundational models trained on large datasets have shown impressive zero-shot performance on downstream tasks like classification and segmentation (Radford et al., 2021; Oquab et al., 2024; Ranzinger et al., 2024). For image-level OoD classification, Vojíř et al. (2023) leverage generic pre-trained representation from CLIP Radford et al. (2021). Wang et al. (2023) train a negation text-encoder to equip CLIP with the ability to separate OoD samples from in-distribution samples. Recently, PixOoD Vojíř et al. (2024) utilizes DINOv2 Oquab et al. (2024) for modeling the in-distribution data and achieves competitive results for OoD segmentation without using any outlier training. Initial work started exploring the potential of foundational models for OoD by building on their powerful representations. In this work, we take it further and improve outlier performance by retraining with outlier supervision without affecting the representation space of the foundational model.
Overview. Our proposed unknown estimation module (UEM) takes the input from the frozen encoder backbone and learns the outlier and inlier distributions \(\tilde{p}_{\text {out}}\) and \(\tilde{p}_{\text {in}}\). Then, we calculate the log-likelihood ratio by combining the outputs of UEM with the class probabilities of the inlier model \(\hat{p}_{\text {in}}\)
3 Methodology
3.1 Overview
We propose a two-stage approach: In the first stage, a semantic segmentation model is trained solely on the known data with the standard segmentation losses. In the second stage, the semantic segmentation model is fully frozen to maintain its exact inlier performance. We train an adaptive, lightweight unknown estimation module that estimates a proxy OoD distribution \(\tilde{p}_{\text {out}}\) and a generic inlier distribution \(\tilde{p}_{\text {in}}\) after injecting the training dataset with pseudo-unknown pixels in the second stage. With this setup, we propose an OoD scoring function based on the likelihood ratio by combining the output of this module and the inlier part, and propose a loss function to optimize it. In Fig 1, we provide an overview of our proposed approach.
3.2 Notation and Preliminaries
Given an input image \(\textbf{x}\in \mathbb {R}^{3 \times H \times W}\) and its corresponding label map \(\textbf{y}\in \mathcal {Y}^{H \times W}\), a closed-set semantic segmentation model learns a mapping from the input pixels to the class logits \(\textbf{F}_{\theta }(\textbf{x}): \mathbb {R}^{3 \times H \times W} \rightarrow \mathbb {R}^{K \times H \times W}\), where \(\mathcal {Y}= \{1, \dots , K\}\) is the set of known class labels during training. In OoD segmentation, we extend the label space to \(\mathcal {Y}^{'} = \mathcal {Y}\cup \{K + 1\}\), where \(K + 1\) represents semantic categories unseen during training or the OoD class. To identify pixels belonging to the class \(K + 1\), we define a scoring function \(\mathcal {S}_{\text {out}}(\textbf{x}) \in \mathbb {R}^{H \times W}\) that assigns high values to OoD pixels and low values to inlier pixels belonging to \(\mathcal {Y}\).
Likelihood Ratio Previous work in image-level OoD detection Nalisnick et al. (2019) has shown that when \(\mathcal {S}_{\text {out}}(x)\) for an image is defined using the likelihood density of the training data, it assigns high likelihood values to some OoD samples. This limitation of likelihood-based methods has been mitigated by defining \(\mathcal {S}_{\text {out}}(x)\) as the likelihood ratio (LR) between two distributions: \(p_{\text {in}}\) representing the likelihood of the sample belonging to the inlier distribution, and \(p_{\text {out}}\) representing the likelihood of a pixel x is an outlier. Formally:
In this formulation, the likelihood of a sample being an inlier is reinforced by the likelihood of it not being an outlier, and vice versa. While defining \(p_{\text {in}}\) is done using the inlier dataset, defining \(p_{\text {out}}\) is challenging due to the unbounded diversity of \(p_{\text {out}}\) compared to \(p_{\text {in}}\). Therefore, using different assumptions, previous work explores approximating \(p_{\text {out}}\) (Ren et al., 2019; Zhang et al., 2021). In this work, we explore representing \(p_{\text {out}}\) by utilizing pseudo-unknown data consisting of objects that are semantically disjoint from the training data distribution (Rai et al., 2023; Nayal et al., 2023; Grcić et al., 2023). We show that this formulation applies to segmentation models that use a standard discriminative classifier and generative classifiers such as in GMMSeg Liang et al. (2022).
3.3 Learning an Inlier Segmentor
The existing pixel-level inlier segmentation models typically consist of three parts:
-
i.
a feature extractor \(\textbf{E}: \mathbb {R}^{3 \times H \times W} \mapsto \mathbb {R}^{C_e \times \dot{H} \times \dot{W}}\), reducing spatial dimension to \(\dot{H} \times \dot{W}\),
-
ii.
a decoder \(\textbf{D}_{\theta _d}: \mathbb {R}^{C_e \times \dot{H} \times \dot{W}} \mapsto \mathbb {R}^{C_d \times H \times W}\), increasing it back to the original \(H \times W\), and
-
iii.
a classification head \(\textbf{G}_{\theta _g} : \mathbb {R}^{C_d \times H \times W} \mapsto \mathbb {R}^{K \times H \times W}\) mapping features to class logit scores.
\(C_e\) and \(C_d\) denote the encoder and decoder’s hidden dimension size, respectively. Hence, the mapping \(\textbf{F}_{\theta }(\textbf{x}): \mathbb {R}^{3 \times H \times W} \mapsto \mathbb {R}^{K \times H \times W}\) is defined as \(\textbf{F}_{\theta } = \textbf{E}\circ \textbf{D}_{\theta _d} \circ \textbf{G}_{\theta _g}\). In this notation, \(\theta \) is the set of all learnable parameters and contains the union of \(\theta _d\) and \(\theta _g\). In some cases, features from multiple layers of the encoder \(\textbf{E}\) can be passed on to the decoder \(\textbf{D}\) to process features in a multi-scale fashion. We omit this in the notation for simplicity.
For the backbone, we use DINOv2 Oquab et al. (2024), which is a self-supervised ViT Dosovitskiy et al. (2020) that has been shown to produce robust and rich visual representations Ranzinger et al. (2024). To maintain its rich representation, we freeze the backbone throughout all stages of training. For the decoder, we utilize a standard Feature Pyramid Network (FPN) Lin et al. (2017) that takes features from multiple layers of the encoder and fuses them to produce an output feature map. For the classification head, we explore using two types of classifiers: generative and discriminative. Although the discriminative version seems less suitable for likelihood computations, considering that the definition of likelihood is valid within the framework of a generative model, we nevertheless show that it performs exceptionally well by relaxing the notion of likelihood to be also synonymous with the confidence of a discriminative classifier.
Overall, we consider two families of inlier segmentation models: discriminative and generative. The main difference between the two lies in how the classification head \(\textbf{G}_{\theta _g}\) is implemented.
Generative Classifier We adopt the generative classification formula proposed in Liang et al. (2022), which replaces the linear softmax classification head by learning class densities of each pixel p(x|k) with Gaussian Mixture Models (GMMs), where each class is represented with a separate GMM with a uniform prior on the component weights. Formally:
where C is the number of components per GMM, \(\pi _{kc}\) is the component mixture weight for component c of class k, \(\mu _{kc}\),\(\Sigma _{kc}\) are the mean and covariance matrix respectively, and \(\mathcal {N}\) is the Gaussian distribution. The GMM parameters are learned with a variant of the Expectation-Maximization (EM) algorithm called Sinkhorn EM, which adds constraints that enforce an even assignment of features to mixture components, thereby improving the training stability. For more details, please refer to Liang et al. (2022).
Discriminative Classifier We train a single linear layer as a discriminative classifier. In this version, the parameters of the model \(\theta \) are supervised by the cross-entropy loss:
where \(\mathcal {D}\) is the set of image-label pairs, and \(p(k | x, \theta )\) is the softmax output of class k after mapping it to class logits first, \(\hat{y} = \textbf{F}_{\theta }(x)\):
3.4 Unknown Estimation Module (UEM)
At this stage, we assume the existence of an inlier segmentation model trained as described in Section 3.3. The unknown estimation module (UEM) consists of a projection module \(\textbf{P}_{\phi _p} \in \mathbb {R}^{C_p \times H \times W}\), where \(C_p\) is the hidden size output for the projection module, which is a 3-layer Multi-Layer Perceptron (MLP) that takes the output of the frozen backbone and produces a projected feature map as follows:
After that, the projected features are fed to a classification head \(\textbf{G}_{\phi _g}\) with two classes: one class maps to the OoD distribution and another to a generic inlier distribution learned directly from the backbone. Hence, the output of UEM: \(\textbf{U}\in \mathbb {R}^{2 \times H \times W}\) is defined as follows:
The classifier head \(\textbf{G}_{\phi _g}\) can be defined as either a generative or discriminative classifier, as detailed in Section 3.3. This means that regardless of whether the inlier segmentation model is generative or discriminative, we can train either a generative or discriminative UEM on top of it, which allows for more flexibility in the design choices. We assume that the outputs of the UEM module represent the likelihoods of a sample under the both the inlier and the unknown distribution. More specifically,we denote \(\tilde{p}_{\text {out}} \in \mathbb {R}^{H \times W}\) as the likelihood of \(\textbf{x}\) being and outlier, and \(\tilde{p}_{\text {in}} \in \mathbb {R}^{H \times W}\) as the likelihood of a sample \(\textbf{x}\) being an inlier. They are computed from the UEM module as follows:
3.5 Log-Likelihood Ratio Score
First, we outline the formulation assuming that both the inlier segmentation and the UEM module have a generative classification head.
Generative We propose the log-likelihood ratio as an OoD scoring function \(\mathcal {S}_{\text {out}}\) where the likelihood ratio is defined in (1). For this, we need to define \(p_{\text {out}}\) and \(p_{\text {in}}\). We simply set the outlier distribution as the outlier distribution predicted by UEM \(p_{\text {out}} = \tilde{p}_{\text {out}}\), where \(\tilde{p}_{\text {out}}(\textbf{x}) \sim \text {GMM}\) with a uniform prior \(\pi _{c}^{\text {out}} = \frac{1}{C}\):
where C is the number of components and \(\mathcal {N}\) is the Gaussian distribution. As for the inlier distribution \(p_{\text {in}}\), we define it by combining \(\tilde{p}_{in}\) with the likelihood that a sample is inlier based on the inlier segmentation model. Due to the independence of the two sources of inlier confidence, \(p_{\text {in}}\) can be defined as their product: \(p_{\text {in}} = \tilde{p}_{\text {in}}\cdot \hat{p}_{\text {in}}\). In this case, \(\tilde{p}_{\text {in}}(\textbf{x})\) follows the same form of \(\tilde{p}_{\text {out}}(\textbf{x})\) in (8). As for \(\hat{p}_{\text {in}}(\textbf{x})\), we have:
Hence, we can write more explicitly:
However, since the inlier segmentation model is a generative classifier, we have \(p(k|\textbf{x}) \sim \text {GMM}\). And following Liang et al. (2022), the log probability is used as class logit scores, hence \(\textbf{F}_k(\textbf{x}) = \log p(k|\textbf{x})\), which makes the full log-likelihood ratio score as follows:
LLR unifies the confidence values of the inlier model and our proposed UEM in a single objective.
Discriminative In this case, \(p_{\text {in}}\) and \(p_{\text {out}}\) are defined as \(\tilde{p}_{\text {in}}\) and \(\tilde{p}_{\text {out}}\) respectively, as defined in (7). The difference compared to the generative case is that these terms are not defined as GMMs, but rather as logits computed through a linear classifier. More formally, the main difference between the discriminative and the generative cases is the form of the classifier \(\textbf{G}_{\phi _g}\), which is mathematically defined as in (2) for the generative case, and defined as in (4) for the discriminative classifier. As for \(\hat{p}_{\text {in}}\), we define it to be the maximum of class logits defined in (4). Hence, the LLR score in this case can be written as:
As the normalization term \( \log {\sum _{k \in \mathcal {Y}}{\exp (\textbf{F}(\textbf{x}))}}\) does not affect the maximum, we obtain the final form of the scoring function as follows:
This shows that the generative and discriminative case formulations converge to the same equation.
3.6 Log-Likelihood Ratio Loss
The proposed unknown estimation module \(\textbf{U}_{\phi }\), for both discriminative and generative cases, is supervised by the LLR loss generically defined as follows:
where \(\tilde{\textbf{y}}\) is a binary label map denoting known and pseudo-outlier pixels, BCE is the Binary Cross Entropy Loss, and \(\mathcal {L}_{\text {GMM}}\) is the loss used to train the GMM component in case the generative classifier used as in Liang et al. (2022), and \(\alpha \) is a coefficient that is used to for weighting the GMM loss, and is also set to \(\alpha = 0\) when training a discriminative classifier to distinguish between the two cases. The \(\mathcal {L}_{\text {GMM}}\) consists of two terms as follows:
where \(\beta \) is a weighting coefficient. \(\mathcal {L}_{\text {CE}}\) is the cross-entropy loss which is applied on the output logit scores of the generative classifier head \(\textbf{F}(\textbf{x})\). As for \(\mathcal {L}_{\text {contrast}}\), this loss is applied to contrast between every component within every class GMM with all other components, including those with the same class and of the other classes. Please refer to Liang et al. (2022) for details.
4 Experiments
4.1 Experimental Setup
In our experiments, we use an inlier segmentation network composed of a feature extractor, an FPN Lin et al. (2017) pixel decoder, and a generative classification head (GMMSeg Liang et al. (2022)). The feature extractor is frozen, and we train the pixel decoder and segmentation head in the first stage on random patches of size \(518 \times 1036\) taken from the Cityscapes dataset Cordts et al. (2016). In the second stage, we train our unknown estimation module using a modified version of Anomaly Mix Tian et al. (2022), where we randomly cut and past objects from the COCO dataset Lin et al. (2014) on the training data for outlier supervision. During outlier supervision, all the trained parameters of the main segmentation network are frozen to maintain inlier performance. Finally, we maintain the training resolution during inference but with a sliding window approach to cover the whole image.
Evaluation Datasets and Metrics We report the performance on SMIYC Chan et al. (2021) Anomaly Track (SMIYC-AT), Obstacle Track (SMIYC-OT), RoadAnomaly Lis et al. (2019), and the validation set of Fishyscapes LostandFound (FS LaF) Blum et al. (2021). SMIYC-AT and RoadAnomaly are real-world images featuring one or several OoD objects of varying sizes and categories. SMIYC-OT and FS LaF assess the model’s capability to identify small-sized obstacles on the road. We evaluate the performance of our method using common pixel-wise anomaly segmentation metrics: Average Precision (AP) and False Positive Rate (FPR) at True Positive Rate of 95%.
4.2 Quantitative Results
Backbone Feature Extractor The proposed Unknown Estimation Module (UEM) builds on a strong backbone model as the feature extractor. The backbone plays a critical role by encoding images into a rich representation space, which helps first model the inliers and then differentiate the outliers with the UEM. We compare the performance of three different backbones for feature extraction, including a self-supervised one, DINOv2 Oquab et al. (2024); a contrastive one, CLIP Radford et al. (2021); and a supervised one, Hierarchical Swin Transformer Liu et al. (2021) for the baseline segmentation network.
Table 1 shows the mIoU performance of the inlier network using different backbones on Cityscapes and the anomaly segmentation performance of UEM trained on top of the inlier network on RoadAnomaly and Fishyscapes. While DINO and Swin show comparable performance on inlier data, DINO significantly outperforms Swin in handling outliers. CLIP shows lower inlier performance than both but surpasses Swin in outlier detection. This difference in outlier performance can be attributed to the pre-training of DINO and CLIP on larger and more diverse datasets, which results in more robust feature representations capable of effectively modeling both in-distribution and out-of-distribution.
Qualitative Results on SMIYC-AT (G-G). The second column (-ID score) shows the outlier score from using the GMM without outlier supervision. The third column (OoD Score) shows the anomaly score from the fine-tuned OoD detection head. The fourth column (LR Score) shows our proposed likelihood formulation. The likelihood formulation combines information from both and predicts more accurate OoD score maps
Improvements from Likelihood Ratio We question whether the likelihood ratio is necessary for unknown segmentation. To investigate this, we push the performance of a generative model that requires no additional outlier training by using more powerful backbones. GMMSeg estimates class probability densities, allowing it to directly compute an anomaly score based on the likelihood of the maximum component without requiring outlier training. We use GMMSeg’s density estimate (ID) as a baseline. We also consider the density estimate of the proxy OoD distribution alone as a scoring function (OoD). Lastly, we use the likelihood ratio scoring (LR), which integrates information from both distributions. Table 2 illustrates the performance improvements of the two scoring functions compared to the density estimates from the GMMSeg. The results consistently demonstrate that the likelihood ratio formulation provides better performance over inlier density estimates or the OoD scoring alone, highlighting the advantages of our approach.
We qualitatively compare the three scoring functions in Fig 2. The in-distribution (ID) score demonstrates lower precision due to its tendency to favor known classes. In contrast, the out-of-distribution (OoD) scoring detects outliers very confidently but at the cost of increasing false positives so as not to miss any outliers. The proposed likelihood ratio (LR) balances the two, leveraging the strengths of each to achieve the best results in terms of both inliers and outliers.
Quantitative Comparison to State-of-the-Art Table 3 shows our results compared to state-of-the-art methods on four datasets with the average performance over datasets in the last column. As each dataset has different characteristics, the existing methods behave differently across the datasets. The top-performing methods include recently proposed masked-based models RbA Nayal et al. (2023), EAM Grcić et al. (2022), and Mask2Anomaly Rai et al. (2023). While these methods achieve impressive performance in terms of accuracy, reasoning at the mask level hurts FPR, as considering a mask outlier introduces several false positives at once. Its negative effect on small objects can be seen by high FPR on SMIYC-OT and FS LaF. Our method achieves significantly lower FPR on these two datasets while being among the top-performing methods in terms of AP. Our method also achieves impressive accuracy levels on real-world images of SMIYC-OT and Road Anomaly, increasing AP by 1.85 and 1.18, respectively, without causing high FPR. Averaged across the four datasets in the last column, our method sets a new state-of-the-art in both metrics, outperforming the previous state-of-the-art by \(3.71\%\) in AP and \(0.27\%\) in FPR.
Training data impacts performance significantly. Both RbA and EAM are trained on Mapillary and Cityscapes datasets, whereas we train our inlier model only on Cityscapes. Additionally, EAM uses ADE20K Zhou et al. (2017) for outlier supervision, which contains a broader range of classes than COCO. We only use COCO to ensure a fair comparison to other methods. We also note that outlier supervision used in most other methods negatively impacts the performance of the inlier segmentation network as reported in Tian et al. (2022), Nayal et al. (2023).
Qualitative Comparison to State-of-the-Art In Fig 3, we qualitatively compare the UEM (G-G) with the previous state-of-the-art methods, mask-level RbA Nayal et al. (2023) and pixel-level PEBAL Tian et al. (2022). First, there is a difference in the range of anomaly scores produced by these methods, leading to large variations in color, especially for RbA. RbA shows a highly skewed anomaly score margin with large variations in the anomaly scores for the inlier objects. Our method, UEM, produces more calibrated anomaly maps as its likelihood ratio-based scoring produces a more balanced anomaly scoring distribution with cleaner, more interpretable masks. By incorporating the highly noisy results of PEBAL, we highlight the challenge of achieving smooth predictions with pixel-level methods like PEBAL or ours, compared to mask-level approaches such as RbA. It is important to note that these significant visual differences are not reflected in the quantitative results shown in Table 3, as the evaluation region is limited to the road.
Qualitative Comparison of Our Model Variations: The discriminative variation generally assigns higher anomaly scores, leading to better detection of certain objects. The generative variation is more conservative, producing cleaner masks but sometimes missing objects. The generative-inlier and discriminative-outlier variation balances between the two, offering intermediate results
Is DINOv2 All You Need To assess the backbone’s impact, we compare our approach to PEBAL Tian et al. (2022) and RbA Nayal et al. (2023). We use their scoring functions to train our segmentation network with the DINOv2 backbone and adjust the outlier supervision process to their original implementations. As shown in Table 4, DINOv2 significantly improves PEBAL’s performance on Road Anomaly, SMIYC-AT, and SMIYC-OT across both metrics and results in a lower FPR on FS LaF. We can attribute these improvements to the more robust backbone. For RbA, AP on Road Anomaly improves, but other metrics are better using the original model with Mask2Former. This performance drop is likely due to the lack of multi-resolution hierarchical feature maps in the DINOv2 architecture, which are essential for the Mask2Former decoder to process multi-scale features effectively. Additionally, we note that the original Swin-L backbone in RbA had 197M parameters, while our version of DINOv2 has only 86M parameters. Finally, our outlier scoring function performs best overall without modifying any original network parameters, a critical constraint for real-world applications. Both other methods are reported to lose at least 1% mIoU during fine-tuning. We found this effect exacerbated with DINO, which requires careful adjustments to mitigate inlier performance loss.
Discriminative vs.Generative Modeling of Estimator module Our unknown estimation module models two distributions during fine-tuning. Each distribution can be modelled as a data density using generative GMMs or explicitly as a linear layer mapping function. In Table 3, we evaluated different possible combinations for each. We find both discriminative and generative classifiers to outperform the previous state-of-the-art methods, with the fully discriminative classifier for the OoD modelling being slightly better. We omit the discriminative inlier and generative outlier (D-G) combination as we find the GMM takes too long to converge due to the unbounded range of values coming from the MLP.
We visualize the different combinations of our model in Fig 4. Qualitatively, discriminative heads tend to assign higher anomaly scores, which helps to capture some anomalies more effectively, increasing the accuracy. This could potentially lead to a higher false positive rate; however, because the evaluation benchmark restricts the assessment to the region of interest, i.e.the road, discriminative heads achieve better performance on the benchmark.
On the Number of Parameters in UEM The original segmentation network consists of 101M parameters. Our UEM module introduces an additional 788K parameters, representing a less than 1% increase in the overall model size. Despite this minimal parameter overhead, the UEM module significantly enhances OoD detection performance.
5 Conclusion and Future Work
In this work, we propose a novel strategy to utilize proxy outlier data for improved OoD detection without retraining the entire network. This allows us to build on the robust representation space of large foundational models, significantly enhancing the generalization capability of the proposed approach. We propose an unknown estimation module (UEM) that can be integrated into the existing segmentation networks to identify OoD objects effectively. We develop an OoD scoring function based on the likelihood ratio by combining UEM’s outputs with inlier predictions. Our method sets a new state-of-the-art in outlier segmentation across multiple datasets, without causing any drops in the inlier performance.
For future work, we aim to investigate how the choice of proxy out-of-distribution (OoD) dataset influences the generalization performance of our method. In this study, we utilized the COCO dataset as the proxy OoD data for fair comparison with the other approaches. We plan to investigate the effect of mining more realistic outliers from real-world OoD datasets Shoeb et al. (2024) as future work.
References
Ackermann, J., Sakaridis, C., & Yu, F. (2023). Maskomaly: Zero-shot mask anomaly segmentation. In The British Machine Vision Conference (BMVC).
Aydemir, G., Xie, W., & Guney, F. (2023). Self-supervised object-centric learning for videos. Advances in Neural Information Processing Systems, 36, 32879–32899.
Bishop, C. M. (1993). Novelty detection and neural network validation. In ICANN 93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3. pp. 789–794. Springer.
Blum, H., Sarlin, P. E., Nieto, J., Siegwart, R., & Cadena, C. (2021). The fishyscapes benchmark: Measuring blind spots in semantic segmentation. International Journal of Computer Vision, 129(11), 3119–3135.
Blumenkamp, J., Morad, S., Gielis, J., & Prorok, A. (2024). Covis-net: A cooperative visual spatial foundation model for multi-robot applications. In: 8th Annual Conference on Robot Learning.
Chan, R., Lis, K., Uhlemeyer, S., Blum, H., Honari, S., Siegwart, R., Fua, P., Salzmann, M., & Rottmann, M. (2021). Segmentmeifyoucan: A benchmark for anomaly segmentation. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Chan, R., Rottmann, M., & Gottschalk, H. (2021). Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5128–5137.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290–1299.
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems, 34, 17864–17875.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3213–3223.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: International Conference on Machine Learning. pp. 1050–1059. PMLR.
Galesso, S., Argus, M., & Brox, T. (2023). Far away in the deep space: Dense nearest-neighbor-based out-of-distribution detection. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). pp. 4479–4489.
Grcić, M., Bevandić, P., & Šegvić, S. (2022). Densehybrid: Hybrid anomaly detection for dense open-set recognition. In: European Conference on Computer Vision. pp. 500–517. Springer.
Grcić, M., Šarić, J., & Šegvić, S. (2023). On advantages of mask-level recognition for outlier-aware segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2937–2947.
Grcic, M., Bevandic, P., Kalafatic, Z., & Šegvic, S. (2024). Dense out-of-distribution detection by robust learning on synthetic negative data. Sensors, 24(4), 1248.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In: International conference on machine learning. pp. 1321–1330. PMLR.
Haldimann, D., Blum, H., Siegwart, R., & Cadena, C. (2019). This is not what i imagined: Error detection for semantic segmentation through visual dissimilarity. arXiv preprint arXiv:1909.00676
Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: International Conference on Learning Representations.
Hendrycks, D., Mazeika, M., & Dietterich, T. (2019). Deep anomaly detection with outlier exposure. In: International Conference on Learning Representations.
Jiang, H., Kim, B., Guan, M., & Gupta, M. (2018). To trust or not to trust a classifier. Advances in neural information processing systems 31.
Kendall, A., & Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems 30.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems 30.
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31.
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., & Shum, H. Y. (2023). Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3041–3050.
Liang, C., Wang, W., Miao, J., & Yang, Y. (2022). Gmmseg: Gaussian mixture based generative semantic segmentation models. Advances in Neural Information Processing Systems, 35, 31360–31375.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 2117–2125.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. pp. 740–755. Springer.
Lis, K., Nakka, K., Fua, P., & Salzmann, M. (2019). Detecting the unexpected via image resynthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2152–2161.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., & Lucic, M. (2021). Revisiting the calibration of modern neural networks. Advances in neural information processing systems, 34, 15682–15694.
Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., & Lakshminarayanan, B. (2019). Do deep generative models know what they don’t know? In: International Conference on Learning Representations.
Nayal, N., Yavuz, M., Henriques, J. F., & Güney, F. (2023). Rba: Segmenting unknown regions rejected by all. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 711–722.
Nekrasov, A., Hermans, A., Kuhnert, L., & Leibe, B. (2023). Ugains: Uncertainty guided anomaly instance segmentation. In: DAGM German Conference on Pattern Recognition. pp. 50–66. Springer.
Nguyen, V.N., Groueix, T., Ponimatkin, G., Lepetit, V., & Hodan, T. (2023). Cnos: A strong baseline for cad-based novel object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2134–2140.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., & Assran, M. (2024). Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PmLR.
Rai, S.N., Cermelli, F., Fontanel, D., Masone, C., & Caputo, B. (2023). Unmasking anomalies in road-scene segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4037–4046.
Ranzinger, M., Heinrich, G., Kautz, J., & Molchanov, P. (2024). Am-radio: Agglomerative visual foundation model – reduce all domains into one. In: CVPR.
Ren, J., Liu, P.J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., & Lakshminarayanan, B. (2019). Likelihood ratios for out-of-distribution detection. Advances in neural information processing systems 32.
Shoeb, Y., Chan, R., Schwalbe, G., Nowzad, A., Güney, F., & Gottschalk, H. (2024). Have we ever encountered this before? retrieving out-of-distribution road obstacles from driving scenes. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7396–7406.
Tian, Y., Liu, Y., Pang, G., Liu, F., Chen, Y., & Carneiro, G. (2022). Pixel-wise energy-biased abstention learning for anomaly segmentation on complex urban driving scenes. In: European Conference on Computer Vision. pp. 246–263. Springer.
Vojir, T., Šipka, T., Aljundi, R., Chumerin, N., Reino, D.O., & Matas, J. (2021). Road anomaly detection by partial image reconstruction with segmentation coupling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15651–15660.
Vojíř, T., Šochman, J., Aljundi, R., & Matas, J. (2023). Calibrated out-of-distribution detection with a generic representation. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). pp. 4509–4518. IEEE.
Vojíř, T., Šochman, J., & Matas, J. (2024). Pixood: Pixel-level out-of-distribution detection. In: European Conference on Computer Vision. pp. 93–109. Springer.
Wang, H., Li, Y., Yao, H., & Li, X. (2023). Clipn for zero-shot ood detection: Teaching clip to say no. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1802–1812.
Xia, Y., Zhang, Y., Liu, F., Shen, W., & Yuille, A.L. (2020). Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. pp. 145–161. Springer.
Zhang, A., & Wischik, D. (2022). Falsehoods that ml researchers believe about ood detection. In: NeurIPS ML Safety Workshop.
Zhang, H., Li, F., Qi, L., Yang, M. H., & Ahuja, N. (2024). Csl: Class-agnostic structure-constrained learning for segmentation including the unseen. Proceedings of the AAAI Conference on Artificial Intelligence., 38, 7078–7086.
Zhang, J., Herrmann, C., Hur, J., Polania Cabrera, L., Jampani, V., Sun, D., & Yang, M. H. (2023). A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 45533–45547.
Zhang, M., Zhang, A., & McDonagh, S. (2021). On the out-of-distribution generalization of probabilistic image modelling. Advances in Neural Information Processing Systems, 34, 3811–3823.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641.
Acknowledgements
This project is co-funded by KUIS AI, Royal Society Newton Fund Advanced Fellowship (NAF\(\backslash \)R\(\backslash \)2202237), and the European Union (ERC, ENSURE, 101116486). Y. Shoeb acknowledges funding from the German Federal Ministry for Economic Affairs and Climate Action within the project “just better DATA”. We also thank Unvest R&D Center for their support. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.
Funding
Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Karteek Alahari.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nayal, N., Shoeb, Y. & Güney, F. A Likelihood Ratio-Based Approach to Segmenting Unknown Objects. Int J Comput Vis 133, 6860–6872 (2025). https://doi.org/10.1007/s11263-025-02509-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02509-0