1 Introduction

Open-set object detection (OSOD)Footnote 1 is the problem of correctly detecting known objects in images while adequately dealing with unknown objects (e.g., detecting them as unknown). Here, known objects are the class of objects that detectors have seen at training time, and unknown objects are those they have not seen before. It has attracted much attention recently (Miller et al., 2018; Dhamija et al., 2020; Joseph et al., 2021; Gupta et al., 2022; Singh et al., 2021; Han et al., 2022; Zheng et al., 2022).

Fig. 1
figure 1

Illustration of OSOD-I, -II, and -III with image examples (top) and object class space (bottom). OSOD-I: Detect the known objects without being distracted by unknown objects. OSOD-II: Detect known and unknown objects as such, although ‘objectness’—what should or should not be a detection target—is ambiguous unless explicitly defined. OSOD-III: Detect known and unknown objects belonging to the same super-class as such.

Table 1 Proposed categorization of OSOD problems. “Det. target” indicates the target of detection. K and U indicate known and unknown objects.

Early studies of OSOD (Miller et al., 2018, 2019; Dhamija et al., 2020) consider how accurately detectors can detect known objects, without being distracted by unknown objects present in input images, which we will refer to as OSOD-I in what follows. Recent studies (Joseph et al., 2021; Gupta et al., 2022; Singh et al., 2021; Han et al., 2022; Zheng et al., 2022) have shifted the focus to detecting unknown objects as well. They follow the studies of open-set recognition (OSR) (Scheirer et al., 2013; Bendale & Boult, 2016; Oza & Patel, 2019; Vaze et al., 2022; Zhou et al., 2021) and aim to detect any arbitrary unknown objects while preserving detection accuracy for known-classes, which we will refer to as OSOD-II; see Fig. 1.

In this paper, we first point out a fundamental issue with the problem formulation of OSOD, which many recent studies rely on, specifically OSOD-II as defined above. OSOD-II requires detectors to detect both known-class and unknown-class objects. However, since unknown-class objects belong to an open set and can encompass any arbitrary classes, it is impossible for detectors to be fully aware of what to detect and what not to detect during inference. To address this, a potential approach is to design a detector that detects any “objects” appearing in images and classifies them as either known or unknown classes. However, this approach is not feasible due to the ambiguity in the definition of “objects.” For instance, should the tires of a car be considered as objects? It is important to note that such a difficulty does not arise in OSR since it is classification; it involves classifying a single input image as either known or unknown. In this setting, anything that is not known is automatically defined as unknown—even if it technically belongs to an open set. Additionally, the aforementioned issue makes it hard to evaluate the performance of methods. Existing studies employ metrics such as A-OSE (Miller et al., 2018) and WI (Dhamija et al., 2020), which primarily measure the accuracy of known object detection (i.e., OSOD-I) and are not suitable for evaluating unknown object detection with OSOD-II.

In light of the above, this paper next introduces a new problem formulation for OSOD, named OSOD-III. This formulation, previously neglected in existing research, is of substantial practical importance. OSOD-III uniquely focuses on unknown classes that are part of the same super-classes as the known classes, differentiating it from OSOD-II (as detailed in Table 1 and Fig. 1). An illustrative application is a traffic sign detector used in advanced driver-assistance systems (ADAS), which, having been pre-trained on existing traffic signs, is tasked with identifying newly introduced traffic signs as novel entities.

We design benchmark tests for OSOD-III using three existing datasets: Open Images (Kuznetsova et al., 2018), Caltech-UCSD Birds-200-2011 (CUB200) (Wah et al., 2011), and Mapillary Traffic Sign Dataset (MTSD) (Ertler et al., 2020). We then evaluate the performance of five recent methods (designed for OSOD-II), namely ORE (Joseph et al., 2021), Dropout Sampling (DS) (Miller et al., 2018), VOS (Du et al., 2022), OpenDet (Han et al., 2022), and OrthogonalDet (Sun et al., 2024). We also test a naive baseline method that classifies predicted boxes as known or unknown based on a simple uncertainty measure computed from predicted class scores. The results yield valuable insights. Firstly, the previous methods known for their good performance in metrics such as A-OSE and WI performed similarly or even worse than our simple baseline when they are evaluated with average precision (AP) in unknown object detection, a more appropriate performance metric. It is worth mentioning that our baseline employs standard detectors trained conventionally, without any additional training steps or extra architectures. Secondly, and more importantly, additional improvements are necessary to enable practical applications of OSOD(-III).

Our contributions are summarized as follows:

  • We highlight a fundamental issue with the problem formulation used in current OSOD studies, which renders it ill-posed and makes proper performance evaluation difficult.

  • We introduce a new variation of OSOD, named OSOD-III, which has been overlooked in the literature, yet holds practical importance.

  • We develop benchmark tests for OSOD-III using existing public datasets and present detailed analyses on the performance of existing OSOD methods.

2 Rethinking Open-set Object Detection

2.1 Formalizing Problems

We first formulate the problem of open-set object detection (OSOD). Previous studies refer to two different problems as OSOD without clarification. We use the names of OSOD-I and -II to distinguish the two, which are defined as follows.

Table 2 The class split employed in the standard benchmark test employed in recent studies of OSOD (Joseph et al., 2021; Gupta et al., 2022; Han et al., 2022; Zhao et al., 2022; Zheng et al., 2022; Wu et al., 2022). The numbers in parentheses indicate the number of categories.

OSOD-I   The goal is to detect all instances of known objects in an image without being distracted by unknown objects present in the image. We want to avoid mistakenly detecting unknown object instances as known objects.

OSOD-II  The goal is to detect all instances of known and unknown objects in an image, identifying them correctly (i.e., classifying them to known classes if known and to the “unknown” class otherwise).

OSOD-I and -II both consider applying a closed-set object detector (i.e., a detector trained on a closed-set of object classes) to an open-set environment where the detector encounters objects of unknown class. Their difference is whether or not the detector detects unknown objects. OSOD-I does not; its concern is with the accuracy of detecting known objects. This problem is first studied in (Dhamija et al., 2020; Miller et al., 2019, 2018). On the other hand, OSOD-II detector detects unknown objects as well, and thus their detection accuracy matters. OSOD-II is often considered as a part of open-world object detection (OWOD) (Joseph et al., 2021; Gupta et al., 2022; Zhao et al., 2022; Singh et al., 2021; Wu et al., 2022).

The existing studies of OSOD-II rely on OWOD (Joseph et al., 2021) for the problem formulation, which aims to generalize the concept of OSR (open-set recognition) to object detection. In OSR, unknown means “anything but known”. Its direct translation to object detection is that any arbitrary classes of objects but known objects can be considered unknown. This formulation is reflected in the experimental settings employed as a common benchmark test in these studies. Table 2 shows the setting, which treats the 20 object classes of PASCAL VOC (Everingham et al., 2010) as known classes and non-overlapping 60 classes from 80 of COCO (Lin et al., 2014) as unknown classes. For instance, the first split comprises the 20 classes from the PASCAL VOC dataset (Everingham et al., 2010), the second split encompasses classes related to outdoor items, accessories, and home appliances, while the third split consists of classes pertaining to sports and food. This division of classes underscores the basic assumption that known and unknown objects are largely unrelated.

However, this OSOD-II’s formulation has an issue, making it ill-posed. It is because the task is detection. Detectors are requested to detect only objects that should be detected. It is a primary problem of object detection to judge whether or not something should be detected. What should not be detected include objects belonging to the background and irrelevant classes. Detectors learn to make this judgment, which is feasible for a closed set of object classes; what to detect is specified. However, this does not apply to OSOD-II, which aims at detecting also unknown objects defined as above. It is infeasible to specify what to detect and what not for any arbitrary objects in advance.

A naive solution to this difficulty is to detect any objects as long as they are “objects.” However, it is not practical since defining what an object is itself hard. Figure 2 provides examples from COCO images. COCO covers only 80 object classes (shown in red rectangles in the images), and many unannotated objects are in the images (shown in blue rectangles). Is it necessary to consider every one of them? Moreover, it is sometimes subjective to determine what constitutes individual “objects.” For instance, a car consists of multiple parts, such as wheels, side mirrors, and headlights, which we may want to treat as “objects” depending on applications. This difficulty is well recognized in the prior studies of open-world detection (Joseph et al., 2021; Gupta et al., 2022) and zero-shot detection (Bansal et al., 2018; Lu et al., 2016).

2.2 Metrics for Measuring OSOD Performance

The above difficulty also leads to make it hard to evaluate how well detectors detect unknown objects. The previous studies of OSOD employ two metrics for evaluating methods’ performance, i.e., absolute open-set error (A-OSE) (Miller et al., 2018) and wilderness impact (WI) (Dhamija et al., 2020). A-OSE is the number of predicted boxes that are in reality unknown objects but wrongly classified as known classes (Miller et al., 2018). WI measures the ratio of the number of erroneous detections of unknowns as knowns (i.e., A-OSE) to the total number of detections of known instances, given by

$$\begin{aligned} \textrm{WI} = \frac{P_{K}}{P_{K \cup U}} - 1 = \frac{{\mathrm{A-OSE}}}{\textrm{TP}_{known}+\textrm{FP}_{known}}, \end{aligned}$$
(1)

where \(P_{K}\) indicates the precision measured in the close-set setting; \(P_{K \cup U}\) is that measured in the open-set setting; and \(\textrm{TP}_{known}\) and \(\textrm{FP}_{known}\) are the number of true positives and false positives for known classes, respectively.

These two metrics are originally designed for OSOD-I; they evaluate detectors’ performance in open-set environments. Precisely, they measure how frequently a detector wrongly detects and misclassifies unknown objects as known classes (lower is better).

Nevertheless, previous studies of OSOD-II have employed A-OSE and WI as primary performance metrics. We point out that these metrics are pretty insufficient to evaluate OSOD-II detectors since they cannot evaluate the accuracy of detecting unknown objects, as mentioned above. They evaluate only one type of error, i.e., detecting unknown as known, and ignore the other type of error, detecting known as unknown.

Fig. 2
figure 2

Example images showing that “object” is an ambiguous concept. It is impractical to cover an unlimited range of object instances with a finite set of predefined categories.

Fig. 3
figure 3

A-OSE (a) and WI (b) of different methods at different detector operating points. Smaller values mean better performance for both metrics. The horizontal axis indicates the confidence threshold for selecting bounding box candidates. Methods’ ranking varies on the choice of the threshold.

In addition, we point out that A-OSE and WI are not flawless even as OSOD-I performance metrics. That is, they merely measure the detectors’ performance at a single operating point; they cannot take the precision-recall tradeoff into account, the fundamental nature of detection. Specifically, previous studies (Joseph et al., 2021) report A-OSE values for bounding boxes with confidence score \(\ge 0.05\)Footnote 2. As for WI, previous studies (Joseph et al., 2021; Gupta et al., 2022; Han et al., 2022; Singh et al., 2021) choose the operating point of recall \(= 0.8\). Thus, they show performance only partially since the setting is left to end users. Figures 3(a) and (b) show the profiles of A-OSE and WI, respectively, over the possible operating points of several existing OSOD-II detectors. It is seen that the ranking of the methods varies depending on the choice of confidence threshold.

In summary, A-OSE and WI are insufficient for evaluating OSOD-II performance since i) they merely measure OSOD-I performance, i.e., only one of the two error types, and ii) they are metrics at a single operating point. To precisely measure OSOD-II performance, we must use average precision (AP), the standard metric for object detection, also to evaluate unknown object detection. It should be noted that while all the previous studies of OSOD-II report APs for known object detection, only a few report APs for unknown detection, such as (Han et al., 2022; Wu et al., 2022), probably because of the mentioned difficulty of specifying what unknown objects to detect and what not.

2.3 Relation to Open-vocabulary Detection

Open-vocabulary object detection (OVD) (Zareian et al., 2021; Gu et al., 2022) is a problem of detecting specific object classes in a zero-shot manner by using text to indicate their class names and/or attributes. This problem has recently gained significant attention (Zhou et al., 2022; Liu et al., 2023; Wu et al., 2023; Li et al., 2022; Minderer et al., 2022; Yao et al., 2022, 2023). Since the purpose of OVD can be said to eliminate the concept of unknown classes, some might think that if OVD is fully developed, the term “unknown” would become obsolete, questioning the relevance of OSOD. However, this is not accurate for several reasons.

First, not all objects targeted for detection can be effectively described in words. For example, the diverse types of damages in factory-produced products are often indescribable in language, sometimes only identifiable as ‘Damage Type No.1, No.2,’ etc. This limitation extends to newly created traffic signs, company logos, and more. Furthermore, since these items are constantly emerging worldwide, the database of object classes in OVD, which relies heavily on internet-sourced corpora (such as those used in CLIP (Radford et al., 2021)), can become outdated quickly. There are also concerns about the efficiency of OVD, particularly when implemented on resource-constrained hardware, like in automotive applications.

Conversely, the fundamental elements of both OVD and OSOD may reside in the structure of their feature spaces. This commonality suggests that integrating the two methods could potentially yield mutually beneficial solutions. However, further investigation into this prospect is a subject for future research.

2.4 Summary of Issues with OSOD-II

Our central claim in this section is: OSOD-II cannot be reliably performed unless the notion of “object” is defined explicitly. We add three important clarifications:

  • Objectness is often intrinsically ambiguous and therefore hard to specify precisely.

  • The benchmarks on which most OSOD-II studies rely (e.g., COCO-based datasets) do not resolve this ambiguity.

  • Even when objectness is explicitly defined, evaluation should employ Average Precision (AP)-including for unknown objects-rather than alternative metrics such as A-OSE or WI.

3 OSOD-III: An Alternative Formulation

This section introduces another application formulation of OSOD. Although it has been overlooked in previous studies, we frequently encounter the scenario in practice. It is free from the fundamental issue of OSOD-II, enabling practical evaluation of methods’ performance and probably making the problem easier to solve.

3.1 OSOD-III: Open at Class Level and Closed at Super-class Level

To motivate the problem we formalize in this section, we begin with two illustrative use cases.

Smartphone insect-detection app   A mobile application that identifies insects from camera images must contend with the vast-and still expanding-number of species. An initial release can cover only a limited subset of known insects. When a user photographs an unfamiliar specimen, the device should flag it as unknown, upload the image to a central server, and allow the model to be updated incrementally. This workflow demands a detector that is closed at the super-class level (“insect”) yet open to previously unseen species within that group.

Traffic-sign detection in advanced driver-assistance systems (ADAS)   Traffic-sign detectors are usually trained on the signs prevalent in the region where a vehicle operates. Signs from other jurisdictions-or newly introduced signs-lie outside this training set, making the detector’s response unpredictable. Labeling such signs as unknown is essential both for informing the driver of perceptual uncertainty and for collecting new examples that can be forwarded to a central server for continual learning. Because comprehensive sign inventories are rarely shared with vendors, a detector that is closed at the super-class level (“traffic sign”) but open to unseen sign classes provides a practical solution.

These problems are similar to OSOD-II; we want to detect unknown, novel animals. However, unlike OSOD-II, it is unnecessary to consider arbitrary objects as detection targets. In brief, we consider only animal classes; our detector does not need to detect any non-animal object, even if it has been unseen. In other words, we consider the set of object classes closed at the super-class level (i.e., animals) and open at the individual class level under the super-class.

We call this problem OSOD-III. The differences between OSOD-I, -II, and -III are shown in Fig. 1 and Table 1. The problem is formally stated as follows:

OSOD-III   Assume we are given a closed set of object classes belonging to a single super-class. Then, we want to detect and classify objects of these known classes correctly and to detect every unknown class object belonging to the same super-class and classify it as “unknown.”

It is noted that there may be multiple super-classes instead of a single. In that case, we need only consider the union of the super-classes. For the sake of simplicity, we only consider the case of a single super-class in what follows.

3.2 Properties of OSOD-III

While the applicability of OSOD-III is narrower than OSOD-II by definition, OSOD-III has two good propertiesFootnote 3.

One is that OSOD-III is free from the fundamental difficulty of OSOD-II, the dilemma of determining what unknown objects to detect and what to not. Indeed, the judgment is clear with OSOD-III; unknowns belonging to the known super-class should be detected, and all other unknowns should notFootnote 4. As a result, OSOD-III no longer suffers from the evaluation difficulty. The clear identification of detection targets enables the computation of AP also for unknown objects.

The other is that detecting unknowns will arguably be easier owing to the similarity between known and unknown classes. In OSOD-II, unknown objects can be arbitrarily dissimilar from known objects. In OSOD-III, known and unknown objects share their super-class, leading to their visual similarity. It should be noted here that what we regard as a super-class is arbitrary; there is no mathematical definition. However, as far as we consider reasonable class hierarchy as in WordNet/ImageNet (Fellbaum, 1998; Deng et al., 2009) , we may say that the sub-classes will share visual similarities.

We emphasize that the formulation of OSOD-III is studied independently of OSOD-II’s limitations; OSOD-III is not intended as a remedy for OSOD-II but instead stands as a self-contained research problem with intrinsic merit.

4 Experimental Results

Based on the above formulation, we evaluate the performance of existing OSOD methods on the proposed OSOD-III scenario. In the following section, we first introduce our experimental settings to simulate the OSOD-III scenario and then report the evaluation results.

4.1 Experimental Settings

4.1.1 Datasets

We use the following three datasets for the experiments: Open Images Dataset v6 (Kuznetsova et al., 2018), Caltech-UCSD Birds-200-2011 (CUB200) (Wah et al., 2011)Footnote 5, and Mapillary Traffic Sign Dataset (MTSD) (Ertler et al., 2020). For each, we split classes into known/unknown and images into training/validation/testing subsets as explained below. Note that one of the compared methods, ORE (Joseph et al., 2021), needs validation images (i.e., example unknown-class instances), which may be regarded as leakage in OSOD problems. This does not apply to the other methods. Additionally, other datasets can also be used for the OSOD-III formulation, provided that superclass information is defined. Common datasets such as COCO (Lin et al., 2014) and Object365 (Shao et al., 2019) are potential candidates, as they offer some degree of superclass information.

Open Images   Open Images (Kuznetsova et al., 2018) contains 1.9M images of 601 classes of diverse objects with 15.9M bounding box annotations. It also provides the hierarchy of object classes in a tree structure, where each node represents a super-class, and each leaf represents an individual object category. For instance, a leaf Polar Bear has a parent node Carnivore. We choose two super-classes, Animal and Vehicle, in our experiments because of their appropriate numbers of sub-classes, i.e., 96 and 24 in the “Animal” and “Vehicle” super-class, respectively. We split these sub-classes into known and unknown classes. To mitigate statistical biases, we consider four random splits and select one for a known-class set and the union of the other three for an unknown-class set.

We construct the training/validation/testing splits of images based on the original splits provided by the dataset. Specifically, we choose the images containing at least one known-class instance from the original training and validation splits. We choose the images containing either at least one known-class instance or at least one unknown-class instance from the original testing split. For the training images, we keep annotations for the known objects and eliminate all other annotations including unknown objects. It should be noted that there is a risk that those removed objects could be treated as the “background” class. For the validation and testing images, we keep the annotations for known and unknown objects and remove all other irrelevant objects.

Table 3 Details of the employed class splits for Open Images dataset. We treat one of the four as a known set and the union of the other three as an unknown set. Thus, there are four cases of known/unknown splits, for each of which we report the detection performance in Table 6.

CUB200   Caltech-UCSD Birds-200-2011 (CUB200) (Wah et al., 2011) is a 200 fine-grained bird species dataset. It contains 12K images, for each of which a single box is provided. We split the 200 classes randomly into four splits, each with 50 classes. We then choose three to form a known-class set and treat the rest as an unknown-class set. We construct the training/validation/testing splits similarly to Open Images with two notable exceptions. One is that we create the training/validation/test splits from the dataset’s original training/validation splits. This is because the dataset does not provide annotation for the original test split. The other is that we remove all the images containing unknown objects from the training splits. This will make the setting more rigorous.

MTSD   Mapillary Traffic Sign Dataset (MTSD) (Ertler et al., 2020) is a dataset of 400 diverse traffic signs from different regions around the world. It contains 52K street-level images with 260K manually annotated traffic sign instances. For the split of known/unknown classes, we consider a practical use case of OSOD-III, where a detector trained using the data from a specific region is used in another region, which might have unknown traffic signs. As the dataset does not provide region information for each image, we divide the 400 traffic sign classes into clusters based on their co-occurrence in the same images. Specifically, we apply normalized graph cut (Shi & Malik, 2000) to obtain three clusters, ensuring any pairs of the clusters share the minimum co-occurrence. We then use the largest cluster as a known-class set (230 classes). Denoting the other two clusters by unknown1 (55) and unknown2 (115), we test three cases, i.e., using either unknown1, unknown2, or their union (unknown1+2) for an unknown-class set. We report the results for the three cases. We create the training/validation/testing splits in the same way as CUB200.

Tables 3, 4, and 5 show the resulted splits of each dataset, based on which known/unknown classes are selected, and also those of training/validation/testing images. See the appendix for more detailed category names contained in each split.

Table 4 Details of the employed class splits for Caltech-UCSD Birds-200-2011 (CUB200) dataset. We treat the union of three of the four as known classes and the rest as unknown classes. Each split corresponds to the results shown in Table 7.

4.1.2 Evaluation

As discussed earlier, the primary metric for evaluating object detection performance is average precision (AP) (Felzenszwalb et al., 2010; Everingham et al., 2010). Although we must use AP for unknown detection, the issue with OSOD-II makes it impractical. OSOD-III is free from the issue, and we can use AP for unknown object detection. Therefore, following the standard evaluation procedure of object detection, we report AP over IoU in the range [0.50, 0.95] for known and unknown object detection.

4.2 Compared Methods

We consider five state-of-the-art OSOD methods (Joseph et al., 2021; Miller et al., 2018; Du et al., 2022; Han et al., 2022; Sun et al., 2024). While they are originally developed to deal with OSOD-II, these methods can be applied to OSOD-III without little or no modification.

Detection pipelines including methodology design, framework, and dataflow remains unchanged from OSOD-II. Accordingly, superclass information is used solely to specify the known/unknown categories that users intend to detect. Neither the superclass name nor its hierarchical information is explicitly incorporated into the training process, whether as supervision or within the loss function.

Table 5 Details of the employed class splits for Mapillary Traffic Sign Dataset (MTSD). Each split corresponds to the results shown in Table 8.

We first summarize five methods and their configurations in our experiments below, followed by the introduction of a simple baseline method that detects unknown objects merely from the outputs of a standard detector.

ORE (Open Wo rld Object D etector) (Joseph et al., 2021) is initially designed for OWOD; it is capable not only of detecting unknown objects but also of incremental learning. We omit the latter capability and use the former as an open-set object detector. It employs an energy-based method to classify known/unknown; using the validation set, including unknown object annotations, it models the energy distributions for known and unknown objects. To compute AP for unknown objects, we use a detection score that ORE provides. Following the original paper (Joseph et al., 2021), we employ Faster RCNN (Ren et al., 2015) with a ResNet50 backbone (He et al., 2016) for the base detector.

DS (Dropout Sampling) (Miller et al., 2018) uses the entropy of class scores to discriminate known and unknown categories. Specifically, during the inference phase, it employs a dropout layer (Gal & Ghahramani, 2016) right before computing class logits and performs inference n iterations. If the entropy of the average class logits over these iterations exceeds a threshold, the detected instance is assigned to the unknown class. The top-1 class score, calculated from the averaged class logits, is employed as the unknown score for computing unknown AP. The base detector is Faster RCNN with ResNet50-FPN backbone (Lin et al., 2017). Following the implementation of (Han et al., 2022), we set the number of inference iterations n to 30, the entropy threshold \(\gamma _{ds}\) to 0.25, and the dropout layer parameter p to 0.5, respectively.

VOS (Virtual Outlier Synthesis) (Du et al., 2022) detects unknown objects by treating them as out-of-distribution (OOD) based on an energy-based method. Specifically, it estimates an energy value for each detected instance and judges whether it is known or unknown by comparing the energy with a threshold. We use the energy value to compute unknown AP. We choose Faster RCNN with ResNet50-FPN backbone (Lin et al., 2017), following the paper.

OpenDet (Open-set Detector) (Han et al., 2022) is the current state-of-the-art on the popular benchmark test designed using PASCAL VOC/COCO (shown in Table 2), although the methods’ performance is evaluated with inappropriate metrics of A-OSE and WI. OpenDet provides a detection score for unknown objects, which we utilize to compute AP. We use the authors’ implementation, which employs Faster RCNN based on ResNet50-FPN for the base detector.

OrthogonalDet (Sun et al., 2024) addressed low recall for unknown objects and misclassification into known classes through two key mechanisms:

  1. 1)

    It promotes decorrelation between objectness prediction and class label prediction, enforcing orthogonality in the feature space.

  2. 2)

    It is based on RandBox (Wang et al., 2023), a Fast RCNN-like detector that features the removal of Region Proposal Network (RPN) from Faster RCNN. This modification introduces greater randomness in region proposals, allowing region selection to be independent of the known categories’ distribution, thereby achieving high unknown recall.

We use ResNet50-FPN backbone for this method.

Simple Baselines   In addition to these existing methods, we also consider a naive baseline for comparison. It merely uses the class scores that standard detectors predict for each bounding box. It relies on an expectation that unknown-class inputs should result in uncertain class prediction. Thus we look at the prediction uncertainty to judge if the input belongs to known/unknown classes. Specifically, we calculate the ratio of the top-1 and top-2 class scores for each candidate bounding box and compare it with a pre-defined threshold \(\gamma \); we regard the input as unknown if it is smaller than \(\gamma \) and as known otherwise. We use the sum of the top three class scores for the unknown object detection score. In our experiments, we employ two detectors, FCOS (Tian et al., 2019) and Faster RCNN (Ren et al., 2015). We use ResNet50-FPN as their backbone, following the above methods. For Open Images, we set \(\gamma =4.0\) for FCOS and \(\gamma =15.0\) for Faster RCNN. For CUB200 and MTSD, we set \(\gamma =1.5\) for FCOS and \(\gamma =3.0\) for Faster RCNN. We need different thresholds due to the difference in the number of classes and the output layer design, i.e., logistic vs. softmax. Additionally, Faster RCNN could be applied temperature scaling T for the softmax layer. We set \(T=1.0\) in our experiments, unless otherwise specified. We report the sensitivity to the choice of \(\gamma \) and T in the appendix.

Table 6 Detection accuracy of known (AP\(_{known}\)) and unknown objects (AP\(_{unk}\)) of different methods for Open Images dataset, “Animal” and “Vehicle” super-classes. “Split-n” indicates that the classes of Split-n are treated as known classes. “mean” is the average of all splits. Orth.Det represents OrthogonalDet.
Table 7 Detection accuracy for CUB200 (Wah et al., 2011). See Table 6 for notations.

4.3 Training Details

We train the models using the SGD optimizer with the batch size of 16 on 8 A100 GPUs. The number of epochs is 12, 80, and 60 for OpenImages, CUB200, and MTSD, respectively. We use the initial learning rate of \(2.0 \times 10^{-2}\) with momentum \(=0.9\) and weight decay \(=1.0\times 10^{-4}\). We drop a learning rate by a factor of 10 at 2/3 and 11/12 epoch. For Open Images and CUB200, we follow a common multi-scale training and resize the input images such that their shorter side is between 480 and 800, while the longer side is 1333 or less. At the inference time, we set the shorter side of input images to 800 and the longer side to less or equal to 1333. For MTSD, we apply similar scaling strategies to Open Images and CUB200 (i.e., multi-scale training and single-scale testing) but the scaling scheme; namely, the input size is doubled, e.g., the shorter side is between 960 and 1600 at training time. This aims to improve detection accuracy for the small-sized objects that frequently appear in MTSD.

We used the publicly available source code for the implementation of OREFootnote 6 (Joseph et al., 2021), Dropout Sampling (DS)Footnote 7 (Miller et al., 2018), VOSFootnote 8 (Du et al., 2022), OpenDetFootnote 9 (Han et al., 2022), and OrthogonalDetFootnote 10. We used mmdetectionFootnote 11 (Chen et al., 2019) for FCOS (Tian et al., 2019) and detectron2Footnote 12 for Faster RCNN (Ren et al., 2015) to implement the baseline methods, respectively.

4.4 Results

4.4.1 Main Results

Tables 6, 7, and 8 present the results for Open Images, CUB200, and MTSD. They all show mAP for known-class objects (\(\textrm{AP}_{known}\)) and AP for unknown objects (\(\textrm{AP}_{unk}\)).

We can see a consistent trend irrespective of datasets. For all methods, while mAP for known classes is high, AP for the unknown class is low. Especially for the insufficient \(\textrm{AP}_{unk}\) scores on “Vehicle” dataset in Table 6, we hypothesize two factors. First, DS and ORE, being among the earliest OSOD models, struggle with distinguishing unknown objects and result in low unknown recall. Second, with only six known categories, the “Vehicle” dataset provides limited supervision, causing DS and ORE to overfit rather than generalize the superclass concept. Overall, a comparison of techniques shows that OpenDet exhibits superior performance in detecting the unknown class compared to other existing methods. The performance of OrthogonalDet in unknown object detection is comparable to or worse than that of other methods across datasets. However, it is noteworthy that our naïve baseline demonstrates comparable or even superior performance. This is surprising, considering that they do not require additional training or mechanism dedicated to unknown detection.

Table 8 Detection accuracy for MTSD (Ertler et al., 2020). K, U1, and U2 stand for the splits of Known, Unknown1, and Unknown2, respectively.

Nonetheless, even the best-performing methods achieve only modest accuracy in unknown detection, rendering them unsuitable for practical application. While reasonable accuracy is achieved in the “Animal” superclasses of Open Images, the evaluation on the more realistic MTSD results in significantly lower accuracy. This suggests that there is substantial room for further research in OSOD(-III). This is especially true considering that our baseline has achieved the highest level of accuracy, indicating significant potential for improvement.

4.4.2 Superclass Specification at Different Levels

To explore the impact of superclass specification at different hierarchical levels, we conducted additional experiments using superclasses from the multi-level hierarchy of the Open Images dataset. We selected “Carnivore” and “Land Vehicle” from the Open Images dataset as superclasses at different hierarchical levels, being subcategories of “Animal” and “Vehicle,” respectively. We followed the same process used when creating the “Animal” and “Vehicle” benchmarks of the Open Images dataset. Specifically, we split the target categories into two splits (i.e., eight known and unknown categories) and created train/validation/test datasets. Table 9 presents the details of resulting dataset. We then evaluated the detection performance on each split.

Table 9 Details of the employed class splits for “Carnivore” and “Land Vehicle” datasets.
Table 10 Detection accuracy of known (AP\(_{known}\)) and unknown objects (AP\(_{unk}\)) of “Carnivore” and “Land Vehicle”. See Table 6 for notations.
Table 11 A-OSE and WI of the compared methods in the experiment of each dataset. The same experimental settings as Table 6-8 are used. The reported values are the averages across all splits. Results for individual splits can be found in the appendix.

The evaluation results are shown in Table. 10. We use OpenDet and Faster RCNN as representative methods, as they demonstrated strong performance in our main experiments on Open Images dataset presented in the main paper. We observe that our naive baseline, Faster RCNN, achieves comparable performance on both known and unknown classes in the “Carnivore” dataset, exhibiting a similar trend to that of the “Animal” superclass. In the “Land Vehicle” dataset, Faster RCNN outperforms OpenDet in detecting known classes but underperforms in detecting unknown classes, highlighting a trade-off between the two. We speculate that as the specified superclass becomes a lower-level concept (i.e., closer to a fine-grained category), the limited domains of known and unknown objects may provoke detectors to focus more on closed-set learning of known categories rather than generalizing to the superclass. Consequently, this could lead to overfitting on the known classes.

4.4.3 Results of A-OSE and WI

In addition to \(\textrm{AP}_{known}\) and \(\textrm{AP}_{unk}\), we report absolute open-set error (A-OSE) and wilderness impact (WI), the metrics widely used in previous studies. Table 11 shows those for the compared methods on the same test data. Recall that i) A-OSE and WI measure only detectors’ performance of known object detection; and ii) they evaluate detectors’ performance at a single operating point. Table 11 shows the results at the operating points chosen in the previous studies, i.e., confidence score \(>0.05\) for A-OSE and the recall (of known object detection) \(=0.8\) for WI, respectively. The results show that OpenDet and Faster RCNN achieve comparable performance on both metrics. FCOS performs worse, but this is not necessarily true at different operating points, as shown in Fig. 3 of the main paper. We can also see from the results a clear inconsistency between the A-OSE/WI and \(\textrm{AP}\)s. For instance, Faster RCNN is inferior to ORE in both the A-OSE and WI metrics (i.e., \(6,382\pm 206\) vs. \(4,849\pm 206\) on A-OSE), whereas it achieves much better \(\textrm{AP}_{known}\) and \(\textrm{AP}_{unk}\) than ORE, as shown in Table 7. Such inconsistency demonstrates that A-OSE and WI are unsuitable performance measures for OSOD-II/III, as discussed in Sec 2.2.

Fig. 4
figure 4

Example outputs of OpenDet (Han et al., 2022) and our baseline method with Faster RCNN (Ren et al., 2015) for Open Images with Animal and Vehicle super-classes and MTSD, respectively. Red and blue boxes indicate detected unknown-class and known-class objects, respectively; “Unk” means “unknown”.

4.5 Analysis on Failure Cases

4.5.1 Overlapped Detections between Known and Unknown

To understand the reasons behind the low accuracy in unknown detection, we analyzed failure cases. This analysis revealed that, in addition to a certain degree of false negatives in unknown detection, there is a significant confusion between unknown and known classes. Some examples are shown in Fig. 4, with more detailed data deferred to the appendix. As shown in Fig. 4, it was found that bounding boxes (BB) for unknown and known classes often overlap, contributing to this confusion.

The methods we evaluated in our experiments all employ non-maximum suppression (NMS). If NMS were applied across two BBs of unknown and a known class (i.e., eliminating the lower-scoring one), such issues might be mitigated. In our experiments, NMS treated the unknown class as equivalent to one of the known classes. In standard object detection procedures, NMS is applied separately for each class, meaning that known and unknown were subjected to NMS independently. It may be beneficial to apply NMS across unknown and known classes. Therefore, we conducted experiments to explore this approach.

Figure 5 shows the mAP for the known category predictions and AP for the unknown, evaluated at varying IoU thresholds for NMS. In Fig 5, an IoU threshold of 1.0 represents results obtained without NMS between known and unknown predictions. Results for other values demonstrate the effect of NMS on these predictions. It is clear that aggressive NMS reduces APs for both categories.

This observation yields two insights: i) predicted known and unknown BBs do frequently overlap, and ii) the scores of these BBs do not consistently reflect prediction accuracy. Ideally, when BBs overlap, the one with the highest score should correspond to the correct prediction. However, in our results, BBs, which incorrectly classify instances as either known or unknown, often receive higher scores than those making accurate classifications. In summary, while the detectors show competence in identifying unknown instances, they regularly misidentify between known and unknown instances. These findings will offer useful insights for the advancement of more effective OSOD methods.

Fig. 5
figure 5

Detection accuracy at various IoU thresholds for NMS between known and unknown predictions: mAP for known classes and AP for unknown. The results for (a) CUB200 and (b) MTSD.

Fig. 6
figure 6

t-SNE visualization of the feature space of OpenDet trained under OSOD-II (VOC-COCO) and OSOD-III (four proposed benchmarks) scenarios. Latent features from known classes (colored circles) and unknown class (black crosses) are randomly sampled.

4.5.2 Feature Space Visualization

For more analysis on the confusion between known and unknown, we performed t-SNE visualization of latent features for both detectors trained under OSOD-II and OSOD-III scenarios. For the OSOD-II setting, we utilized the VOC-COCO dataset (Joseph et al., 2021; Han et al., 2022), which defines 20 PASCAL-VOC classes as known categories and 60 non-PASCAL-VOC classes as the unknown category. For the OSOD-III setting, we employed four of our proposed benchmarks. Specifically, we first sampled RoI features from the penultimate layer before the final classification head. Only RoI features with an IoU of 0.8 or higher with any ground truth (GT) instance were selected. From these, 30 samples were randomly chosen for t-SNE analysis. For visualization, 10 known classes and one unknown class were randomly sampled.

Figure 6 presents the results. We can observe that known features are reasonably well-separated in the feature space, highlighting the strong performance on \(\textrm{AP}_{known}\) in the main experiments. In the VOC-COCO dataset (i.e., OSOD-II formulation), the unknown class tends to map to regions of the feature space where various known categories intermingle, indicating relatively low similarity to known clusters. For the four proposed datasets (i.e., OSOD-III formulation), unknown features are more likely to be mapped near known clusters rather than being concentrated in a distinct region as a separate unknown class. This observation is reasonable, as the known and unknown classes share the same superclass in OSOD-III. The overlap between these two classes in the feature space supports the credibility of our analysis of failure cases, that is, OSOD-III is more prone to misidentification between known and unknown classes rather than discovery of unknown objects.

5 Related Work

5.1 Open-set Recognition

For the safe deployment of neural networks, open-set recognition (OSR) has attracted considerable attention. The task of OSR is to accurately classify known objects and simultaneously detect unseen objects as unknown. Scheirer et al. (Scheirer et al., 2013) first formulated the problem of OSR, and many following studies have been conducted so far (Bendale & Boult, 2016; Ge et al., 2017; Neal et al., 2018; Sun et al., 2020; Oza & Patel, 2019; Shu et al., 2017; Vaze et al., 2022; Zhou et al., 2021).

The work of Bendale and Boult (Bendale & Boult, 2016) is the first to apply deep neural networks to OSR. They use outputs from the penultimate layer of a network to calibrate its prediction scores. Several studies (Ge et al., 2017; Neal et al., 2018; Kong & Ramanan, 2021) found generative models are effective for OSR, where unseen-class images are synthesized and used for training. Another line of OSR studies focuses on a reconstruction-based method using latent features (Zhang & Patel, 2017; Yoshihashi et al., 2019), class conditional auto-encoder (Oza & Patel, 2019), and conditional gaussian distributions (Sun et al., 2020).

5.2 Open-set Object Detection

We can categorize existing open-set object detection (OSOD) problems into two scenarios, OSOD-I and -II, according to their different interest in unknown objects, as we have discussed in this paper.

OSOD-I  Early studies treat OSOD as an extension of OSR problem (Miller et al., 2018, 2019; Dhamija et al., 2020). They aim to correctly detect every known object instance and avoid misclassifying any unseen object instance into known classes. Miller et al. (Miller et al., 2018) first utilize multiple inference results through dropout layers (Gal & Ghahramani, 2016) to estimate the uncertainty of the detector’s prediction and use it to avoid erroneous detections under open-set conditions. Dhamija et al. (Dhamija et al., 2020) investigate how modern CNN detectors behave in an open-set environment and reveal that the detectors detect unseen objects as known objects with a high confidence score. For the evaluation, researchers have employed A-OSE (Miller et al., 2018) and WI (Dhamija et al., 2020) as the primary metrics to measure the accuracy of detecting known objects. They are designed to measure how frequently a detector wrongly detects and classifies unknown objects as known objects.

OSOD-II  More recent studies have moved in a more in-depth direction, where they aim to correctly detect/classify every object instance not only with the known class but also with the unknown class. This scenario is often considered a part of open-world object detection (OWOD) (Joseph et al., 2021; Gupta et al., 2022; Zhao et al., 2022; Singh et al., 2021; Wu et al., 2022). In this case, the detection of unknown objects matters since it considers updating the detectors by collecting unknown classes and using them for retraining. Joseph et al. (Joseph et al., 2021) first introduces the concept of OWOD and establishes the benchmark test. Many subsequent works have strictly followed this benchmark and proposed methods for OSOD. OW-DETR (Gupta et al., 2022) introduces a transformer-based detector (i.e., DETR (Carion et al., 2020; Zhu et al., 2021)) for OWOD and improves the performance. Han et al. (Han et al., 2022) propose OpenDet and pay attention to the fact that unknown classes are distributed in low-density regions in the latent space. They then perform contrastive learning to encourage intra-class compactness and inter-class separation of known classes, leading to performance gain. Similarly, Du et al. (Du et al., 2022) synthesize virtual unseen samples from the decision boundaries of gaussian distributions for each known class. Wu et al. (Wu et al., 2022) propose a further challenging task to distinguish unknown instances as multiple unknown classes.

5.3 Hierarchical Novelty Detection

Recent studies have explored hierarchical novelty detection (HND), which conceptually resembles OSOD-III.

Lee et al. (Lee et al., 2018) introduced a hierarchical classification framework for novelty detection, leveraging a taxonomy of known classes to identify the most relevant superclass for novel objects. More recently, Pyakurel and Yu (Pyakurel & Yu, 2024) proposed fine-grained evidence allocation for HND, improving detection precision by introducing virtual novel classes at each non-leaf level and enabling a structured, evidence-based multi-class classification approach.

Although HND and OSOD-III share conceptual similarities, they differ in three key aspects. 1) HND focuses on image-level classification, whereas OSOD-III operates at the instance-level for object detection. 2) HND aims to identify the closest superclass for detected unknown objects, within the hierarchical taxonomy. In contrast, OSOD-III does not require superclass inference, as the superclass is predefined by the user. This reduces task complexity, making OSOD-III feasible even for instance-level tasks. 3) To facilitate superclass inference, HND explicitly constructs a category hierarchy (e.g., WordNet (Fellbaum, 1998)) as prior knowledge, providing models with a structured taxonomy for classification. OSOD-III does not necessitate such a hierarchy; instead, detectors are supplied with a list of known categories from which they learn the corresponding superclass.

6 Conclusion

In this paper, we have studied the problem of open-set object detection (OSOD). Initially, we categorized existing problem formulations in the literature into two types: OSOD-I and OSOD-II. We then highlighted the inherent difficulties in OSOD-II, the most widely studied formulation, where the primary issue is identifying which unknown objects to detect. This ambiguity renders practical evaluation of OSOD-II problematic.

Subsequently, we introduced a novel OSOD formulation, OSOD-III, which focuses on detecting unknown objects that belong to the same super-class as known objects. This perspective, previously neglected in the field, is of significant practical relevance. We demonstrated that OSOD-III is not subject to the issues plaguing OSOD-II, enabling the effective evaluation of methodologies using the standard AP metric. We also established benchmark tests for OSOD-III and evaluated various methods, including the current state-of-the-art. Our primary finding is that existing methods achieve only modest performance, falling short of practical application in real-world scenarios. Our analyses revealed that the primary challenge lies not in detecting unknown instances, but in differentiating known from unknown instances. Anticipating further advancements, we hope our insights will contribute to future developments in this area.

Table 12 Classes contained in the employed splits for Open Images (Kuznetsova et al., 2018) with the super-classes “Animal” (first column) and “Vehicle” (second column), respectively