Abstract
Zero-shot recognition is centered around learning representations to transfer knowledge from seen to unseen classes. Where foundational approaches perform the transfer with semantic embedding spaces, e.g., from attributes or word vectors, the current state-of-the-art relies on prompting pre-trained vision-language models to obtain class embeddings. Whether zero-shot learning is performed with attributes, CLIP, or something else, current approaches de facto assume that there is a pre-defined embedding space in which seen and unseen classes can be positioned. Our work is concerned with real-world zero-shot settings where a pre-defined embedding space can no longer be assumed. This is natural in domains such as biology and medicine, where class names are not common English words, rendering vision-language models useless; or neuroscience, where class relations are only given with non-semantic human comparison scores. We find that there is one data structure enabling zero-shot learning in both standard and non-standard settings: a similarity matrix spanning the seen and unseen classes. We introduce four similarity-based zero-shot learning challenges, tackling open-ended scenarios such as learning with uncommon class names, learning from multiple partial sources, and learning with missing knowledge. As the first step for zero-shot learning beyond a pre-defined semantic embedding space, we propose \(\kappa \)-MDS, a general approach that obtains a prototype for each class on any manifold from similarities alone, even when part of the similarities are missing. Our approach can be plugged into any standard, hyperspherical, or hyperbolic zero-shot learner. Experiments on existing datasets and the new benchmarks show the promise and challenges of similarity-based zero-shot learning.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Zero-shot learning (ZSL) constitutes a long-standing problem in computer vision, where we seek to recognize classes for which we do not have any training examples (Lampert et al. 2009; Larochelle et al. 2008). Different from supervised learning, the go-to route in zero-shot learning is to rely on an embedding space from a prior knowledge source that is shared amongst seen and unseen classes. By optimizing training samples of seen classes to their embeddings in this shared space, it also becomes possible to recognize unseen classes during inference through a nearest neighbor search of the unseen class embeddings (Xian et al. 2018a).
We propose similarity-based zero-shot learning and introduce four challenges: a Challenge I: Zero-shot without common names. b Challenge II: Multi-source zero-shot learning. c Challenge III: Zero-shot learning with rare missing knowledge. d Challenge IV: Zero-shot learning with random missing knowledge
Zero-shot learning has witnessed tremendous progress over the years. Where traditional solutions commonly use embedding spaces constructed through e.g.,, attributes (Chen et al. 2022a; Lampert et al. 2009; Xu et al. 2020) or word vectors (Frome et al. 2013; Liu et al. 2020; Radford et al. 2021), the state-of-the-art leverages the embedding space of vision-language models such as CLIP (Radford et al. 2021), typically by transforming a class name into a prompt (Wang et al. 2023a; Tang et al. 2024; Ali and Khan 2023). These advances have resulted in high accuracies on the benchmarks we all love in computer vision, from ImageNet (Deng et al. 2009) and CIFAR (Krizhevsky et al. 2009) to Kinetics (Kay et al. 2017), CUB-Birds (Wah et al. 2011), AWA2 (Xian et al. 2018a), SUN (Patterson et al. 2014) and more. The common assumption that drives current works is that all seen and unseen classes either have pre-defined semantic representations or that they can easily be obtained through prompting.
This assumption is however not valid in many real-world settings. What if the class names are not common English names, as happens naturally in biological and medical settings (Khan et al. 2023; Beery et al. 2022; Tschandl et al. 2018)? Any attempt at a prompt will only lead to random predictions. What if the relation between seen and unseen classes is only expressed by a score, as is the case in many neuroscience experiments (Edelman and Shahbazi 2012) and when dealing with co-occurrence statistics (Mensink et al. 2014)? What if we are dealing with different embedding spaces for different sets of classes? And what if knowledge is missing because we are dealing with rare classes (Chen et al. 2023; Beery et al. 2020; Walker and Orenstein 2021) or with noisy information from the internet (Sharma et al. 2018; Han et al. 2023)? This paper strives to enable zero-shot learning when we can no longer assume there is a pre-defined semantic embedding space.
We propose SimZSL: similarity-based zero-shot learning. We find that a single similarity score between each class pair forms the minimum required building block to enable seen to unseen generalization. A similarity matrix for seen and unseen classes as prior knowledge makes for the most foundational building block, as it allows us to become agnostic to the prior knowledge on which we rely. To encourage research beyond the scope of current zero-shot literature, we outline four challenges. The first is zero-shot without common class names, describing the scenario where CLIP is no longer the magic bullet. The second is multi-source zero-shot learning, where different sets of classes share different embedding spaces. This setting cripples standard zero-shot learning, but becomes feasible when breaking the embeddings down to similarity scores. The third is zero-shot learning with rare missing knowledge, describing the scenario where some classes are rare and hence do not have a known relation to all other classes. The fourth is zero-shot learning with random missing knowledge, where the relation between some classes is randomly missing due to a noisy source. The challenges are visualized in Fig. 1.
To obtain the first results on the new similarity-based zero-shot learning challenges, we take inspiration from classical Multidimensional scaling (MDS) (Carroll and Arabie 1998; Jaworska and Chupetlovska-Anastasova 2009; Hout et al. 2013), which strives to construct a feature vector for each object given a matrix of distances between all objects, canonically for visualization purposes (Jaworska and Chupetlovska-Anastasova 2009). We introduce \(\kappa \)-MDS, a generalization of MDS to operate in Euclidean, hyperspherical, and hyperbolic spaces, enabling us to plug its prototype outputs into any existing prototype-based zero-shot learner. While MDS operates in Euclidean space, we propose a generalization, \(\kappa \)-MDS, to operate in hyperspherical and hyperbolic spaces. Depending on the manifold in which the zero-shot learner performs, \(\kappa \) can be set accordingly. We furthermore outline two extensions of \(\kappa \)-MDS for dealing with rare and random missing knowledge. Experiments on the new challenges show that our approach makes it possible to learn under any zero-shot setting, while they also indicate that zero-shot generalization in these challenging scenarios has a lot of room for improvement. We lastly verify that similarity scores are a sufficient source for zero-shot learning in general. Across multiple datasets and zero-shot learners, we show that any prior knowledge source can be compressed to a similarity matrix without hampering performance when using our embedding construction method. We conclude that similarity-based learning opens new doors in zero-shot recognition without limiting the existing direction in the field.
In summary, our contributions are as follows:
-
1.
We propose SimZSL, similarity-based zero-shot learning, and we show that similarity scores are a sufficient minimal knowledge source for zero-shot learning;
-
2.
We outline four new challenges: zero-shot without common class names, multi-source zero-shot learning, zero-shot with rare missing knowledge, and zero-shot learning with random missing knowledge;
-
3.
We introduce \(\kappa \)-MDS to construct class prototypes on any manifold, even when similarities are missing.
2 Related Work
2.1 ZSL w/ Full Prior Knowledge
Zero-shot learning aims to generalize from a set of training classes to a completely separate set of unseen test classes using a prior knowledge source. Zero-shot learning in literature has relied on many types of embedding spaces shared by seen and unseen classes. The most common examples include attributes (Chen et al. 2022b; Akata et al. 2015; Chen et al. 2018b, 2022a; Jiang et al. 2019; Huynh and Elhamifar 2020; Xie et al. 2019; Romera-Paredes and Torr 2015; Xie et al. 2020; Xu et al. 2020; Zhu et al. 2019; Chen et al. 2021b, a; Han et al. 2021; Narayan et al. 2020; Shen et al. 2020; Verma et al. 2018; Xu et al. 2022; Schonfeld et al. 2019; Xian et al. 2018b; Liu et al. 2018; Wang and Chen 2017; Shen et al. 2021; Reed et al. 2016; Yu et al. 2020; Vyas et al. 2020; Rohrbach et al. 2011; Romera-Paredes and Torr 2015), word vectors (Akata et al. 2015; Bretti and Mettes 2021; Xu et al. 2022; Socher et al. 2013; Schonfeld et al. 2019; Chen et al. 2021c; Frome et al. 2013; Liu et al. 2018; Wang and Chen 2017; Shen et al. 2021; Liu et al. 2020; Reed et al. 2016; Yu et al. 2020; Vyas et al. 2020; Chen et al. 2022a), and class hierarchies (Akata et al. 2015; Liu et al. 2020; Li et al. 2019; Li et al. 2020; Long et la. 2020; Atigh et al. 2022; Rohrbach et al. 2011). Lampert et al. (2009) were the first to explore zero-shot learning in computer vision. They introduced attributes, which are human-annotated high-level descriptions of classes, bridging the gap between seen and unseen classes. Attributes are represented as binary or continuous vectors for machine utilization, e.g., binary attributes such as “black”, “has stripes”, or “eats fish” in AWA2 dataset (Lampert et al. 2013; Akata et al. 2015). Attributes can be used for zero-shot learning to play the role of target representations directly (Xu et al. 2020; Akata et al. 2015; Romera-Paredes and Torr 2015), to project to a common embedding space (Chen et al. 2021c; Liu et al. 2018; Wang and Chen 2017), or as the prior knowledge to generate new synthetic samples (Verma et al. 2018; Chen et al. 2021b, a; Schonfeld et al. 2019; Shen et al. 2020). Due to the strong annotation requirements of attributes, a wide range of works have investigated more scalable prior knowledge sources, such as text embeddings and hierarchical class relations. Text-based prior knowledge can be obtained for example through Word2Vec (Schonfeld et al. 2019), GloVe (Liu et al. 2020), or FastText (Xu et al. 2022) vectors extracted from the class names or a prompt including the class names, or by extracting sentences about the classes from web sources and generating TF-IDF (Vyas et al. 2020) or language model representations (Reed et al. 2016). Given a text-based representation per class, zero-shot training and inference can be performed akin to attribute-based zero-shot learning. Where text-based knowledge largely follows the setup of attributes, class hierarchies form a largely different source for enabling zero-shot learning. Class hierarchies are commonly present in visual datasets, e.g., ImageNet (Deng et al. 2009), CUB-Birds (Wah et al. 2011), Kinetics (Kay et al. 2017), and many more. Early approaches use hierarchies to transfer knowledge from seen to unseen classes through hierarchical relations to help distinguish similar classes (Rohrbach et al. 2011; Al-Halah and Stiefelhagen 2015). More recently, several works propose to embed hierarchies such that their parent–child relations are preserved with minimal distortion. Once embedded, the nodes of the hierarchy can be used as target vectors for representing classes, following the consensus in zero-shot learning. Several works have highlighted that different knowledge sources prefer different geometries for the embedding spaces, such as hyperspherical spaces for text-based embeddings (Shen et al. 2021) and hyperbolic spaces for hierarchical embeddings (Liu et al. 2020; Long et la. 2020). All mentioned approaches assume that seen and unseen classes are a priori embedded in a shared embedding space. Our work strives to push zero-shot learning to settings where this assumption is not longer viable. While zero-shot learning approaches using fixed prior knowledge or pseudo-class centers as the prototypes require a pre-defined, single source prior knowledge to generalize from seen to unseen classes, similarity-based zero-shot learning requires only a single similarity score between each pair of classes. The similarity score can be obtained from any knowledge source, or even combinations of knowledge sources or settings where only similarity scores are given, such as in neuroscience and when working with co-occurrence statistics.
Similarly, Mensink et al. (2014) have advocated for similarities in the form of co-occurrence statistics between classes to perform zero-shot learning. Mensink et al. (2014) propose to extract the co-occurrence statistics from the class-level annotations or web-search hit counts, removing the requirement of expensive, expert-driven annotations. While their approach focuses on co-occurrences only, we generalize to any similarity-based setting. Moreover, our \(\kappa \)-MDS approach works with any prototype-based zero-shot learner and can even deal with missing knowledge.
2.2 ZSL w/ Vision-Language Models
The state-of-the-art in zero-shot learning largely builds upon advances in large-scale vision-language models. If trained on large collections of image-text pairs, vision-language models such as CLIP (Radford et al. 2021), ALIGN (Jia et al. 2021), Flamingo (Alayrac et al. 2022), ActionCLIP (Wang et al. 2021), X-CLIP (Ma et al. 2022), MaskCLIP (Zhou et al. 2022), ReCLIP (Subramanian et al. 2022), CLIPCAM (Hsia et al. 2022), ZegCLIP (Zhou et al. 2023), MAFT (Jiao et al. 2023), MERU (Desai et al. 2023) CLIPN (Wang et al. 2023b), kNN-CLIP (Gui et al. 2024), Cascade-CLIP (Li et al. 2024) and many variants are able to generalize out-of-the-box to unseen classes with impressive performance. What is more, all we need is to convert a class name into a short description, with the go-to solution in the form of “this is a photo of [classname]” (Radford et al. 2021; Tang et al. 2024; Ali and Khan 2023). While effective on general datasets where each class is defined by a common English name, this setup will not work in domains with specialized, uncommon, and non-English names, or on classes not observed when training vision-language models. We show how to make zero-shot learning possible in all settings by condensing all knowledge down to similarities.
2.3 ZSL w/ Missing Knowledge
A few works have investigated zero-shot learning with missing knowledge. Wang et al. (2017) propose a zero-shot learning method to deal with a partial set of observed class attributes. They assume that there is a set of attributes that is missing for all unseen classes. Braytee et al. (2021) also investigate zero-shot learning with missing attributes by learning a supplementary semantic attribute matrix. In contrast, we investigate missing knowledge at the similarity level instead of the attribute level, making our approach more general and agnostic to the used knowledge. Moreover, in our case, knowledge can be both structurally and randomly missing. MDS has been used in prior zero-shot learning, e.g., (Changpinyo et al. 2017). We propose a generalization \(\kappa \)-MDS to operate on any non-Euclidean manifold with curvature \(\kappa \), and we show how to operate with partial knowledge.
3 SimZSL
3.1 Problem Formulation
For our problem, we are given a training set \({\mathcal {T}} = \{(x_i, y_i)\}_{i=1}^{N}\) with N examples, where \(x_i \in {\mathcal {I}}\) denotes the \(i^{\text {th}}\) input image and \(y_i \in {\mathcal {Y}}_s\) denotes the corresponding category label. At test time, the goal is to assign a label to a test image from a set of unseen labels \({\mathcal {Y}}_u\), where \({\mathcal {Y}}_s \cap {\mathcal {Y}}_u = \emptyset \). The point of the paper is that for zero-shot learning in any challenging setting, all we need is a similarity matrix \(S \in \mathbb {R}^{K \times K}\), with \({\mathcal {Y}} = {\mathcal {Y}}_s \cup {\mathcal {Y}}_u\) and \(|{\mathcal {Y}}| = K\), where \(S_{i,j}\) is the similarity score of classes i and j. Given S, we want to distill class prototypes for \({\mathcal {Y}}\). In this work, we strive for an embedding algorithm that can be applied to any zero-shot learner, including recent alternatives that rely on non-Euclidean spaces (Shen et al. 2021; Liu et al. 2020). Additionally, such a method should still work when the similarities can be partially given, i.e., when S is incomplete. To summarise, SimZSL consists of two steps: (1) extracting prototypes given pairwise similarities, (2) performing prototype-based zero-shot learning given SimZSL prototypes.
3.2 How to Learn from Similarities
For our approach, we take inspiration from MDS (Borg and Groenen 2005). MDS deals with a dissimilarity matrix \(D \in \mathbb {R}^{K \times K}\) instead of similarities. This can simply be done by identifying the similarities as \(-\frac{1}{2} D'\) and centering the resulting matrix:
with \(J_K\) the all-ones matrix of dimension \(K \times K\) and \(D'\) the matrix of squared dissimilarities with \(D'_{ij} = D^2_{ij}\).
We assume that each label \(y_i\) can be represented by a point \(z_i\) in a latent space \(M_\kappa ^d\), which is the output of the algorithms (1) and (2), with constant curvature \(\kappa \) and dimension d. Moreover, the pairwise dissimilarities \(d_{ij}\) of any two classes \(y_i, y_j\) can be approximated by the distance \(d_\kappa (z_i, z_j)\) of \(z_i\) and \(z_j\) in \(M_\kappa ^d\). This allows us to recast the embedding problem on the dissimilarity matrix D as a problem of completing a distance matrix in \(M_\kappa ^d\). In the Euclidean case, the solution is well-studied and MDS itself suffices, see Borg and Groenen (2005). Below, we propose \(\kappa \)-MDS, a unified formulation for obtaining class embeddings on any non-Euclidean manifold given constant curvature \(\kappa \). First, let us introduce the following function:
The inverse function \({\mathcal {C}}^{-1}\) is well-defined for \(x \in \tfrac{1}{\sqrt{\kappa }} [0,2\pi )\) for \(\kappa > 0\), and for all \(x \in [0,\infty )\) when \(\kappa < 0\). Given the curvature \(\kappa \in \mathbb {R} \setminus \left\{ 0\right\} \), the inner product \(\left\langle z,z'\right\rangle _\kappa \) of \(z, z' \in \mathbb {R}^{d+1}\) is defined as
where this is the usual (Euclidean) inner product if \(\kappa > 0\), and the indefinite Lorentz product if \(\kappa < 0\). Moreover, we can write \(M_\kappa ^d\) as the connected component of
containing \(z = (1,0,\dotsc , 0)\), with the distance on \(M_\kappa ^d\):
see Ratcliffe et al. (1994). Inverting this equation, we obtain
Applied to the latent label representations \(z_1, \dotsc , z_K\), and written in matrix form, this means that the distance matrix \(D = [d_\kappa (z_i,z_j)]\) can be converted to the Gram matrix \(G = [\left\langle z_i,z_j\right\rangle _\kappa ]\) of pairwise inner products by the element-wise operation \(G = \textrm{sign}\left( \kappa \right) {\mathcal {C}}_\kappa (D)\). The coordinates \(z_1, \dotsc , z_K\) can be recovered from the Gram matrix G through eigendecomposition. We outline the unified \(\kappa \)-MDS solution in Algorithm 1, which returns a coordinate matrix Z, whose rows are the coordinates \(z_1, \dotsc , z_K\) of the desired latent label representations (Agarwal et al. 2010; Keller-Ressel and Nargang 2020; Tabaghi and Dokmanić 2020).
3.3 Generalization to Non-Euclidean
By performing MDS or \(\kappa \)-MDS on D, we obtain a coordinate matrix Z. Each row in this matrix denotes a vector representation of a class, which we can directly use as class embedding in any prototype-based zero-shot learning method. In this paper, we investigate our embeddings from similarities plugged in various zero-shot learning methods. \(\kappa \)-MDS is applied to non-Euclidean zero-shot learners i.e., SZSL and HZSL. For Euclidean approaches such as DeViSE and VGSE, \(\kappa \)-MDS reverts to MDS. For the Euclidean case, we investigate both the canonical DeViSE algorithm (Frome et al. 2013) and the more recent VGSE approach (Xu et al. 2022). To train DeViSE, the goal is to optimize the hinge rank loss
where the image embeddings generated by \(\theta (x)^{T}\) to the class embeddings Z, and m denotes the margin. To train VGSE, the loss function is
We also plug our prototypes on top of CLIP (Radford et al. 2021). In all cases, we simply use the seen classes in our embeddings as targets in the corresponding training losses and use the unseen class embeddings for nearest neighbor search during testing. We furthermore investigate SZSL (Shen et al. 2021) and HZSL (Liu et al. 2020) for respectively hyperspherical and hyperbolic zero-shot learning. For these methods, we set the \(\kappa \) in our \(\kappa \)-MDS to respectively 1 and -1 and plug our embeddings as training and test targets. \(\kappa \)-MDS works under varying values of \(\kappa \). It is however not a hyperparameter, but a way to allow us to operate with any non-Euclidean zero-shot learner. As different zero-shot learners prefer different curvatures when training their models, it is possible for our approach to adopt the prototypes from \(\kappa \)-MDS in any method by matching the \(\kappa \) to the curvature used in the zero-shot learner To train HZSL, given a triplet \((h_I, z_{c_I}, z_{c_I}^{-})\) with \(h_I\) as the image embedding, \(z_{c_I}\) as the positive and \(z_{c_I}^{-}\) as the negative label embeddings extracted from Z, the goal is to minimize
where as the representations are in hyperbolic space, \(d_{\mathbb {D}}\) is Poincaré distance. On the other hand, to train SZSL, the goal is to minimize the following objective function
where \({\mathcal {L}}_{KL}\) is minimizing the Kullback–Leibler (KL) divergence between the prediction probability and the one-hot vector of the correct labels and \(R(\eta ^{*})\) and \(\overline{H}\) are spherical and semantic alignments. Similar to others, the prediction probability is generated with the goal of aligning the image embedding with class embedding i.e., Z.
3.4 Learn from Partial Similarities
In real-world scenarios, similarities are not always fully available, e.g., because a knowledge source is incomplete, or because a full similarity matrix would require excessive human annotation effort. For dealing with missing information, we distinguish missing knowledge due to rare classes (i.e., structured case) and due to random noise (i.e., unstructured case). The structured case can be formalized as follows: From the similarity matrix \(S \in \mathbb {R}^{K \times K}\) only a subset of entries described by a mask \(\Omega \in \left\{ 0,1\right\} ^{K \times K}\) is known (\(\Omega _{ij} = 1\)), and the rest is unknown (\(\Omega _{ij} = 0\)). More formally, complete rows, and by symmetry columns, of S are known and labels can be reordered to cast \(\Omega \) into the block shape
where \(L < K\). In this case, the first L classes take the role of landmark, or reference classes, for which similarity information to all other classes is available. The proportion of known similarities is \(p = L(2K - L)/K^2\) in this case. In the unstructured case, the entries of \(\Omega \) are chosen independently at random with \(\mathbb {P}(\Omega _{ij} = 1) = p\), i.e.,, similarity information is available for random pairs \((l_i, l_j)\) of class labels. Both cases can be solved by completing the partial (dis)similarity matrix, as follows:
Completing partial distances - structured case. In the structured case, matrix D is partitioned accordingly as \(D = \Big ({\begin{smallmatrix} D_{L} & D_{NL}^\top \\ D_{NL} & D_{N} \end{smallmatrix}}\Big )\) using the mask from Eq. 10. Here, \(D_L\), the matrix of landmark-to-landmark distances, and \(D_{NL}\), the matrix of nonlandmark-to-landmark distances are known. However, the submatrix \(D_{N}\) of nonlandmark-to-nonlandmark distances is unknown. The Gram matrix G can be partitioned in the same way, and its known blocks can be computed using the function \({\mathcal {C}}_\kappa \) as above. Inspired by (De Silva and Tenenbaum 2004; Keller-Ressel and Nargang 2022), we first recover the landmark coordinates \(z_1, \dotsc , z_L\) from their Gram matrix \(G_L\) by \(\kappa \)-MDS (or MDS). The non-landmark coordinates \(z_{L+1}, \dotsc , z_K\) are then computed from \(z_1, \dotsc , z_L\) and \(G_{NL}\) by solving a least-squares problem. The corresponding Landmarked \(\kappa \)-MDS summarized in Algorithm 2.
Completing partial distances - unstructured case. In the unstructured case, the problem of completing the partial distance matrix can be phrased as a noisy low-rank matrix completion problem in the sense of (Candes and Plan 2010). In other words, we would like to solve the minimization problem
under the constraint that G is a Gram matrix of \(\left\langle z,z'\right\rangle _\kappa \)-inner products of \(\textrm{rank}(G) = d + |\textrm{sign}\left( \kappa \right) |\). This is a non-convex optimization problem, which is usually solved by replacing the rank-constraint with a nuclear-norm constraint or by alternating minimization and projection methods (Candes and Plan 2010; Jain et al. 2013; Jiang et al. 2017; Nguyen et al. 2019). For the non-Euclidean case \(\kappa \ne 0\) we adapt the method of (Jiang et al. 2017) and perform alternating projections onto rank-constraint sets \({\mathcal {S}}_{r_i}\) and fidelity-constraint sets \({\mathcal {S}}_{\epsilon _i}\). The projection onto the rank constraint set \({\mathcal {S}}_{r_i}\) is equivalent to \(\kappa \)-MDS with dimension parameter \(d = r_i + 1\); the projection onto the fidelity-constraint set \({\mathcal {S}}_{\epsilon _i}\) is equivalent to performing a sequence of gradient steps in (11) until an error tolerance of \(\epsilon _i^2\) is met.
3.5 Performance Guarantees
We provide additional performance guarantees, showing that \(\kappa \)-MDS as given in Algorithm 1 is capable of perfectly recovering latent points from a dissimilarity matrix. We show that the same guarantee even holds for Algorithm 2 if the number of landmarks is equal to or greater than the chosen dimensionality. Our approach has no problems scaling to many classes even with the quadratic growth of the similarity matrix. For example, on ImageNet-21k (Deng et al. 2009) with 21k classes, it takes about 4 minutes and 600MB to obtain the prototypes with our \(\kappa \)-MDS, which is an offline process that only has to be run once.
For Algorithm 1, we are able to show the following performance guarantee: Let latent points \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K \in M_\kappa ^d\) with \(K > d\) be given and assume that they are not fully contained in a subspace of strictly smaller dimension \(d' < d\). Then \(\kappa \text {-MDS}(D,d,\kappa )\) can perfectly recover the latent points \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K\) up to isometry from knowledge of their full distance matrix \(D = [d_\kappa (\varvec{{z}}_i, \varvec{{z}}_j)]_{i,j=1\dotsc K}\). For Algorithm 2, we can give the same guarantee if, in addition, the number of landmark labels L satisfies \(L \ge d\). For brevity, we only provide the proof for the first assertion in the rebuttal: Let \({\bar{Z}} \in \mathbb {R}^{K \times (d+1)}\) denote the true coordinate matrix, whose rows are the coordinates in \(M_\kappa ^d\) of the true latent points. This matrix has rank \(d+1\), since otherwise the latent points would be concentrated in a lower dimensional subspace. By Eq. 2 the Gram matrix \({\bar{G}}\) of inner products can be written as \({\bar{G}} = {\bar{Z}} J_\kappa {\bar{Z}}\), where \(J_\kappa \) is the \(\tiny {(d+1) \times (d+1)}\) matrix \(J_\kappa = \textrm{diag}(\textrm{sign}\left( \kappa \right) , 1, \dots , 1)\). By Eq. 4 the true distance matrix of the latent points is \({\bar{D}} = {\mathcal {C}}_\kappa ^{-1}(\textrm{sign}\left( \kappa \right) {\bar{G}})\). Matrix G in the first step of Algorithm 1 is equal to
i.e., the true Gram matrix is recovered. In steps two and three, the Eigenvalue decomposition \(G = Q \Lambda Q^\top \) is computed. The reduced coordinate matrix \(Z'\) is computed as \(Z' = Q \textrm{diag}(\sqrt{\lambda _1}, \dotsc , \sqrt{\lambda _d}, 0, \dotsc , 0)\). In step four, Eq. 3 is used to infer the missing column \(\varvec{{z}}_0\), such that the complete matrix Z satisfies:
Since \({\bar{Z}}\) has full column rank, its Moore-Penrose pseudoinverse \({\bar{Z}}^+\) is a left inverse, i.e., \({\bar{Z}}^+ {\bar{Z}} = I_{d+1}\). Set \(A:= {\bar{Z}}^+ Z\). We have \({\bar{Z}} A = {\bar{Z}} {\bar{Z}}^+ Z = Z\), that is, A transforms the rows of \({\bar{Z}}\) into the rows of Z. Applying \({\bar{Z}}^+\) to (12) from the left and its transpose from the right, we obtain \(A J_\kappa A^\top = J_\kappa \). If \(\kappa > 0\), this implies that A is a rotation matrix, i.e., an isometry of \(M_\kappa ^d\). If \(\kappa < 0\), A is a hyperbolic isometry by [ Ratcliffe et al. (1994), Thm. 3.1.4, 3.2.3]. The latent coordinates \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K\) (rows of \({\bar{Z}}\)) are recovered up to isometry of \(M_\kappa ^d\), completing the proof.
4 SimZSL Challenges
To broaden the scope of zero-shot learning benchmarks towards learning from similarities, we have developed four challenges (CI, CII, CIII, CIV) and provide first numbers for each. Below we dive into them one by one.
4.1 CI: SimZSL w/o Common Names
Context. Vision-language models such as CLIP (Radford et al. 2021) perform zero-shot recognition by calculating the similarity between textual prompts from class names and image representations. This results in strong models when dealing with common class names, e.g., describing objects or actions. The purpose of the first challenge is to create a realistic scenario where we can no longer rely on the text encoders of vision-language models to obtain class embeddings. We take inspiration from the biological domain, where recognition tasks commonly deal with technical, esoteric, or highly fine-grained class names (Beery et al. 2022; Khan et al. 2023; Tschandl et al. 2018). In biology, class relations are often given by taxonomical similarities (Beery et al. 2022; Khan et al. 2023); troublesome for classical zero-shot learning approaches but ideal for our similarity-based zero-shot learning.
Challenge II: t-SNE visualization of multi-source similarity-based embeddings on SUN and AWA2, with attributes as intra-dataset knowledge and word embeddings as inter-dataset knowledge. The temple south asia and blue whale are correctly classified samples from SUN and AWA2 datasets surrounded by embeddings from their dataset. The embedding of the landing deck [AWA2] and corral [AWA2] samples are correctly classified, despite being surrounded by image embeddings from SUN. The horse [AWA2] sample is miss-classified as field cultivated [SUN], due to their high semantic similarity. This also holds for the auditorium [SUN] sample miss-classified as theater-indoor. The analysis shows that we can learn an embedding space that smoothly interpolates within and across datasets and knowledge sources
Setup. We take a look at the recent FishNet dataset (Khan et al. 2023), where class names are given by the formal Latin names of the specific fish categories. For this experiment, we define a split into 232 seen and 231 unseen classes. We compare three approaches. In the experiments shown in Table 1, we use the CLIP model with a ViT-B/32 backbone, and the experiments in Table 7 cover CLIP models with various backbones, i.e., RN50, RN101, ViT-B/16, ViT-B/32, and ViT-L/14. For the prompt-based experiments, we follow Radford et al. (2021) and use “this is a photo of [classname].” as the prompt template. The first takes CLIP out of the box for direct zero-shot generalization by transforming the class names into prompts to be used as prototypes, following the conventional setup (Radford et al. 2021). The second additionally fine-tunes the image and text embeddings using the seen classes. The third uses our similarity-based approach on top of the same image encoder, where our class embeddings are derived from a similarity matrix based on the taxonomical graph distances between classes given by the original paper (Khan et al. 2023). In addition, we perform experiments on Branch real-world dataset (Yang et al. 2023) consisting of 13 classes, divided into 7 seen and 6 unseen classes. For this challenge, we add a linear layer on top of CLIP, fine-tuning the model on FishNet with 50 dimensions. Similar to the original model, representations are normalized and lie in spherical space.
Results. As expected, applying CLIP directly leads to a near-random performance (1.09%), since the textual encoder is unable to deal with the uncommon class names. For CLIP-based zero-shot learning, we follow Radford et al. (2021) by prompting with class names, e.g., creating a prompt in the form of “this is a photo of [classname]”. For example, given a class name such as “lion”, the prompt becomes “this is a photo of lion.” This prompt is then used to calculate the similarity with the image, and the class with the highest similarity is selected as the prediction. However, this approach relies on meaningful class names. As shown for FishNet samples, generating a prompt such as “This is a photo of Lepomis macrochirus”. Since the class names in FishNet are not common names and CLIP models have not encountered similar contexts before, they fail to provide meaningful information, leading to near-random performance. Fine-tuning only does not help a lot, with an accuracy of 1.16%. The fine-tuning result highlights that the issue is truly with the class embeddings, as the model is unable to provide meaningful embeddings for the classes with uncommon names, so improving the alignment to seen classes does not improve the zero-shot alignment to unseen classes. Only when adding our prototypes from taxonomical similarities and using the prototypes instead of the prompt to calculate the similarity, we are able to boost the scores. Regardless, performing zero-shot classification on 231 unseen fish species remains a highly challenging task for future research. The FishNet dataset is large-scale and diverse, with images collected from various regions worldwide, with different sizes, resolutions, and illuminations. The dataset is still challenging in different tasks due to large species diversity, fine-grained labels, diverse backgrounds, low contrast, and more (Khan et al. 2023). As shown by Veiga and Rodrigues (2024), the FishNet dataset is an imbalanced dataset with a minority-to-majority ratio of 4 to 4782. While the language encoder of the CLIP model is unable to generate a meaningful representation given Latin names, the images, and distribution are challenging for the vision encoder as well. However, we do not focus on improving the model in this challenge and only want to pinpoint that using a similarity matrix instead of CLIP representations can boost performance.
To further analyze the performance of the CLIP model, we conduct experiments using different CLIP models on the real-world Branch dataset. As shown in Table 7, using our prototypes instead of prompts outperforms the CLIP models. While fine-tuning CLIP with Branch dataset class names improves the results, our prototypes still outperform with a high margin in 4 out of 5 models.
4.2 CII: Multi-source SimZSL
Context. In standard zero-shot learning each class is assigned an embedding vector from the same knowledge source. For example, if we rely on attributes as our source, each class should have annotated labels for each attribute. This setup makes sense if we are given a specific domain to operate in, e.g., recognizing birds, scenes, or animals. The objective of the second challenge is to go beyond a single domain for zero-shot recognition. We strive to perform zero-shot learning across multiple sets of classes simultaneously. Each set of classes comes with their own embedding space; again not directly applicable to the standard zero-shot pipeline, but feasible with similarity-based zero-shot learning, since all embeddings can be combined into a single unified similarity matrix.
Setup. We construct the second challenge by combining multiple existing zero-shot datasets, i.e., SUN-AWA2, CUB-SUN, and CUB-AWA2. To extract the class embeddings, we create a similarity matrix S that spans both datasets. For all class pairs within an individual dataset, we use their corresponding knowledge source to obtain similarities i.e., similarities obtained by 312, 85, and 102 dimensional attribute vectors for CUB-Birds, AWA2, and SUN. To obtain similarity scores of class pairs between the datasets, we use fastText (Bojanowski et al. 2017) word embeddings. We use DeViSE with 500, 500, and 50 dimensional class embeddings for the respective pairs. As this challenge is not feasible with the standard ZSL methods, we unify the embeddings from multiple sources through PCA to match the dimensionalities of the attributes, on which standard zero-shot learning is performed as a reference baseline.
Failure analysis for challenges III and IV on CUB, SUN, and AWA2 datasets. The first two columns show zero-shot learning with rare missing knowledge, where the model has only access to 15 attributes out of 312 and 102. Given rare missing knowledge, the model misclassified the classes to one with a high pairwise similarity. The third column shows zero-shot learning with random missing knowledge, in which 50% of the knowledge is missing
Results. The multi-source results are shown in Table 2. Our similarity-based approach works both in the settings with knowledge per dataset (intra-dataset knowledge) and with additional knowledge between datasets (inter-dataset knowledge). The type of knowledge does not even have to match, as all becomes the same when turning them into similarities. Extending DeViSE to multi-source zero-shot learning only works in the intra-dataset setting and we outperform this baseline. Adding inter-dataset knowledge is effortless for us. Whether this leads to an improvement depends on the problem at hand. For example, when performing zero-shot learning on CUB and AWA2 jointly, extra word vector knowledge really helps since both datasets focus on animals. For heterogeneous settings, simply having 0 similarity between class pairs across datasets can be sufficient. We conclude that a similarity-based approach enables multi-source zero-shot learning, opening up recognition tasks spanning many datasets from different domains (See Fig. 2).
4.3 CIII: SimZSL w/ Rare Missing Knowledge
Context. We have seen how a similarity-based approach enables zero-shot generalization in difficult biological and multi-source settings. In real-world settings, the minimalist requirement of similarities can still be a high bar. For example, in neuroscience and human cognition, annotators are often asked to judge conceptual similarity between classes (Edelman and Shahbazi 2012). But naturally, some classes occur often while others are rare. The consequence is that not all knowledge is given between rare classes. Similar situations occur again in biological domains, where there is simply more known for some classes than other classes (Chen et al. 2023; Beery et al. 2020; Walker and Orenstein 2021). This challenge strives to address ZSL in such incomplete settings.
Setup. For zero-shot learning with rare missing knowledge, some classes have full knowledge while other classes have partial non-overlapping knowledge, see Fig. 1c. To best understand the challenging nature of this problem and to understand how well our landmarked \(\kappa \)-MDS works, we create a simulated scenario, where we use attribute vectors with partially missing elements as the knowledge. The classes with partial knowledge contain only m elements, with no attributes in common between them. But, they share m attributes with the classes with full knowledge (i.e., the landmarks, as introduced in 3.4). Thus, pairwise similarities within these two sets can be computed. However, since the classes with partial prior knowledge have no attributes in common, their pairwise similarities are missing, resulting in a missing block in the final similarity matrix S, as formalized in Eq. 10. Despite such a gap, our approach is able to obtain class embeddings. For the experiments, we use CUB, AWA2, and SUN datasets given 312, 85, and 102 dimensional attribute vectors as the full knowledge starting point.
Results. The results are shown in Fig. 3. The bottom x-axis shows the number of attributes available for classes with partial knowledge. The top x-axis shows the number of landmarks for each setting (i.e., the classes with full knowledge). The lower the number of attributes for the classes with partial knowledge, the fewer landmark classes we have with full knowledge, and higher the percentage of missing knowledge. Our approach shows stable performance as the number of landmarks and attributes decreases. While the accuracy of the CUB dataset with full prior knowledge (312 attributes) is \(51.18\%\), the accuracy with 80 attributes and 197 landmarks is \(51.33\%\). For the AWA2 dataset, the full 85 attributes and partial 25 attributes yield \(60.08\%\) and \(54.14\%\), respectively. Interestingly on SUN, missing knowledge can actually increase performance. While the performance of the SUN dataset with the full setting (102 attributes) is \(51.10\%\), the use of 25 minimum attributes and 713 landmarks gives an accuracy of \(53.82\%\). We conclude that our approach is not only able to generalize to zero-shot settings where knowledge is structurally missing, it is even possible to increase generalization in such settings. Additional experiments using new more advanced zero-shot learner i.e., VGSE (Xu et al. 2022) and SZSL (Shen et al. 2021) are shown in Fig. 4. Our approach provides a consistent performance along different numbers of landmarks, and shows a consistent performance, even up to only 10 landmarks.
4.4 CIV: SimZSL w/ Random Missing Knowledge
Context. Lastly, we provide a challenge where similarity scores are randomly missing. This is common when gathering similarities from the internet, for example when using co-occurrence statistics, see Fig. 1d.
Setup. We again strive for a simulated setting, because it allows us to investigate the robustness of our approach as a function of the number of missing similarities. We therefore perform this experiment using DeViSE and VGSE as the zero-shot methods, on AWA2 with attributes to generate similarities, generating 25 dimensional class embeddings, as described in Sect. 3.4.
Results. In Fig. 5, we show what happens when we randomly drop between 5% and 75% of the class similarities. Due to the variance of such settings, we ran setting 6 times, with different random seeds for dropping similarities. Our iterative approach for dealing with missing similarity values is robust to many missing values. Even when 50% of the similarities are missing, we obtain an accuracy of 40.33%. This result indicates that similarity-based zero-shot learning has strong potential for settings where similarities are hard to come by or lacking.
4.5 Qualitative Analysis
While SimZSL performs well given rare (i.e., challenge III) or random (i.e., challenge IV) missing knowledge, there are samples where the model struggles because of unavailable knowledge. Figure 6 shows some failure samples of CUB, SUN, and AWA2 datasets, given rare and random missing knowledge. For CUB, the groundtruth classes are only aware of 15 out of 312 attributes, the rest is missing. In Fig. 6a, given a sample of “loggerhead shrike", the model predicts “white treated nuthatch". Given only 15 available attributes out of 312, “loggerhead shrike" and “white treated nuthatch" share “white throat color" and “black primary color" in top-2 most similar attributes, making it a difficult sample for the model. However, “white wing color" and “solid breast pattern" are top-2 most discriminative attributes, that are missing, making the classification difficult. In Fig. 6c, while the classes share “perching like shape" and “grey leg color", the most discriminative attributes, i.e., “yellow breast color" for “tropical kingbird" and “forehead yellow color" for “white-eyed vireo" are missing, making the classes similar given the available set of attributes. Similarly in Fig. 6d, there are only 15 out of 102 attributes available for SUN. As shown in Fig. 6d, the model assigns the “yard" label to “motel". Given only 15 attributes as prior knowledge, both classes share “natural light" and “soothing" in common. In contrast, checking the missing attributes, “motel" is extremely “man-made" compared to “yard", while “yard" has a higher value for “shrubbery". Both attributes are missing, making the classes similar.
The last column shows samples from challenge IV with random missing knowledge. In these samples, the model only has access to 50% of pairwise similarities. In Fig. 6e, the input to the model is an image of “seal", wrongly classified as a “dolphin". While in the original prior knowledge, the pairwise similarity between “seal" and “dolphin" is missing, these classes have an original similarity of 0.86, sharing attributes like “flippers", “ocean", and “water". Similarly in Fig. 6f, the model has classified a “giraffe" as a “horse". These classes have a similarity of 0.80, sharing attributes like “long leg", “big", and “quadrupedal", making it hard sample for the model.
5 Ablation Studies
Setup. Prototypes generated by SimZSL can be plugged into any prototype-based zero-shot learning approach, without changing the structure of the model. For any method used in the experiments, we take the same backbone and training procedure as proposed in the original paper, we only change the target prototypes to our similarity-based prototypes as obtained through k-MDS. We employ HZSL (Liu et al. 2020) to perform zero-shot learning in hyperbolic space, utilizing hierarchies as prior knowledge. Specifically, we obtain the hierarchy for the CUB-Birds dataset from Chen et al. (2018a), while for the AWA2 dataset, we make use of the hierarchy extracted by Wu et al. (2020). We conducted experiments using the default hyperparameter settings of SZSL (Shen et al. 2021) and VGSE (Xu et al. 2022). For the HZSL (Liu et al. 2020) experiments, a learning rate of 0.01 and weight decay of 0.01 were employed. For the hinge loss objective in HZSL (Liu et al. 2020), we set the margin to 0.1. In all experiments with DeViSE (Frome et al. 2013), a learning rate of 0.01 and margin of 1 were utilized, with the exception of the SUN dataset, where a learning rate of 0.001 and margin of 0.25 were used. For CLIP (Radford et al. 2021) and FishNet (Khan et al. 2023) experiments, a learning rate of 0.01 and weight decay of 0.0001 were employed. For seen/unseen split and attributes, data provided by Xian et al. (2018a) is used.
5.1 Are Similarities Enough for Zero-Shot Learning?
The main goal of this ablation study is to confirm our initial hypothesis: a class similarity matrix serves as a sufficient information source to perform zero-shot learning. To empirically validate this hypothesis, we start from standard knowledge sources such as attributes and word embeddings. We then reduce them to similarity matrix S through a matrix product and feed the matrix as input to our approach to regain class embeddings. If our hypothesis is correct, zero-shot learning on the embeddings from the knowledge source directly is as effective as using our class embeddings obtained from similarity matrices. We perform this ablation study across datasets, knowledge sources, and manifolds. The results for Euclidean, hyperspherical, and hyperbolic zero-shot learners are shown in Table 3 for three knowledge sources on CUB with 200D prototypes. Table 4 also shows the results for zero-shot learners in three different manifolds on SUN and AWA2 datasets with 500D and 50D prototypes. We find that zero-shot performance with our approach is as good as with the original semantic embeddings. In 7 out of 9 comparisons on CUB, the t-test indicates no significant difference in performance across the runs (p-value \(\gg \) 0.05). Similarly, there is no significant difference across runs in 6 out of 6 and 4 out of 6 comparisons on SUN and AWA2. We conclude from this ablation study that similarity-based learning does not hamper zero-shot learning.
In addition, we extend our analysis by incorporating SZGZSL (Chen et al. 2021a) and ICIS (Christensen et al. 2023) zero-shot learners on the CUB, SUN, and AWA2 datasets, shown in Table 5, as well as the Flowers and APY datasets, shown in Table 6. These results further strengthen the hypothesis of the adequacy of similarities for zero-shot learning, demonstrating consistent performance across both standard and previously underexplored datasets. This expanded evaluation underscores the robustness of similarity-based representations in diverse learning scenarios.
5.2 Which Manifold is Best for Which Knowledge Source?
As our proposed method provides the flexibility to generate prototypes for any manifold given a similarity matrix, we also compare the use of different knowledge sources for different manifolds, given in Table 3. The comparison between Euclidean and hyperbolic manifolds follows current literature: attributes work better in Euclidean space (51.14 vs. 50.91), while hierarchies work better in hyperbolic space (25.08 vs. 17.40). In line with Shen et al. (2021), hyperspherical spaces are preferred for word vectors. To make hierarchies work for Euclidean and hyperspherical manifolds, we first embed them in hyperbolic space and perform a logarithmic mapping (for Euclidean) or \(\ell _2\) normalization (for hyperspherical) afterwards. Interestingly, we find that with this setup, hyperspherical zero-shot learning works best for hierarchies. This is in line with Ghadimi Atigh et al. (2021) and Moreira et al. (2024), which highlight the innate normalized nature of hyperbolic prototypes.
5.3 How Many Embedding Dimensions are Enough?
Since we start from a similarity matrix, the dimensionality of the class embeddings is now a hyperparameter to choose freely. We analyze the performance of class embeddings with different dimensionalities in Fig. 7. The class embeddings are stable for a wide range of dimensions across different types of prior knowledge, datasets, and manifold choices. We conclude that our approach allows for a strong compression of the embedding space while maintaining zero-shot performance.
6 Conclusions
Empowered by ever more powerful backbones, knowledge, and pre-trained vision-language models, zero-shot learning continues to improve. The general zero-shot pipeline however also has blindspots, ranging from dealing with uncommon names to human judgment similarities as knowledge, and missing information. We advocate for similarity matrices as the all-purpose knowledge source for zero-shot learning and we introduce four challenges inspired by real-world scenarios to where the standard zero-shot assumption does not hold. We furthermore propose \(\kappa \)-MDS, a general approach that obtains prototypes for seen and unseen classes on any manifold solely given similarities. Our method can be plugged into any zero-shot learner. We show how our approach makes learning with uncommon names, multiple sources, and missing information possible, without hampering accuracy in standard settings. We hope that our similarity-based perspective opens new doors in zero-shot learning.
References
Agarwal, A., Phillips, J.M., & Venkatasubramanian, S. (2010). Universal multi-dimensional scaling. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining pp. 1149– 1158.
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2015). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), 1425–1438.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736.
Al-Halah, Z., & Stiefelhagen, R. (2015). How to transfer? zero-shot object recognition via hierarchical transfer of semantic attributes. In 2015 IEEE Winter Conference on Applications of Computer Vision pp. 837– 843. IEEE.
Ali, M., & Khan, S. (2023). Clip-decoder: Zeroshot multilabel classification using multimodal clip aligned representations. In Proceedings of the IEEE/CVF international conference on computer vision pp. 4675– 4679.
Atigh, M.G., Schoep, J., Acar, E., Van Noord, N., & Mettes, P. (2022). Hyperbolic image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 4453– 4462.
Beery, S., Liu, Y., Morris, D., Piavis, J., Kapoor, A., Joshi, N., Meister, M., & Perona, P. ( 2020) Synthetic examples improve generalization for rare classes. In Proceedings of the Ieee/cvf winter conference on applications of computer vision pp. 863– 873.
Beery, S., Wu, G., Edwards, T., Pavetic, F., Majewski, B., Mukherjee, S., Chan, S., Morgan, J., Rathod, V., & Huang, J. ( 2022). The auto arborist dataset: a large-scale benchmark for multiview urban forest monitoring under domain shift. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 21294– 21307
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. Springer.
Braytee, A., Naji, M., Anaissi, A., Chaturvedi, K., & Prasad, M. ( 2021). Zero-shot learning with missing attributes using semantic correlations. In 2021 international joint conference on neural networks (IJCNN) pp. 1– 7. IEEE.
Bretti, C., & Mettes, P. ( 2021). Zero-shot action recognition from diverse object-scene compositions.
Candes, E. J., & Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6), 925–936.
Carroll, J.D., Arabie, P. (1998) Multidimensional scaling. Measurement, judgment and decision making, 179–250
Changpinyo, S., Chao, W.-L., & Sha, F. ( 2017). Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE international conference on computer vision pp. 3476– 3485.
Chen, S., Hong, Z., Liu, Y., Xie, G.-S., Sun, B., Li, H., Peng, Q., Lu, K., & You, X. (2022a). Transzero: Attribute-guided transformer for zero-shot learning. In Proceedings of the AAAI conference on artificial intelligence vol. 36, pp. 330– 338.
Chen, S., Hong, Z., Xie, G.-S., Yang, W., Peng, Q., Wang, K., Zhao, J., & You, X. (2022b). Msdn: Mutually semantic distillation network for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 7612– 7621.
Chen, Z., Luo, Y., Qiu, R., Wang, S., Huang, Z., Li, J., & Zhang, Z. (2021a). Semantics disentangling for generalized zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision pp. 8712– 8720.
Chen, S., Wang, W., Xia, B., Peng, Q., You, X., Zheng, F., & Shao, L. (2021b). Free: Feature refinement for generalized zero-shot learning. In Proceedings of the IEEE/CVF international conference on computer vision pp. 122– 131.
Chen, T., Wu, W., Gao, Y., Dong, L., Luo, X., Lin, L. (2018a) Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 2023– 2031
Chen, L., Zhang, H., Xiao, J., Liu, W., & Chang, S.-F. (2018b). Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 1043– 1052.
Chen, K., Lei, W., Zhao, S., Zheng, W.-S., & Wang, R. (2023). Pcct: Progressive class-center triplet loss for imbalanced medical image classification. IEEE Journal of Biomedical and Health Informatics, 27(4), 2026–2036.
Chen, S., Xie, G., Liu, Y., Peng, Q., Sun, B., Li, H., You, X., & Shao, L. (2021). Hsva: Hierarchical semantic-visual adaptation for zero-shot learning. Advances in Neural Information Processing Systems, 34, 16622–16634.
Christensen, A., Mancini, M., Koepke, A., Winther, O., & Akata, Z. (2023). Image-free classifier injection for zero-shot classification. In Proceedings of the IEEE/CVF international conference on computer vision pp. 19072– 19081.
De Silva, V., & Tenenbaum, J.B.(2004) Sparse multidimensional scaling using landmark points. Technical report, technical report, Stanford University.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L. ( 2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition pp. 248– 255 . Ieee.
Desai, K., Nickel, M., Rajpurohit, T., Johnson, J., & Vedantam, S.R. ( 2023). Hyperbolic image-text representations. In International conference on machine learning pp. 7694– 7731. PMLR.
Edelman, S., & Shahbazi, R. (2012). Renewing the respect for similarity. Frontiers in Computational Neuroscience, 6, 45.
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T.(2013) Devise: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems Vol. 26.
Ghadimi Atigh, M., Keller-Ressel, M., & Mettes, P. (2021). Hyperbolic busemann learning with ideal prototypes. Advances in Neural Information Processing Systems, 34, 103–115.
Gui, Z., Sun, S., Li, R., & Yuan, J., An, Z., Roth, K., Prabhu, A., Torr, P. (2024). knn-clip: Retrieval enables training-free segmentation on continually expanding large vocabularies. arXiv preprint arXiv:2404.09447
Han, Z., Fu, Z., Chen, S., & Yang, J.( 2021). Contrastive embedding for generalized zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 2371– 2381.
Han, H., Miao, K., Zheng, Q., Luo, & M.( 2023). Noisy correspondence learning with meta similarity correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 7517– 7526.
Hout, M. C., Papesh, M. H., & Goldinger, S. D. (2013). Multidimensional scaling. Wiley Interdisciplinary Reviews: Cognitive Science, 4(1), 93–103.
Hsia, H.-A., Lin, C.-H., Kung, B.-H., Chen, J.-T., Tan, D.S., Chen, J.-C., & Hua, K.-L.( 2022). Clipcam: A simple baseline for zero-shot text-guided object and action localization. In ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) pp. 4453– 4457. IEEE.
Huynh, D., & Elhamifar, E. ( 2020). Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 4483– 4493.
Jain, P., Netrapalli, P., & Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on theory of computing pp. 665– 674.
Jaworska, N., & Chupetlovska-Anastasova, A. (2009). A review of multidimensional scaling (mds) and its utility in various psychological domains. Tutorials in Quantitative Methods for Psychology, 5(1), 1–10.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning pp. 4904– 4916. PMLR.
Jiang, H., Wang, R., Shan, S., Chen, X.( 2019) Transferable contrastive network for generalized zero-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9765– 9774
Jiang, X., Zhong, Z., Liu, X., & So, H. C. (2017). Robust matrix completion via alternating projection. IEEE Signal Processing Letters, 24(5), 579–583.
Jiao, S., Wei, Y., Wang, Y., Zhao, Y., & Shi, H. (2023). Learning mask-aware clip representations for zero-shot segmentation. Advances in Neural Information Processing Systems, 36, 35631–35653.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., & Natsev, P., et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Keller-Ressel, M., Nargang, S.(2022) Strain-minimizing hyperbolic network embeddings with landmarks. arXiv preprint arXiv:2207.06775
Keller-Ressel, M., & Nargang, S. (2020). Hydra: a method for strain-minimizing hyperbolic embedding of network-and distance-based data. Journal of Complex Networks, 8(1), 002.
Khan, F.F., Li, X., Temple, A.J., & Elhoseiny, M. (2023). Fishnet: A large-scale dataset and benchmark for fish recognition, detection, and functional trait prediction. In Proceedings of the IEEE/CVF international conference on computer vision pp. 20496– 20506.
Krizhevsky, A., & Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition pp. 951– 958. IEEE.
Lampert, C. H., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465.
Larochelle, H., Erhan, D., Bengio, Y.( 2008) Zero-data learning of new tasks. In: AAAI, vol. 1, p. 3
Li, Y., Li, Z., Zeng, Q., Hou, Q., Cheng, M.-M.(2024) Cascade-clip: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. arXiv preprint arXiv:2406.00670
Li, A., Luo, T., Lu, Z., Xiang, T., & Wang, L. (2019). Large-scale few-shot learning: Knowledge transfer with class hierarchy. In Proceedings of the Ieee/cvf conference on computer vision and pattern recognition pp. 7212– 7220.
Li, A., Lu, Z., Guan, J., Xiang, T., Wang, L., & Wen, J.-R. (2020). Transferrable feature and projection learning with class hierarchy for zero-shot learning. International Journal of Computer Vision, 128, 2810–2827.
Liu, S., Chen, J., Pan, L., Ngo, C.-W., Chua, T.-S., & Jiang, Y.-G. (2020). Hyperbolic visual embedding learning for zero-shot recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 9273– 9281.
Liu, S., Long, M., Wang, J., & Jordan, M.I. (2018). Generalized zero-shot learning with deep calibration network. Advances in Neural Information Processing Systems Vol. 31
Long, T., Mettes, P., Shen, H.T., Snoek, C.G.( 2020) Searching for actions on the hyperbole. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1141– 1150
Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., & Ji, R (2022). X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM international conference on multimedia pp. 638– 647.
Mensink, T., Gavves, E., & Snoek, C.G. (2014). Costa: Co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 2441– 2448.
Moreira, G., Marques, M., Costeira, J.P., & Hauptmann, A. (2024) Hyperbolic vs euclidean embeddings in few-shot learning: Two sides of the same coin. In Proceedings of the IEEE/CVF Winter conference on applications of computer vision pp. 2082– 2090.
Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., & Shao, L. (2020). Latent embedding feedback and discriminative features for zero-shot classification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pp. 479– 495 . Springer
Nguyen, L. T., Kim, J., & Shim, B. (2019). Low-rank matrix completion: a contemporary survey. IEEE Access, 7, 94215–94237.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The sun attribute database: Beyond categories for deeper scene understanding. International Journal of Computer Vision, 108, 59–81.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J., et al. (2021).Learning transferable visual models from natural language supervision. In International conference on machine learning pp. 8748– 8763. PmLR.
Ratcliffe, J.G., Axler, S., & Ribet, K.A. (1994). Foundations of Hyperbolic Manifolds vol. 149. Springer.
Reed, S., Akata, Z., Lee, H., & Schiele, B. (2016). Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 49– 58.
Rohrbach, M., Stark, M., Schiele, B.( 2011) Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: CVPR 2011, pp. 1641– 1648 . IEEE
Romera-Paredes, B., & Torr, P. (2015) An embarrassingly simple approach to zero-shot learning. In International conference on machine learning pp. 2152– 2161. PMLR.
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., & Akata, Z. (2019). Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 8247– 8255.
Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2556– 2565.
Shen, Y., Qin, J., Huang, L., Liu, L., Zhu, F., & Shao, L. (2020). Invertible zero-shot recognition flows. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 614– 631. Springer
Shen, J., Xiao, Z., Zhen, X., & Zhang, L. (2021). Spherical zero-shot learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(2), 634–645.
Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. Advances in Neural Information Processing Systems Vol. 26.
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., & Rohrbach, A. (2022) Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991
Tabaghi, P., Dokmanić, I. (2020). Hyperbolic distance matrices. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining pp. 1728– 1738.
Tang, B., Zhang, J., Yan, L., Yu, Q., Sheng, L., & Xu, D. (2024). Data-free generalized zero-shot learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38, 5108–5117.
Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1), 1–9.
Veiga, R.J., & Rodrigues, J.M. (2024). Fine-grained fish classification from small to large datasets with vision transformers. IEEE Access.
Verma, V.K., Arora, G., Mishra, A., & Rai, P. (2018). Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 4281– 4289.
Vyas, M.R., Venkateswara, H., & Panchanathan, S. (2020). Leveraging seen and unseen semantic relationships for generative zero-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp. 70– 86 . Springer.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
Walker, J.L., & Orenstein, E.C. (2021). Improving rare-class recognition of marine plankton with hard negative mining. In Proceedings of the IEEE/CVF international conference on computer vision pp. 3672– 3682.
Wang, Y., Kwok, J.T., Yao, Q., & Ni, L.M. (2017). Zero-shot learning with a partial set of observed attributes. In 2017 international joint conference on neural networks (IJCNN), pp. 3777– 3784. IEEE.
Wang, H., Li, Y., Yao, H., Li, & X. (2023b). Clipn for zero-shot ood detection: Teaching clip to say no. In Proceedings of the IEEE/CVF international conference on computer vision pp. 1802– 1812.
Wang, Z., Liang, J., He, R., Xu, N., Wang, Z., & Tan, T. (2023a). Improving zero-shot generalization for clip with synthesized prompts. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) pp. 3032– 3042.
Wang, M., Xing, J., & Liu, Y. (2021). Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472
Wang, Q., & Chen, K. (2017). Zero-shot visual recognition via bidirectional latent embedding. International Journal of Computer Vision, 124, 356–383.
Wu, T.-Y., Morgado, P., Wang, P., Ho, C.-H., & Vasconcelos, N. (2020). Solving long-tailed recognition with deep realistic taxonomic classifier. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 171– 189. Springer.
Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 5542– 5551.
Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018). Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2251–2265.
Xie, G.-S., Liu, L., Jin, X., Zhu, F., Zhang, Z., Qin, J., Yao, Y., & Shao, L. (2019). Attentive region embedding network for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 9384– 9393.
Xie, G.-S., Liu, L., Zhu, F., Zhao, F., Zhang, Z., Yao, Y., Qin, J., & Shao, L. (2020). Region graph embedding network for zero-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 562– 580 . Springer.
Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2022). Vgse: Visually-grounded semantic embeddings for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 9316– 9325.
Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2020). Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems, 33, 21969–21980.
Yang, T., Zhou, S., Huang, Z., Xu, A., Ye, J., & Yin, J. (2023). Urban street tree dataset for image classification and instance segmentation. Computers and Electronics in Agriculture, 209, 107852.
Yu, Y., Ji, Z., Han, J., & Zhang, Z.(2020). Episode-based prototype generating network for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 14035– 14044.
Zhou, Z., Lei, Y., Zhang, B., Liu, L., & Liu, Y. (2023). Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 11175– 11185.
Zhou, C., Loy, C.C., & Dai, B. (2022). Extract free dense labels from clip. In European conference on computer vision, pp. 696– 712. Springer.
Zhu, P., Wang, H., & Saligrama, V. (2019). Generalized zero-shot recognition based on visually semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2995– 3003.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gunhee Kim.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Atigh, M.G., Nargang, S., Keller-Ressel, M. et al. SimZSL: Zero-Shot Learning Beyond a Pre-defined Semantic Embedding Space. Int J Comput Vis 133, 5161–5177 (2025). https://doi.org/10.1007/s11263-025-02422-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02422-6