1 Introduction

Zero-shot learning (ZSL) constitutes a long-standing problem in computer vision, where we seek to recognize classes for which we do not have any training examples (Lampert et al. 2009; Larochelle et al. 2008). Different from supervised learning, the go-to route in zero-shot learning is to rely on an embedding space from a prior knowledge source that is shared amongst seen and unseen classes. By optimizing training samples of seen classes to their embeddings in this shared space, it also becomes possible to recognize unseen classes during inference through a nearest neighbor search of the unseen class embeddings (Xian et al. 2018a).

Fig. 1
figure 1

We propose similarity-based zero-shot learning and introduce four challenges: a Challenge I: Zero-shot without common names. b Challenge II: Multi-source zero-shot learning. c Challenge III: Zero-shot learning with rare missing knowledge. d Challenge IV: Zero-shot learning with random missing knowledge

Zero-shot learning has witnessed tremendous progress over the years. Where traditional solutions commonly use embedding spaces constructed through e.g.,, attributes (Chen et al. 2022a; Lampert et al. 2009; Xu et al. 2020) or word vectors (Frome et al. 2013; Liu et al. 2020; Radford et al. 2021), the state-of-the-art leverages the embedding space of vision-language models such as CLIP (Radford et al. 2021), typically by transforming a class name into a prompt (Wang et al. 2023a; Tang et al. 2024; Ali and Khan 2023). These advances have resulted in high accuracies on the benchmarks we all love in computer vision, from ImageNet (Deng et al. 2009) and CIFAR (Krizhevsky et al. 2009) to Kinetics (Kay et al. 2017), CUB-Birds (Wah et al. 2011), AWA2 (Xian et al. 2018a), SUN (Patterson et al. 2014) and more. The common assumption that drives current works is that all seen and unseen classes either have pre-defined semantic representations or that they can easily be obtained through prompting.

This assumption is however not valid in many real-world settings. What if the class names are not common English names, as happens naturally in biological and medical settings (Khan et al. 2023; Beery et al. 2022; Tschandl et al. 2018)? Any attempt at a prompt will only lead to random predictions. What if the relation between seen and unseen classes is only expressed by a score, as is the case in many neuroscience experiments (Edelman and Shahbazi 2012) and when dealing with co-occurrence statistics (Mensink et al. 2014)? What if we are dealing with different embedding spaces for different sets of classes? And what if knowledge is missing because we are dealing with rare classes (Chen et al. 2023; Beery et al. 2020; Walker and Orenstein 2021) or with noisy information from the internet (Sharma et al. 2018; Han et al. 2023)? This paper strives to enable zero-shot learning when we can no longer assume there is a pre-defined semantic embedding space.

We propose SimZSL: similarity-based zero-shot learning. We find that a single similarity score between each class pair forms the minimum required building block to enable seen to unseen generalization. A similarity matrix for seen and unseen classes as prior knowledge makes for the most foundational building block, as it allows us to become agnostic to the prior knowledge on which we rely. To encourage research beyond the scope of current zero-shot literature, we outline four challenges. The first is zero-shot without common class names, describing the scenario where CLIP is no longer the magic bullet. The second is multi-source zero-shot learning, where different sets of classes share different embedding spaces. This setting cripples standard zero-shot learning, but becomes feasible when breaking the embeddings down to similarity scores. The third is zero-shot learning with rare missing knowledge, describing the scenario where some classes are rare and hence do not have a known relation to all other classes. The fourth is zero-shot learning with random missing knowledge, where the relation between some classes is randomly missing due to a noisy source. The challenges are visualized in Fig. 1.

To obtain the first results on the new similarity-based zero-shot learning challenges, we take inspiration from classical Multidimensional scaling (MDS) (Carroll and Arabie 1998; Jaworska and Chupetlovska-Anastasova 2009; Hout et al. 2013), which strives to construct a feature vector for each object given a matrix of distances between all objects, canonically for visualization purposes (Jaworska and Chupetlovska-Anastasova 2009). We introduce \(\kappa \)-MDS, a generalization of MDS to operate in Euclidean, hyperspherical, and hyperbolic spaces, enabling us to plug its prototype outputs into any existing prototype-based zero-shot learner. While MDS operates in Euclidean space, we propose a generalization, \(\kappa \)-MDS, to operate in hyperspherical and hyperbolic spaces. Depending on the manifold in which the zero-shot learner performs, \(\kappa \) can be set accordingly. We furthermore outline two extensions of \(\kappa \)-MDS for dealing with rare and random missing knowledge. Experiments on the new challenges show that our approach makes it possible to learn under any zero-shot setting, while they also indicate that zero-shot generalization in these challenging scenarios has a lot of room for improvement. We lastly verify that similarity scores are a sufficient source for zero-shot learning in general. Across multiple datasets and zero-shot learners, we show that any prior knowledge source can be compressed to a similarity matrix without hampering performance when using our embedding construction method. We conclude that similarity-based learning opens new doors in zero-shot recognition without limiting the existing direction in the field.

In summary, our contributions are as follows:

  1. 1.

    We propose SimZSL, similarity-based zero-shot learning, and we show that similarity scores are a sufficient minimal knowledge source for zero-shot learning;

  2. 2.

    We outline four new challenges: zero-shot without common class names, multi-source zero-shot learning, zero-shot with rare missing knowledge, and zero-shot learning with random missing knowledge;

  3. 3.

    We introduce \(\kappa \)-MDS to construct class prototypes on any manifold, even when similarities are missing.

2 Related Work

2.1 ZSL w/ Full Prior Knowledge

Zero-shot learning aims to generalize from a set of training classes to a completely separate set of unseen test classes using a prior knowledge source. Zero-shot learning in literature has relied on many types of embedding spaces shared by seen and unseen classes. The most common examples include attributes (Chen et al. 2022b; Akata et al. 2015; Chen et al. 2018b, 2022a; Jiang et al. 2019; Huynh and Elhamifar 2020; Xie et al. 2019; Romera-Paredes and Torr 2015; Xie et al. 2020; Xu et al. 2020; Zhu et al. 2019; Chen et al. 2021b, a; Han et al. 2021; Narayan et al. 2020; Shen et al. 2020; Verma et al. 2018; Xu et al. 2022; Schonfeld et al. 2019; Xian et al. 2018b; Liu et al. 2018; Wang and Chen 2017; Shen et al. 2021; Reed et al. 2016; Yu et al. 2020; Vyas et al. 2020; Rohrbach et al. 2011; Romera-Paredes and Torr 2015), word vectors (Akata et al. 2015; Bretti and Mettes 2021; Xu et al. 2022; Socher et al. 2013; Schonfeld et al. 2019; Chen et al. 2021c; Frome et al. 2013; Liu et al. 2018; Wang and Chen 2017; Shen et al. 2021; Liu et al. 2020; Reed et al. 2016; Yu et al. 2020; Vyas et al. 2020; Chen et al. 2022a), and class hierarchies (Akata et al. 2015; Liu et al. 2020; Li et al. 2019; Li et al. 2020; Long et la. 2020; Atigh et al. 2022; Rohrbach et al. 2011). Lampert et al. (2009) were the first to explore zero-shot learning in computer vision. They introduced attributes, which are human-annotated high-level descriptions of classes, bridging the gap between seen and unseen classes. Attributes are represented as binary or continuous vectors for machine utilization, e.g., binary attributes such as “black”, “has stripes”, or “eats fish” in AWA2 dataset (Lampert et al. 2013; Akata et al. 2015). Attributes can be used for zero-shot learning to play the role of target representations directly (Xu et al. 2020; Akata et al. 2015; Romera-Paredes and Torr 2015), to project to a common embedding space (Chen et al. 2021c; Liu et al. 2018; Wang and Chen 2017), or as the prior knowledge to generate new synthetic samples (Verma et al. 2018; Chen et al. 2021b, a; Schonfeld et al. 2019; Shen et al. 2020). Due to the strong annotation requirements of attributes, a wide range of works have investigated more scalable prior knowledge sources, such as text embeddings and hierarchical class relations. Text-based prior knowledge can be obtained for example through Word2Vec (Schonfeld et al. 2019), GloVe (Liu et al. 2020), or FastText (Xu et al. 2022) vectors extracted from the class names or a prompt including the class names, or by extracting sentences about the classes from web sources and generating TF-IDF (Vyas et al. 2020) or language model representations (Reed et al. 2016). Given a text-based representation per class, zero-shot training and inference can be performed akin to attribute-based zero-shot learning. Where text-based knowledge largely follows the setup of attributes, class hierarchies form a largely different source for enabling zero-shot learning. Class hierarchies are commonly present in visual datasets, e.g., ImageNet (Deng et al. 2009), CUB-Birds (Wah et al. 2011), Kinetics (Kay et al. 2017), and many more. Early approaches use hierarchies to transfer knowledge from seen to unseen classes through hierarchical relations to help distinguish similar classes (Rohrbach et al. 2011; Al-Halah and Stiefelhagen 2015). More recently, several works propose to embed hierarchies such that their parent–child relations are preserved with minimal distortion. Once embedded, the nodes of the hierarchy can be used as target vectors for representing classes, following the consensus in zero-shot learning. Several works have highlighted that different knowledge sources prefer different geometries for the embedding spaces, such as hyperspherical spaces for text-based embeddings (Shen et al. 2021) and hyperbolic spaces for hierarchical embeddings (Liu et al. 2020; Long et la. 2020). All mentioned approaches assume that seen and unseen classes are a priori embedded in a shared embedding space. Our work strives to push zero-shot learning to settings where this assumption is not longer viable. While zero-shot learning approaches using fixed prior knowledge or pseudo-class centers as the prototypes require a pre-defined, single source prior knowledge to generalize from seen to unseen classes, similarity-based zero-shot learning requires only a single similarity score between each pair of classes. The similarity score can be obtained from any knowledge source, or even combinations of knowledge sources or settings where only similarity scores are given, such as in neuroscience and when working with co-occurrence statistics.

Similarly, Mensink et al. (2014) have advocated for similarities in the form of co-occurrence statistics between classes to perform zero-shot learning. Mensink et al. (2014) propose to extract the co-occurrence statistics from the class-level annotations or web-search hit counts, removing the requirement of expensive, expert-driven annotations. While their approach focuses on co-occurrences only, we generalize to any similarity-based setting. Moreover, our \(\kappa \)-MDS approach works with any prototype-based zero-shot learner and can even deal with missing knowledge.

2.2 ZSL w/ Vision-Language Models

The state-of-the-art in zero-shot learning largely builds upon advances in large-scale vision-language models. If trained on large collections of image-text pairs, vision-language models such as CLIP (Radford et al. 2021), ALIGN (Jia et al. 2021), Flamingo (Alayrac et al. 2022), ActionCLIP (Wang et al. 2021), X-CLIP (Ma et al. 2022), MaskCLIP (Zhou et al. 2022), ReCLIP (Subramanian et al. 2022), CLIPCAM (Hsia et al. 2022), ZegCLIP (Zhou et al. 2023), MAFT (Jiao et al. 2023), MERU (Desai et al. 2023) CLIPN (Wang et al. 2023b), kNN-CLIP (Gui et al. 2024), Cascade-CLIP (Li et al. 2024) and many variants are able to generalize out-of-the-box to unseen classes with impressive performance. What is more, all we need is to convert a class name into a short description, with the go-to solution in the form of “this is a photo of [classname]” (Radford et al. 2021; Tang et al. 2024; Ali and Khan 2023). While effective on general datasets where each class is defined by a common English name, this setup will not work in domains with specialized, uncommon, and non-English names, or on classes not observed when training vision-language models. We show how to make zero-shot learning possible in all settings by condensing all knowledge down to similarities.

2.3 ZSL w/ Missing Knowledge

A few works have investigated zero-shot learning with missing knowledge. Wang et al. (2017) propose a zero-shot learning method to deal with a partial set of observed class attributes. They assume that there is a set of attributes that is missing for all unseen classes. Braytee et al. (2021) also investigate zero-shot learning with missing attributes by learning a supplementary semantic attribute matrix. In contrast, we investigate missing knowledge at the similarity level instead of the attribute level, making our approach more general and agnostic to the used knowledge. Moreover, in our case, knowledge can be both structurally and randomly missing. MDS has been used in prior zero-shot learning, e.g., (Changpinyo et al. 2017). We propose a generalization \(\kappa \)-MDS to operate on any non-Euclidean manifold with curvature \(\kappa \), and we show how to operate with partial knowledge.

3 SimZSL

3.1 Problem Formulation

For our problem, we are given a training set \({\mathcal {T}} = \{(x_i, y_i)\}_{i=1}^{N}\) with N examples, where \(x_i \in {\mathcal {I}}\) denotes the \(i^{\text {th}}\) input image and \(y_i \in {\mathcal {Y}}_s\) denotes the corresponding category label. At test time, the goal is to assign a label to a test image from a set of unseen labels \({\mathcal {Y}}_u\), where \({\mathcal {Y}}_s \cap {\mathcal {Y}}_u = \emptyset \). The point of the paper is that for zero-shot learning in any challenging setting, all we need is a similarity matrix \(S \in \mathbb {R}^{K \times K}\), with \({\mathcal {Y}} = {\mathcal {Y}}_s \cup {\mathcal {Y}}_u\) and \(|{\mathcal {Y}}| = K\), where \(S_{i,j}\) is the similarity score of classes i and j. Given S, we want to distill class prototypes for \({\mathcal {Y}}\). In this work, we strive for an embedding algorithm that can be applied to any zero-shot learner, including recent alternatives that rely on non-Euclidean spaces (Shen et al. 2021; Liu et al. 2020). Additionally, such a method should still work when the similarities can be partially given, i.e., when S is incomplete. To summarise, SimZSL consists of two steps: (1) extracting prototypes given pairwise similarities, (2) performing prototype-based zero-shot learning given SimZSL prototypes.

3.2 How to Learn from Similarities

For our approach, we take inspiration from MDS (Borg and Groenen 2005). MDS deals with a dissimilarity matrix \(D \in \mathbb {R}^{K \times K}\) instead of similarities. This can simply be done by identifying the similarities as \(-\frac{1}{2} D'\) and centering the resulting matrix:

$$\begin{aligned} S = -\frac{1}{2} \left( I - \frac{1}{K} J_K\right) D' \left( I - \frac{1}{K} J_K\right) , \end{aligned}$$
(1)

with \(J_K\) the all-ones matrix of dimension \(K \times K\) and \(D'\) the matrix of squared dissimilarities with \(D'_{ij} = D^2_{ij}\).

We assume that each label \(y_i\) can be represented by a point \(z_i\) in a latent space \(M_\kappa ^d\), which is the output of the algorithms (1) and (2), with constant curvature \(\kappa \) and dimension d. Moreover, the pairwise dissimilarities \(d_{ij}\) of any two classes \(y_i, y_j\) can be approximated by the distance \(d_\kappa (z_i, z_j)\) of \(z_i\) and \(z_j\) in \(M_\kappa ^d\). This allows us to recast the embedding problem on the dissimilarity matrix D as a problem of completing a distance matrix in \(M_\kappa ^d\). In the Euclidean case, the solution is well-studied and MDS itself suffices, see Borg and Groenen (2005). Below, we propose \(\kappa \)-MDS, a unified formulation for obtaining class embeddings on any non-Euclidean manifold given constant curvature \(\kappa \). First, let us introduce the following function:

$$\begin{aligned}{\mathcal {C}}_\kappa (x) = {\left\{ \begin{array}{ll}\cos (\sqrt{\kappa } x), \qquad \kappa > 0,\\ \cosh (\sqrt{|\kappa |}x) \qquad \kappa < 0.\end{array}\right. } \end{aligned}$$
Algorithm 1
figure a

\(\kappa \)-MDS (D,d,\(\kappa \))

The inverse function \({\mathcal {C}}^{-1}\) is well-defined for \(x \in \tfrac{1}{\sqrt{\kappa }} [0,2\pi )\) for \(\kappa > 0\), and for all \(x \in [0,\infty )\) when \(\kappa < 0\). Given the curvature \(\kappa \in \mathbb {R} \setminus \left\{ 0\right\} \), the inner product \(\left\langle z,z'\right\rangle _\kappa \) of \(z, z' \in \mathbb {R}^{d+1}\) is defined as

$$\begin{aligned} \left\langle z,z'\right\rangle _\kappa = \textrm{sign}\left( \kappa \right) z_0 z'_0 + \Big (z_1 z'_1 + \cdots + z_d z'_d\Big )., \end{aligned}$$
(2)

where this is the usual (Euclidean) inner product if \(\kappa > 0\), and the indefinite Lorentz product if \(\kappa < 0\). Moreover, we can write \(M_\kappa ^d\) as the connected component of

$$\begin{aligned} {M}_\kappa ^d:= \left\{ z \in \mathbb {R}^{d+1}: \left\langle z,z\right\rangle _\kappa = \textrm{sign}\left( \kappa \right) \right\} , \end{aligned}$$
(3)

containing \(z = (1,0,\dotsc , 0)\), with the distance on \(M_\kappa ^d\):

$$\begin{aligned} d_\kappa (z,z') = {\mathcal {C}}^{-1}_\kappa (\textrm{sign}\left( \kappa \right) \left\langle z,z'\right\rangle _\kappa ), \end{aligned}$$
(4)

see Ratcliffe et al. (1994). Inverting this equation, we obtain

$$\begin{aligned} \left\langle z,z'\right\rangle _\kappa = \textrm{sign}\left( \kappa \right) {\mathcal {C}}_\kappa (d_\kappa (z,z')). \end{aligned}$$
(5)

Applied to the latent label representations \(z_1, \dotsc , z_K\), and written in matrix form, this means that the distance matrix \(D = [d_\kappa (z_i,z_j)]\) can be converted to the Gram matrix \(G = [\left\langle z_i,z_j\right\rangle _\kappa ]\) of pairwise inner products by the element-wise operation \(G = \textrm{sign}\left( \kappa \right) {\mathcal {C}}_\kappa (D)\). The coordinates \(z_1, \dotsc , z_K\) can be recovered from the Gram matrix G through eigendecomposition. We outline the unified \(\kappa \)-MDS solution in Algorithm 1, which returns a coordinate matrix Z, whose rows are the coordinates \(z_1, \dotsc , z_K\) of the desired latent label representations (Agarwal et al. 2010; Keller-Ressel and Nargang 2020; Tabaghi and Dokmanić 2020).

3.3 Generalization to Non-Euclidean

By performing MDS or \(\kappa \)-MDS on D, we obtain a coordinate matrix Z. Each row in this matrix denotes a vector representation of a class, which we can directly use as class embedding in any prototype-based zero-shot learning method. In this paper, we investigate our embeddings from similarities plugged in various zero-shot learning methods. \(\kappa \)-MDS is applied to non-Euclidean zero-shot learners i.e., SZSL and HZSL. For Euclidean approaches such as DeViSE and VGSE, \(\kappa \)-MDS reverts to MDS. For the Euclidean case, we investigate both the canonical DeViSE algorithm (Frome et al. 2013) and the more recent VGSE approach (Xu et al. 2022). To train DeViSE, the goal is to optimize the hinge rank loss

$$\begin{aligned} L_{cls} = \Sigma _{y \in {\mathcal {Y}}^{tr}}{[ m + \theta (x_n)^{T}W Z_y - \theta (x_n)^{T}W Z_n]_{+}}, \end{aligned}$$
(6)

where the image embeddings generated by \(\theta (x)^{T}\) to the class embeddings Z, and m denotes the margin. To train VGSE, the loss function is

$$\begin{aligned} L_{cls} = [\max _{y \in {\mathcal {Y}}^{tr}}(m + \theta (x_n)^{T}W Z_y - \theta (x_n)^{T}W Z_n)]_{+}. \end{aligned}$$
(7)

We also plug our prototypes on top of CLIP (Radford et al. 2021). In all cases, we simply use the seen classes in our embeddings as targets in the corresponding training losses and use the unseen class embeddings for nearest neighbor search during testing. We furthermore investigate SZSL (Shen et al. 2021) and HZSL  (Liu et al. 2020) for respectively hyperspherical and hyperbolic zero-shot learning. For these methods, we set the \(\kappa \) in our \(\kappa \)-MDS to respectively 1 and -1 and plug our embeddings as training and test targets. \(\kappa \)-MDS works under varying values of \(\kappa \). It is however not a hyperparameter, but a way to allow us to operate with any non-Euclidean zero-shot learner. As different zero-shot learners prefer different curvatures when training their models, it is possible for our approach to adopt the prototypes from \(\kappa \)-MDS in any method by matching the \(\kappa \) to the curvature used in the zero-shot learner To train HZSL, given a triplet \((h_I, z_{c_I}, z_{c_I}^{-})\) with \(h_I\) as the image embedding, \(z_{c_I}\) as the positive and \(z_{c_I}^{-}\) as the negative label embeddings extracted from Z, the goal is to minimize

$$\begin{aligned} L_{cls} = \max (0, \delta - d_{\mathbb {D}}( h_I, z_{c_I}) + d_{\mathbb {D}}(h_I, z_{c_I}^{-})), \end{aligned}$$
(8)

where as the representations are in hyperbolic space, \(d_{\mathbb {D}}\) is Poincaré distance. On the other hand, to train SZSL, the goal is to minimize the following objective function

$$\begin{aligned} L_{cls} = {\mathcal {L}}_{KL} + \alpha R(\eta ^{*}) + \beta \overline{H}, \end{aligned}$$
(9)

where \({\mathcal {L}}_{KL}\) is minimizing the Kullback–Leibler (KL) divergence between the prediction probability and the one-hot vector of the correct labels and \(R(\eta ^{*})\) and \(\overline{H}\) are spherical and semantic alignments. Similar to others, the prediction probability is generated with the goal of aligning the image embedding with class embedding i.e., Z.

3.4 Learn from Partial Similarities

In real-world scenarios, similarities are not always fully available, e.g., because a knowledge source is incomplete, or because a full similarity matrix would require excessive human annotation effort. For dealing with missing information, we distinguish missing knowledge due to rare classes (i.e., structured case) and due to random noise (i.e., unstructured case). The structured case can be formalized as follows: From the similarity matrix \(S \in \mathbb {R}^{K \times K}\) only a subset of entries described by a mask \(\Omega \in \left\{ 0,1\right\} ^{K \times K}\) is known (\(\Omega _{ij} = 1\)), and the rest is unknown (\(\Omega _{ij} = 0\)). More formally, complete rows, and by symmetry columns, of S are known and labels can be reordered to cast \(\Omega \) into the block shape

$$\begin{aligned} \Omega = \Big (\begin{matrix}1_{L \times L} & 1_{L \times (K-L)} \\ 1_{(K-L) \times L} & 0_{(K-L) \times (K-L)}\end{matrix} \Big ), \end{aligned}$$
(10)

where \(L < K\). In this case, the first L classes take the role of landmark, or reference classes, for which similarity information to all other classes is available. The proportion of known similarities is \(p = L(2K - L)/K^2\) in this case. In the unstructured case, the entries of \(\Omega \) are chosen independently at random with \(\mathbb {P}(\Omega _{ij} = 1) = p\), i.e.,, similarity information is available for random pairs \((l_i, l_j)\) of class labels. Both cases can be solved by completing the partial (dis)similarity matrix, as follows:

Completing partial distances - structured case. In the structured case, matrix D is partitioned accordingly as \(D = \Big ({\begin{smallmatrix} D_{L} & D_{NL}^\top \\ D_{NL} & D_{N} \end{smallmatrix}}\Big )\) using the mask from Eq. 10. Here, \(D_L\), the matrix of landmark-to-landmark distances, and \(D_{NL}\), the matrix of nonlandmark-to-landmark distances are known. However, the submatrix \(D_{N}\) of nonlandmark-to-nonlandmark distances is unknown. The Gram matrix G can be partitioned in the same way, and its known blocks can be computed using the function \({\mathcal {C}}_\kappa \) as above. Inspired by (De Silva and Tenenbaum 2004; Keller-Ressel and Nargang 2022), we first recover the landmark coordinates \(z_1, \dotsc , z_L\) from their Gram matrix \(G_L\) by \(\kappa \)-MDS (or MDS). The non-landmark coordinates \(z_{L+1}, \dotsc , z_K\) are then computed from \(z_1, \dotsc , z_L\) and \(G_{NL}\) by solving a least-squares problem. The corresponding Landmarked \(\kappa \)-MDS summarized in Algorithm 2.

Algorithm 2
figure b

Landmarked \(\kappa \)-MDS (\(D_L\), \(D_{NL}\),d,\(\kappa \))

Completing partial distances - unstructured case. In the unstructured case, the problem of completing the partial distance matrix can be phrased as a noisy low-rank matrix completion problem in the sense of (Candes and Plan 2010). In other words, we would like to solve the minimization problem

$$\begin{aligned} \min _{G \in \mathbb {B}^{K \times K}} \sum _{\begin{array}{c} i,j \in \{1, \dotsc , K\}\\ \Omega _{ij} = 1 \end{array}} \left( {\mathcal {C}}^{-1}_\kappa (\textrm{sign}\left( \kappa \right) G)_{ij} - D_{ij}\right) ^2, \end{aligned}$$
(11)

under the constraint that G is a Gram matrix of \(\left\langle z,z'\right\rangle _\kappa \)-inner products of \(\textrm{rank}(G) = d + |\textrm{sign}\left( \kappa \right) |\). This is a non-convex optimization problem, which is usually solved by replacing the rank-constraint with a nuclear-norm constraint or by alternating minimization and projection methods (Candes and Plan 2010; Jain et al. 2013; Jiang et al. 2017; Nguyen et al. 2019). For the non-Euclidean case \(\kappa \ne 0\) we adapt the method of (Jiang et al. 2017) and perform alternating projections onto rank-constraint sets \({\mathcal {S}}_{r_i}\) and fidelity-constraint sets \({\mathcal {S}}_{\epsilon _i}\). The projection onto the rank constraint set \({\mathcal {S}}_{r_i}\) is equivalent to \(\kappa \)-MDS with dimension parameter \(d = r_i + 1\); the projection onto the fidelity-constraint set \({\mathcal {S}}_{\epsilon _i}\) is equivalent to performing a sequence of gradient steps in (11) until an error tolerance of \(\epsilon _i^2\) is met.

3.5 Performance Guarantees

We provide additional performance guarantees, showing that \(\kappa \)-MDS as given in Algorithm 1 is capable of perfectly recovering latent points from a dissimilarity matrix. We show that the same guarantee even holds for Algorithm 2 if the number of landmarks is equal to or greater than the chosen dimensionality. Our approach has no problems scaling to many classes even with the quadratic growth of the similarity matrix. For example, on ImageNet-21k (Deng et al. 2009) with 21k classes, it takes about 4 minutes and 600MB to obtain the prototypes with our \(\kappa \)-MDS, which is an offline process that only has to be run once.

For Algorithm 1, we are able to show the following performance guarantee: Let latent points \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K \in M_\kappa ^d\) with \(K > d\) be given and assume that they are not fully contained in a subspace of strictly smaller dimension \(d' < d\). Then \(\kappa \text {-MDS}(D,d,\kappa )\) can perfectly recover the latent points \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K\) up to isometry from knowledge of their full distance matrix \(D = [d_\kappa (\varvec{{z}}_i, \varvec{{z}}_j)]_{i,j=1\dotsc K}\). For Algorithm 2, we can give the same guarantee if, in addition, the number of landmark labels L satisfies \(L \ge d\). For brevity, we only provide the proof for the first assertion in the rebuttal: Let \({\bar{Z}} \in \mathbb {R}^{K \times (d+1)}\) denote the true coordinate matrix, whose rows are the coordinates in \(M_\kappa ^d\) of the true latent points. This matrix has rank \(d+1\), since otherwise the latent points would be concentrated in a lower dimensional subspace. By Eq. 2 the Gram matrix \({\bar{G}}\) of inner products can be written as \({\bar{G}} = {\bar{Z}} J_\kappa {\bar{Z}}\), where \(J_\kappa \) is the \(\tiny {(d+1) \times (d+1)}\) matrix \(J_\kappa = \textrm{diag}(\textrm{sign}\left( \kappa \right) , 1, \dots , 1)\). By Eq. 4 the true distance matrix of the latent points is \({\bar{D}} = {\mathcal {C}}_\kappa ^{-1}(\textrm{sign}\left( \kappa \right) {\bar{G}})\). Matrix G in the first step of Algorithm 1 is equal to

$$\begin{aligned} G&= \textrm{sign}\left( \kappa \right) {\mathcal {C}}_\kappa ({\bar{D}}) \\&= \textrm{sign}\left( \kappa \right) {\mathcal {C}}_\kappa \left( {\mathcal {C}}_\kappa ^{-1}(\textrm{sign}\left( \kappa \right) {\bar{G}})\right) = {\bar{G}}, \end{aligned}$$

i.e., the true Gram matrix is recovered. In steps two and three, the Eigenvalue decomposition \(G = Q \Lambda Q^\top \) is computed. The reduced coordinate matrix \(Z'\) is computed as \(Z' = Q \textrm{diag}(\sqrt{\lambda _1}, \dotsc , \sqrt{\lambda _d}, 0, \dotsc , 0)\). In step four, Eq. 3 is used to infer the missing column \(\varvec{{z}}_0\), such that the complete matrix Z satisfies:

$$\begin{aligned} Z J_\kappa Z^\top = Q\Lambda Q^\top = G = {\bar{Z}} J_\kappa {\bar{Z}}^\top . \end{aligned}$$
(12)

Since \({\bar{Z}}\) has full column rank, its Moore-Penrose pseudoinverse \({\bar{Z}}^+\) is a left inverse, i.e., \({\bar{Z}}^+ {\bar{Z}} = I_{d+1}\). Set \(A:= {\bar{Z}}^+ Z\). We have \({\bar{Z}} A = {\bar{Z}} {\bar{Z}}^+ Z = Z\), that is, A transforms the rows of \({\bar{Z}}\) into the rows of Z. Applying \({\bar{Z}}^+\) to (12) from the left and its transpose from the right, we obtain \(A J_\kappa A^\top = J_\kappa \). If \(\kappa > 0\), this implies that A is a rotation matrix, i.e., an isometry of \(M_\kappa ^d\). If \(\kappa < 0\), A is a hyperbolic isometry by [ Ratcliffe et al. (1994), Thm. 3.1.4, 3.2.3]. The latent coordinates \(\varvec{{z}}_1, \dotsc , \varvec{{z}}_K\) (rows of \({\bar{Z}}\)) are recovered up to isometry of \(M_\kappa ^d\), completing the proof.

4 SimZSL Challenges

To broaden the scope of zero-shot learning benchmarks towards learning from similarities, we have developed four challenges (CI, CII, CIII, CIV) and provide first numbers for each. Below we dive into them one by one.

4.1 CI: SimZSL w/o Common Names

Context. Vision-language models such as CLIP (Radford et al. 2021) perform zero-shot recognition by calculating the similarity between textual prompts from class names and image representations. This results in strong models when dealing with common class names, e.g., describing objects or actions. The purpose of the first challenge is to create a realistic scenario where we can no longer rely on the text encoders of vision-language models to obtain class embeddings. We take inspiration from the biological domain, where recognition tasks commonly deal with technical, esoteric, or highly fine-grained class names (Beery et al. 2022; Khan et al. 2023; Tschandl et al. 2018). In biology, class relations are often given by taxonomical similarities (Beery et al. 2022; Khan et al. 2023); troublesome for classical zero-shot learning approaches but ideal for our similarity-based zero-shot learning.

Table 1 Challenge I: Zero-shot without common names
Fig. 2
figure 2

Challenge II: t-SNE visualization of multi-source similarity-based embeddings on SUN and AWA2, with attributes as intra-dataset knowledge and word embeddings as inter-dataset knowledge. The temple south asia and blue whale are correctly classified samples from SUN and AWA2 datasets surrounded by embeddings from their dataset. The embedding of the landing deck [AWA2] and corral [AWA2] samples are correctly classified, despite being surrounded by image embeddings from SUN. The horse [AWA2] sample is miss-classified as field cultivated [SUN], due to their high semantic similarity. This also holds for the auditorium [SUN] sample miss-classified as theater-indoor. The analysis shows that we can learn an embedding space that smoothly interpolates within and across datasets and knowledge sources

Table 2 Challenge II: Multi-source zero-shot learning using different sources for similarity within and across datasets

Setup. We take a look at the recent FishNet dataset (Khan et al. 2023), where class names are given by the formal Latin names of the specific fish categories. For this experiment, we define a split into 232 seen and 231 unseen classes. We compare three approaches. In the experiments shown in Table 1, we use the CLIP model with a ViT-B/32 backbone, and the experiments in Table 7 cover CLIP models with various backbones, i.e., RN50, RN101, ViT-B/16, ViT-B/32, and ViT-L/14. For the prompt-based experiments, we follow Radford et al. (2021) and use “this is a photo of [classname].” as the prompt template. The first takes CLIP out of the box for direct zero-shot generalization by transforming the class names into prompts to be used as prototypes, following the conventional setup (Radford et al. 2021). The second additionally fine-tunes the image and text embeddings using the seen classes. The third uses our similarity-based approach on top of the same image encoder, where our class embeddings are derived from a similarity matrix based on the taxonomical graph distances between classes given by the original paper (Khan et al. 2023). In addition, we perform experiments on Branch real-world dataset (Yang et al. 2023) consisting of 13 classes, divided into 7 seen and 6 unseen classes. For this challenge, we add a linear layer on top of CLIP, fine-tuning the model on FishNet with 50 dimensions. Similar to the original model, representations are normalized and lie in spherical space.

Fig. 3
figure 3

Challenge III: Zero-shot learning w/ rare missing knowledge with DeViSE. While most zero-shot approaches require full prior knowledge, our similarity-based approach generalizes to settings where knowledge is structurally missing

Results. As expected, applying CLIP directly leads to a near-random performance (1.09%), since the textual encoder is unable to deal with the uncommon class names. For CLIP-based zero-shot learning, we follow Radford et al. (2021) by prompting with class names, e.g., creating a prompt in the form of “this is a photo of [classname]”. For example, given a class name such as “lion”, the prompt becomes “this is a photo of lion.” This prompt is then used to calculate the similarity with the image, and the class with the highest similarity is selected as the prediction. However, this approach relies on meaningful class names. As shown for FishNet samples, generating a prompt such as “This is a photo of Lepomis macrochirus”. Since the class names in FishNet are not common names and CLIP models have not encountered similar contexts before, they fail to provide meaningful information, leading to near-random performance. Fine-tuning only does not help a lot, with an accuracy of 1.16%. The fine-tuning result highlights that the issue is truly with the class embeddings, as the model is unable to provide meaningful embeddings for the classes with uncommon names, so improving the alignment to seen classes does not improve the zero-shot alignment to unseen classes. Only when adding our prototypes from taxonomical similarities and using the prototypes instead of the prompt to calculate the similarity, we are able to boost the scores. Regardless, performing zero-shot classification on 231 unseen fish species remains a highly challenging task for future research. The FishNet dataset is large-scale and diverse, with images collected from various regions worldwide, with different sizes, resolutions, and illuminations. The dataset is still challenging in different tasks due to large species diversity, fine-grained labels, diverse backgrounds, low contrast, and more (Khan et al. 2023). As shown by Veiga and Rodrigues (2024), the FishNet dataset is an imbalanced dataset with a minority-to-majority ratio of 4 to 4782. While the language encoder of the CLIP model is unable to generate a meaningful representation given Latin names, the images, and distribution are challenging for the vision encoder as well. However, we do not focus on improving the model in this challenge and only want to pinpoint that using a similarity matrix instead of CLIP representations can boost performance.

Fig. 4
figure 4

Challenge III: Zero-shot learning w/ rare missing knowledge with VGSE and SZSL. Given more advanced zero-shot learning approaches e.g., VGSE and SZSL, our similarity-based approach generalizes to settings with structurally missing prior knowledge

To further analyze the performance of the CLIP model, we conduct experiments using different CLIP models on the real-world Branch dataset. As shown in Table 7, using our prototypes instead of prompts outperforms the CLIP models. While fine-tuning CLIP with Branch dataset class names improves the results, our prototypes still outperform with a high margin in 4 out of 5 models.

4.2 CII: Multi-source SimZSL

Context. In standard zero-shot learning each class is assigned an embedding vector from the same knowledge source. For example, if we rely on attributes as our source, each class should have annotated labels for each attribute. This setup makes sense if we are given a specific domain to operate in, e.g., recognizing birds, scenes, or animals. The objective of the second challenge is to go beyond a single domain for zero-shot recognition. We strive to perform zero-shot learning across multiple sets of classes simultaneously. Each set of classes comes with their own embedding space; again not directly applicable to the standard zero-shot pipeline, but feasible with similarity-based zero-shot learning, since all embeddings can be combined into a single unified similarity matrix.

Fig. 5
figure 5

Challenge IV: Zero-shot learning with random missing knowledge on AWA2 using attributes with DEViSE and VGSE. Even with many similarities randomly missing, we are able to perform zero-shot learning

Setup. We construct the second challenge by combining multiple existing zero-shot datasets, i.e., SUN-AWA2, CUB-SUN, and CUB-AWA2. To extract the class embeddings, we create a similarity matrix S that spans both datasets. For all class pairs within an individual dataset, we use their corresponding knowledge source to obtain similarities i.e., similarities obtained by 312, 85, and 102 dimensional attribute vectors for CUB-Birds, AWA2, and SUN. To obtain similarity scores of class pairs between the datasets, we use fastText (Bojanowski et al. 2017) word embeddings. We use DeViSE with 500, 500, and 50 dimensional class embeddings for the respective pairs. As this challenge is not feasible with the standard ZSL methods, we unify the embeddings from multiple sources through PCA to match the dimensionalities of the attributes, on which standard zero-shot learning is performed as a reference baseline.

Fig. 6
figure 6

Failure analysis for challenges III and IV on CUB, SUN, and AWA2 datasets. The first two columns show zero-shot learning with rare missing knowledge, where the model has only access to 15 attributes out of 312 and 102. Given rare missing knowledge, the model misclassified the classes to one with a high pairwise similarity. The third column shows zero-shot learning with random missing knowledge, in which 50% of the knowledge is missing

Results. The multi-source results are shown in Table 2. Our similarity-based approach works both in the settings with knowledge per dataset (intra-dataset knowledge) and with additional knowledge between datasets (inter-dataset knowledge). The type of knowledge does not even have to match, as all becomes the same when turning them into similarities. Extending DeViSE to multi-source zero-shot learning only works in the intra-dataset setting and we outperform this baseline. Adding inter-dataset knowledge is effortless for us. Whether this leads to an improvement depends on the problem at hand. For example, when performing zero-shot learning on CUB and AWA2 jointly, extra word vector knowledge really helps since both datasets focus on animals. For heterogeneous settings, simply having 0 similarity between class pairs across datasets can be sufficient. We conclude that a similarity-based approach enables multi-source zero-shot learning, opening up recognition tasks spanning many datasets from different domains (See Fig. 2).

Table 3 Zero-shot learning across manifolds knowledge sources on CUB
Table 4 Zero-shot learning across various manifolds and knowledge sources on SUN and AWA2

4.3 CIII: SimZSL w/ Rare Missing Knowledge

Context. We have seen how a similarity-based approach enables zero-shot generalization in difficult biological and multi-source settings. In real-world settings, the minimalist requirement of similarities can still be a high bar. For example, in neuroscience and human cognition, annotators are often asked to judge conceptual similarity between classes (Edelman and Shahbazi 2012). But naturally, some classes occur often while others are rare. The consequence is that not all knowledge is given between rare classes. Similar situations occur again in biological domains, where there is simply more known for some classes than other classes (Chen et al. 2023; Beery et al. 2020; Walker and Orenstein 2021). This challenge strives to address ZSL in such incomplete settings.

Setup. For zero-shot learning with rare missing knowledge, some classes have full knowledge while other classes have partial non-overlapping knowledge, see Fig. 1c. To best understand the challenging nature of this problem and to understand how well our landmarked \(\kappa \)-MDS works, we create a simulated scenario, where we use attribute vectors with partially missing elements as the knowledge. The classes with partial knowledge contain only m elements, with no attributes in common between them. But, they share m attributes with the classes with full knowledge (i.e., the landmarks, as introduced in 3.4). Thus, pairwise similarities within these two sets can be computed. However, since the classes with partial prior knowledge have no attributes in common, their pairwise similarities are missing, resulting in a missing block in the final similarity matrix S, as formalized in Eq. 10. Despite such a gap, our approach is able to obtain class embeddings. For the experiments, we use CUB, AWA2, and SUN datasets given 312, 85, and 102 dimensional attribute vectors as the full knowledge starting point.

Results. The results are shown in Fig. 3. The bottom x-axis shows the number of attributes available for classes with partial knowledge. The top x-axis shows the number of landmarks for each setting (i.e., the classes with full knowledge). The lower the number of attributes for the classes with partial knowledge, the fewer landmark classes we have with full knowledge, and higher the percentage of missing knowledge. Our approach shows stable performance as the number of landmarks and attributes decreases. While the accuracy of the CUB dataset with full prior knowledge (312 attributes) is \(51.18\%\), the accuracy with 80 attributes and 197 landmarks is \(51.33\%\). For the AWA2 dataset, the full 85 attributes and partial 25 attributes yield \(60.08\%\) and \(54.14\%\), respectively. Interestingly on SUN, missing knowledge can actually increase performance. While the performance of the SUN dataset with the full setting (102 attributes) is \(51.10\%\), the use of 25 minimum attributes and 713 landmarks gives an accuracy of \(53.82\%\). We conclude that our approach is not only able to generalize to zero-shot settings where knowledge is structurally missing, it is even possible to increase generalization in such settings. Additional experiments using new more advanced zero-shot learner i.e., VGSE (Xu et al. 2022) and SZSL (Shen et al. 2021) are shown in Fig. 4. Our approach provides a consistent performance along different numbers of landmarks, and shows a consistent performance, even up to only 10 landmarks.

4.4 CIV: SimZSL w/ Random Missing Knowledge

Context. Lastly, we provide a challenge where similarity scores are randomly missing. This is common when gathering similarities from the internet, for example when using co-occurrence statistics, see Fig.  1d.

Setup. We again strive for a simulated setting, because it allows us to investigate the robustness of our approach as a function of the number of missing similarities. We therefore perform this experiment using DeViSE and VGSE as the zero-shot methods, on AWA2 with attributes to generate similarities, generating 25 dimensional class embeddings, as described in Sect. 3.4.

Results. In Fig. 5, we show what happens when we randomly drop between 5% and 75% of the class similarities. Due to the variance of such settings, we ran setting 6 times, with different random seeds for dropping similarities. Our iterative approach for dealing with missing similarity values is robust to many missing values. Even when 50% of the similarities are missing, we obtain an accuracy of 40.33%. This result indicates that similarity-based zero-shot learning has strong potential for settings where similarities are hard to come by or lacking.

Table 5 Zero-shot learning on SDGZSL (Chen et al. 2021a) and ICIS (Christensen et al. 2023) models
Table 6 Zero-shot learning on Flowers and APY datasets given SDGZSL as the zero-shot learner

4.5 Qualitative Analysis

While SimZSL performs well given rare (i.e., challenge III) or random (i.e., challenge IV) missing knowledge, there are samples where the model struggles because of unavailable knowledge. Figure 6 shows some failure samples of CUB, SUN, and AWA2 datasets, given rare and random missing knowledge. For CUB, the groundtruth classes are only aware of 15 out of 312 attributes, the rest is missing. In Fig. 6a, given a sample of “loggerhead shrike", the model predicts “white treated nuthatch". Given only 15 available attributes out of 312, “loggerhead shrike" and “white treated nuthatch" share “white throat color" and “black primary color" in top-2 most similar attributes, making it a difficult sample for the model. However, “white wing color" and “solid breast pattern" are top-2 most discriminative attributes, that are missing, making the classification difficult. In Fig. 6c, while the classes share “perching like shape" and “grey leg color", the most discriminative attributes, i.e., “yellow breast color" for “tropical kingbird" and “forehead yellow color" for “white-eyed vireo" are missing, making the classes similar given the available set of attributes. Similarly in Fig. 6d, there are only 15 out of 102 attributes available for SUN. As shown in Fig. 6d, the model assigns the “yard" label to “motel". Given only 15 attributes as prior knowledge, both classes share “natural light" and “soothing" in common. In contrast, checking the missing attributes, “motel" is extremely “man-made" compared to “yard", while “yard" has a higher value for “shrubbery". Both attributes are missing, making the classes similar.

The last column shows samples from challenge IV with random missing knowledge. In these samples, the model only has access to 50% of pairwise similarities. In Fig. 6e, the input to the model is an image of “seal", wrongly classified as a “dolphin". While in the original prior knowledge, the pairwise similarity between “seal" and “dolphin" is missing, these classes have an original similarity of 0.86, sharing attributes like “flippers", “ocean", and “water". Similarly in Fig. 6f, the model has classified a “giraffe" as a “horse". These classes have a similarity of 0.80, sharing attributes like “long leg", “big", and “quadrupedal", making it hard sample for the model.

5 Ablation Studies

Table 7 Zero-shot learning on real-world Branch dataset.

Setup. Prototypes generated by SimZSL can be plugged into any prototype-based zero-shot learning approach, without changing the structure of the model. For any method used in the experiments, we take the same backbone and training procedure as proposed in the original paper, we only change the target prototypes to our similarity-based prototypes as obtained through k-MDS. We employ HZSL (Liu et al. 2020) to perform zero-shot learning in hyperbolic space, utilizing hierarchies as prior knowledge. Specifically, we obtain the hierarchy for the CUB-Birds dataset from Chen et al. (2018a), while for the AWA2 dataset, we make use of the hierarchy extracted by Wu et al. (2020). We conducted experiments using the default hyperparameter settings of SZSL (Shen et al. 2021) and VGSE (Xu et al. 2022). For the HZSL (Liu et al. 2020) experiments, a learning rate of 0.01 and weight decay of 0.01 were employed. For the hinge loss objective in HZSL (Liu et al. 2020), we set the margin to 0.1. In all experiments with DeViSE (Frome et al. 2013), a learning rate of 0.01 and margin of 1 were utilized, with the exception of the SUN dataset, where a learning rate of 0.001 and margin of 0.25 were used. For CLIP (Radford et al. 2021) and FishNet (Khan et al. 2023) experiments, a learning rate of 0.01 and weight decay of 0.0001 were employed. For seen/unseen split and attributes, data provided by Xian et al. (2018a) is used.

Fig. 7
figure 7

The effect of dimensionality of our prototypes across zero-shot learners, datasets, given attributes. Our approach allows for strongly compressed embeddings

5.1 Are Similarities Enough for Zero-Shot Learning?

The main goal of this ablation study is to confirm our initial hypothesis: a class similarity matrix serves as a sufficient information source to perform zero-shot learning. To empirically validate this hypothesis, we start from standard knowledge sources such as attributes and word embeddings. We then reduce them to similarity matrix S through a matrix product and feed the matrix as input to our approach to regain class embeddings. If our hypothesis is correct, zero-shot learning on the embeddings from the knowledge source directly is as effective as using our class embeddings obtained from similarity matrices. We perform this ablation study across datasets, knowledge sources, and manifolds. The results for Euclidean, hyperspherical, and hyperbolic zero-shot learners are shown in Table 3 for three knowledge sources on CUB with 200D prototypes. Table 4 also shows the results for zero-shot learners in three different manifolds on SUN and AWA2 datasets with 500D and 50D prototypes. We find that zero-shot performance with our approach is as good as with the original semantic embeddings. In 7 out of 9 comparisons on CUB, the t-test indicates no significant difference in performance across the runs (p-value \(\gg \) 0.05). Similarly, there is no significant difference across runs in 6 out of 6 and 4 out of 6 comparisons on SUN and AWA2. We conclude from this ablation study that similarity-based learning does not hamper zero-shot learning.

In addition, we extend our analysis by incorporating SZGZSL (Chen et al. 2021a) and ICIS (Christensen et al. 2023) zero-shot learners on the CUB, SUN, and AWA2 datasets, shown in Table 5, as well as the Flowers and APY datasets, shown in Table 6. These results further strengthen the hypothesis of the adequacy of similarities for zero-shot learning, demonstrating consistent performance across both standard and previously underexplored datasets. This expanded evaluation underscores the robustness of similarity-based representations in diverse learning scenarios.

5.2 Which Manifold is Best for Which Knowledge Source?

As our proposed method provides the flexibility to generate prototypes for any manifold given a similarity matrix, we also compare the use of different knowledge sources for different manifolds, given in Table  3. The comparison between Euclidean and hyperbolic manifolds follows current literature: attributes work better in Euclidean space (51.14 vs. 50.91), while hierarchies work better in hyperbolic space (25.08 vs. 17.40). In line with Shen et al. (2021), hyperspherical spaces are preferred for word vectors. To make hierarchies work for Euclidean and hyperspherical manifolds, we first embed them in hyperbolic space and perform a logarithmic mapping (for Euclidean) or \(\ell _2\) normalization (for hyperspherical) afterwards. Interestingly, we find that with this setup, hyperspherical zero-shot learning works best for hierarchies. This is in line with Ghadimi Atigh et al. (2021) and Moreira et al. (2024), which highlight the innate normalized nature of hyperbolic prototypes.

5.3 How Many Embedding Dimensions are Enough?

Since we start from a similarity matrix, the dimensionality of the class embeddings is now a hyperparameter to choose freely. We analyze the performance of class embeddings with different dimensionalities in Fig. 7. The class embeddings are stable for a wide range of dimensions across different types of prior knowledge, datasets, and manifold choices. We conclude that our approach allows for a strong compression of the embedding space while maintaining zero-shot performance.

6 Conclusions

Empowered by ever more powerful backbones, knowledge, and pre-trained vision-language models, zero-shot learning continues to improve. The general zero-shot pipeline however also has blindspots, ranging from dealing with uncommon names to human judgment similarities as knowledge, and missing information. We advocate for similarity matrices as the all-purpose knowledge source for zero-shot learning and we introduce four challenges inspired by real-world scenarios to where the standard zero-shot assumption does not hold. We furthermore propose \(\kappa \)-MDS, a general approach that obtains prototypes for seen and unseen classes on any manifold solely given similarities. Our method can be plugged into any zero-shot learner. We show how our approach makes learning with uncommon names, multiple sources, and missing information possible, without hampering accuracy in standard settings. We hope that our similarity-based perspective opens new doors in zero-shot learning.