Towards Task Sampler Learning for Meta-Learning

Wang, Jingyao; Qiang, Wenwen; Su, Xingzhe; Zheng, Changwen; Sun, Fuchun; Xiong, Hui

doi:10.1007/s11263-024-02145-0

Towards Task Sampler Learning for Meta-Learning

Published: 17 June 2024

Volume 132, pages 5534–5564, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jingyao Wang^1,2,
Wenwen Qiang ORCID: orcid.org/0000-0002-7985-5743^1,2,
Xingzhe Su^1,2,
Changwen Zheng^1,2,
Fuchun Sun^1,3 &
…
Hui Xiong^4,5

742 Accesses
9 Citations
Explore all metrics

Abstract

Meta-learning aims to learn general knowledge with diverse training tasks conducted from limited data, and then transfer it to new tasks. It is commonly believed that increasing task diversity will enhance the generalization ability of meta-learning models. However, this paper challenges this view through empirical and theoretical analysis. We obtain three conclusions: (i) there is no universal task sampling strategy that can guarantee the optimal performance of meta-learning models; (ii) over-constraining task diversity may incur the risk of under-fitting or over-fitting during training; and (iii) the generalization performance of meta-learning models are affected by task diversity, task entropy, and task difficulty. Based on this insight, we design a novel task sampler, called Adaptive Sampler (ASr). ASr is a plug-and-play module that can be integrated into any meta-learning framework. It dynamically adjusts task weights according to task diversity, task entropy, and task difficulty, thereby obtaining the optimal probability distribution for meta-training tasks. Finally, we conduct experiments on a series of benchmark datasets across various scenarios, and the results demonstrate that ASr has clear advantages. The code is publicly available at https://github.com/WangJingyao07/Adaptive-Sampler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 8

Adaptive Task Sampling for Meta-learning

A Task-Aware Attention-Based Method for Improved Meta-Learning

Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-agent General-Sum Games

Data Availability

The benchmark datasets can be downloaded from the literature cited in Sect. 8.2.1.

References

Abbas, M., Xiao, Q., Chen, L., Chen, P. Y., & Chen, T. (2022). Sharp-maml: Sharpness-aware model-agnostic meta learning. In International conference on machine learning, PMLR, pp. 10–32.
Bartler, A., Bühler, A., Wiewel, F., Döbler, M., & Yang, B. (2022). Mt3: Meta test-time training for self-supervised test-time adaption. In International conference on artificial intelligence and statistics, PMLR, pp. 3080–3090.
Bateni, P., Goyal, R., Masrani, V., Wood, F., & Sigal, L. (2020). Improved few-shot visual classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14493–14502.
Bertinetto, L., Henriques, J. F., Torr, P. H., & Vedaldi, A. (2018). Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136.
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Book Google Scholar
Chan, K. (2022). Redunet: A white-box deep network from the principle of maximizing rate reduction. Journal of Machine Learning Research, 23(114).
Chen, H., Wang, Y., Wang, G., & Qiao, Y. (2018). Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI conference on artificial intelligence.
Chen, W. Y., Liu, Y. C., Kira, Z., Wang, Y. C. F., & Huang, J. B. (2019). A closer look at few-shot classification. arXiv preprint arXiv:1904.04232.
Chen, X., & He, K. (2021). Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750–15758.
Chen, Y., Zhou, K., Bian, Y., Xie, B., Ma, K., Zhang, Y., Yang, H., Han, B., & Cheng, J. (2022). Pareto invariant risk minimization. arXiv preprint arXiv:2206.07766.
Cheng, G., Lang, C., & Han, J. (2022). Holistic prototype activation for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4650–4666.
Google Scholar
Cheng, P. W., & Lu, H. (2017). Causal invariance as an essential constraint for creating representation of the world: Generalizing the invariance of causal power. The Oxford handbook of causal reasoning, pp. 65–84.
Chi, Z., Gu, L., Liu, H., Wang, Y., Yu, Y., & Tang, J. (2022). Metafscil: A meta-learning approach for few-shot class incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14166–14175.
Daskalaki, S., Kopanas, I., & Avouris, N. (2006). Evaluation of classifiers for an uneven class distribution problem. Applied Artificial Intelligence, 20(5), 381–417.
Article Google Scholar
Daw, A., & Pender, J. (2023). Matrix calculations for moments of markov processes. Advances in Applied Probability, 55(1), 126–150.
Article MathSciNet Google Scholar
DeVries, T., & Taylor, G. W. (2017). Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.
Article Google Scholar
Feng, Y., Chen, J., Zhang, T., He, S., Xu, E., & Zhou, Z. (2022). Semi-supervised meta-learning networks with squeeze-and-excitation attention for few-shot fault diagnosis. ISA Transactions, 120, 383–401.
Article Google Scholar
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, PMLR, pp. 1126–1135.
Gao, C., Zheng, Y., Li, N., Li, Y., Qin, Y., Piao, J., Quan, Y., Chang, J., Jin, D., He, X., et al. (2023). A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Transactions on Recommender Systems, 1(1), 1–51.
Article Google Scholar
Grant, E., Finn, C., Levine, S., Darrell, T., & Griffiths, T. (2018). Recasting gradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930.
Guo, Y., Codella, N. C., Karlinsky, L., Codella, J. V., Smith, J. R., Saenko, K., Rosing, T., & Feris, R. (2020). A broader study of cross-domain few-shot learning. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, Springer, pp. 124–141.
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In 2011 international conference on computer vision, IEEE, pp. 991–998.
Hilliard, N., Phillips, L., Howland, S., Yankov, A., Corley, C. D., & Hodas, N. O. (2018). Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376.
Hospedales, T., Antoniou, A., Micaelli, P., & Storkey, A. (2021). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5149–5169.
Google Scholar
Hu, Z., Li, Z., Wang, X., & Zheng, S. (2022). Unsupervised descriptor selection based meta-learning networks for few-shot classification. Pattern Recognition, 122, 108304.
Article Google Scholar
Huang, G., Laradji, I., Vazquez, D., Lacoste-Julien, S., & Rodriguez, P. (2022). A survey of self-supervised and few-shot object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4071–4089.
Google Scholar
Jamal, M. A., & Qi, G. J. (2019). Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11719–11727.
Jeong, T., & Kim, H. (2020). Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification. Advances in Neural Information Processing Systems, 33, 3907–3916.
Google Scholar
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Koch, G., Zemel, R., Salakhutdinov, R., et al. (2015). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Lille.
Kumar, R., Deleu, T., & Bengio, Y. (2022). The effect of diversity in meta-learning. arXiv preprint arXiv:2201.11775.
Lacoste, A., Oreshkin, B., Chung, W., Boquet, T., Rostamzadeh, N., & Krueger, D. (2018). Uncertainty in multitask transfer learning. arXiv preprint arXiv:1806.07528.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2019). The omniglot challenge: A 3-year progress report. Current Opinion in Behavioral Sciences, 29, 97–104.
Article Google Scholar
Lang, C., Cheng, G., Tu, B., & Han, J. (2022). Learning what not to segment: A new perspective on few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8057–8067.
Lang, C., Cheng, G., Tu, B., & Han, J. (2023a). Few-shot segmentation via divide-and-conquer proxies. International Journal of Computer Vision, pp. 1–23.
Lang, C., Cheng, G., Tu, B., Li, C., & Han, J. (2023b). Base and meta: A new perspective on few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Lee, K., Maji, S., Ravichandran, A., & Soatto, S. (2019). Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10657–10665.
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 1–45.
Article Google Scholar
Liu, C., Wang, Z., Sahoo, D., Fang, Y., Zhang, K., & Hoi, S. C. (2020a). Adaptive task sampling for meta-learning. In Computer Vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, Springer, pp. 752–769.
Liu, W., Zhang, C., Lin, G., & Liu, F. (2020b). Crnet: Cross-reference networks for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4165–4173.
Lu, C., Feng, J., Lin, Z., Mei, T., & Yan, S. (2018). Subspace clustering by block diagonal representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 487–501.
Article Google Scholar
Luo, C., Song, C., & Zhang, Z. (2020). Generalizing person re-identification by camera-aware invariance learning and cross-domain mixup. In Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, Springer, pp. 224–241.
Ma, Y., Derksen, H., Hong, W., & Wright, J. (2007). Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1546–1562.
Article Google Scholar
Mangla, P., Kumari, N., Sinha, A., Singh, M., Krishnamurthy, B., & Balasubramanian, V. N. (2020). Charting the right manifold: Manifold mixup for few-shot learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2218–2227.
Martin, E. J., Polyakov, V. R., Zhu, X. W., Tian, L., Mukherjee, P., & Liu, X. (2019). All-assay-max2 pqsar: Activity predictions as accurate as four-concentration ic50s for 8558 novartis assays. Journal of Chemical Information and Modeling, 59(10), 4450–4459.
Article Google Scholar
Myers, V., & Sardana, N. (2021). Bayesian meta-learning through variational gaussian processes. arXiv preprint arXiv:2110.11044.
Nichol, A., & Schulman, J. (2018). Reptile: A scalable metalearning algorithm. 2(3), 4. arXiv preprint arXiv:1803.02999.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717–1724.
Parnami, A., & Lee, M. (2022). Learning from few examples: A summary of approaches to few-shot learning. arXiv preprint arXiv:2203.04291.
Qiao, S., Liu, C., Shen, W., & Yuille, A. L. (2018). Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7229–7238.
Raghu, A., Raghu, M., Bengio, S., & Vinyals, O. (2019). Rapid learning or feature reuse? Towards understanding the effectiveness of maml. arXiv preprint arXiv:1909.09157.
Rajeswaran, A., Finn, C., Kakade, S. M., & Levine, S. (2019). Meta-learning with implicit gradients. In Advances in neural information processing systems, 32.
Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., & Zemel, R. S. (2018). Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
Ren, M., Liao, R., Fetaya, E., & Zemel, R. (2019). Incremental few-shot learning with attention attractor networks. In Advances in neural information processing systems, 32.
Requeima, J., Gordon, J., Bronskill, J., Nowozin, S., & Turner, R. E. (2019). Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in neural information processing systems, 32.
Rivolli, A., Garcia, L. P., Soares, C., Vanschoren, J., & de Carvalho, A. C. (2022). Meta-features for meta-learning. Knowledge-Based Systems, 240, 108101.
Article Google Scholar
Shaban, A., Bansal, S., Liu, Z., Essa, I., & Boots, B. (2017). One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410.
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. In Advances in neural information processing systems, 30.
Sun, X., Wu, P., & Hoi, S. C. (2018). Face detection using deep learning: An improved faster rcnn approach. Neurocomputing, 299, 42–50.
Article Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208.
Tang, J., Wu, S., Sun, J., & Su, H. (2012). Cross-domain collaboration recommendation. In Proceedings of the 18th ACM SIGKDD international conference on nnowledge discovery and data mining, pp. 1285–1293.
Tao, X., Hong, X., Chang, X., Dong, S., Wei, X., & Gong, Y. (2020). Few-shot class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12183–12192.
Tian, Z., Zhao, H., Shu, M., Yang, Z., Li, R., & Jia, J. (2020). Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1050–1065.
Article Google Scholar
Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci, U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P. A., et al. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.
Vanschoren, J. (2018). Meta-learning: A survey. arXiv preprint arXiv:1810.03548.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. In Advances in neural information processing systems, 29.
Wang, K., Liew, J. H., Zou, Y., Zhou, D., & Feng, J. (2019a). Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9197–9206.
Wang, Y., Chao, W. L., Weinberger, K. Q., & van der Maaten, L. (2019b). Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623.
Wang, Y., Yao, Q., Kwok, J. T., & Ni, L. M. (2020). Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (csur), 53(3), 1–34.
Article Google Scholar
Willemink, M. J., Koszek, W. A., Hardell, C., Wu, J., Fleischmann, D., Harvey, H., Folio, L. R., Summers, R. M., Rubin, D. L., & Lungren, M. P. (2020). Preparing medical imaging data for machine learning. Radiology, 295(1), 4–15.
Article Google Scholar
Wu, X., Sahoo, D., & Hoi, S. (2020). Meta-rcnn: Meta learning for few-shot object detection. In Proceedings of the 28th ACM international conference on multimedia, pp. 1679–1687.
Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, IEEE, pp. 75–82.
Yang, Z., Luo, T., Wang, D., Hu, Z., Gao, J., & Wang, L. (2018). Learning to navigate for fine-grained classification. In V. Ferrari, M. Hebert, C. Sminchisescu Y. Weiss (Eds.) Computer vision—ECCV 2018—15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XIV, Springer, Lecture Notes in Computer Science, vol. 11218, pp. 438–454, https://doi.org/10.1007/978-3-030-01264-9_26.
Yao, H., Huang, L. K., Zhang, L., Wei, Y., Tian, L., Zou, J., Huang, J., et al. (2021a). Improving generalization in meta-learning via task augmentation. In International conference on machine learning, PMLR, pp. 11887–11897.
Yao, H., Wang, Y., Wei, Y., Zhao, P., Mahdavi, M., Lian, D., & Finn, C. (2021). Meta-learning with an adaptive task scheduler. Advances in Neural Information Processing Systems, 34, 7497–7509.
Google Scholar
Ying, X. (2019). An overview of overfitting and its solutions. Journal of Physics: Conference Series, 1168, 022022.
Google Scholar
Zhang, C., Song, N., Lin, G., Zheng, Y., Pan, P., & Xu, Y. (2021a). Few-shot incremental learning with continually evolved classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12455–12464.
Zhang, X., Meng, D., Gouk, H., & Hospedales, T. M. (2021b). Shallow bayesian meta learning for real-world few-shot recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 651–660.
Zheng, Y. (2015). Methodologies for cross-domain data fusion: An overview. IEEE Transactions on Big Data, 1(1), 16–34.
Article Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1452–1464.
Article Google Scholar
Zhu, Q., Mao, Q., Jia, H., Noi, O. E. N., & Tu, J. (2022). Convolutional relation network for facial expression recognition in the wild with few-shot learning. Expert Systems with Applications, 189, 116046.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments. This work was supported in part by the Postdoctoral Fellowship Program of CPSF (Grant No. GZB20230790), the China Postdoctoral Science Foundation (Grant No. 2023M743639), the Special Research Assistant Fund, Chinese Academy of Sciences (Grant No. E3YD590101), the Science and Technology Planning Project of Guangdong Province (Grant No. 2023A0505050111), and the Guangzhou-HKUST (GZ) Joint Funding Program (Grant No.2023A03J0008).

Author information

Authors and Affiliations

National Key Laboratory of Space Integrated Information System, Institute of Software Chinese Academy of Sciences, Beijing, China
Jingyao Wang, Wenwen Qiang, Xingzhe Su, Changwen Zheng & Fuchun Sun
University of Chinese Academy of Sciences, Beijing, China
Jingyao Wang, Wenwen Qiang, Xingzhe Su & Changwen Zheng
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Fuchun Sun
Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology, Guangzhou, China
Hui Xiong
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
Hui Xiong

Authors

Jingyao Wang
View author publications
Search author on:PubMed Google Scholar
Wenwen Qiang
View author publications
Search author on:PubMed Google Scholar
Xingzhe Su
View author publications
Search author on:PubMed Google Scholar
Changwen Zheng
View author publications
Search author on:PubMed Google Scholar
Fuchun Sun
View author publications
Search author on:PubMed Google Scholar
Hui Xiong
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Wenwen Qiang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Zhouchen Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs

This appendix first provides the theoretical proofs of the theorems in Sect. 6. Next, we introduce the details and experimental settings of the meta-learning models.

In this section, we provide the proofs of Theorems 1, 2, and 3 in Appendix A.1, Appendix A.2, and Appendix A.3, respectively.

Notations Throughout this section, we use $Z_i$ to denote the representation of task $\mathcal {T}_i$, use ${\textbf {Z}}_i$ to denote the representation of the optimal $\mathcal {T}_i $, use $\mathbb {A}_+^n $, $\mathbb {R}_+ $, and $\mathbb {Z}_+ $ to denote the collection of $n \times n$ symmetric positive definite matrices, non-negative real numbers, and positive integers, respectively. The task $\mathcal {T}_i $ contains n samples and k classes, and class j contains $n_j$ samples. The dimension of representation $Z_i$ is d, $Z_i\in \mathbb {R}^d$.

1.1 A.1 Proof of Theorem 1

Theorem 1, also the Theorem 4 mentioned below, gives the upper bound of task diversity. The condition for the upper bound being tight is consistent with Maximally Feature Space in Corollary 1.

Theorem 4

Let $Z_i=\left[ Z_i^1,\ldots ,,Z_i^k \right] \in \mathbb {R}^{d\times n}$ be the representation of task $\mathcal {T}_i$, which has k classes and $n= {\textstyle \sum _{j=1}^{k}n_j} $ samples. For any representations $Z_i^j\in \mathbb {R}^{d \times n_j}$ of class j and any $\sigma >0$, we have:

$$\begin{aligned}&\frac{n}{2}\log \textrm{det}(\mathcal {I} +\frac{d}{n\sigma ^2} Z_i^*Z_i) \nonumber \\&\quad \le \sum _{j=1}^{k}\frac{n}{2}\log \textrm{det} \left( \mathcal {I} +\frac{d}{n\sigma ^2} {(Z_i^j)}^*{(Z_i^j)}\right) \end{aligned}$$

(14)

the equality holds if and only if:

$$\begin{aligned} {(Z_i^{j_1})}^*(Z_i^{j_2})=0,\quad s.t.\quad 1\le j_1 \le j_2 \le k \end{aligned}$$

(15)

Proof

According to Chan (2022), $\log \textrm{det} (\cdot ):\mathbb {A}_+^n\rightarrow \mathbb {R} $ is strictly concave. For any $\beta \in (0,1)$ and $\left\{ Z_{j_1},Z_{j_2} \right\} \in \mathbb {A}_+^n $:

$$\begin{aligned}{} & {} \log \textrm{det}((1-\beta )Z_{j_1}+\beta Z_{j_2}) \ge (1-\beta )\nonumber \\{} & {} \quad \log \textrm{det}(Z_{j_1})+\beta \log \textrm{det} (Z_{j_2}) \end{aligned}$$

(16)

with equality holds if and only if $Z_{j_1}=Z_{j_2}$. Then for all $\left\{ A_a, A_b \right\} \in \mathbb {A}_+^n$, we have:

$$\begin{aligned} \log (A_a)\le \log (A_b)+\left\langle \bigtriangledown \log (A_b),A_a-A_b \right\rangle \end{aligned}$$

(17)

According to Boyd and Vandenberghe (2004), let $A_b^{-1}=\bigtriangledown \log (A_b)$ and $A_b^{-1}=(A_b^{-1})^*$, we have:

$$\begin{aligned} \log (A_a)\le \log (A_b)+\textrm{tr}(A_b^{-1}A_a)-n \end{aligned}$$

(18)

we now let:

$$\begin{aligned} A_a&= \mathcal {I}+\frac{d}{n\sigma ^2}(Z_i)^*Z_i\nonumber \\&=\mathcal {I}+\frac{d}{n\sigma ^2}\begin{bmatrix} (Z_i^1)^*Z_i^1 &{} (Z_i^1)^*Z_i^2 &{} \dots &{} (Z_i^1)^*Z_i^k \\ (Z_i^2)^*Z_i^1 &{} (Z_i^2)^*Z_i^2 &{} \dots &{} (Z_i^2)^*Z_i^k \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ (Z_i^k)^*Z_i^1 &{} (Z_i^k)^*Z_i^2 &{} \dots &{} (Z_i^k)^*Z_i^k \end{bmatrix} \end{aligned}$$

(19)

From the property of determinant for block diagonal matrix (Lu et al., 2018), we let:

$$\begin{aligned} A_b&= \mathcal {I}+\frac{d}{n\sigma ^2}(Z_i^j)^*Z_i^j\nonumber \\&=\mathcal {I}+\frac{d}{n\sigma ^2}\begin{bmatrix} (Z_i^1)^*Z_i^1 &{} 0 &{} \dots &{} 0 \\ 0 &{} (Z_i^2)^*Z_i^2 &{} \dots &{} 0\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \dots &{} (Z_i^k)^*Z_i^k \end{bmatrix} \end{aligned}$$

(20)

Then, for $\textrm{tr}(A_b^{-1}A_a)$:

$$\begin{aligned} \textrm{tr}(A_b^{-1}A_a) = \textrm{tr}\begin{bmatrix} \mathcal {I} &{} \dots &{} \chi \\ \vdots &{} \ddots &{} \vdots \\ \chi &{} \dots &{} \mathcal {I} \end{bmatrix} =n \end{aligned}$$

(21)

bring Eqs. (19), (20), and (21) back to Eq. (18), we can get:

$$\begin{aligned}&\le \frac{n}{2}\log \textrm{det}(\mathcal {I} +\frac{d}{n\sigma ^2} Z_i^*Z_i) \nonumber \\&\le \sum _{j=1}^{k}\frac{n}{2}\log \textrm{det} \left( \mathcal {I} +\frac{d}{n\sigma ^2} {(Z_i^j)}^*{(Z_i^j)}\right) \end{aligned}$$

(22)

where the equality holds if and only if $A_a=A_b$, i.e., ${(Z_i^{j_1})}^*(Z_i^{j_2})=0,\quad s.t.\quad 1\le j_1 \le j_2 \le k$. $\square $

1.2 A.2 Proof of Theorem 2

Theorem 2, also the Theorem 5 mentioned below, gives that task entropy is maximized by the representations that are maximally discriminative between different classes and tight in each class. This is consistent with Maximally Discriminability in Corollary 1, demonstrating that task entropy can well reflect intra-class compaction and inter-class separability.

Theorem 5

Let $Z_i=\left[ Z_i^1,\ldots , Z_i^k \right] $ be the representation of task $\mathcal {T}_i$, $\varsigma _j:= \left[ \varsigma _{1,j},\ldots , \varsigma _{min(n_j,d),j} \right] $ be the singular values of the representation $Z_i^j$ of class j, $\textrm{C}_i=\left[ \textrm{C}_i^1,\ldots ,\textrm{C}_i^k \right] $ is a collection of diagonal matrices, where the diagonal elements encode the n samples into the k classes. Given any $\epsilon >0$ and $d \ge d_j>0$, consider the optimization problem of task entropy:

$$\begin{aligned}&\underset{Z_i\in \mathbb {R}^{d\times n} }{\arg \max } (t_{et}^i)\nonumber \\&s.t.\quad \left\| Z_i\textrm{C}_i^j \right\| ^2=\textrm{tr}(\textrm{C}_i^j ),\nonumber \\&\textrm{rank}(Z_i)\le d_j, \forall j\in \left\{ 1,\ldots ,k \right\} \end{aligned}$$

(23)

Under the conditions where the error upper limit $\epsilon ^4< \underset{j}{\min }\left\{ \frac{n_j}{n}\frac{d ^2}{d_j^2} \right\} $, and the dimension $d\ge {\textstyle \sum _{j=1}^{k}d_j} $, the optimal solution ${\textbf {Z}}_i$ satisfies:

Between-class: The representation ${\textbf {Z}}_i^{j_1}$ and ${\textbf {Z}}_i^{j_2}$ lie in orthogonal subspaces, i.e., $({\textbf {Z}}_i^{j_1})^*({\textbf {Z}}_i^{j_2})=0$, where $1\le j_1 \le j_2 \le k$.
Within-class: each class j achieves its maximal dimension $d_j$, i.e., $\textrm{rank}({\textbf {Z}}_i^j)= d_j$, and either $\left[ \varsigma _{1,j},\ldots , \varsigma _{d_j,j} \right] $ equal to $\frac{\textrm{tr}(\textrm{C}_i^j )}{d_j} $, or $\left[ \varsigma _{1,j},\ldots , \varsigma _{d_j-1,j} \right] $ equal to and have values larger than $\frac{\textrm{tr}(\textrm{C}_i^j )}{d_j} $.

Proof

Use singular value decomposition (SVD) to decompose $Z_i$ into $Z_i=U_i\Sigma _i V_i^*$, where $U_i$ and $V_i$ are unitary matrices, $\Sigma _i$ is a diagonal matrix, and its diagonal elements are singular values of $Z_i$. Since $\textrm{rank}(Z_i)\le d_j, \forall j\in \left\{ 1,\ldots ,k \right\} $, assume the first $d_j$ diagonal elements of $\Sigma _i$ is not zero, and the subsequent diagonal elements are all zero. Therefore, $\Sigma _i$ will be:

$$\begin{aligned} \Sigma _i=\left[ \begin{array}{cc} \Sigma _{i,1} &{} 0 \\ 0 &{} 0 \end{array}\right] \end{aligned}$$

(24)

where $\Sigma _{i,1}$ is a diagonal matrix of $d_j\times d_j$, and its diagonal elements are $\varsigma _{1,j},\ldots , \varsigma _{d_j,j}$. Similarly, for $U_i$ and $V_i$:

$$\begin{aligned} U_i=\left[ \begin{array}{cc} U_{i,1} &{} U_{i,2} \\ \end{array}\right] , \quad V_i=\left[ \begin{array}{cc} V_{i,1} &{} V_{i,2} \\ \end{array}\right] \end{aligned}$$

(25)

Among them, $U_{i,1}$ and $V_{i,1}$ are both matrices of $d\times d_j$, and $U_{i,2}$ and $V_{i,2}$ are both $ The matrix of d\times (n-d_j)$. Then, we get:

$$\begin{aligned} Z_i&=U_i\Sigma _i V_i^* \nonumber \\&=\left[ \begin{array}{cc} U_{i,1} &{} U_{i,2} \\ \end{array}\right] \left[ \begin{array}{cc} \Sigma _{i,1} &{} 0 \\ 0 &{} 0 \end{array}\right] \left[ \begin{array}{c} V_{i,1}^* \\ V_{i,2}^* \end{array}\right] \nonumber \\&=U_{i,1}\Sigma _{i,1} V_{i,1}^* \end{aligned}$$

(26)

Since $\textrm{C}_i^j $ is a diagonal matrix, and only $n_j$ of its diagonal elements are 1, and the rest are 0, we have:

$$\begin{aligned} \textrm{tr}(\textrm{C}_i^j )=n_j,\quad Z_i\textrm{C}_i^j =U_{i,1}\Sigma _{i,1} V_{i,1}^ *\textrm{C}_i^j \end{aligned}$$

(27)

Therefore, the constraint $\left\| Z_i\textrm{C}_i^j \right\| ^2 =\textrm{tr}(\textrm{C}_i^j )$ can be equivalently written as:

$$\begin{aligned} \left\| U_{i,1}\Sigma _{i,1} V_{i,1}^*\textrm{C}_i^j\right\| ^2=n_j, \quad \forall j\in \left\{ 1,\ldots ,k \right\} \end{aligned}$$

(28)

Without loss of generality, let ${\textbf {Z}}_i=\left[ {\textbf {Z}}_i^1,\ldots , {\textbf {Z}}_i^k \right] $ is the feature representation of the optimal task $\mathcal {T}_i $. To show that ${\textbf {Z}}_i^j,j\in \left\{ 1,\ldots ,k \right\} $, are pairwise orthogonal, suppose for the purpose of arriving at a contradiction that $({\textbf {Z}}_i^{j_1})^*({\textbf {Z}}_i^{j_2})=0$ for some $1\le j_1 \le j_2 \le k$. That is:

$$\begin{aligned}&\underset{Z_i\in \mathbb {R}^{d\times n} }{\arg \max } (t_{et}^i)\nonumber \\&\quad \le \frac{\textrm{tr}(\textrm{C}_i^j ) }{2n})\log \det \left( \mathcal {I} +\frac{d}{\textrm{tr}(\textrm{C}_i^j ) \epsilon ^2}{{{\textbf {Z}}}_i^*} \textrm{C}_i^j{{{\textbf {Z}}}_i}\right) \nonumber \\&s.t.\quad \left\| {\textbf {Z}}_i\textrm{C}_i^j \right\| ^2 =\textrm{tr}(\textrm{C}_i^j ),\nonumber \\&\textrm{rank}({\textbf {Z}}_i)= d_j, \forall j\in \left\{ 1,\ldots ,k \right\} \end{aligned}$$

(29)

According to the proof of Theorem 4, the strict inequality in Between-class of Eq. (23) holds for the optimal solution ${\textbf {Z}}_i$. On the other hand, since $\sum _{j=1}^{k} d_j\le n$, there exists $\left\{ Q_i^j\in \mathbb {R}^{d\times d_j} \right\} _{j=1}^k$ such that the columns of the matrix $\mathcal {Q} $ are orthonormal.

Since $Z_i=U_i\Sigma _i V_i^*$, $\Sigma _i\Sigma _i^*$ is a diagonal matrix and its diagonal element is $\varsigma _{l,j}^2 $, $\Sigma _{i,2}$ is a diagonal matrix of $d\times d$, and its diagonal element is $\varsigma _{l,j}^2 $, we have:

$$\begin{aligned} Z_iZ_i^*&=U_i\Sigma _i\Sigma _i^* U_i^*\nonumber \\&=\left[ \begin{array}{cc} U_{i,1} &{} U_{i,2} \\ \end{array}\right] \left[ \begin{array}{cc} \Sigma _{i,2} &{} 0 \\ 0 &{} 0 \end{array}\right] \left[ \begin{array}{c} U_{i,1}^* \\ U_{i,2}^* \end{array}\right] \nonumber \\&=U_{i,1}\Sigma _{i,2} U_{i,1}^* \end{aligned}$$

(30)

where the rank of $Z_iZ_i^*$ is equal to the rank of $\Sigma _{i,2}$, that is, $\textrm{rank}(Z_iZ_i^*)=\textrm{rank}(\Sigma _{i,2})=d_j $. This means that only $d_j$ of the eigenvalues of $Z_iZ_i^*$ are non-zero, and the rest are zero. Since $Z_iZ_i^*$ is a symmetric matrix, we can diagonalize it as:

$$\begin{aligned} Z_iZ_i^*=Q_i\Lambda _i Q_i^* \end{aligned}$$

(31)

Since $(Z_i^{j_1})^* Z_i^{j_2}=V_{i,1}^{j_1*}\Sigma _{i,1}^{j_1*} U_{i,1}^{j_1* } U_{i,1}^{j_2}\Sigma _{i,1}^{j_2} V_{i,1}^{j_2}$, then:

$$\begin{aligned}&Z_i^{j_1}Z_i^{j_2*}\nonumber \\&\quad =Q_{i,1}^{j_1}\Lambda _{i,1}^{j_1} Q_{i,1}^{j_1*} Q_{i,1}^{j_2}\Lambda _{i,1}^{j_2} Q_{i,1}^{j_2*}\nonumber \\&\quad =Q_{i,1}^{j_1}\Lambda _{i,1}^{j_1}(Q_{i,1}^{j_1*} Q_{i,1}^{j_2})\Lambda _{i,1}^{j_2} Q_{i,1}^{j_2*}\nonumber \\&\quad =0 \end{aligned}$$

(32)

That is, the matrices are pairwise orthogonal, i.e., $({\textbf {Z}}_i^{j_1})^*({\textbf {Z}}_i^{j_2})$ $=0$, where $1\le j_1 \le j_2 \le k$.

Since $\det (Z_iZ_i^*)=\det (\Sigma _i\Sigma _i^*)=\det (\Sigma _{i,1}\Sigma _{i,1}^*)$ $=\prod _{l=1} ^{d_j}\varsigma _{l,j}^2$, we have:

$$\begin{aligned} t_{et}^i=\sum _{j=1}^{k}\frac{n_j}{n}\log _2\left( \frac{n}{n_j}\right) -\frac{n}{d}\log _2\left( \frac{n}{d}\right) +\frac{2n}{d}\log _2\left( \prod _{l=1}^{d_j}\varsigma _{l,j}\right) \end{aligned}$$

(33)

In order to maximize $t_{et}^i$, we need to maximize $\prod _{l=1}^{d_j}\varsigma _{l,j}$, subject to satisfying the constraints. Since $\left\| U_{i,1}\Sigma _{i,1} V_{i,1}^*\textrm{C}_i^j \right\| ^2=n_j$, we have:

$$\begin{aligned}&\left\| U_{i,1}\Sigma _{i,1} V_{i,1}^*\textrm{C}_i^j \right\| ^2\nonumber \\&\quad =\textrm{tr}(U_{i,1}\Sigma _{i,1} V_{i,1}^*\textrm{C}_i^j V_{i,1} \Sigma _{i,1}^* U_{i,1}^*)\nonumber \\&\quad =\textrm{tr}(\Sigma _{i,1} V_{i,1}^*\textrm{C}_i^j V_{i,1} \Sigma _{i,1}^*)\nonumber \\&\quad =\sum _{l=1}^{d_j}\varsigma _{l,j}^2\left\| V_{i,l}^* \textrm{C}_i^j V_{i,l} \right\| ^2\nonumber \\&\quad =n_j \end{aligned}$$

(34)

where $V_{i,l}$ represents the lth column of $V_{i,1}$. Since $\left\| V_{i,l}^*\textrm{C}_i^j V_{i,l} \right\| ^2\le 1$, we can get:

$$\begin{aligned} \sum _{l=1}^{d_j}\varsigma _{l,j}^2\le n_j \end{aligned}$$

(35)

the equal sign holds true if and only if $\left\| V_{i,l}^*\textrm{C}_i^j V_{i,l} \right\| ^2= 1$. This means that $V_{i,l}$ must be an eigenvector of $\textrm{C}_i^j $, and the corresponding eigenvalue is 1. Since only $n_j$ eigenvalues of $\textrm{C}_i^j $ are 1 and the rest are 0. To maximize $\prod _{l=1}^{d_j}\varsigma _{l,j}$, we need to make $\varsigma _{l,j}$ as equal as possible, that is:

$$\begin{aligned} \varsigma _{l,j}=n_j/d_j,\quad \forall l\in \left\{ 1,\ldots ,d_j \right\} \end{aligned}$$

(36)

Then, according to Chan (2022), the optimization problem in Eq. (23) depends on $Z_i^j$ only through its singular values. We have:

$$\begin{aligned}&\left\| Z_i^j \right\| _{F}^2 =\sum _{l=1}^{\min \left\{ n_j,d \right\} } \varsigma _{l,j}^2\nonumber \\&\textrm{rank}(Z_i^j)=\left\| \varsigma _j \right\| _0 \end{aligned}$$

(37)

Let $\varsigma _j^*:= \left[ \varsigma _{1,j}^*,\ldots , \varsigma _{min(n_j,d),j}^* \right] $ be an optimal solution to Eq. (23). Without loss of generality we assume that the entries of $\varsigma _j^*$ are sorted in descending order. We define:

$$\begin{aligned} f(z;d,\epsilon ,n_j,n)=\log \left( \frac{1 +\frac{d}{n \epsilon ^2}{\varsigma _i}}{1 +\frac{d}{n_j \epsilon ^2}{\varsigma _i^j}}\right) \end{aligned}$$

(38)

Then apply the Lemma 13 in Chan (2022) and conclude that the unique optimal solution to Eq. (23), we get:

$\varsigma _i^*=\left[ \frac{n_j}{d_j},\ldots ,\frac{n_j}{d_j} \right] $ or
$\varsigma _i^*=\left[ \frac{n_j}{d_j},\ldots ,\frac{n_j}{{d_j}-1}, \varsigma _i^L \right] , \varsigma _i^L>0$

$\square $

1.3 A.3 Proof of Theorem 3

Theorem 3, also the Theorem 6 mentioned below, gives the lower bound of task difficulty. The condition for the lower bound being tight is consistent with Minimally Effect Gap in Corollary 1. This shows that task difficulty can well reflect causal invariance.

Theorem 6

The support set $\mathcal {D}_i^s$ and query set $\mathcal {D}_i^q$ are two different datasets of $\mathcal {T}_i$. For any $\mathcal {T}_i$ and f, we have:

$$\begin{aligned} \sum \limits _{i,j} {\Vert {{\nabla _{x_{i,j}^s}}\mathcal {L} (\mathcal {D}_{i}^s,f) - {\nabla _{x_{i,j}^q}}\mathcal {L} (\mathcal {D}_{i}^q,f}) \Vert _2^2}\ge 0 \end{aligned}$$

(39)

the equality holds if and only if the gradients of the support set $\mathcal {D}_i^s$ and query set $\mathcal {D}_i^q$ are consistent:

$$\begin{aligned} \sum \limits _{j}{{\nabla _{x_{i,j}^s}}\mathcal {L}(\mathcal {D}_{i}^s,f) = \sum \limits _{j}{\nabla _{x_{i,j}^q}}\mathcal {L}(\mathcal {D}_{i}^q,f}) \end{aligned}$$

(40)

Proof

we can deduce it directly from Eq. (39):

$$\begin{aligned}&\sum \limits _{i,j} {\Vert {{\nabla _{x_{i,j}^s}}\mathcal {L} (\mathcal {D}_{i}^s,f) - {\nabla _{x_{i,j}^q}}\mathcal {L} (\mathcal {D}_{i}^q,f}) \Vert _2^2}\nonumber \\&\quad =\sum \limits _{i} {\Vert \sum \limits _{j}{{\nabla _{x_{i,j}^s}} \mathcal {L}(\mathcal {D}_{i}^s,f) -\sum \limits _{j}{\nabla _{x_{i,j}^q}}\mathcal {L} (\mathcal {D}_{i}^q,f)} \Vert _2^2} \end{aligned}$$

(41)

where $\left\| \cdot \right\| $ represents absolute value, always greater than zero, then:

$$\begin{aligned} \sum \limits _{i} {\Vert \sum \limits _{j}{{\nabla _{x_{i,j}^s}} \mathcal {L}(\mathcal {D}_{i}^s,f) -\sum \limits _{j}{\nabla _{x_{i,j}^q}} \mathcal {L}(\mathcal {D}_{i}^q,f}) \Vert _2^2}\ge 0 \end{aligned}$$

(42)

where the equality holds if and only if:

$$\begin{aligned} \sum \limits _{j}{{\nabla _{x_{i,j}^s}}\mathcal {L} (\mathcal {D}_{i}^s,f) = \sum \limits _{j}{\nabla _{x_{i,j}^q}} \mathcal {L}(\mathcal {D}_{i}^q,f)} \end{aligned}$$

(43)

In summary, the measurements we proposed can well evaluate the quality of meta-learning tasks.

Appendix B: Meta-learning Models

In this section, we describe the details and experimental settings of the three types of meta-learning models mentioned in Sect. 8.2.

1.1 B.1 Overview

To conduct a more comprehensive analysis of task diversity, we incorporate various meta-learning models. According to Hospedales et al. (2021), we classify them into three categories: 1) Optimization-based (i.e., MAML (Finn et al., 2017), Reptile (Nichol and Schulman, 2018), and MetaOptNet (Lee et al., 2019)). These methods aim to learn a set of optimal initialization parameters that guide the model to quickly converge when learning new tasks. 2) Metric-based (i.e., ProtoNet (Snell et al., 2017), MatchingNet (Vinyals et al., 2016), and RelationNet (Sung et al., 2018)). These non-parametric methods are based on metric learning and are similar to nearest neighbor algorithms and kernel density estimation. 3) Bayesian-based models (i.e., CNAPs (Requeima et al., 2019), SCNAP (Bateni et al., 2020)). These methods use conditional probability as the core of meta-learning computations and modify the classifier to pick up new classes using pre-trained networks.

With the development of meta-learning, many novel models and variants have emerged in recent years. Besides the above three types of frameworks, there are methods that particularly meta-learn features designed for few-shot learning and/or update features during meta-testing (i.e., SimpleShot (Wang et al., 2019b), SUR (Mangla et al., 2020), and PPA (Qiao et al., 2018)). We consider these models to be non-direct frameworks for the study of task sampling. For cross-domain analysis, we use Baseline++ (Chen et al., 2019) and S2M2 (Mangla et al., 2020) that use linear classifiers, and MetaQDA (Zhang et al., 2021b) which is a Bayesian meta-learning generalization of the classic quadratic discriminant analysis as base frameworks.

1.2 B.2 Optimization-based

1.2.1 B.2.1 MAML

MAML (Finn et al., 2017) is a meta-learning approach that is agnostic to specific models, making it compatible with any model trained using gradient descent. It can be applied to a variety of learning problems, with the explicit goal of training parameters that will generalize well to new tasks with only a small amount of training data and a few gradient steps.

For meta-learning with MAML, the method first initializes meta-parameters $\theta $ and samples tasks $\mathcal {T}_{i}$ from a task distribution $p(\mathcal {T})$. In the inner loop, the adaptive parameters are calculated for each task $\mathcal {T}_{i}$ using gradient descent, as follows:

$$\begin{aligned} \theta _{i}^{'}=\theta -\alpha \bigtriangledown _{\theta } \mathcal {L}_{\mathcal {T}_{i} }(f_{\theta }) \end{aligned}$$

(44)

In the outer loop, the meta-parameter $\theta $ is updated based on the accumulated gradient, as follows:

$$\begin{aligned} \theta \leftarrow \theta -\beta \bigtriangledown _{\theta } {\textstyle \sum _{\mathcal {T}_{i}\sim p(\mathcal {T})}} \mathcal {L}_{\mathcal {T}_{i} }(f_{\theta _{i}^{'}}) \end{aligned}$$

(45)

where $\alpha $ and $\beta $ are the step size hyperparameters of the inner loop and outer loop, respectively.

In the experiments, we set the parameters as follows: the size of the running epoch is set to 150; batch size is 32 or 16; the meta-learning rate of Adam optimizer (Kingma and Ba, 2014) is 0.001; and the internal adaptation number is 1 with a step size of 0.4.

1.2.2 B.2.2 Reptile

Reptile (Nichol and Schulman, 2018) is a meta-learning approach that extends MAML by learning the initialization of neural network model parameters. It operates by repeatedly sampling a task, training it, and moving the initialization of the model parameters toward the trained weights for that task.

For meta-learning with Reptile, the method first initializes the meta-parameter $\theta $. For each iteration, a task $\mathcal {T}$ is sampled, corresponding to loss $\mathcal {L}_{\mathcal {T}}$ and a set of trained weights $\tilde{\theta }$. For a specific task, Reptile computes $\tilde{\theta }=U_{\mathcal {T}^{k} }(\theta ) $ denoting k steps of SGD or Adam. The meta-parameter $\theta $ is then updated using the following equation:

$$\begin{aligned} \theta \leftarrow \theta +\epsilon (\tilde{\theta }-\theta ) \end{aligned}$$

(46)

In the last step, instead of simply updating $\theta $, Reptile treats $\tilde{\theta }-\theta $ as a gradient and plugs it into an adaptive algorithm, such as Adam (Kingma and Ba, 2014).

In the experiments, we set the parameters as follows: For miniImageNet, the running epoch size is set to 150, the batch size is 32, the learning rate is 0.01, the meta-learning rate is 0.001, and the internal adaptation number is 5. The inner loop uses the SGD optimizer, and the outer loop uses the Adam optimizer. For tieredImageNet, we increase the number of internal adaptations to 10. For Omniglot, the meta-learning rate is set to 0.0005, and the number of internal adaptations is the same as that on tieredImageNet, while only running for 100 epochs.

1.2.3 B.2.3 MetaOptNet

MetaOptNet (Lee et al., 2019) is a meta-learning model proposed for few-shot learning, which aims to learn an embedding model that generalizes well for novel categories under a linear classification rule. To achieve this, it utilizes the implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem.

The learning objective of MetaOptNet is to minimize the generalization error across tasks given a base learner $\mathcal {A}$ and an embedding model $\phi $. The generalization error is estimated on a set of held-out tasks. The choice of the base learner has a significant impact on the objective as it has to be efficient since the expectation has to be computed over a distribution of tasks.

Formally, the learning objective is:

$$\begin{aligned} \underset{\phi }{\min }\ \mathbb {E}_{\mathcal {T}} \left[ \mathcal {L}^{meta}(\mathcal {D}^{test};\theta ;\phi )\right] \end{aligned}$$

(47)

Once the embedding model $f_{\phi }$ is learned, its generalization is estimated on a set of held-out tasks (often referred to as a meta-test set):

$$\begin{aligned} \mathbb {E}_{S}[\mathcal {L}^{meta}(\mathcal {D}^{test};\theta ,\phi ) ], \quad where \quad \theta =\mathcal {A}(\mathcal {D}^{train};\theta ) \end{aligned}$$

(48)

The above equation is greatly affected by the selection of the base learner $\mathcal {A}$. The chosen base learner must be efficient as the expectation is calculated across a task distribution. In this study, we explore base learners that rely on multi-class linear classifiers, which can be expressed in a simplified form as follows:

$$\begin{aligned}&\theta = \mathcal {A}(\mathcal {D}^{train};\phi )=\arg \underset{\left\{ w_{k} \right\} }{\min } \underset{\left\{ \zeta _{i} \right\} }{\min } \frac{1}{2}\sum _{k}\left\| w_{k} \right\| _{2}^{2} +C\sum _{n}\zeta _{n} \end{aligned}$$

(49)

$$\begin{aligned}&w_{y_{n}} \cdot f_{\phi }(x_{n})-w_{k}\cdot f_{\phi }(x_{n}) \ge 1-\delta _{y_{n,k}}-\zeta _{n},\forall n,k \end{aligned}$$

(50)

where C is the regularization parameter, and $\delta _{.,.}$ denotes the Kronecker delta function. The official repository trains the model using a 5-way 15-shot approach and evaluates it using a 5-way 1-shot approach. However, to ensure a fair and accurate comparison with other models as outlined in Kumar et al. (2022), we train and test the model using a 5-way 1-shot approach in this study. It is worth noting that our focus is on comparing the performance of different samplers for a given model, and the aforementioned difference in training and testing approaches would not affect our examination of task diversity in any way.

In the experiments, we set the parameters as follows: the size of the running epoch is set to 60, the batch size is 32 or 16, the learning rate is 0.01, and the meta-learning rate is 0.001. We use an SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001 to make gradient steps.

1.3 B.3 Metric-based Models

1.3.1 B.3.1 ProtoNet

Prototypical networks (ProtoNet) (Snell et al., 2017) is a method proposed for few-shot classification tasks. This involves a classifier that must generalize to new classes not seen in the training set, with only a few examples available for each new class. ProtoNet addresses this problem by learning a metric space, where classification is performed by calculating distances to prototype representations of each class. Compared to other few-shot learning approaches, ProtoNet’s simpler inductive bias is advantageous in the limited-data regime and achieves outstanding results.

To generate M-dimensional prototype representations $c_{k} \in \mathbb {R}^{M}$ for each class, ProtoNet employs an embedding function $f_{\phi }:\mathbb {R}^{D}\rightarrow \mathbb {R}^{M} $ with learnable parameters $\phi $. Each prototype is computed as the mean vector of the embedded support points belonging to its corresponding class:

$$\begin{aligned} c_k=\frac{1}{\left| S \right| }\sum _{(x_i,y_i)\in S_{k}}f_{\phi }(x_i) \end{aligned}$$

(51)

Once a prototype is constructed for each class, ProtoNet classifies query examples by determining the nearest prototype to them in the metric space using Euclidean distance. Specifically, the probability that a query example $x^*$ belongs to class k is calculated as follows:

$$\begin{aligned} p(y^*=k|x^*,S)=\frac{\textrm{exp}(-\left\| g(x^*)-c_{k} \right\| _2^2 ) }{ {\textstyle \sum _{k^{'}\in \left\{ 1,\ldots ,N \right\} }\textrm{exp}(-\left\| g(x^*) -c_{k^{'}} \right\| _2^2 )}} \end{aligned}$$

(52)

In our experiments, we set the parameters as follows: For miniImageNet, Omniglot, and tieredImageNet, we use a batch size of 32 and run for 100 epochs in a 5-way-1-shot setting. However, we use a batch size of 16 rather than 32 in a 20-way-1-shot setting to accommodate the longer training time and memory constraints. We set the meta-learning rate to 0.001, use an Adam optimizer for gradient steps, and set the step size of the StepLR scheduler to 0.4 with a gamma value of 0.5.

1.3.2 B.3.2 MatchingNet

Matching Networks (MatchingNet), as described in Vinyals et al. (2016), leverages the concepts of metric learning based on deep neural features and the latest developments that enhance neural networks with external memories. This approach trains a network that maps a small labeled support set and an unlabelled example to its label, eliminating the need for fine-tuning to adapt to new class types.

The crucial point is that once trained, MatchingNet can generate sensible test labels for unseen classes without modifying the network. More precisely, MatchingNet aims to map a support set of k image-label pairs, denoted as $S=\left\{ (x_i,y_i) \right\} ^k_{i=1}$, to a classifier $c_{S}(x^* )$. Given a test example $x^*$, the classifier produces a probability distribution over possible outputs $y^*$. The parametric neural network defined by p is used to predict the appropriate label $\hat{y}$ for each test example $x^*$. MatchingNet assigns labels to each query example based on a cosine distance-weighted linear combination of the support labels:

$$\begin{aligned} p(y^*=k|x^*,S)=\sum _{i=1}^{\left| S \right| }a(x^*,x_i) \Psi _{y_{i}=k} \end{aligned}$$

(53)

where $a(\cdot ,\cdot )$ denotes cosine similarity, $\Psi $ is the indicator function, and the output is softmax normalized over all support examples $x_i$.

In the experiments, we set the model parameters as follows: For standard few-shot learning under 5-way-1-shot settings, we run the epoch for 100 times with a batch size of 32. To make gradient steps, we use an Adam optimizer with a meta-learning rate of 0.001 and a weight decay of 0.0001. For training on CUB and Meta-Dataset under a 5-way 1-shot setting, we use the same parameters as the miniImageNet, except for the batch size and learning rate, which were set to 16 and 0.005, respectively.

1.3.3 B.3.3 RelationNet

Relation Network (RelationNet), presented in Sung et al. (2018), is a flexible and general framework for few-shot learning that is conceptually simple. The framework involves learning a deep distance metric to compare a small number of images within episodes, which is trained end-to-end from scratch.

The RelationNet framework comprises two modules: an embedding module, $f_{\varphi }$, and a relation module, $g_{\phi }$. The embedding module produces feature maps, $f_{\varphi }(x_{i})$ and $f_{\varphi }(x_{j})$, where $x_{i}$ and $x_{j}$ are samples in the support set S, and query set Q. These feature maps are combined using the operator $C(f_{\varphi }(x_{i}),f_{\varphi }(x_{j}))$ and fed into the relation module, $g_{\phi }$, for the next stage. The relation module produces a scalar value between 0 and 1 that represents the similarity between $x_{i}$ and $x_{j}$. The relation scores $r_{i,j}$ in C-way-1-shot settings (C relation scores) are generated using the following equation:

$$\begin{aligned} r_{i,j}=g_{\phi }(C(f_{\varphi }(x_i),f_{\varphi }(x_j))),i=1,2,\ldots ,C \end{aligned}$$

(54)

For K-shot settings, where $K>1$, we sum the embedding module outputs of all samples from each training class element-wise to form the feature map for that class. The model is trained using Mean Square Error (MSE) loss:

$$\begin{aligned} \varphi ,\phi =\underset{\varphi ,\phi }{\arg \min }\sum _{i=1}^{m} \sum _{j=1}^{n}(r_{i,j}-1(y_i==y_j))^2 \end{aligned}$$

(55)

Conceptually, this framework predicts relation scores, which can be considered a regression problem.

In the experiments, we set the model parameters as follows: For Omniglot, miniImageNet, and tieredImageNet under a 5-way-1-shot setting, the method is run for 100 epochs with a batch size of 32. An Adam optimizer is used to make gradient steps with a meta-learning rate of 0.001 and a weight decay of 0.0005. The same hyperparameters are used for training the model on Omniglot under a 20-way-1-shot setting.

In our experiments on Omniglot and miniImageNet, and tieredImageNet under a 5-way-1-shot setting, we run the epoch 100 times with a batch size of 32. We use an Adam optimizer to make gradient steps with a meta-learning rate of 0.001 and a weight decay of 0.0005. The same hyperparameters are used for training our model on Omniglot under a 20-way 1-shot setting.

1.4 B.4 Bayesian-based Models

1.4.1 B.4.1 CNAPs

The Conditional Neural Adaptive Processes (CNAPs) (Requeima et al., 2019) approach is designed to handle multi-task classification problems. It is based on a conditional neural process that employs an adaptation network to modulate the classifier’s parameters based on the current task’s dataset, without requiring additional tuning. This feature enables the model to handle a variety of input distributions.

The data for task $\tau $ includes a context set $D^\tau =\left\{ (x_{n}^\tau ,y_{n}^\tau ) \right\} _{n=1}^{N_{\tau }}$ and a target set $\left\{ (x_{m}^\tau ,y_{m}^\tau ) \right\} _{n=1}^{M_{\tau }}$. The former is with inputs and outputs observed while the latter is used to make predictions ($y^{\tau *}$ are only observed during training). CNAPs construct predictive distributions given $x^*$ as:

$$\begin{aligned} p(y^*|x^*,\theta , D^\tau )=p(y^*|x^*,\theta ,\psi ^\tau =\psi _\phi (D^\tau )) \end{aligned}$$

(56)

where $\theta $ are global classifier parameters shared across tasks, $\phi $ are adaptation network parameters used in the function $\psi _\phi (\cdot )$ that acts on $D^\tau $, and $\psi ^\tau $ are local task-specific parameters produced by $\psi _\phi (\cdot )$.

In the experiments, we set the model parameters as follows: In the standard few-shot learning setting, we run the epoch ten times with a batch size of 16 and a meta-learning rate of 0.005. In multi-domain few-shot learning, the meta-learning rate is set to 0.01.

1.4.2 B.4.2 SCNAP

Simple CNAPS (SCNAP) (Bateni et al., 2020)) is an architecture that performs better than CNAPs with up to 9.2% fewer trainable parameters. It hypothesizes that a class-covariance-based distance metric, specifically the Mahalanobis distance, can be adopted into CNAPs. In contrast to CNAPs, SCNAP directly computes the conditional probability $p(\cdot )$ of a sample belonging to a class using a deterministic, fixed distance metric $d_k$, as follows:

$$\begin{aligned}&p(y_i^*=k|f_{\theta }^{\tau }(x_i^*),S^\tau ) =\textrm{softmax}(-d_k(f_{\theta }^{\tau }(x_i^*)),\mu _k) \nonumber \\&d_k(x,y)=\frac{1}{2}(x-y)^T(Q_k^\tau )^{-1}(x-y) \qquad \quad \end{aligned}$$

(57)

where $Q_k^\tau $ is a covariance matrix specific to the task and class.

The parameters of this model are consistent with CNAPs.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Qiang, W., Su, X. et al. Towards Task Sampler Learning for Meta-Learning. Int J Comput Vis 132, 5534–5564 (2024). https://doi.org/10.1007/s11263-024-02145-0

Download citation

Received: 03 October 2023
Accepted: 31 May 2024
Published: 17 June 2024
Version of record: 17 June 2024
Issue date: December 2024
DOI: https://doi.org/10.1007/s11263-024-02145-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Task Sampler Learning for Meta-Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Task Sampling for Meta-learning

A Task-Aware Attention-Based Method for Improved Meta-Learning

Learning Adaptable Risk-Sensitive Policies to Coordinate in Multi-agent General-Sum Games

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proofs

1.1 A.1 Proof of Theorem 1

Theorem 4

Proof

1.2 A.2 Proof of Theorem 2

Theorem 5

Proof

1.3 A.3 Proof of Theorem 3

Theorem 6

Proof

Appendix B: Meta-learning Models

1.1 B.1 Overview

1.2 B.2 Optimization-based

1.2.1 B.2.1 MAML

1.2.2 B.2.2 Reptile

1.2.3 B.2.3 MetaOptNet

1.3 B.3 Metric-based Models

1.3.1 B.3.1 ProtoNet

1.3.2 B.3.2 MatchingNet

1.3.3 B.3.3 RelationNet

1.4 B.4 Bayesian-based Models

1.4.1 B.4.1 CNAPs

1.4.2 B.4.2 SCNAP

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now