这是indexloc提供的服务,不要输入任何密码
Skip to main content
Log in

Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. (1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. (2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (Aper), which aggregates the embeddings of PTM and adapted models for classifier construction. Aper is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. (3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of Aper with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Code Availability

Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.

References

  • Akbari, H., Kondratyuk, D., Cui, Y., Hornung, R., Wang, H., & Adam, H. (2023) Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint[SPACE]arXiv:2305.06324.

  • Alfassy, A., Arbelle, A., Halimi, O., Harary, S., Herzig, R., Schwartz, E., Panda, R., Dolfi, M., Auer, C., Staar, P., et al. (2022). Feta: Towards specializing foundational models for expert task applications. NeurIPS, 35, 29873–29888.

    Google Scholar 

  • Aljundi, R., Lin, M., Goujaud, B., & Bengio, Y. (2019). Gradient based sample selection for online continual learning. In NeurIPS, 32, 11816–11825.

    MATH  Google Scholar 

  • Bahng, H., Jahanian, A., Sankaranarayanan, S. & Isola, P. (2022). Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint[SPACE]arXiv:2203.17274.

  • Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J. & Katz, B. (2019). Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 32.

  • Belouadah E & Popescu A. (2019). Il2m: Class incremental learning with dual memory. In ICCV, pp. 583–592.

  • Cai X, Lou J, Bu J, Dong J, Wang H, Yu H. (2024). Single depth image 3d face reconstruction via domain adaptive learning. Frontiers of Computer Science, 18(1).

  • Cermelli F, Fontanel D, Tavera A, Ciccone M, Caputo B. (2022). Incremental learning in semantic segmentation from image labels. In CVPR, pp. 4371–4381.

  • Chaudhry, A., Dokania, P. K., Ajanthan, T., & Torr, P. H. (2018). Understanding forgetting and intransigence. Riemannian walk for incremental learning. In ECCV, pp.532–547.

  • Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., & Luo, P. (2022). Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, pp. 16664–16678.

  • Chen, X., Hsieh C.-J., Gong, B. (2022). When vision transformers outperform resnets without pre-training or strong data augmentations. In ICLR

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255.

  • Dhar, P., Singh, R. V., Peng, K.-C., Ziyan, W., & Chellappa, R. (2019). Learning without memorizing. In CVPR, pp. 5138–5146.

  • Ding, Z., Xie, H., Li, P., & Xin, X. (2023). A structural developmental neural network with information saturation for continual unsupervised learning. CAAI Transactions on Intelligence Technology, 8(3), 780–795.

    Article  MATH  Google Scholar 

  • Dong, J., Cong, Y., Sun, G., Ma, B., & Wang, L. (2021). I3dol: Incremental 3d object learning without catastrophic forgetting. In AAAI, pp. 6066–6074.

  • Dong, S., Hong, X., Tao, X., Chang, X., Wei, X., & Gong, Y. (2021). Few-shot class-incremental learning via relation knowledge distillation. In AAAI, pp. 1255–1263.

  • Dong, J., Liang, W., Cong, Y., & Sun, G. (2023). Heterogeneous forgetting compensation for class-incremental learning. In ICCV, pp. 11742–11751.

  • Dong, J., Wang, L., Fang, Z., Sun, G., Shichao, X., Wang, X., & Zhu, Q. (2022). Federated class-incremental learning. In CVPR, pp. 10164–10173.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

  • Douillard, A., Cord, M., Ollion, C., Robert, T., & Valle, E. (2020). Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, pp. 86–102.

  • Douillard, A., Ramé, A., Couairon, G., & Cord, M. (2022). Dytox: Transformers for continual learning with dynamic token expansion. In CVPR, pp. 9285–9295.

  • French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135.

    Article  MATH  Google Scholar 

  • Gao, R., & Liu, W. (2023). Ddgr: Continual learning with deep diffusion-based generative replay. In ICML, pp. 10744–10763.

  • Gao, Q., Zhao, C., Ghanem, B., & Zhang, J. (2022). R-DFCIL: relation-guided representation learning for data-free class incremental learning. In ECCV, pp. 423–439.

  • Gao, Q., Zhao, C., Sun, Y., Xi, T., Zhang, G., Ghanem, B., & Zhang, J. (2023). A unified continual learning framework with general parameter-efficient tuning. In ICCV, pp. 11483–11493.

  • Gomes, H. M., Barddal, J. P., Enembreck, F., & Bifet, A. (2017). A survey on ensemble learning for data stream classification. CSUR, 50(2), 1–36.

    Article  MATH  Google Scholar 

  • Grossberg, S. .T. . (2012). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control (Vol. 70). Springer Science & Business Media.

    MATH  Google Scholar 

  • Han, X., Zhang, Z., Ding, N., Yuxian, G., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L., et al. (2021). Pre-trained models: Past, present and future. AI Open, 2, 225–250.

    Article  MATH  Google Scholar 

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR, pp. 16000–16009.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.

  • He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., & Neubig, G. (2022). Towards a unified view of parameter-efficient transfer learning. In ICLR.

  • Hendrycks, D., Basart, S., Norman, M., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., et al. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pp. 8340–8349.

  • Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021). Natural adversarial examples. In CVPR, pp. 15262–15271.

  • Hinton, G., Vinyals, O., & Dean, J. (2015) Distilling the knowledge in a neural network. arXiv preprint[SPACE]arXiv:1503.02531.

  • Hou, S., Pan, X, L. C. Change, Wang, Z., & Lin, D. (2019). Learning a unified classifier incrementally via rebalancing. In CVPR, pp. 831–839.

  • Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, Sylvain. (2019). Parameter-efficient transfer learning for nlp. In ICML, pp. 2790–2799.

  • Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora et al. (2022). Lora: Low-rank adaptation of large language models. In ICLR.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456.

  • Iscen, A., Zhang, J., Lazebnik, S., & Schmid, C. (2020). Memory-efficient incremental learning through feature adaptation. In ECCV, pp. 699–715.

  • Jaimes, A., & Sebe, N. (2007). Multimodal human-computer interaction: A survey. Computer Vision and Image Understanding, 108(1–2), 116–134.

    Article  MATH  Google Scholar 

  • Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S. J., Hariharan, B., & Lim, S.-N. (2022). Visual prompt tuning. In ECCV,56, pp. 709–727. Springer.

  • Jiahui, Y., Wang, Z., Vasudevan, V., Yeung, L., & Seyedhosseini, M. (2022). and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research.

  • Jiang, J., Cetin, E., & Celiktutan, O. (2021). Ib-drr-incremental learning with information-back discrete representation replay. In CVPRW, pp. 3533–3542.

  • Jie, S., & Deng, Z.-H. (2023). Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, pp. 1060–1068.

  • Ju, X., & Zhu, Z. (2018). Reinforced continual learning. In NeurIPS, pp. 899–908.

  • Jung, D., Han, D., Bang, J., & Song, H. (2023). Generating instance-level prompts for rehearsal-free continual learning. In ICCV, pp. 11847–11857.

  • Kang, M., Park, J., & Han, B. (2022). Class-incremental learning by knowledge distillation with adaptive feature consolidation. In CVPR, pp. 16071–16080.

  • Khattak, M. U., Rasheed, H., Maaz, M., Khan, S., & Khan, F. S. (2023). Multi-modal prompt learning. Maple. In CVPR, 19113–19122.

  • Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521–3526.

    Article  MathSciNet  MATH  Google Scholar 

  • Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images. Technical Report.

  • Kumar, A., & Raghunathan, A. (2022). Robbie Matthew Jones, Tengyu Ma, and Percy Liang. In ICLR: Fine-tuning can distort pretrained features and underperform out-of-distribution.

  • Li, X.L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 4582–4597.

  • Li, L., Peng, J., Chen, H., Gao, C., & Yang, X. (2023). How to configure good in-context sequence for visual question answering. arXiv preprint[SPACE]arXiv:2312.01571.

  • Li, Z., Zhao, L., Zhang, Z., Zhang, H., Liu, D., Liu, T., & Metaxas, D.N. (2024). Steering prototypes with prompt-tuning for rehearsal-free continual learning. In WACV, 2523–2533.

  • Lian, D., Zhou, D., Feng, J., & Wang, X. (2022). Scaling & shifting your features: A new baseline for efficient model tuning. NeurIPS, 35, 109–123.

    MATH  Google Scholar 

  • Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on pattern analysis and machine intelligence, 40(12), 2935–2947.

    Article  Google Scholar 

  • Li, L., Huang, J., Chang, L., Weng, J., Chen, J., & Li, J. (2023). Dpps: A novel dual privacy-preserving scheme for enhancing query privacy in continuous location-based services. Frontiers of Computer Science, 17(5), 175814.

    Article  MATH  Google Scholar 

  • Liu, H., Ji, R., Yongjian, W., Huang, F., & Zhang, B. (2017). Cross-modality binary code learning via fusion similarity hashing. In CVPR, pp. 7380–7388.

  • Liu, M., Roy, S., Zhong, Z., Sebe, N., & Ricci, E. (2023). Large-scale pre-trained models are surprisingly strong in incremental novel class discovery. arXiv:2303.15975

  • Liu, Y., Yuting, S., Liu, A.-A., Schiele, B., & Sun, Q. (2020). Mnemonics training: Multi-class incremental learning without forgetting. In CVPR, pp. 12245–12254.

  • Lu, Y., Twardowski, B., Liu, X., Herranz, L., Wang, K., Cheng, Y., Jui, S., & van de Weijer, J. (2020). Semantic drift compensation for class-incremental learning. In CVPR, pp. 6982–6991.

  • Masana, M., Liu, X., Twardowski, B., Menta, M., Bagdanov, A. D., & Van De Weijer, J. (2022). Class-incremental learning: Survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5), 5513–5533.

    Article  Google Scholar 

  • Mermillod, M., Bugaiska, A., & Bonin, P. (2013). The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology, 4, 54654.

    Article  Google Scholar 

  • Mirzadeh, S. I., Farajtabar, M., Pascanu, R., & Ghasemzadeh, H. (2020). Understanding the role of training regimes in continual learning. NeurIPS, 33, 7308–7320.

    MATH  Google Scholar 

  • Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint[SPACE]arXiv:2304.07193.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pp.8026–8037.

  • Pearson, K. (1901). Liii. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin PPhilosophical Magazine and Journal of Science, 2(11), 559–572.

    Article  MATH  Google Scholar 

  • Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., & Wang, B. (2019). Moment matching for multi-source domain adaptation. In ICCV, pp. 1406–1415.

  • Perez-Rua, J.-M., & Zhu, X. (2020). Timothy M Hospedales, and Tao Xiang. Incremental few-shot object detection. In CVPR, pp. 13846–13855.

  • Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., & Gurevych, I. (2021). Adapterfusion: Non-destructive task composition for transfer learning. In ACL, pp. 487–503.

  • Pham, Q., & Liu, C. (2022). and HOI Steven. Continual normalization: Rethinking batch normalization for online continual learning. In ICLR.

  • Radford, A., Kim, Jong W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763. PMLR.

  • Rebuffi, S.-A., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual domains with residual adapters. NIPS, pp. 506–516.

  • Rebuffi, S.-A., Kolesnikov, A., Sperl, G., & Lampert, C.H. (2017). Incremental classifier and representation learning. icarl. In CVPR, pp. 2001–2010.

  • Ridnik, T., Ben-Baruch, E., Noy, A., & Zelnik-Manor, L. (2021). Imagenet-21k pretraining for the masses. arXiv preprint[SPACE]arXiv:2104.10972

  • Roy, S., Liu, M., Zhong, Z., Sebe, N., & Ricci, E. (2022). Class-incremental novel class discovery. In ECCV, pp. 317–333.

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pp.618–626.

  • Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual learning with deep generative replay. In NIPS, pp. 2990–2999.

  • Shi Y, Zhou K, Liang J, Jiang Z, Feng J, Torr PH, Bai S, & Tan VY. (2022). Mimicking the oracle: An initial phase decorrelation approach for class incremental learning. In CVPR, pp. 16722–16731.

  • Simon, C., Koniusz, P., & Harandi, M. (2021). On learning the geodesic path for incremental learning. In CVPR, pp. 1591–1600.

  • Smith, J., Hsu, Y.-C., Balloch, J., Shen, Y., Jin, H., & Kira, Z. (2021). Always be dreaming: A new approach for data-free class-incremental learning. In ICCV, pp. 9374–9384.

  • Smith, J.S., Karlinsky, L., Gutta, V., Cascante-Bonilla, P., Kim, D., Arbelle, A., Panda, R., Feris, R. & Kira, Z. (2023). Continual decomposed attention-based prompting for rehearsal-free continual learning. Coda-prompt. In CVPR, pp. 11909–11919.

  • Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. In NIPS, 56, 4080–4090.

    MATH  Google Scholar 

  • Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J. & Beyer, L. (2022). How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research.

  • Sun, G., Liang, W., Dong, J., Li, J., & Ding, Z. (2024). and Yang Cong. Create your world: Lifelong text-to-image diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Sun, H.-L., Zhou, D.-W., Ye, H.-J., & Zhan, D.-C. (2023). Pilot: A pre-trained model-based continual learning toolbox. arXiv preprint[SPACE]arXiv:2309.07117.

  • Tao, X., Hong, X., Chang, X., Dong, S., Wei, X., & Gong, Y. (2020). Few-shot class-incremental learning. In CVPR, pp. 12183–12192.

  • Tschannen, M., Mustafa, B., & Houlsby, N. (2022). Image-and-language understanding from pixels only. arXiv preprint[SPACE]arXiv:2212.08045.

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research,9(11).

  • Venkateswara, H., Eusebio, J., Chakraborty, S., & Panchanathan, S. (2017). Deep hashing network for unsupervised domain adaptation. In CVPR, pp. 5018–5027.

  • Villa, A., Alcázar, J.L., Alfarra, M., Alhamoud, K., Hurtado, J., Heilbron, F.C., Soto, A. and Ghanem, B. (2023) Prompting for video continual learning. Pivot. In CVPR, pp. 24214–24223.

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.

  • Wang, Y., Ma, Z., Huang, Z., Wang, Y., Su, Z., & Hong, X. (2023). Isolation and impartial aggregation: A paradigm of incremental learning without interference. In AAAI, pp. 10209–10217.

  • Wang, Z., Wang, Z., Zheng, Y., Chuang, Y.-Y., & Satoh, S. (2019). Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In CVPR, pp. 618–626.

  • Wang, Z., Zhang, Z., Ebrahimi, S., Sun, R., Zhang, H., Lee, C.-Y., Ren, X., Guolong, S., Perot, V., Dy, J., et al. (2022). Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV, pp. 631–648.

  • Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Guolong, S., Perot, V., Dy, J., & Pfister, T. (2022). Learning to prompt for continual learning. In CVPR, pp. 139–149.

  • Wang, L., Zhang, X., Su, H., & Zhu, J. (2023). A comprehensive survey of continual learning: Theory, method and application. arXiv preprint[SPACE]arXiv:2302.00487.

  • Wang, F.-Y., Zhou, D.-W., Liu, L., Ye, H.-J., Bian, Y., & Zhan, D.-C. (2023). and Peilin Zhao. BEEF: Bi-compatible class-incremental learning via energy-based expansion and fusion. In ICLR.

  • Wang, F.-Y., Zhou, D.-W., Ye, H.-J., & Zhan, D.-C. (2022). Foster: Feature boosting and compression for class-incremental learning. In ECCV, pp. 398–414.

  • Wang, Q.-W., Zhou, D.-W., Zhang, Y.-K., Zhan, D.-C., & Ye, H.-J. (2023). Few-shot class-incremental learning via training-free prototype calibration. NeurIPS,36.

  • Wang, Y., Huang, Z., & Hong, X. (2022). S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. NeurIPS, 35, 5682–5695.

    MATH  Google Scholar 

  • Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models.

  • Xinting, H., Tang, K., Miao, C., Hua, X.-S., & Zhang, H. (2021). Distilling causal effect of data in class-incremental learning. In CVPR, pp. 3957–3966.

  • Yan, S., Xie, J., & He, X. (2021). Der: Dynamically expandable representation for class incremental learning. In CVPR, pp. 3014–3023.

  • Yang, X., Yongliang, W., Yang, M., Chen, H., & Geng, X. (2024). Exploring diverse in-context configurations for image captioning. NeurIPS,36.

  • Yang, M.-H. (2002). Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1), 34–58.

    Article  MATH  Google Scholar 

  • Yoon, J., Yang, E., Lee, J., & Hwang, S. J. (2018). Lifelong learning with dynamically expandable networks. In ICLR.

  • You, K., Kou, Z., Long, M., & Wang, J. (2020). Co-tuning for transfer learning. NeurIPS, pp. 17236–17246.

  • Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint[SPACE]arXiv:2111.11432.

  • Yue, W., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., & Yun, F. (2019). Large scale incremental learning. In CVPR, pp. 374–382.

  • Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2022). Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, pp. 1–9.

  • Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P., Riquelme, C., Lucic, M., Djolonga, J., Pinto, A., Neumann, M., Dosovitskiy, A., et al. (2019). A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint[SPACE]arXiv:1910.04867.

  • Zhang, G., Wang, L., Kang, G., Chen, L., & Wei, Y. (2023). Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. In ICCV, pp. 19148–19158.

  • Zhang, Y., Yin, Z., Shao, J., & Liu, Z. (2022). Benchmarking omni-vision representation through the lens of visual realms. In ECCV, pp. 594–611. Springer

  • Zhang, J., Zhang, J., Ghosh, S., Li, D., Tasci, S., Heck, L., Zhang, H., & Kuo J. C.-C. (2020). Class-incremental learning via deep model consolidation. In WACV, pp. 1131–1140.

  • Zhang, Y., Zhou, K., & Liu, Z. (2022). Neural prompt search. arXiv preprint[SPACE]arXiv:2206.04673.

  • Zhang, T., Zhang, H., & Li, X. (2023). Vision-audio fusion slam in dynamic environments. CAAI Transactions on Intelligence Technology, 8(4), 1364–1373.

    Article  MATH  Google Scholar 

  • Zhao, H., Qin, X., Shihao, S., Yongjian, F., Lin, Z., & Li, X. (2021). When video classification meets incremental classes. In ACM MM, pp. 880–889.

  • Zhao, B., Xiao, X., Gan, G., Zhang, B., & Xia, S.-T. (2020). Maintaining discrimination and fairness in class incremental learning. In CVPR, pp. 13208–13217.

  • Zhao, H., Wang, H., Yongjian, F., Fei, W., & Li, X. (2021). Memory-efficient class-incremental learning for image classification. IEEE Transactions on Neural Networks and Learning Systems, 33(10), 5966–5977.

    Article  MATH  Google Scholar 

  • Zhong, Z., Fini, E., Roy, S., Luo, Z., Ricci, E., & Sebe, N. (2021). Neighborhood contrastive learning for novel class discovery. In CVPR, pp. 10867–10875.

  • Zhou, D.-W., Wang, Q.-W., Qi, Z.-H., Ye, H.-J., Zhan, D.-C., & Liu, Z. (2023). Class-incremental learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Zhou, D.-W., Sun, H.-L., Ning, J., & Ye, H.-J. (2024). and De-Chuan Zhan. Continual learning with pre-trained models: A survey. In IJCAI.

  • Zhou, D. W., Wang, Q. W., Ye, H. J., & Zhan, D. C. (2023). A model or 603 exemplars: Towards memory-efficient class-incremental learning. In ICLR.

  • Zhou, D.-W., Wang, F.-Y., Ye, H.-J., Ma, L., Shiliang, P., & Zhan, D.-C. (2022). Forward compatible few-shot class-incremental learning. In CVPR, pp. 9046–9056.

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In CVPR, pp. 16816–16825.

  • Zhou, K., Liu, Z., Qiao, Y., Xiang, T., & Loy, C. C. (2022). Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4396–4415.

    MATH  Google Scholar 

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348.

    Article  MATH  Google Scholar 

  • Zhu, F., Zhang, X.-Y., Wang, C., Yin, F., & Liu, C.-L. (2021). Prototype augmentation and self-supervision for incremental learning. In CVPR, pp. 5871–5880.

Download references

Funding

This work is partially supported by National Science and Technology Major Project (2022ZD0114805), Fundamental Research Funds for the Central Universities (2024300373), NSFC (62376118, 62006112, 62250069, 61921006), Collaborative Innovation Center of Novel Software Technology and Industrialization, China Scholarship Council, Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOET2EP20221- 0012), NTU NAP, and under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Han-Jia Ye or Ziwei Liu.

Additional information

Communicated by ZHUN ZHONG.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

More Results

More Results

Table 4 Experiments on ImageNet100

1.1 Experiments on ImageNet

Traditional class-incremental learning mainly considers three benchmark datasets, i.e., CIFAR100 (Krizhevsky et al., 2009), ImageNet100 and ImageNet1000 (Deng et al., 2009). Among them, ImageNet100 is a subset of ImageNet1000 containing 100 classes. However, current Vision Transformers are widely pre-trained with the ImageNet21K (Steiner et al., 2022) dataset (a superset of ImageNet1000). Consequently, incremental learning with its subset becomes less reasonable since the pre-trained model has already seen these classes.

To verify this claim, we experiment with ImageNet100 B0 Inc10 setting and report the results in Table 4. Specifically, we have two major conclusions from these results. First, all methods can achieve a competitive performance of around 90%, indicating that ImageNet100 is quite easy for the pre-trained model to tackle. Besides, SimpleCIL achieves an accuracy of 92%, indicating that the pre-trained model can handle the ImageNet100 dataset even without tuning. These experiments verify that ImageNet and its subsets are unsuitable for the incremental task due to the overlapping of pre-training data and downstream data.

Table 5 Number of trainable parameters in different methods
Table 6 Experiments on the generalizability-adaptivity trade-off

1.2 Trainable Parameters

In this section, we further report the number of learnable parameters in various parameter-efficient tuning methods in Table 5. As we can infer from the table, Aper and its variations only cost a tiny portion of trainable parameters. Besides, the cost is required only once since the model is finetuned with the first incremental stage using parameter-efficient tuning modules. By contrast, L2P and DualPrompt shall require more parameters when facing longer learning sequences since the small prompt pool may not be diverse enough to capture multiple tasks.

1.3 Module Trade-Off

In the main paper, we mainly consider the exemplar-free scenario where the model cannot save any exemplars. In this section, we can also treat the pre-trained model and adapted model as two separate modules for ensemble with the help of exemplars. We denote the output of pre-trained model as \(f(\textbf{x})=W^\top \phi (\textbf{x})\), and the output of the adapted model as \(f^*(\textbf{x})=W^{*\top } \phi ^*(\textbf{x})\). The linear classifiers W and \(W^*\) are obtained via the prototype extraction. In this way, \(f(\textbf{x})\) corresponds to the generalizability while \(f^*(\textbf{x})\) stands for the adaptivity. Instead of concatenating the features, we consider the weighted ensemble between these modules:

$$\begin{aligned} \alpha f(\textbf{x})+ f^*(\textbf{x}) \,, \end{aligned}$$
(10)

where \(\alpha \) is the trade-off parameter (scalar) to adjust the importance of different modules. To obtain \(\alpha \) in a data-driven way, we consider using auxiliary information from the exemplar set, which contains a small portion of previous instances. Afterward, we freeze \( f(\textbf{x})\) and \(f^*(\textbf{x})\) and optimize \(\alpha \) with the exemplar set. This enables learning a suitable trade-off parameter to capture the pre-trained and adapted models’ capabilities. In the implementation, we find saving a small portion of exemplars is enough for the learning process, e.g., 5 per class. We denote the inference with Eq. 10 as Aper†, and report the results in Table 6. As we can infer from the table, utilizing the weighted ensemble strikes a balance between adaptivity and generalizability, which further improves the performance of Aper by 3\(\sim \)5%. However, it must be noted that Aper†requires extra exemplars for the parameter optimization. Without such extra exemplars, the average ensemble (i.e., \(\alpha =1\)) even shows inferior performance than the vanilla Aper on VTAB.

1.4 Large Base Classes

In this section, we report the detailed performance of different methods with large base classes in Table 7. As we can infer from the table, Aper consistently achieves the best performance in the current setting.

Table 7 Experiments with vast base classes on seven datasets with ViT-B/16-IN21K as the backbone
Table 8 Experiments on pre-trained CLIP under CIFAR100 Base0 Inc10 setting. Aper can also be combined with CoOp to improve CLIP’s incremental learning ability
Table 9 Experiments on pre-trained CLIP under ImageNet-R Base0 Inc20 setting. Aper can also be combined with CoOp to improve CLIP’s incremental learning ability

1.5 Influence of Hyper-Parameters

In this section, we explore the influence of hyper-parameters in Aper. Specifically, since Aper is optimized with parameter-efficient tuning, the only hyper-parameters come from these tuning methods, i.e., the prompt length p in VPT and the projection dim r in Adapter. We train the model on CIFAR100 B50 Inc5 with pre-trained ViT-B/16-IN1K.

Firstly, we investigate the influence of the prompt length in Fig. 8. The figure shows that Aper’s performance is robust with the change of the prompt length. Therefore, we use \(p=5\) as a default parameter for all settings. Similarly, we observe similar phenomena in Fig. 9, where the performance is robust with the change of the projection dimension. As a result, we set \(r=16\) as a default parameter for all settings.

1.6 Combining with CLIP

Apart from the typical backbone Vision Transformer adopted in the main paper, in this section, we also combine Aper with CLIP (Radford et al., 2021) to show its compatibility to enhance CLIP’s incremental learning ability.

We treat CoOp (Zhou et al., 2022d) as the baseline and combine Aper w/ CoOp on CIFAR100 B0 Inc10 and ImageNet-R B0 Inc20 settings. Specifically, the CLIP model possesses two branches of encoders, i.e., the visual encoder and textual encoder, and utilizes the matching target with max similarity for prediction. CoOp, the most popular tuning technique in CILP, finetunes a set of prompts and freezes the other parameters to adapt it to downstream tasks. We only learn the prompts of CoOp in the first stage and conduct inference for the following incremental tasks. We implement these methods using CLIP with OpenAI weights (Radford et al., 2021) as initialization. For Aper w/ CoOp, we utilize CoOp to finetune the model with the first incremental stage and adopt Eq. 10 for inference.

Fig. 8
figure 8

Influence of prompt length. The performance of Aper is robust with the change of the prompt length, and we set \(p=5\) as the default setting

Fig. 9
figure 9

Influence of projected dimension in adapter. The performance of Aper is robust with the change of the projected dimension, and we set \(r=16\) as the default setting

As shown in Tables 8 and 9, CoOp shows poor performance in the incremental learning process when the prompts are only tuned for the first stage. However, when combining Aper w/ CoOp, the incremental learning performance can be further improved, indicating Aper can tackle various backbones and improve the incremental learning ability.

1.7 ImageNet-R B0 Inc20

Table 10 Experiments on ImageNet-R Base0 Inc20

In this section, we compare Aper to CPP with ImageNet-R Base0 Inc20 setting. As shown in Table 10, Aper still shows substantial improvement to CPP (3% by last accuracy and 7% by average accuracy), indicating its strong performance against the state-of-the-art.

1.8 Ablations on ImageNet-A

In the main paper, we show the improvement of best Aper variation to SimpleCIL with different backbones on ImageNet-R. In this section, we report the results with ImageNet-A in Fig. 10. As shown in the figure, Aper still consistently improves the performance of SimpleCIL in this new benchmark. We also find that the improvement for large models like ViT-L is also 6%, indicating that Aper also shows stable improvements to larger models.

Fig. 10
figure 10

CIL with different kinds of PTMs on ImageNet-A Base0 Inc10. Aper consistently improves the performance of different PTMs. We report the best variation of Aper in the figure and its relative improvement above SimpleCIL in the figure

1.9 ObjectNet300

In the main paper, we sample 200 out of the 313 classes in ObjectNet for class-incremental learning. In this section, we also sample 300 classes from it, denoted as ObjectNet300. We conduct experiments against other state-of-the-art methods and report the results in Table 11. Results indicate that Aper still performs better than competitors in this more “natural” setting.

Table 11 Average and last performance comparison on ObjectNet300 with ViT-B/16-IN21K as the backbone

1.10 Grad-CAM

In this section, we have show more Grad-CAM results. As shown in Fig. 11, Aper concentrates more on task-specific features than vanilla PTM, confirming the effectiveness of model adaptation in capturing task-specific features.

Fig. 11
figure 11

Supplied Grad-CAM results. Important regions are highlighted with warm colors. Compared to PTM, Aper concentrates more on task-specific features (Color figure online)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, DW., Cai, ZW., Ye, HJ. et al. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need. Int J Comput Vis 133, 1012–1032 (2025). https://doi.org/10.1007/s11263-024-02218-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-024-02218-0

Keywords

Profiles

  1. Da-Wei Zhou