Data-Adaptive Weight-Ensembling for Multi-task Model Fusion

Tang, Anke; Shen, Li; Luo, Yong; Liu, Shiwei; Hu, Han; Du, Bo; Tao, Dacheng

doi:10.1007/s11263-025-02434-2

Data-Adaptive Weight-Ensembling for Multi-task Model Fusion

Published: 25 April 2025

Volume 133, pages 5396–5412, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Anke Tang¹,
Li Shen ORCID: orcid.org/0000-0001-5659-3464^2,6,
Yong Luo¹,
Shiwei Liu³,
Han Hu⁴,
Bo Du¹ &
…
Dacheng Tao⁵

577 Accesses
Explore all metrics

Abstract

Creating a multi-task model by merging models for distinct tasks has proven to be an economical and scalable approach. Recent research, like task arithmetic, demonstrates that a static solution for multi-task model fusion can be located within the vector space spanned by task vectors. However, the static nature of these methods limits their ability to adapt to the intricacies of individual instances, thereby hindering their performance in complex scenarios. To overcome this limitation, we propose a data-adaptive weight-ensembling approach that generates model weights in time. Specifically, we first feed the input samples into a hypernetwork to generate instance-specific weights for the primary model. Subsequently, we perform a functional call on the primary large model with the instance-specific weights. By generating model weights in time, the unified model gains increased flexibility and can resolve potential weight conflicts between tasks. Building upon this adaptability, our method necessitates solely the model checkpoints and unlabeled test samples using test-time adaptation training. We primarily conduct extensive experiments on vision Transformers and Flan-T5 models, demonstrating superior performance and satisfactory zero-shot transferability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Positive weights in data envelopment analysis

Article Open access 05 June 2023

Multi-classifier ensemble based on dynamic weights

Article Open access 30 December 2017

Simultaneous Perturbation Method for Multi-task Weight Optimization in One-Shot Meta-learning

Notes

References

Ainsworth, S. K., Hayase, J., & Srinivasa, S. (2023). Git Re-Basin: Merging Models modulo Permutation Symmetries (No. arXiv:2209.04836). arXiv.
Alaluf, Y., Tov, O., Mokady, R., Gal, R., & Bermano, A. (2022). HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing. 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18490–18500). New Orleans, LA, USA: IEEE.
Benton, G. W., Maddox, W. J., Lotfi, S., & Wilson, A. G. (2021). Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling (No. arXiv:2102.13042). arXiv.
Chauhan, V. K., Zhou, J., Lu, P., Molaei, S., & Clifton, D. A. (2024). A brief review of hypernetworks in deep learning. Artificial Intelligence Review, 57(9), 250. https://doi.org/10.1007/s10462-024-10862-8
Article Google Scholar
Cheng, G., Han, J., & Lu, X. (2017). Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE, 105(10), 1865–1883. https://doi.org/10.1109/JPROC.2017.2675998
Article Google Scholar
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., & Wei, J. (2022). Scaling Instruction-Finetuned Language Models (No. arXiv:2210.11416). arXiv.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing Textures in the Wild. 2014 IEEE conference on computer vision and pattern recognition (pp. 3606–3613). Columbus, OH, USA: IEEE.
Daniel Freeman, C., & Bruna, J. (2017). Topology and geometry of half-rectified network optimization: 5th international conference on learning representations, ICLR 2017.
Do, T., Khiem, L., Pham, Q., Nguyen, T., Doan, T.- N., Nguyen, B., & Hoi, S. (2023). HyperRouter: Towards efficient training and inference of sparse mixture of experts. H. Bouamor, J. Pino, and K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5754–5765). Singapore: Association for Computational Linguistics.
Draxler, F., Veschgini, K., Salmhofer, M., & Hamprecht, F.A. (2019). Essentially No Barriers in Neural Network Energy Landscape (No. arXiv:1803.00885). arXiv.
Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. (2022). The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks (No. arXiv:2110.06296). arXiv.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2020). Linear Mode Connectivity and the Lottery Ticket Hypothesis (No. arXiv:1912.05671). arXiv.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., & Wilson, A. G. (2018). Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs (No. arXiv:1802.10026). arXiv.
Gulati, A., Qin, J., Chiu, C.- C., Parmar, N., Zhang, Y., Yu, J., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition (No. arXiv:2005.08100). arXiv.
Ha, D., Dai, A. M., & Le, Q. V. (2022). Hypernetworks. International conference on learning representations.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners (No. arXiv:2111.06377). arXiv.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-Decem, 770–778, 10.1109/CVPR.2016.90 arxiv:1512.03385
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2018). Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. Igarss 2018-2018 IEEE international geoscience and remote sensing symposium (pp. 204–207).
Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations (No. arXiv:1903.12261). arXiv.
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S.. & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (No. arXiv:2106.09685). arXiv.
Huang, C., Liu, Q., Lin, B.Y., Pang, T., Du, C., & Lin, M. (2023). LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition (No. arXiv:2307.13269). arXiv.
Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2023). Editing models with task arithmetic (No. arXiv:2212.04089). arXiv.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A.G. (2019). Averaging weights leads to wider optima and better generalization (No. arXiv:1803.05407). arXiv.
Jin, X., Ren, X., Preotiuc-Pietro, D., & Cheng, P. (2023). Dataless knowledge fusion by merging weights of language models (No. arXiv:2212.09849). arXiv.
Kaddour, J. (2022). Stop wasting my time! saving days of imagenet and BERT training with latest weight averaging (No. arXiv:2209.14981). arXiv.
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D Object Representations for Fine-Grained Categorization. 2013 IEEE international conference on computer vision workshops (pp. 554–561).
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Article Google Scholar
Li, W., Peng, Y., Zhang, M., Ding, L., Hu, H., & Shen, L. (2023). Deep Model Fusion: A Survey (No. arXiv:2309.15698). arXiv.
Li, Y., Yosinski, J., Clune, J., Lipson, H., & Hopcroft, J. (2016). Convergent Learning: Do different neural networks learn the same representations? (No. arXiv:1511.07543). arXiv.
Liang, J., He, R., & Tan, T. (2023). A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts (No. arXiv:2303.15361). arXiv.
Liu, C., Lou, C., Wang, R., Xi, A.Y., Shen, L., & Yan, J. (2022). Deep Neural Network Fusion via Graph Matching with Applications to Model Ensemble and Federated Learning. Proceedings of the 39th International conference on machine learning (pp. 13857–13869). PMLR.
Lu, Z., Fan, C., Wei, W., Qu, X., Chen, D., & Cheng, Y. (2024). Twin-merging: dynamic integration of modular expertise in Model Merging. arXiv.
Matena, M., & Raffel, C. (2022). Merging Models with Fisher-Weighted Averaging (No. arXiv:2111.09832). arXiv.
Mounsaveng, S., Chiaroni, F., Boudiaf, M., Pedersoli, M., & Ayed, I.B. (2023). Bag of Tricks for Fully Test-Time Adaptation (No. arXiv:2310.02416). arXiv.
Nagarajan, V., & Kolter, J.Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc.
Navon, A., Shamsian, A., Chechik, G., & Fetaya, E. (2021). Learning the Pareto Front with Hypernetworks (No. arXiv:2010.04104). arXiv.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2021). Reading Digits in Natural Images with Unsupervised Feature Learning.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (No. arXiv:2103.00020). arXiv.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Google Scholar
Shamsian, A., Navon, A., Fetaya, E., & Chechik, G. (2021). Personalized Federated Learning using Hypernetworks (No. arXiv:2103.04628). arXiv.
Shen, L., Tang, A., Yang, E., Guo, G., Luo, Y., Zhang, L., & Tao, D. (2024). Efficient and effective weight-ensembling mixture of experts for multi-task model merging. arXiv preprint arXiv:2410.21804 ,
Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323–332. https://doi.org/10.1016/j.neunet.2012.02.016
Article Google Scholar
Stoica, George, Bolya, Daniel, Bjorner, Jakob, Hearn, Taylor, & Hoffman, Judy. (2023). ZipIt! Merging Models from Different Tasks without Training (No. arXiv:2305.03053). arXiv.
Sun, Z., Ozay, M., & Okatani, T. (2017). HyperNetworks with statistical filtering for defending adversarial examples (No. arXiv:1711.01791). arXiv.
Tam, D., Bansal, M., & Raffel, C. (2023). Merging by Matching Models in Task Subspaces (No. arXiv:2312.04339). arXiv.
Tang, A., Shen, L., Luo, Y., Liu, S., Hu, H., & Du, B. (2024). Towards efficient pareto set approximation via mixture of experts based model fusion. arXiv preprint arXiv:2406.09770,
Tang, A., Shen, L., Luo, Y., Xie, S., Hu, H., Zhang, L., & Tao, D. (2024). SMILE: Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models. arXiv.
Tatro, N., Chen, P.- Y., Das, P., Melnyk, I., Sattigeri, P., & Lai, R. (2020). Optimizing Mode Connectivity via Neuron Alignment. Advances in Neural Information Processing Systems (Vol. 33, pp. 15300–15311). Curran Associates, Inc.
von Oswald, J., Henning, C., Grewe, B.F., & Sacramento, J. (2022). Continual learning with hypernetworks (No. arXiv:1906.00695). arXiv.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (pp. 353–355). Brussels, Belgium: Association for Computational Linguistics.
Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., & Schmidt, L. (2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (No. arXiv:2203.05482). arXiv.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 3485–3492). San Francisco, CA, USA: IEEE.
Yadav, P., Raffel, C., Muqeeth, M., Caccia, L., Liu, H., Chen, T., & Sordoni, A. (2024). A Survey on Model MoErging: recycling and routing among specialized experts for collaborative learning. arXiv.
Yadav, P., Tam, D., Choshen, L., Raffel, C., & Bansal, M. (2023). Resolving Interference When Merging Models (No. arXiv:2306.01708). arXiv.
Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., & Tao, D. (2024). Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666,
Yang, G., Simon, J.B., & Bernstein, J. (2023). A Spectral Condition for Feature Learning (No. arXiv:2310.17813). arXiv.
Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., & Tao, D. (2023). AdaMerging: Adaptive Model Merging for Multi-Task Learning (No. arXiv:2310.02575). arXiv.
Yu, L., Yu, B., Yu, H., Huang, F., & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (No. arXiv:2311.03099). arXiv.
Yunis, D., Patel, K.K., Savarese, P., Vardi, G., Livescu, K., Walter, M., & Maire, M. (2022). On Convexity and Linear Mode Connectivity in Neural Networks. OPT2022: 14th annual workshop on optimization for machine learning
Zheng, H., Shen, L., Tang, A., Luo, Y., Hu, H., Du, B., & Tao, D. (2023). Learn From Model Beyond Fine-Tuning: A Survey (No. arXiv:2310.08184). arXiv.

Download references

Acknowledgements

This work is supported in part by STI 2030-Major Projects (Grant No. 2021ZD0201405), the National Natural Science Foundation of China (Grant No. 62276195, 62225113, U23A20318 and U2336211), the Science and Technology Major Project of Hubei Province (Grant No. 2024BAB046) and the Innovative Research Group Project of Hubei Province (Grant No. 2024AFA017), Shenzhen Basic Research Project (Natural Science Foundation) Basic Research Key Project (Grant NO. JCYJ20241202124430041). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. Dr Tao’s research is partially supported by NTU RSR and Start Up Grants.

Funding

This work is supported in part by STI 2030-Major Projects (Grant No. 2021ZD0201405), the National Natural Science Foundation of China (Grant No. 62276195, 62225113, U23A20318 and U2336211), the Science and Technology Major Project of Hubei Province (Grant No. 2024BAB046) and the Innovative Research Group Project of Hubei Province (Grant No. 2024AFA017), Shenzhen Basic Research Project (Natural Science Foundation) Basic Research Key Project (Grant NO. JCYJ20241202124430041). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software, School of Computer Science and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, 430072, Hubei, China
Anke Tang, Yong Luo & Bo Du
School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, 518107, Guangdong, China
Li Shen
University of Oxford, Wellington Square, Oxford, OX1 2JD, UK
Shiwei Liu
School of Information and Electronics, Beijing Institute of Technology, Beijing, China
Han Hu
College of Computing and Data Science, Nanyang Technological University, Nanyang Avenue, Singapore, 639798, Singapore
Dacheng Tao
JD Explore Academy, Beijing, 100176, China
Li Shen

Authors

Anke Tang
View author publications
Search author on:PubMed Google Scholar
Li Shen
View author publications
Search author on:PubMed Google Scholar
Yong Luo
View author publications
Search author on:PubMed Google Scholar
Shiwei Liu
View author publications
Search author on:PubMed Google Scholar
Han Hu
View author publications
Search author on:PubMed Google Scholar
Bo Du
View author publications
Search author on:PubMed Google Scholar
Dacheng Tao
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Li Shen or Yong Luo.

Additional information

Communicated by Jingdong Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Model Fine-Tuning Details

The experiments in our study were conducted on a consistent hardware setup, utilizing eight NVIDIA GTX 3090 GPUs, each equipped with 24GB of video memory. To implement our experiments, we employed PyTorch version 2.1 with Python 3.10. The source code, model checkpoints, and logs from our experiments are available at https://github.com/tanganke/dict_fusion.

1.1 Experiment Setup

Our experimental scope encompasses a diverse range of tasks spanning both the vision and natural language domains. The model selection, evaluation datasets, and metric configurations are detailed as follows:

Vision tasks: For vision tasks, we leverage the pre-trained models from CLIP (Radford et al. 2021) .^{Footnote 1} The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al. 2010), Stanford Cars (Krause et al. 2013), RESISC45 (Cheng et al. 2017), EuroSAT (Helber et al. 2018), SVHN (Netzer et al. 2021), GTSRB (Stallkamp et al. 2012), MNIST (Lecun et al. 1998), and DTD (Cimpoi et al. 2014). We report top-1 accuracy as the performance metric.

NLP tasks: In the realm of natural language processing (NLP) tasks, our pre-trained language model is Flan-T5 (Chung et al. 2022) .^{Footnote 2} For fine-tuning, we deploy the flan-t5 models on eight tasks derived from the GLUE benchmark (Wang et al. 2018). The suite of natural language processing (NLP) tasks are CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST2, and STSB. For consistency and reproducibility, we initialize the parameter-efficient models using a identical random seed 42. The Flan-T5 models undergo LoRA fine-tuning, utilizing hyperparameters $r=16$ and $\alpha =32$ as specified in the work by Hu et al. (Hu et al. 2021). In this process, we set a constant learning rate of $4e-5$ and maintain a uniform batch size of 16 throughout all tasks. We fine-tune the Flan-T5 models for 2000 steps on each downstream task.

Flan-T5 models are encoder-decoder architecture Transformer models, and function within a text-to-text framework. Consequently, we have restructured the initial inputs to adhere to this text-to-text paradigm. We evaluate exact match accuracy for all tasks except STSB, for which we report Spearman’s $\rho $. The prompt templates for each task are available in our code repository.

Figure 5 presents detailed performance metrics for the CLIP-ViT-B/32 and CLIP-ViT-L/14 models, respectively. These visual and tabular data representations dissect the performance impact that both pre-training and subsequent fine-tuning have on the models’ task execution. The data spans a variety of downstream tasks, offering a comprehensive view of the models’ capabilities. Similarly, for the realm of natural language processing models, Fig. 5 provides an insight into the individual performance transformations that occur in the Flan-T5-base and Flan-T5-large models post LoRA fine-tuning. As evidenced by the highlighted diagonal cells, it is apparent that the fine-tuning process has a significant impact on the models’ abilities to perform specific tasks on both vision and language tasks.

Multi-task Model Fusion Methods

This section provides an breif introduction to the baseline methods been compared in our experiments. Including weight averaging, Task Arithmetic, Ties-Merging, and AdaMerging. Table 1 summarizes the requirements of different model merging methods.

Simple weight averaging: Weight Averaging is the most straightforward method wherein the weights of models fine-tuned on different tasks are averaged, commonly known in the literature as ModelSoups (Wortsman et al. 2022). Subsequently, the averaged model undergoes evaluation on the validation set of each task. In the case of full fine-tuned models, the weights are directly averaged, denoted as

$$\begin{aligned} \theta = {\frac{1}{n} \sum _{i=1}^{n} \theta _i}. \end{aligned}$$

(B1)

When the models are LoRA fine-tuned, we average the weights of merged models, expressed as

$$\begin{aligned} \theta = \theta _0 + \frac{1}{n}\sum _i A_i B_i. \end{aligned}$$

(B2)

Task Arithmetic (Ilharco et al. 2023): The Task Arithmetic involves the computation of a task vector for each individual downstream task, followed by the summation of these task vectors to construct a multi-task vector. This multi-task vector is then element-wise multiplied by a scaling coefficient denoted as $\lambda $ and added to the initial parameters of the pre-trained model. In mathematical terms, for full fine-tuning, this process is represented as

$$\begin{aligned} \theta = \theta _0 + \lambda \sum _{i=1}^n (\theta _i - \theta _0). \end{aligned}$$

(B3)

And for LoRA fine-tuning, it is expressed as

$$\begin{aligned} \theta = \theta _0 + \lambda \sum _{i=1}^n A_i B_i, \end{aligned}$$

(B4)

where $\lambda $ serves as a hyperparameter determined by selecting the best-performing model on the validation set.

In the context of our study, we set $\lambda $ to a value of 0.3, a choice made after evaluating the model’s performance on the validation set.

Ties-Merging (Yadav et al. 2023): Similar to Task Arithmetic, Ties-Merging also involves element-wise multiplication of a unified task vector by a scaling coefficient denoted as $\lambda $ and is then added to the initial parameters of the pre-trained model.

However, before doing this, Ties-Merging involves three key steps, namely trimming, determining the sign of parameters, and disjoint merging, ultimately resulting in a merged task vector denoted as $\nu $. In other words, the final model parameters are given by $\theta = \theta _0 + \lambda \tau $, where $\lambda $ represents a hyperparameter chosen based on the best-performing model identified through validation set evaluation.

For our specific study, we set the hyperparameter $\lambda $ to a value of 0.3, a decision made after thorough assessment of model performance on the validation set.

AdaMerging (Yang et al. 2023): AdaMerging represents an adaptive model merging approach that autonomously learns coefficients for merging, operating either on a task-wise or layer-wise basis. It achieves this by employing entropy minimization on unlabeled test samples as a surrogate objective function to refine the merging coefficients. The task-wise AdaMerging is expressed as

$$\begin{aligned} \theta = \theta _0 + \sum _{i=1}^{n} \lambda _i \tau _i, \end{aligned}$$

(B5)

where $\lambda _k$ denotes the merging coefficient for the k-th task, and $\tau _k$ represents the task vector for the same task. On the other hand, the layer-wise AdaMerging formulation is given by

$$\begin{aligned} \theta ^{(l)} = \theta _0^{(l)} + \sum _{i=1}^{n} \lambda ^{(l)}_{i} \tau ^{(l)}_{i}. \end{aligned}$$

(B6)

Where l is the layer-index of models.

In our study, we initialize the merging coefficients $\lambda $ to be 0.3 for all tasks and layers as in (Yang et al. 2023). This ensures a standardized starting point for the adaptive merging process.

Ours. We set the learning rate to 1e-5 and employ the Adam optimizer for test-time adaptation on the test dataset, following the details outlined in Sect. 3.5 of the main paper. The model undergoes training for 1000 steps on each task, and we present the performance results across all tasks in Tables 3 and 4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, A., Shen, L., Luo, Y. et al. Data-Adaptive Weight-Ensembling for Multi-task Model Fusion. Int J Comput Vis 133, 5396–5412 (2025). https://doi.org/10.1007/s11263-025-02434-2

Download citation

Received: 01 April 2024
Accepted: 24 March 2025
Published: 25 April 2025
Version of record: 25 April 2025
Issue date: August 2025
DOI: https://doi.org/10.1007/s11263-025-02434-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-Adaptive Weight-Ensembling for Multi-task Model Fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Positive weights in data envelopment analysis

Multi-classifier ensemble based on dynamic weights

Simultaneous Perturbation Method for Multi-task Weight Optimization in One-Shot Meta-learning

Explore related subjects

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Model Fine-Tuning Details

1.1 Experiment Setup

Multi-task Model Fusion Methods

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now