Abstract
Creating a multi-task model by merging models for distinct tasks has proven to be an economical and scalable approach. Recent research, like task arithmetic, demonstrates that a static solution for multi-task model fusion can be located within the vector space spanned by task vectors. However, the static nature of these methods limits their ability to adapt to the intricacies of individual instances, thereby hindering their performance in complex scenarios. To overcome this limitation, we propose a data-adaptive weight-ensembling approach that generates model weights in time. Specifically, we first feed the input samples into a hypernetwork to generate instance-specific weights for the primary model. Subsequently, we perform a functional call on the primary large model with the instance-specific weights. By generating model weights in time, the unified model gains increased flexibility and can resolve potential weight conflicts between tasks. Building upon this adaptability, our method necessitates solely the model checkpoints and unlabeled test samples using test-time adaptation training. We primarily conduct extensive experiments on vision Transformers and Flan-T5 models, demonstrating superior performance and satisfactory zero-shot transferability.
Similar content being viewed by others
References
Ainsworth, S. K., Hayase, J., & Srinivasa, S. (2023). Git Re-Basin: Merging Models modulo Permutation Symmetries (No. arXiv:2209.04836). arXiv.
Alaluf, Y., Tov, O., Mokady, R., Gal, R., & Bermano, A. (2022). HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing. 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 18490–18500). New Orleans, LA, USA: IEEE.
Benton, G. W., Maddox, W. J., Lotfi, S., & Wilson, A. G. (2021). Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling (No. arXiv:2102.13042). arXiv.
Chauhan, V. K., Zhou, J., Lu, P., Molaei, S., & Clifton, D. A. (2024). A brief review of hypernetworks in deep learning. Artificial Intelligence Review, 57(9), 250. https://doi.org/10.1007/s10462-024-10862-8
Cheng, G., Han, J., & Lu, X. (2017). Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE, 105(10), 1865–1883. https://doi.org/10.1109/JPROC.2017.2675998
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., & Wei, J. (2022). Scaling Instruction-Finetuned Language Models (No. arXiv:2210.11416). arXiv.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing Textures in the Wild. 2014 IEEE conference on computer vision and pattern recognition (pp. 3606–3613). Columbus, OH, USA: IEEE.
Daniel Freeman, C., & Bruna, J. (2017). Topology and geometry of half-rectified network optimization: 5th international conference on learning representations, ICLR 2017.
Do, T., Khiem, L., Pham, Q., Nguyen, T., Doan, T.- N., Nguyen, B., & Hoi, S. (2023). HyperRouter: Towards efficient training and inference of sparse mixture of experts. H. Bouamor, J. Pino, and K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 5754–5765). Singapore: Association for Computational Linguistics.
Draxler, F., Veschgini, K., Salmhofer, M., & Hamprecht, F.A. (2019). Essentially No Barriers in Neural Network Energy Landscape (No. arXiv:1803.00885). arXiv.
Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. (2022). The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks (No. arXiv:2110.06296). arXiv.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carbin, M. (2020). Linear Mode Connectivity and the Lottery Ticket Hypothesis (No. arXiv:1912.05671). arXiv.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., & Wilson, A. G. (2018). Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs (No. arXiv:1802.10026). arXiv.
Gulati, A., Qin, J., Chiu, C.- C., Parmar, N., Zhang, Y., Yu, J., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition (No. arXiv:2005.08100). arXiv.
Ha, D., Dai, A. M., & Le, Q. V. (2022). Hypernetworks. International conference on learning representations.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners (No. arXiv:2111.06377). arXiv.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-Decem, 770–778, 10.1109/CVPR.2016.90 arxiv:1512.03385
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2018). Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. Igarss 2018-2018 IEEE international geoscience and remote sensing symposium (pp. 204–207).
Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations (No. arXiv:1903.12261). arXiv.
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S.. & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (No. arXiv:2106.09685). arXiv.
Huang, C., Liu, Q., Lin, B.Y., Pang, T., Du, C., & Lin, M. (2023). LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition (No. arXiv:2307.13269). arXiv.
Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2023). Editing models with task arithmetic (No. arXiv:2212.04089). arXiv.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A.G. (2019). Averaging weights leads to wider optima and better generalization (No. arXiv:1803.05407). arXiv.
Jin, X., Ren, X., Preotiuc-Pietro, D., & Cheng, P. (2023). Dataless knowledge fusion by merging weights of language models (No. arXiv:2212.09849). arXiv.
Kaddour, J. (2022). Stop wasting my time! saving days of imagenet and BERT training with latest weight averaging (No. arXiv:2209.14981). arXiv.
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D Object Representations for Fine-Grained Categorization. 2013 IEEE international conference on computer vision workshops (pp. 554–561).
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Li, W., Peng, Y., Zhang, M., Ding, L., Hu, H., & Shen, L. (2023). Deep Model Fusion: A Survey (No. arXiv:2309.15698). arXiv.
Li, Y., Yosinski, J., Clune, J., Lipson, H., & Hopcroft, J. (2016). Convergent Learning: Do different neural networks learn the same representations? (No. arXiv:1511.07543). arXiv.
Liang, J., He, R., & Tan, T. (2023). A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts (No. arXiv:2303.15361). arXiv.
Liu, C., Lou, C., Wang, R., Xi, A.Y., Shen, L., & Yan, J. (2022). Deep Neural Network Fusion via Graph Matching with Applications to Model Ensemble and Federated Learning. Proceedings of the 39th International conference on machine learning (pp. 13857–13869). PMLR.
Lu, Z., Fan, C., Wei, W., Qu, X., Chen, D., & Cheng, Y. (2024). Twin-merging: dynamic integration of modular expertise in Model Merging. arXiv.
Matena, M., & Raffel, C. (2022). Merging Models with Fisher-Weighted Averaging (No. arXiv:2111.09832). arXiv.
Mounsaveng, S., Chiaroni, F., Boudiaf, M., Pedersoli, M., & Ayed, I.B. (2023). Bag of Tricks for Fully Test-Time Adaptation (No. arXiv:2310.02416). arXiv.
Nagarajan, V., & Kolter, J.Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc.
Navon, A., Shamsian, A., Chechik, G., & Fetaya, E. (2021). Learning the Pareto Front with Hypernetworks (No. arXiv:2010.04104). arXiv.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2021). Reading Digits in Natural Images with Unsupervised Feature Learning.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (No. arXiv:2103.00020). arXiv.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Shamsian, A., Navon, A., Fetaya, E., & Chechik, G. (2021). Personalized Federated Learning using Hypernetworks (No. arXiv:2103.04628). arXiv.
Shen, L., Tang, A., Yang, E., Guo, G., Luo, Y., Zhang, L., & Tao, D. (2024). Efficient and effective weight-ensembling mixture of experts for multi-task model merging. arXiv preprint arXiv:2410.21804 ,
Stallkamp, J., Schlipsing, M., Salmen, J., & Igel, C. (2012). Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32, 323–332. https://doi.org/10.1016/j.neunet.2012.02.016
Stoica, George, Bolya, Daniel, Bjorner, Jakob, Hearn, Taylor, & Hoffman, Judy. (2023). ZipIt! Merging Models from Different Tasks without Training (No. arXiv:2305.03053). arXiv.
Sun, Z., Ozay, M., & Okatani, T. (2017). HyperNetworks with statistical filtering for defending adversarial examples (No. arXiv:1711.01791). arXiv.
Tam, D., Bansal, M., & Raffel, C. (2023). Merging by Matching Models in Task Subspaces (No. arXiv:2312.04339). arXiv.
Tang, A., Shen, L., Luo, Y., Liu, S., Hu, H., & Du, B. (2024). Towards efficient pareto set approximation via mixture of experts based model fusion. arXiv preprint arXiv:2406.09770,
Tang, A., Shen, L., Luo, Y., Xie, S., Hu, H., Zhang, L., & Tao, D. (2024). SMILE: Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models. arXiv.
Tatro, N., Chen, P.- Y., Das, P., Melnyk, I., Sattigeri, P., & Lai, R. (2020). Optimizing Mode Connectivity via Neuron Alignment. Advances in Neural Information Processing Systems (Vol. 33, pp. 15300–15311). Curran Associates, Inc.
von Oswald, J., Henning, C., Grewe, B.F., & Sacramento, J. (2022). Continual learning with hypernetworks (No. arXiv:1906.00695). arXiv.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (pp. 353–355). Brussels, Belgium: Association for Computational Linguistics.
Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., & Schmidt, L. (2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (No. arXiv:2203.05482). arXiv.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 3485–3492). San Francisco, CA, USA: IEEE.
Yadav, P., Raffel, C., Muqeeth, M., Caccia, L., Liu, H., Chen, T., & Sordoni, A. (2024). A Survey on Model MoErging: recycling and routing among specialized experts for collaborative learning. arXiv.
Yadav, P., Tam, D., Choshen, L., Raffel, C., & Bansal, M. (2023). Resolving Interference When Merging Models (No. arXiv:2306.01708). arXiv.
Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., & Tao, D. (2024). Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666,
Yang, G., Simon, J.B., & Bernstein, J. (2023). A Spectral Condition for Feature Learning (No. arXiv:2310.17813). arXiv.
Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., & Tao, D. (2023). AdaMerging: Adaptive Model Merging for Multi-Task Learning (No. arXiv:2310.02575). arXiv.
Yu, L., Yu, B., Yu, H., Huang, F., & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (No. arXiv:2311.03099). arXiv.
Yunis, D., Patel, K.K., Savarese, P., Vardi, G., Livescu, K., Walter, M., & Maire, M. (2022). On Convexity and Linear Mode Connectivity in Neural Networks. OPT2022: 14th annual workshop on optimization for machine learning
Zheng, H., Shen, L., Tang, A., Luo, Y., Hu, H., Du, B., & Tao, D. (2023). Learn From Model Beyond Fine-Tuning: A Survey (No. arXiv:2310.08184). arXiv.
Acknowledgements
This work is supported in part by STI 2030-Major Projects (Grant No. 2021ZD0201405), the National Natural Science Foundation of China (Grant No. 62276195, 62225113, U23A20318 and U2336211), the Science and Technology Major Project of Hubei Province (Grant No. 2024BAB046) and the Innovative Research Group Project of Hubei Province (Grant No. 2024AFA017), Shenzhen Basic Research Project (Natural Science Foundation) Basic Research Key Project (Grant NO. JCYJ20241202124430041). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. Dr Tao’s research is partially supported by NTU RSR and Start Up Grants.
Funding
This work is supported in part by STI 2030-Major Projects (Grant No. 2021ZD0201405), the National Natural Science Foundation of China (Grant No. 62276195, 62225113, U23A20318 and U2336211), the Science and Technology Major Project of Hubei Province (Grant No. 2024BAB046) and the Innovative Research Group Project of Hubei Province (Grant No. 2024AFA017), Shenzhen Basic Research Project (Natural Science Foundation) Basic Research Key Project (Grant NO. JCYJ20241202124430041). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Jingdong Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Model Fine-Tuning Details
The experiments in our study were conducted on a consistent hardware setup, utilizing eight NVIDIA GTX 3090 GPUs, each equipped with 24GB of video memory. To implement our experiments, we employed PyTorch version 2.1 with Python 3.10. The source code, model checkpoints, and logs from our experiments are available at https://github.com/tanganke/dict_fusion.
1.1 Experiment Setup
Our experimental scope encompasses a diverse range of tasks spanning both the vision and natural language domains. The model selection, evaluation datasets, and metric configurations are detailed as follows:
Vision tasks: For vision tasks, we leverage the pre-trained models from CLIP (Radford et al. 2021) .Footnote 1 The downstream tasks encompass a variety of challenges, including SUN397 (Xiao et al. 2010), Stanford Cars (Krause et al. 2013), RESISC45 (Cheng et al. 2017), EuroSAT (Helber et al. 2018), SVHN (Netzer et al. 2021), GTSRB (Stallkamp et al. 2012), MNIST (Lecun et al. 1998), and DTD (Cimpoi et al. 2014). We report top-1 accuracy as the performance metric.
NLP tasks: In the realm of natural language processing (NLP) tasks, our pre-trained language model is Flan-T5 (Chung et al. 2022) .Footnote 2 For fine-tuning, we deploy the flan-t5 models on eight tasks derived from the GLUE benchmark (Wang et al. 2018). The suite of natural language processing (NLP) tasks are CoLA, MNLI, MRPC, QNLI, QQP, RTE, SST2, and STSB. For consistency and reproducibility, we initialize the parameter-efficient models using a identical random seed 42. The Flan-T5 models undergo LoRA fine-tuning, utilizing hyperparameters \(r=16\) and \(\alpha =32\) as specified in the work by Hu et al. (Hu et al. 2021). In this process, we set a constant learning rate of \(4e-5\) and maintain a uniform batch size of 16 throughout all tasks. We fine-tune the Flan-T5 models for 2000 steps on each downstream task.
Flan-T5 models are encoder-decoder architecture Transformer models, and function within a text-to-text framework. Consequently, we have restructured the initial inputs to adhere to this text-to-text paradigm. We evaluate exact match accuracy for all tasks except STSB, for which we report Spearman’s \(\rho \). The prompt templates for each task are available in our code repository.
Figure 5 presents detailed performance metrics for the CLIP-ViT-B/32 and CLIP-ViT-L/14 models, respectively. These visual and tabular data representations dissect the performance impact that both pre-training and subsequent fine-tuning have on the models’ task execution. The data spans a variety of downstream tasks, offering a comprehensive view of the models’ capabilities. Similarly, for the realm of natural language processing models, Fig. 5 provides an insight into the individual performance transformations that occur in the Flan-T5-base and Flan-T5-large models post LoRA fine-tuning. As evidenced by the highlighted diagonal cells, it is apparent that the fine-tuning process has a significant impact on the models’ abilities to perform specific tasks on both vision and language tasks.
Individual performance of fine-tuned models. Here, we provide a visual comparison of the performance of various fine-tuned models. The upper figure contrasts the performance of CLIP-ViT-B/32 and CLIP-ViT-L/14 models across image classification tasks. Meanwhile, the lower figure compares the performance of LoRA fine-tuned Flan-t5-base and Flan-t5-large models on tasks from the GLUE benchmark
Multi-task Model Fusion Methods
This section provides an breif introduction to the baseline methods been compared in our experiments. Including weight averaging, Task Arithmetic, Ties-Merging, and AdaMerging. Table 1 summarizes the requirements of different model merging methods.
Simple weight averaging: Weight Averaging is the most straightforward method wherein the weights of models fine-tuned on different tasks are averaged, commonly known in the literature as ModelSoups (Wortsman et al. 2022). Subsequently, the averaged model undergoes evaluation on the validation set of each task. In the case of full fine-tuned models, the weights are directly averaged, denoted as
When the models are LoRA fine-tuned, we average the weights of merged models, expressed as
Task Arithmetic (Ilharco et al. 2023): The Task Arithmetic involves the computation of a task vector for each individual downstream task, followed by the summation of these task vectors to construct a multi-task vector. This multi-task vector is then element-wise multiplied by a scaling coefficient denoted as \(\lambda \) and added to the initial parameters of the pre-trained model. In mathematical terms, for full fine-tuning, this process is represented as
And for LoRA fine-tuning, it is expressed as
where \(\lambda \) serves as a hyperparameter determined by selecting the best-performing model on the validation set.
In the context of our study, we set \(\lambda \) to a value of 0.3, a choice made after evaluating the model’s performance on the validation set.
Ties-Merging (Yadav et al. 2023): Similar to Task Arithmetic, Ties-Merging also involves element-wise multiplication of a unified task vector by a scaling coefficient denoted as \(\lambda \) and is then added to the initial parameters of the pre-trained model.
However, before doing this, Ties-Merging involves three key steps, namely trimming, determining the sign of parameters, and disjoint merging, ultimately resulting in a merged task vector denoted as \(\nu \). In other words, the final model parameters are given by \(\theta = \theta _0 + \lambda \tau \), where \(\lambda \) represents a hyperparameter chosen based on the best-performing model identified through validation set evaluation.
For our specific study, we set the hyperparameter \(\lambda \) to a value of 0.3, a decision made after thorough assessment of model performance on the validation set.
AdaMerging (Yang et al. 2023): AdaMerging represents an adaptive model merging approach that autonomously learns coefficients for merging, operating either on a task-wise or layer-wise basis. It achieves this by employing entropy minimization on unlabeled test samples as a surrogate objective function to refine the merging coefficients. The task-wise AdaMerging is expressed as
where \(\lambda _k\) denotes the merging coefficient for the k-th task, and \(\tau _k\) represents the task vector for the same task. On the other hand, the layer-wise AdaMerging formulation is given by
Where l is the layer-index of models.
In our study, we initialize the merging coefficients \(\lambda \) to be 0.3 for all tasks and layers as in (Yang et al. 2023). This ensures a standardized starting point for the adaptive merging process.
Ours. We set the learning rate to 1e-5 and employ the Adam optimizer for test-time adaptation on the test dataset, following the details outlined in Sect. 3.5 of the main paper. The model undergoes training for 1000 steps on each task, and we present the performance results across all tasks in Tables 3 and 4.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tang, A., Shen, L., Luo, Y. et al. Data-Adaptive Weight-Ensembling for Multi-task Model Fusion. Int J Comput Vis 133, 5396–5412 (2025). https://doi.org/10.1007/s11263-025-02434-2
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02434-2