Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

Duan, Haoran; Shao, Shuai; Zhai, Bing; Shah, Tejal; Han, Jungong; Ranjan, Rajiv

doi:10.1007/s11263-025-02398-3

Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

Open access
Published: 13 March 2025

Volume 133, pages 4590–4603, (2025)
Cite this article

You have full access to this open access article

Download PDF

International Journal of Computer Vision Aims and scope Submit manuscript

Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

Download PDF

Haoran Duan ORCID: orcid.org/0000-0001-9956-7020¹,
Shuai Shao²,
Bing Zhai³,
Tejal Shah⁴,
Jungong Han¹ &
…
Rajiv Ranjan⁴

2718 Accesses
9 Citations
Explore all metrics

A Correction to this article was published on 21 April 2025

This article has been updated

Abstract

The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.

Enhancing Fine-Tuning Performance of Text-to-Image Diffusion Models for Few-Shot Image Generation Through Contrastive Learning

Efficient and Versatile Robust Fine-Tuning of Zero-Shot Models

Learning More Expressive Joint Distributions in Multimodal Variational Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multimodal-to-image diffusion models have drawn considerable research interest owing to their remarkable generalization abilities. Notably, advanced vision-language models, such as Stable Diffusion (Rombach et al., 2022), Imagen (Saharia et al., 2022), and LLAVA (Liu et al., 2024), can generate high-fidelity images by incorporating various modalities. By integrating additional rich modalities such as segmentation maps, canny edges, and landmarks, these models leverage diffusion processes to achieve state-of-the-art generative performance with enhanced control (Zhang et al., 2023). For context-specific downstream tasks, full fine-tuning of these pre-trained models is frequently employed to attain satisfactory performance while preserving their diverse generative capabilities. However, these approaches necessitates retraining all model parameters (Hu et al., 2022; Xia et al., 2024), resulting in significant computer memory demands, a substantial volume of fine-tuning data, and extensive training duration.

Parameter-efficient fine-tuning (PEFT) provides a more resource-effective alternative by optimising only a minimal number of parameters. Such as adapter modules (Houlsby et al., 2019; Dettmers et al., 2024), low-rank factorisation (Liu et al., 2024; Hayou et al., 2024; Valipour et al., 2023; Hu et al., 2022), and prompt tuning (Li and Liang, 2021; Liu et al., 2021; Lester et al., 2021), selectively adjust only a subset of the model’s parameters, thus largely reducing the computational burden and resource requirements. Among PEFT methods, Low-Rank Adaptation (LoRA) (Hu et al., 2022) is notably popular for its efficacy and simplicity. LoRA introduces a limited number of learnable parameters without modifying the model architecture, it freezes the original model weights and injects trainable low-rank matrices into the weight updates, thereby enabling models to be trained over a small number of epochs while maintaining high-fidelity and diverse generative performance compared to pre-trained models. Moreover, several weight decomposition methods have been proposed to quantify the preservation of pre-trained generative capabilities. One such method reparameterises model weights into magnitude and directional components to capture the distinct patterns of updates during fine-tuning (Liu et al., 2024).

Another representative approach employs orthogonal transformations for neurons to maintain pairwise angular relationships between fine-tuned and pre-trained models (Qiu et al., 2023). Empirical evidence supports the concept of overall hyperspherical similarity, known as hyperspherical energy (Liu et al., 2017), which is often characterised by the pairwise relational structure, such as cosine similarity among neurons. By maintaining the angle between neuron pairs, orthogonal fine-tuning (OFT) effectively limits the degrees of freedom in the orientations of paired neurons. This constraint guarantees the preservation of angular correlations between neuron pairs, thus maintaining the knowledge from pre-trained models in the fine-tuned models. The use of OFT still meet the inefficiencies problems, motivate researchers to suggest the implementation of a block diagonal structure within the orthogonal matrix. This approach minimizes the quantity of trainable parameters while permitting a subset of orthogonal transformations (Qiu et al., 2023). By balancing the preservation of pre-trained capabilities with the integration of new task-specific features, these methods push the boundaries of what generative models can achieve, ensuring high-quality and diverse outputs while addressing computational and practical limitations.

In this paper, we introduce a novel PEFT approach that is inspired by the Möbius geometry to enhance the flexibility and expressiveness of fine-tuning large multimodal generative models. LoRA, while reducing the computational load by constraining updates to a low-dimensional subspace, may not fully capture the complexity of high-dimensional data. OFT preserves the geometric structure of pre-trained weights. It is restricted by the constraints of the Stiefel space and the computational overhead of maintaining orthogonality during training. These limitations hinder their ability to represent highly non-linear and complex data patterns. Our method motivated by Möbius geometry to operate on the Riemann sphere and can be benefit from the hyperbolic properties like the Poincaré ball (Nickel and Kiela, 2017). This allows for more flexible and powerful non-linear mappings, enabling the model to capture complex patterns and hierarchical structures effectively. The integration of the Möbius style transformations into model fine-tuning is particularly beneficial for tasks involving intricate data distributions. To summarise our contribution, the proposed method is firstly supported by our analysis highlighting the advantages of Möbius geometry. Inspired by this, illustrative validation is further leveraged to confirm that Möbius guided transformation outperforms traditional OFT fine-tuning techniques with cayley transformation, which shows the superior generalisation capabilities. Most notably, we introduce a Möbius-inspired transformation for fine-tuning large multimodal generative models, enabling the generation of high-quality visual signals from diverse inputs. Our method allows for capturing more complex relationships within the data, leading to improved performance. The proposed method is evaluated with the primary tasks, such as the subject-driven generation, controllable generation and human motion generation. Our extensive experiments demonstrated improvements over previous methods with respect to generation quality.

2 Related Work

This section reviews advancements in multi-modal generative models, focusing on diffusion models for precise visual editing and efficient fine-tuning methods that boost performance with minimal computational cost.

Multi-modal Generative Model Multimodal synthesis has emerged as a preeminent field of research, showcasing capabilities in generating human-like natural language (Brown et al., 2020), high-quality images (Karras et al., 2020; Rombach et al., 2022), videos (Blattmann et al., 2023; Ho et al., 2022),3D models (Lin et al., 2023; Liu et al., 2023, 2024), speech (Tan et al., 2024; Van Den Oord et al., 2016; Shen et al., 2018) and music (Dhariwal et al., 2020; Copet et al., 2024). Within these, the inherently high-dimensional nature of images introduces considerable complexities to the domain of generative modelling. Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) prevail in producing high-resolution images with impressive perceptual quality (Brock et al., 2018; Zhang et al., 2017; Karras et al., 2020), yet they struggle with optimization challenges (Gulrajani et al., 2017; Karras et al., 2017, 2020) and often fail to capture the full data distribution (Metz et al., 2016). In response, likelihood-based models like Variational Autoencoders (VAEs) (Kingma and Welling, 2013) and flow-based models (Dinh et al., 2014, 2016; Papamakarios et al., 2017) accentuate density estimation thoroughly, which is facilitated by more stable optimization trajectories, though their output quality typically lags behind that of GANs (Kingma and Dhariwal, 2018; Vahdat and Kautz, 2020). Autoregressive models (ARMs) (Chen et al., 2020; Child et al., 2019; Oord et al., 2016; Van Den Oord et al., 2016), while adequate in density estimation with strong performance, suffer from computational intensity (Vaswani et al., 2017; Jouppi et al., 2017) and slow processing due to their reliance on sequential sampling and details, pixel-based representations that can extend training times, which is not computational resources friendly, limiting them to low-resolution multimedia data synthesis. The intrinsic difficulties with pixel-based image representations include the modelling of minute, high-frequency details (Salimans et al., 2017) that are often imperceptible, leading maximum-likelihood training methods to disproportionately allocate computational resources to these elements, thereby prolonging training duration. To circumvent these limitations and effectively scale to higher resolutions, innovative methodologies have been developed. Notably, several two-stage approaches (Esser et al., 2021; Razavi et al., 2019) have been proposed, where ARMs are employed not to model raw pixel data but rather a compressed latent image space. This strategic shift not only reduces the computational costs but also enhances the efficiency and scalability of generating high-resolution images, promising a more practical application in advanced generative tasks. Lately, Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015) have established new benchmarks in both density estimation (Kingma et al., 2021) and sample quality (Dhariwal and Nichol, 2021). Their effectiveness in generating high-quality images is largely attributed to the use of a UNet-based architecture and the diffusion process, which leverages the inductive biases inherent in image data. Typically, these models are evaluated and refined within the pixel domain, where DPMs function as generative models that iteratively denoise samples. Optimal synthesis outcomes are often achieved by employing a reweighted training objective. However, this approach has the trade-offs of slow inference speeds and significant computational costs during training. While advanced sampling techniques and hierarchical methods can partially improve inference speed, training models on high-resolution images still requires the computation of costly gradients. Nevertheless, DPMs offer a compelling alternative to traditional ARMs, GANs, and VAEs. They produce high-quality, diverse outputs and show robustness against common issues such as mode collapse and blurriness seen in other generative approaches. Despite their computational intensity, these trade-offs are offset by the significant improvements in output quality and stability, making them a valuable choice in fields where fidelity and variety are crucial. In this work, we will mainly focus on the diffusion-based models.

Controllable Generation Diffusion Models Controllable text-to-image diffusion models (T2I DMs) (Saharia et al., 2022; Ramesh et al., 2022; Rombach et al., 2022) have gained popularity due to their impressive ability to generate high-quality visual content, attributed to their versatility, expressiveness, and user-friendly interface. Traditionally, GANs have been at the forefront of controllable image generation, offering high-quality results and an ordered, interpretable latent space. Despite their successes, the main challenge with GANs is the difficulty in controlling fine-grained visual details. Diffusion-based models have emerged as a promising alternative to GANs, offering greater flexibility and controllability. However, unlike GANs, diffusion models do not inherently provide an ordered and interpretable latent space, which has sparked research interest in enhancing their controllability.

The recent advancements in technology facilitate a nuanced level of control over image manipulation, enabling operations ranging from concept interpolation (Brack et al., 2024; Gandikota et al., 2023; Hertz et al., 2023; Kawar et al., 2023) and instance penalisation (Ruiz et al., 2023) to controllable motion generation Dai et al. (2024) and image editing (Brooks et al., 2023; Yang et al., 2024). The expectation in image editing is that modifications should be localised, affecting only selected aspects of image. Prior research has underscored the challenge of adequately disentangling multiple concepts within a sample to avert global alterations, thus allowing for focused, localised edits. Notable works such as SDEdit (Meng et al., 2022), which adds intermediate noise to an image and denoises it based on the desired edit, and DDIB (Su et al., 2023), which utilizes DDIM inversion for encoding and decoding images, have paved the way for more sophisticated editing techniques. DiffusionCLIP (Kim et al., 2022) leverages language-vision model gradients, DDIM inversion, and model fine-tuning for domain-specific editing, while in Liu et al. (2023) authors guide the diffusion process using text and image inputs to synthesise similar images aligned with the given text. Hertz et al. (2023) takes a different approach by manipulating cross-attention layers in text-to-image diffusion models for fine-grained control.

Instance penalisation(subject-driven generation) has also drawn significant attention in controllable image editing, with models like DreamBooth (Ruiz et al., 2023) adapting the visual backbone using a prior preservation loss for specific instances, and Textual Inversion (Gal et al., 2023) optimizing an added embedding vector to represent specific instances or concepts. CustomDiffusion (Kumari et al., 2023) proposes efficient training of multiple concepts by fine-tuning only the cross-attention layers. The CLIP embedding space has been extensively utilised to steer the generation process, although no specific models were mentioned in this context.

Text-based image editing has advanced significantly with models like Textual Inversionn (Gal et al., 2023) and DreamBooth (Ruiz et al., 2023) synthesising novel views of given subjects using a few images and a target text. Imagic (Kawar et al., 2023) maintains high fidelity to the input image while applying non-rigid edits based on a single natural language prompt. P2P (Yang et al., 2024) and its extension, InstructPix2Pix (Brooks et al., 2023), perform structure-preserving editing using Stable Diffusion models and allow for human-like instructions. Structure preservation has been a key focus in editing techniques, with pix2pix-zero (Brooks et al., 2023) proposing noise regularization and cross-attention guidance to retrain image structure, and StyleDiffusion (Li et al., 2023) introducing a mapping network to modify cross-attention computation for regularisation. Automatic mask generation has also been explored, with DiffEdit (Couairon et al., 2023) generating masks by contrasting predictions conditioned on different text prompts and GSAM-inpaint (Yu et al., 2023) detecting masks using pretrained segmentation models. Manipulating cross-attention layers or spatial features to preserve image structure has also been demonstrated in models like P2P (Yang et al., 2024), PnP (Tumanyan et al., 2023), and MnM (Patashnik et al., 2023).

Parameter-Efficient Fine-Tuning Fine-tuning the multimodal generative models remains indispensable for optimal task-specific performance. Given the computational constraints inherent in large-scale models, PEFT methodologies have gained traction, enabling selective parameter adjustment while preserving model integrity and efficacy. LoRA (Hu et al., 2022) has emerged as a seminal work in PEFT that introduces low-rank matrices to approximate the updates to the pre-trained model weights, achieving a good balance between efficiency and effectiveness. Many variants of LoRA have been proposed, such as AdaLoRA (Zhang et al., 2023), which prunes the singular values of less important updates. This approach is crucial for reducing the parameter budget while avoiding computationally intensive exact SVD calculations. IncreLoRA (Zhang et al., 2023) and DyLoRA (Valipour et al., 2023), which dynamically adjust LoRA rank distribution for improving tuning efficiency; QLoRA (Dettmers et al., 2024), which combines LoRA with model quantization to further save computing resources; LoRA+ (Hayou et al., 2024) and PrecLoRA (Zhang and Pilanci, 2024), which study the optimization landscape of LoRA training; and the more recent variant DoRA (Liu et al., 2024), which decomposes pre-trained weights into magnitude and direction components and applies LoRA for direction tuning. To address the theoretical limitations of LoRA concerning the preservation of pre-training knowledge, OFT (Qiu et al., 2023; Liu et al., 2024) has been proposed. OFT transforms neuron vectors within the same layer using set of orthogonal matrices, preserving the pairwise angles between neuron vectors. Recent hypotheses postulate that an optimal fine-tuned model should exhibit minimal deviations in hyperspherical energy compared to its pre-trained counterpart. Drawing inspiration from the empirical observation that hyperspherical similarity encodes semantic information well (Liu et al., 2017, 2018) and that angular feature difference characterises the semantic gap (Chen et al., 2020), OFT underscore the crucial role of neuron angles (directions) in encoding semantic information, proposing methods such as Constrained Orthogonal Fine-tuning (COFT) to preserve the pairwise angles between neurons while constraining the fine-tuned model within a fixed radius of the pre-trained model. However, the number of trainable parameters in OFT can be quite large due to the high dimensionality of orthogonal matrices. To address this issue, BOFT (Orthogonal Butterfly) (Liu et al., 2024) is introduced as an extension of OFT that generates a dense orthogonal matrix using butterfly factorization, achieving better parameter efficiency. While existing parameter-efficient fine-tuning (PEFT) methods for multimodal generative models have achieved significant success in reducing fine-tuning costs. Their ability to guide the generative model from diverse data and feature patterns still has room for improvement, and addressing the limitation is the focus of our work, which will be discussed in subsequent sections.

3 Method

To overcome the expressiveness limitations of existing PEFT methods (e.g., OFT and LoRA), we propose leveraging Möbius geometry to enhance the model’s ability to fine-tune weights while capturing complex data relationships. This section formalises the LoRA and OFT frameworks, outlining their advantages and limitations. We examine the constraints of orthogonal transformations within the Stiefel space Atiyah and Todd (1960), introduce a Möbius-inspired transformation as a promising alternative, and analyse its properties and benefits using synthesised data. Finally, we propose a Möbius-inspired transformation for PEFT method on fine-tuning multimodal generative vision models to generate novel visual signals from input prompts and references.

3.1 Preliminaries

Given the large pre-trained multi-modal generative model, let ${\textbf{W}} \in {\mathbb {R}}^{d \times n}$ denote the pre-trained weight matrix. Recent research has extensively explored techniques (e.g., PEFT) for large-scale multimodal generative models. These techniques aim to reduce the number of trainable parameters while preserving or enhancing model performance. By minimising the number of parameters that need to be updated during fine-tuning, PEFT methods make the adaptation of large-scale models more computationally feasible and memory efficient. This is crucial for deploying models in resource-constrained environments and for scaling to even larger models without incurring prohibitive computational costs.

The LoRA framework defines the updated weight matrix $\textbf{W}$ as:

$$\begin{aligned} \textbf{W}' = \textbf{W} + \Delta = \textbf{W} + \textbf{BA}, \end{aligned}$$

(1)

where $\textbf{W}$ is the pre-trained and frozen weight matrix, and $\Delta = \textbf{BA}$ represents the low-rank update. Here, $\textbf{B} \in \mathbb {R}^{d \times r}$ and $\textbf{A} \in \mathbb {R}^{r \times n}$ are the low-rank matrices with $r \ll \min (d, n)$. This decomposition constrains the updates to lie in a low-dimensional subspace, significantly reducing the number of parameters that need to be trained. By focusing on the low-rank adaptation, LoRA achieves substantial computational savings and maintains or improves model performance across various tasks. LoRA has been effectively applied in domains such as natural language processing and generative computer vision, where large pre-trained models are common.

However, while LoRA excels in reducing computational complexity, it may inadvertently introduce uncertainty in the feature learning process and affect stabilising neuron dynamics, potentially impacting the model’s generalisation ability. To counteract this issue, OFT is proposed to enforce orthogonality constraints on the weight updates, promoting diversity and independence among the learned features. This orthogonalisation helps to preserve a rich and decor-related learning space, enhancing the robustness and generalisation capabilities of the model. OFT uses orthogonal transformations to fine-tune neural networks while preserving pair-wise neuron similarities. The fine-tuning is defined by:

$$\begin{aligned} \textbf{W}' = \textbf{RW}, \text{ s.t. } \varvec{R}^{\top } \varvec{R}=\varvec{R} \varvec{R}^{\top }=\varvec{I} \end{aligned}$$

(2)

where $\textbf{R}$ is an orthogonal matrix and $\varvec{I}$ is the identity matrix. The objective of OFT is to ensure that the transformation matrix $\textbf{R}$ maintains the orthogonality constraints, thereby preserving the geometric structure of the pre-trained model’s weights. This approach has been shown to enhance the stability of the fine-tuning process and improve the model’s performance by maintaining the hyper-spherical energy of the neural network. Specifically, the cayley parameterization is utilized to construct the orthogonal matrix for better efficiency hence $\textbf{R} = (\textbf{I} + \textbf{Q})(\textbf{I} - \textbf{Q})^{-1}$, where $\textbf{Q}$ is a skew-symmetric matrix, ensuring the orthogonality of $\textbf{R}$. Despite its advantages, OFT faces challenges due to the constraints on the learning capacity imposed by the Stiefel spacee. As a result, the hypothesis space is limited to rotations and reflections, which might not be expressive enough to capture the full complexity of data, especially in highly non-linear or high-dimensional settings. The optimisation often involves methods like the Gram-Schmidt process or other orthogonalisation techniques like Givens Rotations or Householder Reflection, which can be more complex and may slow down the convergence.

3.2 Möbius-Inspired Transformation

To address the above limitations, we propose improving the fine-tuning framework leveraging the benefit from Möbius geometry, which offers the following advantages over traditional orthogonal transformations: 1) Möbius geometry originally operate in the complex plane, represented as the Riemann sphere, allowing for more flexible and powerful transformations. Therefore, Möbius geometry inherently has potentiality on non-linear mappings, enabling the model to capture complex patterns and relationships in the data more effectively. 2) These transformations can be extended to hyperbolic models, such as the Poincaré ball, which are well-suited for representing hierarchical and structured data. By extending the hypothesis space beyond the constraints of the Stiefel manifold, Möbius geometry provides a richer set of non-linear transformations, enhancing the model’s expressiveness.

Definitions and Properties The Möbius transformation, a unifying framework for analysing planar geometries and originally fundamental in complex analysis, is a function of the form:

$$\begin{aligned} f(z) = \frac{az + b}{cz + d}, \end{aligned}$$

(3)

where $a$, $b$, $c$, and $d$ are complex numbers and $ad - bc \ne 0$. Geometrically, Möbius transformations can be interpreted as transformations of the Riemann sphere, which is the complex plane augmented by a point at infinity. This sphere provides a compact representation of the extended complex plane, enabling the process of Möbius transformations as rotations, translations, and inversions. Möbius transformations include rotations around the origins and reflections, corresponding to orthogonal transformations in higher-dimensional spaces. These transformations also include translations and inversions, providing greater flexibility than orthogonal transformations. Moreover, Möbius transformations are bijective conformal maps that preserve angles and the structure of the complex plane, forming a group under composition known as the Möbius group or the projective special linear group. The conformal characteristic, meaning they maintain angles between curves. Also, they map circles and lines to other circles and lines, meaning that they preserve geometric structures such as curvature and straightness. In the context of generative models fine-tuning, in which all the operations are in the weight space, we want to utilise such property to ensure that key geometric features, such as boundaries or relations, are maintained during transformations, contributing to the stability and consistency of generated outputs. In this paper, we believe the rationale behind Möbius-inspired transformation provides the theoretical motivation for enhanced non-linearity and has the potential to aid in fine-tuning model weights by coordinating diverse data distributions and feature patterns. Although the Möbius transformation is well-known for its ability to capture complex relationships in the complex plane due to its non-linear structure, making it adept at handling intricate mappings and transformations, this characteristic of the equation indicates that even outside the context of complex numbers, the structured form of the Möbius transformation may still be effective in managing complex, non-linear patterns in data. Therefore, in the context of the deep learning models, such a formula can well preserve the geometric structure of the pre-trained model’s weights. We first use an example to evaluate our assumption, and then we propose a method based on the Möbius inspired formula for PEFT on a multimodal generative vision model.

Analysis with Toy Examples To evaluate the effectiveness of Möbius-inspired transformations in handling the complex distribution, we synthesised toy datasets and fitted models using both Cayley and Möbius-inspired transformations. We visualised the data distributions to compare the theoretical properties and practical applications of these transformations. As shown in Fig. 1, the upper and lower sections represent the Cayley and Möbius-inspired transformations, respectively. The training and testing data for each column are denoted as E/E, H/E, E/H, H/H, where E stands for Euclidean data and H for Hyperbolic Data (Poincaré Ball). When training and testing data distribution are aligned (E/E and H/H), both Cayley and Möbius-inspired transformations are able to fit the data while the Möbius transformation performs slightly better than the Cayley transformation. However, when there is discrepancy in the distribution of training and testing data (E/H and H/E), the Cayley transformation struggles to capture the hierarchical structure of the hyperbolic data. This limitation is due to its linear nature and the constraints imposed by orthogonal transformations, which are not well-suited for the inherent curvature and hierarchy of feature space. In comparison, the Möbius-inspired transformation shows the perfect fit for the the complex data distribution. Its ability to perform complex, non-linear transformations allows it to capture the complex relationships and the curvature of the Poincaré ball effectively, thereby demonstrating the better flexibility and expressiveness of Möbius-inspired transformations in representing hierarchical structures. This example demonstrates that models based on Möbius-inspired transformations exhibit better performance in fitting complex and non-linear data distributions. This finding illustrates that Möbius-inspired transformations can effectively capture and organize intricate data distributions and features. Consequently, employing Möbius-inspired transformations to manipulate model parameters enables the model to better adapt and fine-tune its representation of the data, leading to improved generation of high-quality, diverse samples. By leveraging the flexibility and robustness of Möbius-inspired transformations, we can enhance the model’s ability to handle complex and multimodal data structures, achieving more accurate and realistic generation.

3.3 Möbius-Inspired Transformation in Generative Fine-tuning

The goal of this work is to leverage the benefits of the rationale behind Möbius geometry transformations to improve weight transformations (e.g., Cayley transformations) during the fine-tuning of generative models. This extension to a higher-dimensional geometric interpretation allows for capturing more complex relationships within data, making it particularly suitable for applications requiring advanced modelling capabilities. Although Möbius transformations are traditionally applied in the complex plane, their inherent non-linearity and rich mathematical structure offer valuable inspiration for broader applications. Instead of directly using the complex form, we adapt their key properties-such as handling complex mappings and preserving geometric relationships to effectively manipulate neural network weights in a structured manner. The transformations preserve key geometric structures, such as curvature and straightness, ensuring that critical neuron relationships and weight distributions remain intact during fine-tuning. The non-linearity of the transformations allows for flexible adjustments, capturing complex patterns while maintaining the model’s foundational behaviour. Additionally, their ability to handle translations, rotations, and inversions enables precise control over neuron modifications, enhancing the model’s adaptability. We integrate our proposed Möbius-inspired transformation within the PEFT framework and illustrate how this transformation can be adapted for fine-tuning a stable diffusion model to generate high-quality contents from multi-modal references. Figure 2 summarized the same and differences of representative PEFT methods and ours. We focus on transforming the linear layers of the attention modules in generative model (e.g., stable diffusion model). Like other PEFT frameworks, we fine-tune only the proposed transformation parameters while keeping the rest of the pre-trained model parameters fixed. To handle matrix operations and make it suitable for neural networks, we extend Eq. (3) to:

$$\begin{aligned} \textbf{W}' = (\textbf{A} \textbf{R} + \textbf{B}) \left( \textbf{C} \textbf{R} + \textbf{D} \right) ^{-1}\textbf{W} \end{aligned}$$

(4)

where ${\textbf{A}} \in \mathbb {R}^{d \times d}$ is a weight matrix fine-tuning the pre-trained model; $\textbf{B} \in \mathbb {R}^{d \times n}$ and $\textbf{D} \in \mathbb {{R}}^{d \times n}$ are bias-like matrices $d=n$; $\textbf{C} \in \mathbb {{R}}^{d \times d}$ is a weight matrix affecting the transformation denominator, $\textbf{R}$ is initialised as the identity matrix but set as the learnable parameters. The goal is to optimise the transformation matrix on the frozen input pre-trained model. Moreover, motivated by previous work (Qiu et al., 2023), we leverage the block-diagonal design to further improve the efficiency.

Table 1 Comparison of different methods based on various quantitative metrics

Full size table

4 Experimental Setup

Our method is evaluated in different settings to show its effectiveness and robustness. For text-to-image generation tasks, we use Stable Diffusion v1.5, with most settings aligned with those outlined in the original OFT paper.r (Qiu et al., 2023). More specifically, as for the subject-driven generation, we use the DreamBooth (Ruiz et al., 2023), LoRA (Hu et al., 2021), and OFT (Qiu et al., 2023) as the baselines to compare with, all these methods including ours follow the training setting (e.g., loss function, hyper-parameters, etc.) in the original Dreambooth paper. We utilise the official DreamBooth dataset, which comprises 30 subjects and 15 unique classes. Each subject is depicted by various images, supplemented by 25 distinct text prompts. For the controllable generation, the ControlNet (Zhang et al., 2023), T2I-Adapter (Mou et al., 2024) and LoRA (Hu et al., 2021) are used as the baselines. Three challenging tasks are considered in this paper, involving the segmentations to images, edges to images and landmark to face image. The Stable Diffusion v1.5 model is fine-tuned 20 epochs using the task-specific dataset, namely the ADE20K (Zhou et al., 2017), COCO (Lin et al., 2014) and CelebA-HQ (Karras et al., 2017), more details can be found in the original OFT paper. We also evaluate the proposed method under the text-to-motion tasks, and we follow the baseline from MotionGPT Zhang et al. (2024); the LoRA is replaced by our proposed method, and the HumanML3D Guo et al. (2022) dataset is used to train. Moreover, we also consider evaluating the proposed PEFT method on the conventional mask generation tasks (i.e., image segmentation) to further prove the generalibility. We use the DINO-v2 (Oquab et al., 2023) as the pre-trained backbone, we directly use the ADE20K (Zhou et al., 2017) to fine-tune the encoder along with a task specific decoder, the encoder is fine-tuned with help of PEFT, and we compare the effectiveness between the proposed method and LoRA.

5 Results and Discussion

We evaluated our method across several key multimodal generative computer vision tasks.

5.1 Subject-Driven Generation

Subject-driven generation focuses on creating images of a specific subject, using just a few reference images, and placing that subject in different contexts guided by a text prompt. This task is challenging because it requires the model to accurately capture the subject’s unique features and then recreate them consistently across various scenarios. We follow the previous work (Qiu et al., 2023) to calculate the DINO (Qiu et al., 2023), CLIP-I (Qiu et al., 2023), CLIP-T Qiu et al. (2023) and LPIPS (Qiu et al., 2023). Specifically, CLIP-I evaluates the average cosine similarity between the CLIP embeddings of generated and real images, while DINO performs a similar assessment using ViT S/16 DINO embeddings. They used to evaluate subject fidelity. Meanwhile, CLIP-T measures the average cosine similarity between the embeddings of text prompts and the generated images, which is used to evaluate the text prompt fidelity. Additionally, LPIPS evaluates the average cosine similarity between generated images of the same subject using identical text prompts to evaluate the sample diversity. As detailed in Table 1, our proposed method outperforms OFT, DreamBooth and LoRA on the DINO, CLIP-I and the LPIPS metrics, and also achieve comparable results in terms of text prompt fidelity and sample diversity. To better illustrate the benefits of our methods, we showcases several randomly selected examples of subject-driven generation. For a fair comparison, all examples were generated using the same finetuned model for each method with the same text prompt. From the results in Fig. 3, Our approach exhibits a clear advantage in maintaining semantic subject consistency. Through our observations, we noted that both LoRA and DreamBooth encounter significant challenges in preserving subject identity. DreamBooth, in particular, struggles to maintain both identity retention and alignment with the corresponding text prompts. LoRA shows a modest capacity to retain the original subject’s identity; however, this is limited and insufficient for comprehensive subject preservation. OFT provides enhanced precision in controlling outputs through text prompts; however, it falls short in preserving subtle details and lacks a more in-depth modelling of the intrinsic nature of objects. Most existing methods, including these, cannot achieve a balance between accuracy and diversity. Our method addresses this issue by enhancing both aspects during the generation process. This qualitative comparison underscores the high-quality generation and subject preservation achieved by our approach.

5.2 Controllable Generation

Controllable generation entails producing images that adhere to an additional control signal, such as canny edges or segmentation maps, in conjunction with a textual prompt. This task requires exceptional precision, as the model must comply with the specified control signal while producing high-quality images. Specifically, we follow the previous work (Qiu et al., 2023) to show the model generation in Fig. 4. We compare our method to existing state-of-the-art methods, including OFT (Qiu et al., 2023), ControlNet (Zhang et al., 2023), LoRA (Hu et al., 2021) and T2I-Adapter (Mou et al., 2024). The Landmark-to-Face, CannyEdge-to-Image and Segmentation-to-Image is denoted as L2F, C2I and S2I. In the L2F task, both OFT and our method produce accurate view control for the generated faces under challenging reference. In the C2I task, our method is able to hallucinate semantically similar images based on rough Canny edges, while others can also reach a competitive performance. In the S2I task, LoRA fails to generate images that adhere to the input segmentation map, whereas other methods effectively control the generated images. Our approach not only offers precise controllability but also produces high-fidelity images with greater detail.

5.3 Mask Generation

We compare the proposed Möbius-inspired transformation PEFT method with LoRA to evaluate the effectiveness of fine-tuning in Fig. 5. Comparing LoRA and our proposed PEFT method visually shows that both methods are able to approximate the ground truth fairly well. However, there are some noticeable improvements in finer details when using our method, particularly in complex scenes with multiple objects. This indicates that our PEFT approach may be more effective in retaining important fine details, potentially due to better guiding the model to handle the relationships between object features. It is also worth to mention that across different types of images, our PEFT method maintains relatively high consistency in performance. For example, in images with cluttered environments or challenging lighting conditions, the model demonstrates robust segmentation capabilities.

5.4 Text-to-Motion Generation

Following the settings in Zhang et al. (2024), we also conducted the fine-tuning text-to-motion generation task. We replaced LoRA with our proposed Möbius-inspired transformation to fine-tune the pre-trained Llama 2 7B model on HumanML3D Guo et al. (2022) dataset, enabling it to generate motion tokens that can be decoded by a pre-trained motion decoder. In Fig. 6, the visualised results demonstrate that the fine-tuned model’s motions align adequately with the provided poses and remain consistent with the textual descriptions. Applying the proposed transformation to the fine-tuned Llama 2 7B model improves text consistency and motion completeness, leveraging the model’s capabilities to generate more detailed and comprehensive motion sequences, making it effective for controlled motion generation. Although the overall motion consistency appears natural, the proposed adapter occasionally produces motion sequences that do not exactly correspond to the text descriptions.

6 Conclusion

This paper introduces a novel fine-tuning approach for multi-modal generative models using Möbius-inspired transformations. By replacing traditional orthogonal transformations, our method effectively captures complex and hierarchical data patterns. The experiments show notable improvements in generation quality for tasks like subject-driven, controllable generation, mask generation and text to motion generation. This approach addresses the computational and practical limitations of existing fine-tuning methods, offering a promising direction for future advancements in the efficient fine-tuning of large multimodal generative models. While the Möbius-inspired transformation for fine-tuning method has demonstrated notable improvements in capturing complex data patterns, several avenues for future research remain. One direction is to explore how this transformation framework can be extended to other generative architectures, such as autoregressive models and GANs, to assess its adaptability across different multimodal frameworks. Additionally, integrating this method with retrieval-augmented generation or graph-augmented generation approaches could further enhance the ability of the models to reason over structured and unstructured data sources. Further research can also investigate the scalability of such transformations in distributed or federated learning settings. By focusing on minimising communication overhead while maintaining model precision, this could broaden the application of fine-tuning strategies in resource-constrained environments. Furthermore, investigating the method’s performance on tasks beyond content generation, such as in multimodal perception or predictive analytics by fine-tuning pre-trained large models, could offer insights into its robustness across various domains. Moreover, future work could focus on improving the interpretability of Möbius-inspired fine-tuning strategies. By analysing how this transformation manipulates parameter spaces and affects the learning process, researchers could develop more transparent and explainable generative models.

Data Availability

The data used in this paper could be provided upon a reasonable request.

Change history

24 April 2025
The co-author’s affiliation has been corrected
21 April 2025
A Correction to this paper has been published: https://doi.org/10.1007/s11263-025-02456-w

References

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., & Kreis, K. (2023). Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Brack, M., Friedrich, F., Hintersdorf, D., Struppek, L., Schramowski, P., & Kersting, K. (2024). Sega: Instructing text-to-image models using semantic guidance. Advances in Neural Information Processing Systems, 36, 25365–25389.
Google Scholar
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096
Brooks, T., Holynski, A., & Efros, A. A. (2023). Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Google Scholar
Chen, B., Liu, W., Yu, Z., Kautz, J., Shrivastava, A., Garg, A., & Anandkumar, A. (2020). Angular visual hardness. In International conference on machine learning.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., & Sutskever, I. (2020). Generative Pretraining from Pixels. PMLR: In International conference on machine learning.
Google Scholar
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., & Défossez, A. (2024). Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 47704–47720.
Google Scholar
Couairon, G., Verbeek, J., Schwenk, H., & Cord, M. (2023). DiffEdit: Diffusion-based semantic image editing with mask guidance. https://openreview.net/forum?id=3lge0p5o-M-
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2024). Qlora: Efficient finetuning of quantized llms.
Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., & Sutskever, I. (2020). Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.
Google Scholar
Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803
Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., & Cohen-or, D. (2023). An image is worth one word: Personalizing text-to-image generation using textual inversion. https://openreview.net/forum?id=NAQvF08TcyG
Gandikota, R., Materzynska, J., Zhou, T., Torralba, A., & Bau, D. (2023). Concept sliders: Lora adaptors for precise control in diffusion models. arXiv preprint arXiv:2311.12092.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30.
Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-or, D. (2023). Prompt-to-prompt image editing with cross-attention control. https://openreview.net/forum?id=_CDixzkzeyb
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D. J., et al. (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In International conference on machine learning, pp. 2790–2799. PMLR.
Hu, E. J., shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. https://openreview.net/forum?id=nZeVKeeFYf9.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). Analyzing and improving the image quality of StyleGAN. In IEEE computer society, Los Alamitos, CA, USA
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., & Aila, T. (2020). Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33, 12104–12114.
Google Scholar
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., & Irani, M. (2023). Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Kim, G., Kwon, T., & Ye, J. C. (2022). Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems, 31.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Kingma, D., Salimans, T., Poole, B., & Ho, J. (2021). Variational diffusion models. Advances in Neural Information Processing Systems, 34, 21696–21707.
Google Scholar
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., & Zhu, J.-Y. (2023). Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
Li, S., Weijer, J., Hu, T., Khan, F.S., Hou, Q., Wang, Y., & Yang, J. (2023). Stylediffusion: Prompt-embedding inversion for text-based editing. arXiv preprint arXiv:2303.15649
Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.-Y., Lin, T.-Y. (2023). Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer.
Liu, X., Ji, K., Fu, Y., Tam, W.L., Du, Z., Yang, Z., & Tang, J. (2021). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances in Neural Information Processing Systems, 36.
Liu, W., Liu, Z., Yu, Z., Dai, B., Lin, R., Wang, Y., Rehg, J. M., & Song, L. (2018). Decoupled networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Liu, X., Park, D. H., Azadi, S., Zhang, G., Chopikyan, A., Hu, Y., Shi, H., Rohrbach, A., & Darrell, T. (2023). More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.
Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., Wen, Y., Black, M. J., Weller, A., & Schölkopf, B. (2024). Parameter-efficient orthogonal finetuning via butterfly factorization. https://openreview.net/forum?id=7NzgkEdGyr
Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C.F., Cheng, K.-T., & Chen, M.-H. (2024). DoRA: Weight-decomposed low-rank adaptation. arxiv.org/abs/2402.09353
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., & Vondrick, C. (2023). Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision.
Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T., & Song, L. (2017). Deep hyperspherical learning.
Liu, M., Xu, C., Jin, H., Chen, L., Varma, T., & M., Xu, Z., & Su, H. (2024). One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 22226–22246.
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (2022). SDEdit: Guided image synthesis and editing with stochastic differential equations. https://openreview.net/forum?id=aBsCjcPu_tE
Metz, L., Poole, B., Pfau, D., & Sohl-Dickstein, J. (2016). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163
Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., & Shan, Y. (2024). T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, 38, 4296–4304.
Article Google Scholar
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems, 30.
Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016). Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, 29.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 30.
Patashnik, O., Garibi, D., Azuri, I., Averbuch-Elor, H., & Cohen-Or, D. (2023). Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision.
Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., & Schölkopf, B. (2023). Controlling text-to-image diffusion by orthogonal finetuning. In Thirty-seventh conference on neural information processing systems. https://openreview.net/forum?id=K30wTdIIYc
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Razavi, A., Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494.
Google Scholar
Salimans, T., Karpathy, A., Chen, X., & Kingma, D. P. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2018.8461368
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning.
Su, X., Song, J., Meng, C., & Ermon, S. (2023). Dual diffusion implicit bridges for image-to-image translation. https://openreview.net/forum?id=5HLoTvVGDe
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., He, L., et al. (2024). Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence., 46, 4234–4245.
Article Google Scholar
Tumanyan, N., Geyer, M., Bagon, S., & Dekel, T. (2023). Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Vahdat, A., & Kautz, J. (2020). NVAE: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33, 19667–19679.
Google Scholar
Valipour, M., Rezagholizadeh, M., Kobyzev, I., & Ghodsi, A. (2023). DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Vlachos, A., Augenstein, I. (eds.) Proceedings of the 17th conference of the European chapter of the association for computational linguistics, pp. 3274–3287. Association for Computational Linguistics, Dubrovnik, Croatia. https://doi.org/10.18653/v1/2023.eacl-main.239. https://aclanthology.org/2023.eacl-main.239
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K., et al. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
Dai, W., Chen, LH., Wang, J., Liu, J., dai, B., Tang, Y. (2024). Real-time controllable motion generation via latent consistency model. European Conference on Computer Vision. (pp. 390-408). Springer
Atiyah, MF., Todd, JA. (1960). On complex Stiefel manifolds. Mathematical Proceedings of the Cambridge Philosophical Society, 56(4), 342-353. Cambridge University Press.
Zhang, Y., Huang, D., Liu, B., Tang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Yu, N., & Ouyang, W. (2024). Motiongpt: Finetuned llms are general-purpose motion generators. In Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7368–7376.
Article Google Scholar
Guo, C., Zou, S., Zuo X., Wang, S., Ji, W., Li, X., Cheng, L. (2022). Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5152-5161.
Zhang, Y., Huang, D., Liu, B., Tang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Yu, N., & Ouyang, W. (2024). Motiongpt: Finetuned llms are general-purpose motion generators. In Proceedings of the AAAI Conference on Artificial Intelligence, 38(7), 7368–7376.
Article Google Scholar
Van Den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In International conference on machine learning, PMLR.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Xia, W., Qin, C., & Hazan, E. (2024). Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151
Yang, F., Yang, S., Butt, M. A., Weijer, J., et al. (2024). Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems, 36, 26291–26303.
Google Scholar
Yu, T., Feng, R., Feng, R., Liu, J., Jin, X., Zeng, W., & Chen, Z. (2023). Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790
Zhang, F., & Pilanci, M. (2024). Riemannian preconditioned lora for fine-tuning foundation models. arXiv preprint arXiv:2402.02347
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512
Zhang, F., Li, L., Chen, J., Jiang, Z., Wang, B., & Qian, Y. (2023). IncreLoRA: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv:2308.12043
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision.
Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641.

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China No. 62441235.

Author information

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, China
Haoran Duan & Jungong Han
School of Health Sciences, University of Manchester, Manchester, UK
Shuai Shao
Department of Computer and Information Sciences, Northumbria University, Newcastle Upon Tyne, UK
Bing Zhai
School of Computing, Newcastle University, Newcastle Upon Tyne, UK
Tejal Shah & Rajiv Ranjan

Authors

Haoran Duan
View author publications
Search author on:PubMed Google Scholar
Shuai Shao
View author publications
Search author on:PubMed Google Scholar
Bing Zhai
View author publications
Search author on:PubMed Google Scholar
Tejal Shah
View author publications
Search author on:PubMed Google Scholar
Jungong Han
View author publications
Search author on:PubMed Google Scholar
Rajiv Ranjan
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Bing Zhai or Jungong Han.

Additional information

Communicated by Long Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Duan, H., Shao, S., Zhai, B. et al. Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation. Int J Comput Vis 133, 4590–4603 (2025). https://doi.org/10.1007/s11263-025-02398-3

Download citation

Received: 23 July 2024
Accepted: 14 February 2025
Published: 13 March 2025
Version of record: 13 March 2025
Issue date: July 2025
DOI: https://doi.org/10.1007/s11263-025-02398-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

Abstract

Similar content being viewed by others

Enhancing Fine-Tuning Performance of Text-to-Image Diffusion Models for Few-Shot Image Generation Through Contrastive Learning

Efficient and Versatile Robust Fine-Tuning of Zero-Shot Models

Learning More Expressive Joint Distributions in Multimodal Variational Methods

Explore related subjects

1 Introduction

2 Related Work

3 Method

3.1 Preliminaries

3.2 Möbius-Inspired Transformation

3.3 Möbius-Inspired Transformation in Generative Fine-tuning

4 Experimental Setup

5 Results and Discussion

5.1 Subject-Driven Generation

5.2 Controllable Generation

5.3 Mask Generation

5.4 Text-to-Motion Generation

6 Conclusion

Data Availability

Change history

24 April 2025

21 April 2025

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords