1 Introduction

Multimodal-to-image diffusion models have drawn considerable research interest owing to their remarkable generalization abilities. Notably, advanced vision-language models, such as Stable Diffusion (Rombach et al., 2022), Imagen (Saharia et al., 2022), and LLAVA (Liu et al., 2024), can generate high-fidelity images by incorporating various modalities. By integrating additional rich modalities such as segmentation maps, canny edges, and landmarks, these models leverage diffusion processes to achieve state-of-the-art generative performance with enhanced control (Zhang et al., 2023). For context-specific downstream tasks, full fine-tuning of these pre-trained models is frequently employed to attain satisfactory performance while preserving their diverse generative capabilities. However, these approaches necessitates retraining all model parameters (Hu et al., 2022; Xia et al., 2024), resulting in significant computer memory demands, a substantial volume of fine-tuning data, and extensive training duration.

Parameter-efficient fine-tuning (PEFT) provides a more resource-effective alternative by optimising only a minimal number of parameters. Such as adapter modules (Houlsby et al., 2019; Dettmers et al., 2024), low-rank factorisation (Liu et al., 2024; Hayou et al., 2024; Valipour et al., 2023; Hu et al., 2022), and prompt tuning (Li and Liang, 2021; Liu et al., 2021; Lester et al., 2021), selectively adjust only a subset of the model’s parameters, thus largely reducing the computational burden and resource requirements. Among PEFT methods, Low-Rank Adaptation (LoRA) (Hu et al., 2022) is notably popular for its efficacy and simplicity. LoRA introduces a limited number of learnable parameters without modifying the model architecture, it freezes the original model weights and injects trainable low-rank matrices into the weight updates, thereby enabling models to be trained over a small number of epochs while maintaining high-fidelity and diverse generative performance compared to pre-trained models. Moreover, several weight decomposition methods have been proposed to quantify the preservation of pre-trained generative capabilities. One such method reparameterises model weights into magnitude and directional components to capture the distinct patterns of updates during fine-tuning (Liu et al., 2024).

Another representative approach employs orthogonal transformations for neurons to maintain pairwise angular relationships between fine-tuned and pre-trained models (Qiu et al., 2023). Empirical evidence supports the concept of overall hyperspherical similarity, known as hyperspherical energy (Liu et al., 2017), which is often characterised by the pairwise relational structure, such as cosine similarity among neurons. By maintaining the angle between neuron pairs, orthogonal fine-tuning (OFT) effectively limits the degrees of freedom in the orientations of paired neurons. This constraint guarantees the preservation of angular correlations between neuron pairs, thus maintaining the knowledge from pre-trained models in the fine-tuned models. The use of OFT still meet the inefficiencies problems, motivate researchers to suggest the implementation of a block diagonal structure within the orthogonal matrix. This approach minimizes the quantity of trainable parameters while permitting a subset of orthogonal transformations (Qiu et al., 2023). By balancing the preservation of pre-trained capabilities with the integration of new task-specific features, these methods push the boundaries of what generative models can achieve, ensuring high-quality and diverse outputs while addressing computational and practical limitations.

In this paper, we introduce a novel PEFT approach that is inspired by the Möbius geometry to enhance the flexibility and expressiveness of fine-tuning large multimodal generative models. LoRA, while reducing the computational load by constraining updates to a low-dimensional subspace, may not fully capture the complexity of high-dimensional data. OFT preserves the geometric structure of pre-trained weights. It is restricted by the constraints of the Stiefel space and the computational overhead of maintaining orthogonality during training. These limitations hinder their ability to represent highly non-linear and complex data patterns. Our method motivated by Möbius geometry to operate on the Riemann sphere and can be benefit from the hyperbolic properties like the Poincaré ball (Nickel and Kiela, 2017). This allows for more flexible and powerful non-linear mappings, enabling the model to capture complex patterns and hierarchical structures effectively. The integration of the Möbius style transformations into model fine-tuning is particularly beneficial for tasks involving intricate data distributions. To summarise our contribution, the proposed method is firstly supported by our analysis highlighting the advantages of Möbius geometry. Inspired by this, illustrative validation is further leveraged to confirm that Möbius guided transformation outperforms traditional OFT fine-tuning techniques with cayley transformation, which shows the superior generalisation capabilities. Most notably, we introduce a Möbius-inspired transformation for fine-tuning large multimodal generative models, enabling the generation of high-quality visual signals from diverse inputs. Our method allows for capturing more complex relationships within the data, leading to improved performance. The proposed method is evaluated with the primary tasks, such as the subject-driven generation, controllable generation and human motion generation. Our extensive experiments demonstrated improvements over previous methods with respect to generation quality.

2 Related Work

This section reviews advancements in multi-modal generative models, focusing on diffusion models for precise visual editing and efficient fine-tuning methods that boost performance with minimal computational cost.

Multi-modal Generative Model  Multimodal synthesis has emerged as a preeminent field of research, showcasing capabilities in generating human-like natural language (Brown et al., 2020), high-quality images (Karras et al., 2020; Rombach et al., 2022), videos (Blattmann et al., 2023; Ho et al., 2022),3D models (Lin et al., 2023; Liu et al., 2023, 2024), speech (Tan et al., 2024; Van Den Oord et al., 2016; Shen et al., 2018) and music (Dhariwal et al., 2020; Copet et al., 2024). Within these, the inherently high-dimensional nature of images introduces considerable complexities to the domain of generative modelling. Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) prevail in producing high-resolution images with impressive perceptual quality (Brock et al., 2018; Zhang et al., 2017; Karras et al., 2020), yet they struggle with optimization challenges (Gulrajani et al., 2017; Karras et al., 2017, 2020) and often fail to capture the full data distribution (Metz et al., 2016). In response, likelihood-based models like Variational Autoencoders (VAEs) (Kingma and Welling, 2013) and flow-based models (Dinh et al., 2014, 2016; Papamakarios et al., 2017) accentuate density estimation thoroughly, which is facilitated by more stable optimization trajectories, though their output quality typically lags behind that of GANs (Kingma and Dhariwal, 2018; Vahdat and Kautz, 2020). Autoregressive models (ARMs) (Chen et al., 2020; Child et al., 2019; Oord et al., 2016; Van Den Oord et al., 2016), while adequate in density estimation with strong performance, suffer from computational intensity (Vaswani et al., 2017; Jouppi et al., 2017) and slow processing due to their reliance on sequential sampling and details, pixel-based representations that can extend training times, which is not computational resources friendly, limiting them to low-resolution multimedia data synthesis. The intrinsic difficulties with pixel-based image representations include the modelling of minute, high-frequency details (Salimans et al., 2017) that are often imperceptible, leading maximum-likelihood training methods to disproportionately allocate computational resources to these elements, thereby prolonging training duration. To circumvent these limitations and effectively scale to higher resolutions, innovative methodologies have been developed. Notably, several two-stage approaches (Esser et al., 2021; Razavi et al., 2019) have been proposed, where ARMs are employed not to model raw pixel data but rather a compressed latent image space. This strategic shift not only reduces the computational costs but also enhances the efficiency and scalability of generating high-resolution images, promising a more practical application in advanced generative tasks. Lately, Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015) have established new benchmarks in both density estimation (Kingma et al., 2021) and sample quality (Dhariwal and Nichol, 2021). Their effectiveness in generating high-quality images is largely attributed to the use of a UNet-based architecture and the diffusion process, which leverages the inductive biases inherent in image data. Typically, these models are evaluated and refined within the pixel domain, where DPMs function as generative models that iteratively denoise samples. Optimal synthesis outcomes are often achieved by employing a reweighted training objective. However, this approach has the trade-offs of slow inference speeds and significant computational costs during training. While advanced sampling techniques and hierarchical methods can partially improve inference speed, training models on high-resolution images still requires the computation of costly gradients. Nevertheless, DPMs offer a compelling alternative to traditional ARMs, GANs, and VAEs. They produce high-quality, diverse outputs and show robustness against common issues such as mode collapse and blurriness seen in other generative approaches. Despite their computational intensity, these trade-offs are offset by the significant improvements in output quality and stability, making them a valuable choice in fields where fidelity and variety are crucial. In this work, we will mainly focus on the diffusion-based models.

Controllable Generation Diffusion Models Controllable text-to-image diffusion models (T2I DMs) (Saharia et al., 2022; Ramesh et al., 2022; Rombach et al., 2022) have gained popularity due to their impressive ability to generate high-quality visual content, attributed to their versatility, expressiveness, and user-friendly interface. Traditionally, GANs have been at the forefront of controllable image generation, offering high-quality results and an ordered, interpretable latent space. Despite their successes, the main challenge with GANs is the difficulty in controlling fine-grained visual details. Diffusion-based models have emerged as a promising alternative to GANs, offering greater flexibility and controllability. However, unlike GANs, diffusion models do not inherently provide an ordered and interpretable latent space, which has sparked research interest in enhancing their controllability.

The recent advancements in technology facilitate a nuanced level of control over image manipulation, enabling operations ranging from concept interpolation (Brack et al., 2024; Gandikota et al., 2023; Hertz et al., 2023; Kawar et al., 2023) and instance penalisation (Ruiz et al., 2023) to controllable motion generation Dai et al. (2024) and image editing (Brooks et al., 2023; Yang et al., 2024). The expectation in image editing is that modifications should be localised, affecting only selected aspects of image. Prior research has underscored the challenge of adequately disentangling multiple concepts within a sample to avert global alterations, thus allowing for focused, localised edits. Notable works such as SDEdit (Meng et al., 2022), which adds intermediate noise to an image and denoises it based on the desired edit, and DDIB (Su et al., 2023), which utilizes DDIM inversion for encoding and decoding images, have paved the way for more sophisticated editing techniques. DiffusionCLIP (Kim et al., 2022) leverages language-vision model gradients, DDIM inversion, and model fine-tuning for domain-specific editing, while in Liu et al. (2023) authors guide the diffusion process using text and image inputs to synthesise similar images aligned with the given text. Hertz et al. (2023) takes a different approach by manipulating cross-attention layers in text-to-image diffusion models for fine-grained control.

Instance penalisation(subject-driven generation) has also drawn significant attention in controllable image editing, with models like DreamBooth (Ruiz et al., 2023) adapting the visual backbone using a prior preservation loss for specific instances, and Textual Inversion (Gal et al., 2023) optimizing an added embedding vector to represent specific instances or concepts. CustomDiffusion (Kumari et al., 2023) proposes efficient training of multiple concepts by fine-tuning only the cross-attention layers. The CLIP embedding space has been extensively utilised to steer the generation process, although no specific models were mentioned in this context.

Text-based image editing has advanced significantly with models like Textual Inversionn (Gal et al., 2023) and DreamBooth (Ruiz et al., 2023) synthesising novel views of given subjects using a few images and a target text. Imagic (Kawar et al., 2023) maintains high fidelity to the input image while applying non-rigid edits based on a single natural language prompt. P2P (Yang et al., 2024) and its extension, InstructPix2Pix (Brooks et al., 2023), perform structure-preserving editing using Stable Diffusion models and allow for human-like instructions. Structure preservation has been a key focus in editing techniques, with pix2pix-zero (Brooks et al., 2023) proposing noise regularization and cross-attention guidance to retrain image structure, and StyleDiffusion (Li et al., 2023) introducing a mapping network to modify cross-attention computation for regularisation. Automatic mask generation has also been explored, with DiffEdit (Couairon et al., 2023) generating masks by contrasting predictions conditioned on different text prompts and GSAM-inpaint (Yu et al., 2023) detecting masks using pretrained segmentation models. Manipulating cross-attention layers or spatial features to preserve image structure has also been demonstrated in models like P2P (Yang et al., 2024), PnP (Tumanyan et al., 2023), and MnM (Patashnik et al., 2023).

Parameter-Efficient Fine-Tuning Fine-tuning the multimodal generative models remains indispensable for optimal task-specific performance. Given the computational constraints inherent in large-scale models, PEFT methodologies have gained traction, enabling selective parameter adjustment while preserving model integrity and efficacy. LoRA (Hu et al., 2022) has emerged as a seminal work in PEFT that introduces low-rank matrices to approximate the updates to the pre-trained model weights, achieving a good balance between efficiency and effectiveness. Many variants of LoRA have been proposed, such as AdaLoRA (Zhang et al., 2023), which prunes the singular values of less important updates. This approach is crucial for reducing the parameter budget while avoiding computationally intensive exact SVD calculations. IncreLoRA (Zhang et al., 2023) and DyLoRA (Valipour et al., 2023), which dynamically adjust LoRA rank distribution for improving tuning efficiency; QLoRA (Dettmers et al., 2024), which combines LoRA with model quantization to further save computing resources; LoRA+ (Hayou et al., 2024) and PrecLoRA (Zhang and Pilanci, 2024), which study the optimization landscape of LoRA training; and the more recent variant DoRA (Liu et al., 2024), which decomposes pre-trained weights into magnitude and direction components and applies LoRA for direction tuning. To address the theoretical limitations of LoRA concerning the preservation of pre-training knowledge, OFT (Qiu et al., 2023; Liu et al., 2024) has been proposed. OFT transforms neuron vectors within the same layer using set of orthogonal matrices, preserving the pairwise angles between neuron vectors. Recent hypotheses postulate that an optimal fine-tuned model should exhibit minimal deviations in hyperspherical energy compared to its pre-trained counterpart. Drawing inspiration from the empirical observation that hyperspherical similarity encodes semantic information well (Liu et al., 2017, 2018) and that angular feature difference characterises the semantic gap (Chen et al., 2020), OFT underscore the crucial role of neuron angles (directions) in encoding semantic information, proposing methods such as Constrained Orthogonal Fine-tuning (COFT) to preserve the pairwise angles between neurons while constraining the fine-tuned model within a fixed radius of the pre-trained model. However, the number of trainable parameters in OFT can be quite large due to the high dimensionality of orthogonal matrices. To address this issue, BOFT (Orthogonal Butterfly) (Liu et al., 2024) is introduced as an extension of OFT that generates a dense orthogonal matrix using butterfly factorization, achieving better parameter efficiency. While existing parameter-efficient fine-tuning (PEFT) methods for multimodal generative models have achieved significant success in reducing fine-tuning costs. Their ability to guide the generative model from diverse data and feature patterns still has room for improvement, and addressing the limitation is the focus of our work, which will be discussed in subsequent sections.

3 Method

To overcome the expressiveness limitations of existing PEFT methods (e.g., OFT and LoRA), we propose leveraging Möbius geometry to enhance the model’s ability to fine-tune weights while capturing complex data relationships. This section formalises the LoRA and OFT frameworks, outlining their advantages and limitations. We examine the constraints of orthogonal transformations within the Stiefel space Atiyah and Todd (1960), introduce a Möbius-inspired transformation as a promising alternative, and analyse its properties and benefits using synthesised data. Finally, we propose a Möbius-inspired transformation for PEFT method on fine-tuning multimodal generative vision models to generate novel visual signals from input prompts and references.

3.1 Preliminaries

Given the large pre-trained multi-modal generative model, let \({\textbf{W}} \in {\mathbb {R}}^{d \times n}\) denote the pre-trained weight matrix. Recent research has extensively explored techniques (e.g., PEFT) for large-scale multimodal generative models. These techniques aim to reduce the number of trainable parameters while preserving or enhancing model performance. By minimising the number of parameters that need to be updated during fine-tuning, PEFT methods make the adaptation of large-scale models more computationally feasible and memory efficient. This is crucial for deploying models in resource-constrained environments and for scaling to even larger models without incurring prohibitive computational costs.

The LoRA framework defines the updated weight matrix \(\textbf{W}\) as:

$$\begin{aligned} \textbf{W}' = \textbf{W} + \Delta = \textbf{W} + \textbf{BA}, \end{aligned}$$
(1)

where \(\textbf{W}\) is the pre-trained and frozen weight matrix, and \(\Delta = \textbf{BA}\) represents the low-rank update. Here, \(\textbf{B} \in \mathbb {R}^{d \times r}\) and \(\textbf{A} \in \mathbb {R}^{r \times n}\) are the low-rank matrices with \(r \ll \min (d, n)\). This decomposition constrains the updates to lie in a low-dimensional subspace, significantly reducing the number of parameters that need to be trained. By focusing on the low-rank adaptation, LoRA achieves substantial computational savings and maintains or improves model performance across various tasks. LoRA has been effectively applied in domains such as natural language processing and generative computer vision, where large pre-trained models are common.

However, while LoRA excels in reducing computational complexity, it may inadvertently introduce uncertainty in the feature learning process and affect stabilising neuron dynamics, potentially impacting the model’s generalisation ability. To counteract this issue, OFT is proposed to enforce orthogonality constraints on the weight updates, promoting diversity and independence among the learned features. This orthogonalisation helps to preserve a rich and decor-related learning space, enhancing the robustness and generalisation capabilities of the model. OFT uses orthogonal transformations to fine-tune neural networks while preserving pair-wise neuron similarities. The fine-tuning is defined by:

$$\begin{aligned} \textbf{W}' = \textbf{RW}, \text{ s.t. } \varvec{R}^{\top } \varvec{R}=\varvec{R} \varvec{R}^{\top }=\varvec{I} \end{aligned}$$
(2)

where \(\textbf{R}\) is an orthogonal matrix and \(\varvec{I}\) is the identity matrix. The objective of OFT is to ensure that the transformation matrix \(\textbf{R}\) maintains the orthogonality constraints, thereby preserving the geometric structure of the pre-trained model’s weights. This approach has been shown to enhance the stability of the fine-tuning process and improve the model’s performance by maintaining the hyper-spherical energy of the neural network. Specifically, the cayley parameterization is utilized to construct the orthogonal matrix for better efficiency hence \(\textbf{R} = (\textbf{I} + \textbf{Q})(\textbf{I} - \textbf{Q})^{-1}\), where \(\textbf{Q}\) is a skew-symmetric matrix, ensuring the orthogonality of \(\textbf{R}\). Despite its advantages, OFT faces challenges due to the constraints on the learning capacity imposed by the Stiefel spacee. As a result, the hypothesis space is limited to rotations and reflections, which might not be expressive enough to capture the full complexity of data, especially in highly non-linear or high-dimensional settings. The optimisation often involves methods like the Gram-Schmidt process or other orthogonalisation techniques like Givens Rotations or Householder Reflection, which can be more complex and may slow down the convergence.

Fig. 1
figure 1

Comparison between Cayley and Möbius-inspired transformation in handling both Euclidean and Hyperbolic data, showcasing better generalisation and adaptability of Möbius-inspired transformation for robust modelling in different distribution settings. The training and testing data for each column are denoted as E/E, H/E, E/H, H/H, where E stands for Euclidean data and H for Hyperbolic Data (Poincaré Ball). When training and testing data distribution are aligned (E/E and H/H), both Cayley and Möbius-inspired transformations are able to fit the data while the Möbius-inspired transformation performs slightly better than the Cayley transformation. However, when there is discrepancy in the distribution of training and testing data (E/H and H/E), the Cayley transformation struggles to capture the hierarchical structure of the hyperbolic data

3.2 Möbius-Inspired Transformation

To address the above limitations, we propose improving the fine-tuning framework leveraging the benefit from Möbius geometry, which offers the following advantages over traditional orthogonal transformations: 1) Möbius geometry originally operate in the complex plane, represented as the Riemann sphere, allowing for more flexible and powerful transformations. Therefore, Möbius geometry inherently has potentiality on non-linear mappings, enabling the model to capture complex patterns and relationships in the data more effectively. 2) These transformations can be extended to hyperbolic models, such as the Poincaré ball, which are well-suited for representing hierarchical and structured data. By extending the hypothesis space beyond the constraints of the Stiefel manifold, Möbius geometry provides a richer set of non-linear transformations, enhancing the model’s expressiveness.

Definitions and Properties The Möbius transformation, a unifying framework for analysing planar geometries and originally fundamental in complex analysis, is a function of the form:

$$\begin{aligned} f(z) = \frac{az + b}{cz + d}, \end{aligned}$$
(3)

where \(a\), \(b\), \(c\), and \(d\) are complex numbers and \(ad - bc \ne 0\). Geometrically, Möbius transformations can be interpreted as transformations of the Riemann sphere, which is the complex plane augmented by a point at infinity. This sphere provides a compact representation of the extended complex plane, enabling the process of Möbius transformations as rotations, translations, and inversions. Möbius transformations include rotations around the origins and reflections, corresponding to orthogonal transformations in higher-dimensional spaces. These transformations also include translations and inversions, providing greater flexibility than orthogonal transformations. Moreover, Möbius transformations are bijective conformal maps that preserve angles and the structure of the complex plane, forming a group under composition known as the Möbius group or the projective special linear group. The conformal characteristic, meaning they maintain angles between curves. Also, they map circles and lines to other circles and lines, meaning that they preserve geometric structures such as curvature and straightness. In the context of generative models fine-tuning, in which all the operations are in the weight space, we want to utilise such property to ensure that key geometric features, such as boundaries or relations, are maintained during transformations, contributing to the stability and consistency of generated outputs. In this paper, we believe the rationale behind Möbius-inspired transformation provides the theoretical motivation for enhanced non-linearity and has the potential to aid in fine-tuning model weights by coordinating diverse data distributions and feature patterns. Although the Möbius transformation is well-known for its ability to capture complex relationships in the complex plane due to its non-linear structure, making it adept at handling intricate mappings and transformations, this characteristic of the equation indicates that even outside the context of complex numbers, the structured form of the Möbius transformation may still be effective in managing complex, non-linear patterns in data. Therefore, in the context of the deep learning models, such a formula can well preserve the geometric structure of the pre-trained model’s weights. We first use an example to evaluate our assumption, and then we propose a method based on the Möbius inspired formula for PEFT on a multimodal generative vision model.

Fig. 2
figure 2

Overview of our method in comparison to existing approaches: a LoRA fine-tunes large models by factorizing the weight update matrix into two smaller matrices, \(A\) and \(B\), reducing the number of trainable parameters while preserving performance. b Orthogonal fine-tuning uses the Cayley transformation to update model weights within an orthogonal subspace. c Our proposed method applies a Möbius geometry-inspired transformation, offering non-linearity that enables more flexible and efficient parameter adjustments

Analysis with Toy Examples To evaluate the effectiveness of Möbius-inspired transformations in handling the complex distribution, we synthesised toy datasets and fitted models using both Cayley and Möbius-inspired transformations. We visualised the data distributions to compare the theoretical properties and practical applications of these transformations. As shown in Fig. 1, the upper and lower sections represent the Cayley and Möbius-inspired transformations, respectively. The training and testing data for each column are denoted as E/E, H/E, E/H, H/H, where E stands for Euclidean data and H for Hyperbolic Data (Poincaré Ball). When training and testing data distribution are aligned (E/E and H/H), both Cayley and Möbius-inspired transformations are able to fit the data while the Möbius transformation performs slightly better than the Cayley transformation. However, when there is discrepancy in the distribution of training and testing data (E/H and H/E), the Cayley transformation struggles to capture the hierarchical structure of the hyperbolic data. This limitation is due to its linear nature and the constraints imposed by orthogonal transformations, which are not well-suited for the inherent curvature and hierarchy of feature space. In comparison, the Möbius-inspired transformation shows the perfect fit for the the complex data distribution. Its ability to perform complex, non-linear transformations allows it to capture the complex relationships and the curvature of the Poincaré ball effectively, thereby demonstrating the better flexibility and expressiveness of Möbius-inspired transformations in representing hierarchical structures. This example demonstrates that models based on Möbius-inspired transformations exhibit better performance in fitting complex and non-linear data distributions. This finding illustrates that Möbius-inspired transformations can effectively capture and organize intricate data distributions and features. Consequently, employing Möbius-inspired transformations to manipulate model parameters enables the model to better adapt and fine-tune its representation of the data, leading to improved generation of high-quality, diverse samples. By leveraging the flexibility and robustness of Möbius-inspired transformations, we can enhance the model’s ability to handle complex and multimodal data structures, achieving more accurate and realistic generation.

3.3 Möbius-Inspired Transformation in Generative Fine-tuning

The goal of this work is to leverage the benefits of the rationale behind Möbius geometry transformations to improve weight transformations (e.g., Cayley transformations) during the fine-tuning of generative models. This extension to a higher-dimensional geometric interpretation allows for capturing more complex relationships within data, making it particularly suitable for applications requiring advanced modelling capabilities. Although Möbius transformations are traditionally applied in the complex plane, their inherent non-linearity and rich mathematical structure offer valuable inspiration for broader applications. Instead of directly using the complex form, we adapt their key properties-such as handling complex mappings and preserving geometric relationships to effectively manipulate neural network weights in a structured manner. The transformations preserve key geometric structures, such as curvature and straightness, ensuring that critical neuron relationships and weight distributions remain intact during fine-tuning. The non-linearity of the transformations allows for flexible adjustments, capturing complex patterns while maintaining the model’s foundational behaviour. Additionally, their ability to handle translations, rotations, and inversions enables precise control over neuron modifications, enhancing the model’s adaptability. We integrate our proposed Möbius-inspired transformation within the PEFT framework and illustrate how this transformation can be adapted for fine-tuning a stable diffusion model to generate high-quality contents from multi-modal references. Figure 2 summarized the same and differences of representative PEFT methods and ours. We focus on transforming the linear layers of the attention modules in generative model (e.g., stable diffusion model). Like other PEFT frameworks, we fine-tune only the proposed transformation parameters while keeping the rest of the pre-trained model parameters fixed. To handle matrix operations and make it suitable for neural networks, we extend Eq. (3) to:

$$\begin{aligned} \textbf{W}' = (\textbf{A} \textbf{R} + \textbf{B}) \left( \textbf{C} \textbf{R} + \textbf{D} \right) ^{-1}\textbf{W} \end{aligned}$$
(4)

where \({\textbf{A}} \in \mathbb {R}^{d \times d}\) is a weight matrix fine-tuning the pre-trained model; \(\textbf{B} \in \mathbb {R}^{d \times n}\) and \(\textbf{D} \in \mathbb {{R}}^{d \times n}\) are bias-like matrices \(d=n\); \(\textbf{C} \in \mathbb {{R}}^{d \times d}\) is a weight matrix affecting the transformation denominator, \(\textbf{R}\) is initialised as the identity matrix but set as the learnable parameters. The goal is to optimise the transformation matrix on the frozen input pre-trained model. Moreover, motivated by previous work (Qiu et al., 2023), we leverage the block-diagonal design to further improve the efficiency.

Table 1 Comparison of different methods based on various quantitative metrics
Fig. 3
figure 3

Our method is compared with existing subject-driven text-to-image generation methods qualitatively among DreamBooth (Ruiz et al., 2023), LoRA (Hu et al., 2021) and OFT (Qiu et al., 2023)

4 Experimental Setup

Our method is evaluated in different settings to show its effectiveness and robustness. For text-to-image generation tasks, we use Stable Diffusion v1.5, with most settings aligned with those outlined in the original OFT paper.r (Qiu et al., 2023). More specifically, as for the subject-driven generation, we use the DreamBooth (Ruiz et al., 2023), LoRA (Hu et al., 2021), and OFT (Qiu et al., 2023) as the baselines to compare with, all these methods including ours follow the training setting (e.g., loss function, hyper-parameters, etc.) in the original Dreambooth paper. We utilise the official DreamBooth dataset, which comprises 30 subjects and 15 unique classes. Each subject is depicted by various images, supplemented by 25 distinct text prompts. For the controllable generation, the ControlNet (Zhang et al., 2023), T2I-Adapter (Mou et al., 2024) and LoRA (Hu et al., 2021) are used as the baselines. Three challenging tasks are considered in this paper, involving the segmentations to images, edges to images and landmark to face image. The Stable Diffusion v1.5 model is fine-tuned 20 epochs using the task-specific dataset, namely the ADE20K (Zhou et al., 2017), COCO (Lin et al., 2014) and CelebA-HQ (Karras et al., 2017), more details can be found in the original OFT paper. We also evaluate the proposed method under the text-to-motion tasks, and we follow the baseline from MotionGPT Zhang et al. (2024); the LoRA is replaced by our proposed method, and the HumanML3D Guo et al. (2022) dataset is used to train. Moreover, we also consider evaluating the proposed PEFT method on the conventional mask generation tasks (i.e., image segmentation) to further prove the generalibility. We use the DINO-v2 (Oquab et al., 2023) as the pre-trained backbone, we directly use the ADE20K (Zhou et al., 2017) to fine-tune the encoder along with a task specific decoder, the encoder is fine-tuned with help of PEFT, and we compare the effectiveness between the proposed method and LoRA.

5 Results and Discussion

We evaluated our method across several key multimodal generative computer vision tasks.

5.1 Subject-Driven Generation

Subject-driven generation focuses on creating images of a specific subject, using just a few reference images, and placing that subject in different contexts guided by a text prompt. This task is challenging because it requires the model to accurately capture the subject’s unique features and then recreate them consistently across various scenarios. We follow the previous work (Qiu et al., 2023) to calculate the DINO (Qiu et al., 2023), CLIP-I (Qiu et al., 2023), CLIP-T Qiu et al. (2023) and LPIPS (Qiu et al., 2023). Specifically, CLIP-I evaluates the average cosine similarity between the CLIP embeddings of generated and real images, while DINO performs a similar assessment using ViT S/16 DINO embeddings. They used to evaluate subject fidelity. Meanwhile, CLIP-T measures the average cosine similarity between the embeddings of text prompts and the generated images, which is used to evaluate the text prompt fidelity. Additionally, LPIPS evaluates the average cosine similarity between generated images of the same subject using identical text prompts to evaluate the sample diversity. As detailed in Table 1, our proposed method outperforms OFT, DreamBooth and LoRA on the DINO, CLIP-I and the LPIPS metrics, and also achieve comparable results in terms of text prompt fidelity and sample diversity. To better illustrate the benefits of our methods, we showcases several randomly selected examples of subject-driven generation. For a fair comparison, all examples were generated using the same finetuned model for each method with the same text prompt. From the results in Fig. 3, Our approach exhibits a clear advantage in maintaining semantic subject consistency. Through our observations, we noted that both LoRA and DreamBooth encounter significant challenges in preserving subject identity. DreamBooth, in particular, struggles to maintain both identity retention and alignment with the corresponding text prompts. LoRA shows a modest capacity to retain the original subject’s identity; however, this is limited and insufficient for comprehensive subject preservation. OFT provides enhanced precision in controlling outputs through text prompts; however, it falls short in preserving subtle details and lacks a more in-depth modelling of the intrinsic nature of objects. Most existing methods, including these, cannot achieve a balance between accuracy and diversity. Our method addresses this issue by enhancing both aspects during the generation process. This qualitative comparison underscores the high-quality generation and subject preservation achieved by our approach.

Fig. 4
figure 4

Comparing the proposed method with existing controllable text-to-image generation methods qualitatively among ControlNet (Zhang et al., 2023), T2I-Adapter (Mou et al., 2024), LoRA (Hu et al., 2021) and OFT (Qiu et al., 2023)

Fig. 5
figure 5

Comparison of segmentation results using different methods. From top to bottom: original input images, ground truth (GT), ground truth segmentation (GT-Seg), results using LoRA for parameter-efficient fine-tuning, and results using our proposed PEFT method. Our method shows improved segmentation in complex scenes, demonstrating its advantage over LoRA in capturing finer details while maintaining efficiency

Fig. 6
figure 6

Randomly selected text-to-motion samples generated using the Möbius-Inspired Transformation for fine-tuning LLaMA-7B on the HumanML3D dataset, presented in comparison to ground truth results

5.2 Controllable Generation

Controllable generation entails producing images that adhere to an additional control signal, such as canny edges or segmentation maps, in conjunction with a textual prompt. This task requires exceptional precision, as the model must comply with the specified control signal while producing high-quality images. Specifically, we follow the previous work (Qiu et al., 2023) to show the model generation in Fig. 4. We compare our method to existing state-of-the-art methods, including OFT (Qiu et al., 2023), ControlNet (Zhang et al., 2023), LoRA (Hu et al., 2021) and T2I-Adapter (Mou et al., 2024). The Landmark-to-Face, CannyEdge-to-Image and Segmentation-to-Image is denoted as L2F, C2I and S2I. In the L2F task, both OFT and our method produce accurate view control for the generated faces under challenging reference. In the C2I task, our method is able to hallucinate semantically similar images based on rough Canny edges, while others can also reach a competitive performance. In the S2I task, LoRA fails to generate images that adhere to the input segmentation map, whereas other methods effectively control the generated images. Our approach not only offers precise controllability but also produces high-fidelity images with greater detail.

5.3 Mask Generation

We compare the proposed Möbius-inspired transformation PEFT method with LoRA to evaluate the effectiveness of fine-tuning in Fig. 5. Comparing LoRA and our proposed PEFT method visually shows that both methods are able to approximate the ground truth fairly well. However, there are some noticeable improvements in finer details when using our method, particularly in complex scenes with multiple objects. This indicates that our PEFT approach may be more effective in retaining important fine details, potentially due to better guiding the model to handle the relationships between object features. It is also worth to mention that across different types of images, our PEFT method maintains relatively high consistency in performance. For example, in images with cluttered environments or challenging lighting conditions, the model demonstrates robust segmentation capabilities.

5.4 Text-to-Motion Generation

Following the settings in Zhang et al. (2024), we also conducted the fine-tuning text-to-motion generation task. We replaced LoRA with our proposed Möbius-inspired transformation to fine-tune the pre-trained Llama 2 7B model on HumanML3D Guo et al. (2022) dataset, enabling it to generate motion tokens that can be decoded by a pre-trained motion decoder. In Fig. 6, the visualised results demonstrate that the fine-tuned model’s motions align adequately with the provided poses and remain consistent with the textual descriptions. Applying the proposed transformation to the fine-tuned Llama 2 7B model improves text consistency and motion completeness, leveraging the model’s capabilities to generate more detailed and comprehensive motion sequences, making it effective for controlled motion generation. Although the overall motion consistency appears natural, the proposed adapter occasionally produces motion sequences that do not exactly correspond to the text descriptions.

6 Conclusion

This paper introduces a novel fine-tuning approach for multi-modal generative models using Möbius-inspired transformations. By replacing traditional orthogonal transformations, our method effectively captures complex and hierarchical data patterns. The experiments show notable improvements in generation quality for tasks like subject-driven, controllable generation, mask generation and text to motion generation. This approach addresses the computational and practical limitations of existing fine-tuning methods, offering a promising direction for future advancements in the efficient fine-tuning of large multimodal generative models. While the Möbius-inspired transformation for fine-tuning method has demonstrated notable improvements in capturing complex data patterns, several avenues for future research remain. One direction is to explore how this transformation framework can be extended to other generative architectures, such as autoregressive models and GANs, to assess its adaptability across different multimodal frameworks. Additionally, integrating this method with retrieval-augmented generation or graph-augmented generation approaches could further enhance the ability of the models to reason over structured and unstructured data sources. Further research can also investigate the scalability of such transformations in distributed or federated learning settings. By focusing on minimising communication overhead while maintaining model precision, this could broaden the application of fine-tuning strategies in resource-constrained environments. Furthermore, investigating the method’s performance on tasks beyond content generation, such as in multimodal perception or predictive analytics by fine-tuning pre-trained large models, could offer insights into its robustness across various domains. Moreover, future work could focus on improving the interpretability of Möbius-inspired fine-tuning strategies. By analysing how this transformation manipulates parameter spaces and affects the learning process, researchers could develop more transparent and explainable generative models.