🤗 PEFT

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Fine-tuning large pretrained models is often prohibitively costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.

PEFT is integrated with Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference for really big models.

Tip

Visit the PEFT organization to read about the PEFT methods implemented in the library and to see notebooks demonstrating how to apply these methods to a variety of downstream tasks. Click the "Watch repos" button on the organization page to be notified of newly implemented methods and notebooks!

Check the PEFT Adapters API Reference section for a list of supported PEFT methods, and read the Adapters, Soft prompts, and IA3 conceptual guides to learn more about how these methods work.

Quickstart

Install PEFT from pip:

pip install peft

Prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model. For the bigscience/mt0-large model, you're only training 0.19% of the parameters!

from transformers import AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model

device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    task_type=TaskType.CAUSAL_LM,
    # target_modules=["q_proj", "v_proj", ...]  # optionally indicate target modules
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# prints: trainable params: 3,686,400 || all params: 3,089,625,088 || trainable%: 0.1193

# now perform training on your dataset, e.g. using transformers Trainer, then save the model
model.save_pretrained("qwen2.5-3b-lora")

To load a PEFT model for inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
model_id = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "qwen2.5-3b-lora")

inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")
outputs = model.generate(**inputs.to(device), max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# prints something like: Preheat the oven to 350 degrees and place the cookie dough in a baking dish [...]

Mixture-of-Experts (MoE) Support

PEFT now supports efficient Mixture-of-Experts training for LoRA models. This enables you to train multiple specialized expert adapters that share a single frozen base model, with routing weights determining how to combine expert outputs.

Why MoE with PEFT?

Memory Efficient: All experts share the same frozen base model weights. Only the tiny LoRA adapter matrices (A and B) are unique per expert.
Computational Efficient: Single-pass forward through the base model with weighted combination of expert outputs, instead of N separate forward passes.
Specialized Learning: Each expert can specialize in different aspects of the task while maintaining parameter efficiency.

MoE Configuration

Use MoELoraConfig to configure multiple LoRA experts with routing:

from transformers import AutoModelForCausalLM
from peft import MoELoraConfig, TaskType, get_peft_model

model_id = "google/gemma-2-2b-it"
model = AutoModelForCausalLM.from_pretrained(model_id)

# Configure MoE with 4 experts
moe_config = MoELoraConfig(
    r=16,
    lora_alpha=32,
    task_type=TaskType.CAUSAL_LM,
    num_experts=4,
    top_k_experts=2,  # Use top-2 experts during inference
    router_aux_loss_coef=0.01,  # Load balancing coefficient
    target_modules=["q_proj", "v_proj"],
)

model = get_peft_model(model, moe_config)
model.print_trainable_parameters()
# With 4 experts, you're still only training a small fraction of parameters!

Training with MoE

The router learns to select and weight experts during training. You need to provide routing weights during the forward pass:

# Your model should compute routing weights, typically from a router network
routing_weights = router(hidden_states)  # Shape: [batch_size, num_experts]

# Attach router to enable weighted expert combination
model.attach_router(router)

# Forward pass automatically uses routing weights
outputs = model(input_ids, attention_mask=attention_mask)

# Add load balancing loss to prevent expert collapse
balance_loss = compute_balance_loss(routing_weights)
total_loss = outputs.loss + moe_config.router_aux_loss_coef * balance_loss

Key Features

Dynamic Routing: Dense routing (softmax over all experts) during training for gradient flow; sparse routing (top-k selection) during inference for efficiency.
Load Balancing: Auxiliary loss prevents router from collapsing to a single expert.
Gradient Flow: Router receives gradients from the balance loss, while LoRA adapters receive gradients from the main task loss.
Memory Overhead: Each expert adds only ~2-3% memory overhead compared to single-adapter LoRA.

Performance

For a 4-expert MoE configuration:

60% reduction in forward passes: 2 passes instead of 5 (N+1)
Excellent scaling: 1.93 scaling efficiency when increasing batch size
Stable training: Validated gradient flow, no NaN/Inf issues
Checkpoint compatible: Save and load work seamlessly with exact output reproduction

Why you should use PEFT

There are many benefits of using PEFT but the main one is the huge savings in compute and storage, making PEFT applicable to many different use cases.

High performance on consumer hardware

Consider the memory requirements for training the following models on the ought/raft/twitter_complaints dataset with an A100 80GB GPU with more than 64GB of CPU RAM.

Model	Full Finetuning	PEFT-LoRA PyTorch	PEFT-LoRA DeepSpeed with CPU Offloading
bigscience/T0_3B (3B params)	47.14GB GPU / 2.96GB CPU	14.4GB GPU / 2.96GB CPU	9.8GB GPU / 17.8GB CPU
bigscience/mt0-xxl (12B params)	OOM GPU	56GB GPU / 3GB CPU	22GB GPU / 52GB CPU
bigscience/bloomz-7b1 (7B params)	OOM GPU	32GB GPU / 3.8GB CPU	18.1GB GPU / 35GB CPU

With LoRA you can fully finetune a 12B parameter model that would've otherwise run out of memory on the 80GB GPU, and comfortably fit and train a 3B parameter model. When you look at the 3B parameter model's performance, it is comparable to a fully finetuned model at a fraction of the GPU memory.

Submission Name	Accuracy
Human baseline (crowdsourced)	0.897
Flan-T5	0.892
lora-t0-3b	0.863

Tip

The bigscience/T0_3B model performance isn't optimized in the table above. You can squeeze even more performance out of it by playing around with the input instruction templates, LoRA hyperparameters, and other training related hyperparameters. The final checkpoint size of this model is just 19MB compared to 11GB of the full bigscience/T0_3B model. Learn more about the advantages of finetuning with PEFT in this blog post.

Quantization

Quantization is another method for reducing the memory requirements of a model by representing the data in a lower precision. It can be combined with PEFT methods to make it even easier to train and load LLMs for inference.

Learn how to finetune meta-llama/Llama-2-7b-hf with QLoRA and the TRL library on a 16GB GPU in the Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem blog post.
Learn how to finetune a openai/whisper-large-v2 model for multilingual automatic speech recognition with LoRA and 8-bit quantization in this notebook (see this notebook instead for an example of streaming a dataset).

Save compute and storage

PEFT can help you save storage by avoiding full finetuning of models on each of downstream task or dataset. In many cases, you're only finetuning a very small fraction of a model's parameters and each checkpoint is only a few MBs in size (instead of GBs). These smaller PEFT adapters demonstrate performance comparable to a fully finetuned model. If you have many datasets, you can save a lot of storage with a PEFT model and not have to worry about catastrophic forgetting or overfitting the backbone or base model.

PEFT integrations

PEFT is widely supported across the Hugging Face ecosystem because of the massive efficiency it brings to training and inference.

Diffusers

The iterative diffusion process consumes a lot of memory which can make it difficult to train. PEFT can help reduce the memory requirements and reduce the storage size of the final model checkpoint. For example, consider the memory required for training a Stable Diffusion model with LoRA on an A100 80GB GPU with more than 64GB of CPU RAM. The final model checkpoint size is only 8.8MB!

Model	Full Finetuning	PEFT-LoRA	PEFT-LoRA with Gradient Checkpointing
CompVis/stable-diffusion-v1-4	27.5GB GPU / 3.97GB CPU	15.5GB GPU / 3.84GB CPU	8.12GB GPU / 3.77GB CPU

Tip

Take a look at the examples/lora_dreambooth/train_dreambooth.py training script to try training your own Stable Diffusion model with LoRA, and play around with the smangrul/peft-lora-sd-dreambooth Space which is running on a T4 instance. Learn more about the PEFT integration in Diffusers in this tutorial.

Transformers

PEFT is directly integrated with Transformers. After loading a model, call add_adapter to add a new PEFT adapter to the model:

from peft import LoraConfig
model = ...  # transformers model
peft_config = LoraConfig(...)
model.add_adapter(lora_config, adapter_name="lora_1")

To load a trained PEFT adapter, call load_adapter:

model = ...  # transformers model
model.load_adapter(<path-to-adapter>, adapter_name="lora_1")

And to switch between different adapters, call set_adapter:

model.set_adapter("lora_2")

The Transformers integration doesn't include all the functionalities offered in PEFT, such as methods for merging the adapter into the base model.

Accelerate

Accelerate is a library for distributed training and inference on various training setups and hardware (GPUs, TPUs, Apple Silicon, etc.). PEFT models work with Accelerate out of the box, making it really convenient to train really large models or use them for inference on consumer hardware with limited resources.

TRL

PEFT can also be applied to training LLMs with RLHF components such as the ranker and policy. Get started by reading:

Fine-tune a Mistral-7b model with Direct Preference Optimization with PEFT and the TRL library to learn more about the Direct Preference Optimization (DPO) method and how to apply it to a LLM.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU with PEFT and the TRL library, and then try out the gpt2-sentiment_peft.ipynb notebook to optimize GPT2 to generate positive movie reviews.
StackLLaMA: A hands-on guide to train LLaMA with RLHF with PEFT, and then try out the stack_llama/scripts for supervised finetuning, reward modeling, and RL finetuning.

Model support

Use this Space or check out the docs to find which models officially support a PEFT method out of the box. Even if you don't see a model listed below, you can manually configure the model config to enable PEFT for a model. Read the New transformers architecture guide to learn how.

MoE LoRA Support

MoE (Mixture-of-Experts) support for LoRA is available for any model that supports standard LoRA. The MoE implementation works at the PEFT layer level and is model-agnostic. Key features:

Universal Compatibility: Works with any transformer architecture that supports LoRA (GPT, LLaMA, Gemma, T5, etc.)
Efficient Weight Sharing: All experts share the frozen base model via PEFT's adapter management
Router Management: Built-in methods for attaching routers and managing routing weights
Validated Models: Tested with CodeGemma-2B, Gemma-2-2B, and Gemma-3-270M

To use MoE with your model, simply use MoELoraConfig instead of LoraConfig and ensure your training loop provides routing weights during forward passes.

Contribute

If you would like to contribute to PEFT, please check out our contribution guide.

Citing 🤗 PEFT

To use 🤗 PEFT in your publication, please cite it by using the following BibTeX entry.

@Misc{peft,
  title =        {{PEFT}: State-of-the-art Parameter-Efficient Fine-Tuning methods},
  author =       {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan},
  howpublished = {\url{https://github.com/huggingface/peft}},
  year =         {2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,500 Commits
.github		.github
docker		docker
docs		docs
examples		examples
method_comparison		method_comparison
scripts		scripts
src/peft		src/peft
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤗 PEFT

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Quickstart

Mixture-of-Experts (MoE) Support

Why MoE with PEFT?

MoE Configuration

Training with MoE

Key Features

Performance

Why you should use PEFT

High performance on consumer hardware

Quantization

Save compute and storage

PEFT integrations

Diffusers

Transformers

Accelerate

TRL

Model support

MoE LoRA Support

Contribute

Citing 🤗 PEFT

About

Uh oh!

Releases

Packages

Languages

License

johnrcumming/peft

Folders and files

Latest commit

History

Repository files navigation

🤗 PEFT

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Quickstart

Mixture-of-Experts (MoE) Support

Why MoE with PEFT?

MoE Configuration

Training with MoE

Key Features

Performance

Why you should use PEFT

High performance on consumer hardware

Quantization

Save compute and storage

PEFT integrations

Diffusers

Transformers

Accelerate

TRL

Model support

MoE LoRA Support

Contribute

Citing 🤗 PEFT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages