Support for RoAd: 2D Rotary Adaptation #2678

ppetrushkov · 2025-07-28T22:31:07Z

Discussed in issue #2672

RoAd is a Parameter-efficient Fine-Tuning technique that is especially well-suited for efficient inference with mixed adapters in a batch, while still providing high output quality with very small parameter count.

RoAd learns 2D rotation matrices that are applied to the layer output which can be written using only element-wise multiplication (rather than matrix multiplication), enabling very fast inference with adapters in unmerged state. It is somewhat related to Orthogonal Finetuning (OFT) method.

Paper: https://arxiv.org/pdf/2409.00119

I also have a fork of VLLM where I implemented and tested the efficiency of this method which shows significantly better performance than LoRA https://github.com/ppetrushkov/vllm/tree/v0.9.1-road

BenjaminBossan

Thank you a lot for this PR to add RoAd to PEFT. I have to say, this is one of the most mature PRs to add a new PEFT method so far, really great work on understanding the PEFT code base and following the conventions.

I did a first review pass and didn't find any big flaws. There are a few smaller issues that I found, but I think they should be easily fixed. Please check my comments.

Something I'd be interested in is if you have checked the efficiency of mixed batch inference. If not, I think this it would be great to have a small script to check this. What I would imagine is to load a base model, add multiple RoAd/LoRA adapters (need not be trained), perform single adapter/batch inference and multi adapter/batch inference, then compare the relative overhead. If you don't have such a script, I'd be happy to help working on this.

Once you finish all your changes, don't forget to run make style before pushing and then ping me.

BenjaminBossan · 2025-07-29T12:25:28Z

docs/source/package_reference/road.md

+
+# RoAd
+
+[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.


superior performance

Superior to what?

heterogeneous requests in the same batch
distributed interchange intervention framework

I would elaborate on that, as it might not be clear what is meant by that to some users.

significantly higher serving throughput when handling heterogeneous requests in the same batch

It would be great if you could add a toy example on how to achieve this. For reference, check this section for LoRA: https://huggingface.co/docs/peft/developer_guides/lora#inference-with-different-lora-adapters-in-the-same-batch

BenjaminBossan · 2025-07-29T12:26:52Z

docs/source/package_reference/road.md

+
+[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.
+
+Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3.


Thanks for documenting that.

BenjaminBossan · 2025-07-29T12:27:59Z

examples/road_finetuning/README.md

+performance with < 0.1% trainable parameters; efficient
+in serving requests requiring different adapters within a batch, with an overhead
+comparable to element-wise multiplication instead of batch matrix multiplication;
+enhances LLM’s interpretability.


This paragraph could use a bit of rework to make it more readable.

examples/road_finetuning/road_finetuning.py

BenjaminBossan · 2025-07-29T12:30:22Z

method_comparison/MetaMathQA/experiments/road/llama-3.2-3B-variant2/adapter_config.json

Let's rename the directory to road/llama-3.2-3B-lr_0.001

src/peft/tuners/road/layer.py

src/peft/tuners/road/model.py

tests/test_custom_models.py

src/peft/tuners/road/model.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

ppetrushkov · 2025-07-31T01:03:22Z

@BenjaminBossan thank you for the thorough review! I addressed some of the comments, will look at the remaining things and get back to you.

One comment regarding benchmarking mixed adapters - my focus was mainly on the VLLM implementation, as this is how we want to use it. To do this efficiently one needs something like grouped gemm to multiply hidden outputs with different adapter weights based on the adapter mapping. I implemented this in VLLM with a triton kernel, somewhat like what punica is doing, but much simpler. Doing this in plain torch would be much slower, or at least I don't know of a good way to do that. So benchmarking the torch version of mixed road inference in peft doesn't make much sense to me. Although, RoAd itself is still much faster than LoRA, whether mixed or not mixed.

FYI attaching one plot I created with my fork of VLLM comparing mixed inference with LoRA vs RoAd vs just base model.

BenjaminBossan

Regarding benchmarks: It makes sense that users who would be searching for the best performance would look into vLLM etc. Still, I think it's good to check if the advantages also manifest in PEFT.

I created a small toy benchmark:

Show code

import time

import torch
from transformers import AutoModelForCausalLM
from peft import get_peft_model, RoadConfig, LoraConfig

num_adapters = 10
num_iters = 50
model_id = "meta-llama/Llama-3.2-3B"

config0 = LoraConfig(init_lora_weights=False, r=4)  # matching params of RoadConfig
config1 = RoadConfig(init_weights=False)
inputs = [torch.randint(0, 100, (num_adapters, 50)).to(0) for _ in range(num_iters)]

for config in [config0, config1]:
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(0)
    model = get_peft_model(model, config, adapter_name="adapter0")
    for i in range(1, num_adapters):
        model.add_adapter(f"adapter{i}", config)
    model.eval()

    adapter_names = [f"adapter{i}" for i in range(num_adapters)]

    # warmup
    model(inputs[0])

    # Run the model w/o mixed adapter batch
    durations = {"no mixed batch": [], "   mixed batch": []}

    for i in range(num_iters):
        x = inputs[i]
        tic = time.perf_counter()
        model(x)
        toc = time.perf_counter()
        durations["no mixed batch"].append(toc - tic)

    # Run the model with mixed adapter batch
    for i in range(num_iters):
        x = inputs[i]
        tic = time.perf_counter()
        model(x, adapter_names=adapter_names)
        toc = time.perf_counter()
        durations["   mixed batch"].append(toc - tic)

    print(f"{config.peft_type.value} summary (over 10 runs)")
    for name, dur in durations.items():
        print(f"{name}: avg {sum(dur)/num_iters:.4f} sec, min {min(dur):.4f} sec, max {max(dur):.4f} sec, total: {sum(dur):.4f} sec")

    del model
    torch.cuda.empty_cache()

The results look quite promising so far:

LORA summary (over 10 runs)
no mixed batch: avg 0.0262 sec, min 0.0159 sec, max 0.0272 sec, total: 1.3117 sec
   mixed batch: avg 0.0878 sec, min 0.0848 sec, max 0.1182 sec, total: 4.3920 sec
ROAD summary (over 10 runs)
no mixed batch: avg 0.0263 sec, min 0.0179 sec, max 0.0269 sec, total: 1.3160 sec
   mixed batch: avg 0.0266 sec, min 0.0263 sec, max 0.0269 sec, total: 1.3297 sec

src/peft/tuners/road/config.py

BenjaminBossan · 2025-07-31T10:05:29Z

src/peft/tuners/road/layer.py

+        self.road_theta[adapter_name] = nn.Parameter(torch.rand(size))
+        self.road_alpha[adapter_name] = nn.Parameter(torch.rand(size))


When passing init_weights=False, the values should be randomly initialized in reset_road_parameters (which can be renamed to reset_parameters) to avoid this.

src/peft/tuners/road/layer.py

ppetrushkov · 2025-08-03T20:35:34Z

@BenjaminBossan I think I addressed all your comments so far, please have another look. Thanks!

BenjaminBossan

Thanks for addressing my comments, there is not much missing. There is an error in the included example, please check my comments. The of my comments are just smaller changes.

Before we can merge, since we have bnb support, let's also add some GPU tests for this. You can copy the RandLoRA examples and adjust them for RoAD:

peft/tests/test_gpu_examples.py

Lines 1492 to 1718 in e0b2ca7

    
               @pytest.mark.single_gpu_tests 
        
               def test_causal_lm_training_8bit_randlora(self): 
        
                   r""" 
        
                   Same as test_causal_lm_training but with RandLora 
        
                   """ 
        
                   with tempfile.TemporaryDirectory() as tmp_dir: 
        
                       model = AutoModelForCausalLM.from_pretrained( 
        
                           self.causal_lm_model_id, 
        
                           quantization_config=BitsAndBytesConfig(load_in_8bit=True), 
        
                           device_map="auto", 
        
                       ) 
        
                       tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id) 
        
                       model = prepare_model_for_kbit_training(model) 
        
                       config = RandLoraConfig( 
        
                           r=16, 
        
                           target_modules=["q_proj", "v_proj"], 
        
                           randlora_dropout=0.05, 
        
                           bias="none", 
        
                           task_type="CAUSAL_LM", 
        
                       ) 
        
                       model = get_peft_model(model, config) 
        
                       data = load_dataset("ybelkada/english_quotes_copy") 
        
                       data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True) 
        
                       trainer = Trainer( 
        
                           model=model, 
        
                           train_dataset=data["train"], 
        
                           args=TrainingArguments( 
        
                               per_device_train_batch_size=4, 
        
                               gradient_accumulation_steps=4, 
        
                               warmup_steps=2, 
        
                               max_steps=3, 
        
                               learning_rate=2e-4, 
        
                               fp16=True, 
        
                               logging_steps=1, 
        
                               output_dir=tmp_dir, 
        
                           ), 
        
                           data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), 
        
                       ) 
        
                       model.config.use_cache = False 
        
                       trainer.train() 
        
                       model.cpu().save_pretrained(tmp_dir) 
        
                       assert "adapter_config.json" in os.listdir(tmp_dir) 
        
                       assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) 
        
                       # assert loss is not None 
        
                       assert trainer.state.log_history[-1]["train_loss"] is not None 
        
               @pytest.mark.single_gpu_tests 
        
               def test_causal_lm_training_4bit_randlora(self): 
        
                   r""" 
        
                   Same as test_causal_lm_training_4bit but with RandLora 
        
                   """ 
        
                   with tempfile.TemporaryDirectory() as tmp_dir: 
        
                       model = AutoModelForCausalLM.from_pretrained( 
        
                           self.causal_lm_model_id, 
        
                           quantization_config=BitsAndBytesConfig(load_in_4bit=True), 
        
                           device_map="auto", 
        
                       ) 
        
                       tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id) 
        
                       model = prepare_model_for_kbit_training(model) 
        
                       config = RandLoraConfig( 
        
                           r=16, 
        
                           target_modules=["q_proj", "v_proj"], 
        
                           randlora_dropout=0.05, 
        
                           bias="none", 
        
                           task_type="CAUSAL_LM", 
        
                       ) 
        
                       model = get_peft_model(model, config) 
        
                       data = load_dataset("ybelkada/english_quotes_copy") 
        
                       data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True) 
        
                       trainer = Trainer( 
        
                           model=model, 
        
                           train_dataset=data["train"], 
        
                           args=TrainingArguments( 
        
                               per_device_train_batch_size=4, 
        
                               gradient_accumulation_steps=4, 
        
                               warmup_steps=2, 
        
                               max_steps=3, 
        
                               learning_rate=2e-4, 
        
                               fp16=True, 
        
                               logging_steps=1, 
        
                               output_dir=tmp_dir, 
        
                           ), 
        
                           data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), 
        
                       ) 
        
                       model.config.use_cache = False 
        
                       trainer.train() 
        
                       model.cpu().save_pretrained(tmp_dir) 
        
                       assert "adapter_config.json" in os.listdir(tmp_dir) 
        
                       assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) 
        
                       # assert loss is not None 
        
                       assert trainer.state.log_history[-1]["train_loss"] is not None 
        
               @pytest.mark.multi_gpu_tests 
        
               def test_causal_lm_training_multi_gpu_8bit_randlora(self): 
        
                   r""" 
        
                   Same as test_causal_lm_training_multi_gpu but with RandLoRA 
        
                   """ 
        
                   with tempfile.TemporaryDirectory() as tmp_dir: 
        
                       model = AutoModelForCausalLM.from_pretrained( 
        
                           self.causal_lm_model_id, 
        
                           device_map=DEVICE_MAP_MAP[self.causal_lm_model_id], 
        
                           quantization_config=BitsAndBytesConfig(load_in_8bit=True), 
        
                       ) 
        
                       assert set(model.hf_device_map.values()) == set(range(device_count)) 
        
                       assert {p.device.index for p in model.parameters()} == set(range(device_count)) 
        
                       model = prepare_model_for_kbit_training(model) 
        
                       setattr(model, "model_parallel", True) 
        
                       setattr(model, "is_parallelizable", True) 
        
                       config = RandLoraConfig( 
        
                           r=16, 
        
                           target_modules=["q_proj", "v_proj"], 
        
                           randlora_dropout=0.05, 
        
                           bias="none", 
        
                           task_type="CAUSAL_LM", 
        
                       ) 
        
                       model = get_peft_model(model, config) 
        
                       data = load_dataset("Abirate/english_quotes") 
        
                       data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True) 
        
                       trainer = Trainer( 
        
                           model=model, 
        
                           train_dataset=data["train"], 
        
                           args=TrainingArguments( 
        
                               per_device_train_batch_size=4, 
        
                               gradient_accumulation_steps=4, 
        
                               warmup_steps=2, 
        
                               max_steps=3, 
        
                               learning_rate=2e-4, 
        
                               fp16=True, 
        
                               logging_steps=1, 
        
                               output_dir=tmp_dir, 
        
                           ), 
        
                           data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False), 
        
                       ) 
        
                       model.config.use_cache = False 
        
                       trainer.train() 
        
                       model.cpu().save_pretrained(tmp_dir) 
        
                       assert "adapter_config.json" in os.listdir(tmp_dir) 
        
                       assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) 
        
                       # assert loss is not None 
        
                       assert trainer.state.log_history[-1]["train_loss"] is not None 
        
               @pytest.mark.multi_gpu_tests 
        
               def test_causal_lm_training_multi_gpu_4bit_randlora(self): 
        
                   r""" 
        
                   Same as test_causal_lm_training_multi_gpu_4bit but with RandLora 
        
                   """ 
        
                   with tempfile.TemporaryDirectory() as tmp_dir: 
        
                       model = AutoModelForCausalLM.from_pretrained( 
        
                           self.causal_lm_model_id, 
        
                           device_map=DEVICE_MAP_MAP[self.causal_lm_model_id], 
        
                           quantization_config=BitsAndBytesConfig(load_in_4bit=True), 
        
                       ) 
        
                       assert set(model.hf_device_map.values()) == set(range(device_count)) 
        
                       assert {p.device.index for p in model.parameters()} == set(range(device_count)) 
        
                       model = prepare_model_for_kbit_training(model) 
        
                       setattr(model, "model_parallel", True) 
        
                       setattr(model, "is_parallelizable", True) 
        
                       config = RandLoraConfig( 
        
                           r=16, 
        
                           target_modules=["q_proj", "v_proj"], 
        
                           randlora_dropout=0.05, 
        
                           bias="none", 
        
                           task_type="CAUSAL_LM", 
        
                       ) 
        
                       model = get_peft_model(model, config) 
        
                       data = load_dataset("Abirate/english_quotes") 
        
                       data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True) 
        
                       trainer = Trainer( 
        
                           model=model, 
        
                           train_dataset=data["train"], 
        
                           args=TrainingArguments( 
        
                               per_device_train_batch_size=4, 
        
                               gradient_accumulation_steps=4, 
        
                               warmup_steps=2, 
        
                               max_steps=3, 
        
                               learning_rate=2e-4, 
        
                               fp16=True, 
        
                               logging_steps=1, 
        
                               output_dir=tmp_dir, 
        
                           ), 
        
                           data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False), 
        
                       ) 
        
                       model.config.use_cache = False 
        
                       trainer.train() 
        
                       model.cpu().save_pretrained(tmp_dir) 
        
                       assert "adapter_config.json" in os.listdir(tmp_dir) 
        
                       assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) 
        
                       # assert loss is not None 
        
                       assert trainer.state.log_history[-1]["train_loss"] is not None

PS: I also ran the MetaMathQA benchmark and although the numbers generally look good (memory similar to LoRA rank 32, runtime a bit slower), the test score was significantly lower, only reaching 39.4% accuracy. My first guess is that the learning rate could be a bit too high, but I'm not sure. For the PR, it's not required to improve the score, but maybe you have some ideas.

tests/test_custom_models.py

BenjaminBossan · 2025-08-05T10:59:29Z

examples/road_finetuning/road_finetuning.py

+    parser.add_argument("--eval_step", type=int, default=10, help="Evaluation step interval")
+    parser.add_argument("--save_step", type=int, default=100, help="Save step interval")
+    parser.add_argument("--device", type=str, default="cuda:0", help="Device to use for training")
+    parser.add_argument("--variant", type=str, default="1", choices=["1", "2", "4"], help="RoAD variant")


Variants should be "road_1", etc. or you have to allow "1" etc. too.

examples/road_finetuning/road_finetuning.py

src/peft/tuners/road/layer.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

ppetrushkov · 2025-08-06T02:09:57Z

@BenjaminBossan done.

Regarding MetaMath, I tried a bit, but didn't improve much. Including all linear layers helped most. I think some tasks that require larger changes and more parameters are harder for RoAd.

BenjaminBossan

Thanks for all the changes, this PR looks pretty much ready for me. Please run make style to ensure that the linter is happy.

examples/road_finetuning/road_finetuning.py

BenjaminBossan · 2025-08-19T09:40:48Z

@ppetrushkov We're missing an entry in the toc tree, which is why the doc builder complains. Regarding the failing CI tests, this is just flakiness and can be ignored.

When you're finished with your changes, please ping me.

ppetrushkov · 2025-08-19T09:51:22Z

@BenjaminBossan added it to the toc, let me know if anything else is missing.

BenjaminBossan · 2025-08-19T10:30:07Z

Thanks for updating the toc. A recent commit created a merge conflict with your PR, could you please merge with/rebase on the latest main and fix the conflict? For the resolution, just take what's on main, it'll work with RoAd.

HuggingFaceDocBuilderDev · 2025-08-19T12:30:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks a lot for the final effort, the PR now LGTM and can be merged. Great work.

ppetrushkov added 2 commits July 25, 2025 11:57

RoAD adapter implementation (https://arxiv.org/pdf/2409.00119)

95456bc

Cleanup and documentation

2df411f

ppetrushkov changed the title ~~Support for RoAD: 2D Rotary Adaptation~~ Support for RoAd: 2D Rotary Adaptation Jul 28, 2025

BenjaminBossan requested changes Jul 29, 2025

View reviewed changes

ppetrushkov and others added 2 commits July 31, 2025 02:00

Apply suggestions from code review

74ac723

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Change road enum, extra checks, documentation.

8067d04

BenjaminBossan reviewed Jul 31, 2025

View reviewed changes

ppetrushkov added 2 commits August 3, 2025 22:11

Fix initialization and unit tests

6c894d9

Merge branch 'main' into road

e93cc20

BenjaminBossan requested changes Aug 5, 2025

View reviewed changes

ppetrushkov and others added 2 commits August 6, 2025 01:49

Apply suggestions from code review

06a84b9

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Add gpu tests for bnb road

070ca71

Style

319ab6d

BenjaminBossan requested changes Aug 7, 2025

View reviewed changes

examples/road_finetuning/road_finetuning.py Show resolved Hide resolved

Style with doc-builder

93b71bc

Add to toc

e4ce021

Merge remote-tracking branch 'upstream/main' into road

eaa8ca5

BenjaminBossan approved these changes Aug 19, 2025

View reviewed changes

BenjaminBossan merged commit ce5c204 into huggingface:main Aug 19, 2025
14 checks passed

BenjaminBossan mentioned this pull request Aug 25, 2025

Support for RoAD: 2D Rotary Adaptation for Efficient Finetuning, Efficient Batching and Composability #2672

Closed

BenjaminBossan mentioned this pull request Nov 3, 2025

[FEAT] Integrate LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models #2813

Open


		# RoAd

		[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.


		[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.

		Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3.

		self.road_theta[adapter_name] = nn.Parameter(torch.rand(size))
		self.road_alpha[adapter_name] = nn.Parameter(torch.rand(size))

	@pytest.mark.single_gpu_tests
	def test_causal_lm_training_8bit_randlora(self):
	r"""
	Same as test_causal_lm_training but with RandLora
	"""
	with tempfile.TemporaryDirectory() as tmp_dir:
	model = AutoModelForCausalLM.from_pretrained(
	self.causal_lm_model_id,
	quantization_config=BitsAndBytesConfig(load_in_8bit=True),
	device_map="auto",
	)

	tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
	model = prepare_model_for_kbit_training(model)

	config = RandLoraConfig(
	r=16,
	target_modules=["q_proj", "v_proj"],
	randlora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	)

	model = get_peft_model(model, config)

	data = load_dataset("ybelkada/english_quotes_copy")
	data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

	trainer = Trainer(
	model=model,
	train_dataset=data["train"],
	args=TrainingArguments(
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	warmup_steps=2,
	max_steps=3,
	learning_rate=2e-4,
	fp16=True,
	logging_steps=1,
	output_dir=tmp_dir,
	),
	data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
	)
	model.config.use_cache = False
	trainer.train()

	model.cpu().save_pretrained(tmp_dir)

	assert "adapter_config.json" in os.listdir(tmp_dir)
	assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)

	# assert loss is not None
	assert trainer.state.log_history[-1]["train_loss"] is not None

	@pytest.mark.single_gpu_tests
	def test_causal_lm_training_4bit_randlora(self):
	r"""
	Same as test_causal_lm_training_4bit but with RandLora
	"""
	with tempfile.TemporaryDirectory() as tmp_dir:
	model = AutoModelForCausalLM.from_pretrained(
	self.causal_lm_model_id,
	quantization_config=BitsAndBytesConfig(load_in_4bit=True),
	device_map="auto",
	)

	tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
	model = prepare_model_for_kbit_training(model)

	config = RandLoraConfig(
	r=16,
	target_modules=["q_proj", "v_proj"],
	randlora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	)

	model = get_peft_model(model, config)

	data = load_dataset("ybelkada/english_quotes_copy")
	data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

	trainer = Trainer(
	model=model,
	train_dataset=data["train"],
	args=TrainingArguments(
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	warmup_steps=2,
	max_steps=3,
	learning_rate=2e-4,
	fp16=True,
	logging_steps=1,
	output_dir=tmp_dir,
	),
	data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
	)
	model.config.use_cache = False
	trainer.train()

	model.cpu().save_pretrained(tmp_dir)

	assert "adapter_config.json" in os.listdir(tmp_dir)
	assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)

	# assert loss is not None
	assert trainer.state.log_history[-1]["train_loss"] is not None

	@pytest.mark.multi_gpu_tests
	def test_causal_lm_training_multi_gpu_8bit_randlora(self):
	r"""
	Same as test_causal_lm_training_multi_gpu but with RandLoRA
	"""

	with tempfile.TemporaryDirectory() as tmp_dir:
	model = AutoModelForCausalLM.from_pretrained(
	self.causal_lm_model_id,
	device_map=DEVICE_MAP_MAP[self.causal_lm_model_id],
	quantization_config=BitsAndBytesConfig(load_in_8bit=True),
	)

	assert set(model.hf_device_map.values()) == set(range(device_count))
	assert {p.device.index for p in model.parameters()} == set(range(device_count))

	model = prepare_model_for_kbit_training(model)

	setattr(model, "model_parallel", True)
	setattr(model, "is_parallelizable", True)

	config = RandLoraConfig(
	r=16,
	target_modules=["q_proj", "v_proj"],
	randlora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	)

	model = get_peft_model(model, config)

	data = load_dataset("Abirate/english_quotes")
	data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)

	trainer = Trainer(
	model=model,
	train_dataset=data["train"],
	args=TrainingArguments(
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	warmup_steps=2,
	max_steps=3,
	learning_rate=2e-4,
	fp16=True,
	logging_steps=1,
	output_dir=tmp_dir,
	),
	data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
	)
	model.config.use_cache = False
	trainer.train()

	model.cpu().save_pretrained(tmp_dir)

	assert "adapter_config.json" in os.listdir(tmp_dir)
	assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)

	# assert loss is not None
	assert trainer.state.log_history[-1]["train_loss"] is not None

	@pytest.mark.multi_gpu_tests
	def test_causal_lm_training_multi_gpu_4bit_randlora(self):
	r"""
	Same as test_causal_lm_training_multi_gpu_4bit but with RandLora
	"""

	with tempfile.TemporaryDirectory() as tmp_dir:
	model = AutoModelForCausalLM.from_pretrained(
	self.causal_lm_model_id,
	device_map=DEVICE_MAP_MAP[self.causal_lm_model_id],
	quantization_config=BitsAndBytesConfig(load_in_4bit=True),
	)

	assert set(model.hf_device_map.values()) == set(range(device_count))
	assert {p.device.index for p in model.parameters()} == set(range(device_count))

	model = prepare_model_for_kbit_training(model)

	setattr(model, "model_parallel", True)
	setattr(model, "is_parallelizable", True)

	config = RandLoraConfig(
	r=16,
	target_modules=["q_proj", "v_proj"],
	randlora_dropout=0.05,
	bias="none",
	task_type="CAUSAL_LM",
	)

	model = get_peft_model(model, config)

	data = load_dataset("Abirate/english_quotes")
	data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)

	trainer = Trainer(
	model=model,
	train_dataset=data["train"],
	args=TrainingArguments(
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	warmup_steps=2,
	max_steps=3,
	learning_rate=2e-4,
	fp16=True,
	logging_steps=1,
	output_dir=tmp_dir,
	),
	data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
	)
	model.config.use_cache = False
	trainer.train()

	model.cpu().save_pretrained(tmp_dir)

	assert "adapter_config.json" in os.listdir(tmp_dir)
	assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)

	# assert loss is not None
	assert trainer.state.log_history[-1]["train_loss"] is not None

Support for RoAd: 2D Rotary Adaptation #2678

Support for RoAd: 2D Rotary Adaptation #2678

Uh oh!

Conversation

ppetrushkov commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenjaminBossan Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppetrushkov commented Jul 31, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenjaminBossan Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ppetrushkov commented Aug 3, 2025

Uh oh!

BenjaminBossan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BenjaminBossan Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ppetrushkov commented Aug 6, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenjaminBossan commented Aug 19, 2025

Uh oh!

ppetrushkov commented Aug 19, 2025

Uh oh!

BenjaminBossan commented Aug 19, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 19, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ppetrushkov commented Jul 28, 2025 •

edited

Loading

BenjaminBossan left a comment •

edited

Loading