这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@ppetrushkov
Copy link
Contributor

@ppetrushkov ppetrushkov commented Jul 28, 2025

Discussed in issue #2672

RoAd is a Parameter-efficient Fine-Tuning technique that is especially well-suited for efficient inference with mixed adapters in a batch, while still providing high output quality with very small parameter count.

RoAd learns 2D rotation matrices that are applied to the layer output which can be written using only element-wise multiplication (rather than matrix multiplication), enabling very fast inference with adapters in unmerged state. It is somewhat related to Orthogonal Finetuning (OFT) method.

Paper: https://arxiv.org/pdf/2409.00119

I also have a fork of VLLM where I implemented and tested the efficiency of this method which shows significantly better performance than LoRA https://github.com/ppetrushkov/vllm/tree/v0.9.1-road

@ppetrushkov ppetrushkov changed the title Support for RoAD: 2D Rotary Adaptation Support for RoAd: 2D Rotary Adaptation Jul 28, 2025
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot for this PR to add RoAd to PEFT. I have to say, this is one of the most mature PRs to add a new PEFT method so far, really great work on understanding the PEFT code base and following the conventions.

I did a first review pass and didn't find any big flaws. There are a few smaller issues that I found, but I think they should be easily fixed. Please check my comments.

Something I'd be interested in is if you have checked the efficiency of mixed batch inference. If not, I think this it would be great to have a small script to check this. What I would imagine is to load a base model, add multiple RoAd/LoRA adapters (need not be trained), perform single adapter/batch inference and multi adapter/batch inference, then compare the relative overhead. If you don't have such a script, I'd be happy to help working on this.

Once you finish all your changes, don't forget to run make style before pushing and then ping me.


# RoAd

[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superior performance

Superior to what?

heterogeneous requests in the same batch
distributed interchange intervention framework

I would elaborate on that, as it might not be clear what is meant by that to some users.

significantly higher serving throughput when handling heterogeneous requests in the same batch

It would be great if you could add a toy example on how to achieve this. For reference, check this section for LoRA: https://huggingface.co/docs/peft/developer_guides/lora#inference-with-different-lora-adapters-in-the-same-batch


[RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks.

Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting that.

performance with < 0.1% trainable parameters; efficient
in serving requests requiring different adapters within a batch, with an overhead
comparable to element-wise multiplication instead of batch matrix multiplication;
enhances LLM’s interpretability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph could use a bit of rework to make it more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the directory to road/llama-3.2-3B-lr_0.001

ppetrushkov and others added 2 commits July 31, 2025 02:00
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@ppetrushkov
Copy link
Contributor Author

@BenjaminBossan thank you for the thorough review! I addressed some of the comments, will look at the remaining things and get back to you.

One comment regarding benchmarking mixed adapters - my focus was mainly on the VLLM implementation, as this is how we want to use it. To do this efficiently one needs something like grouped gemm to multiply hidden outputs with different adapter weights based on the adapter mapping. I implemented this in VLLM with a triton kernel, somewhat like what punica is doing, but much simpler. Doing this in plain torch would be much slower, or at least I don't know of a good way to do that. So benchmarking the torch version of mixed road inference in peft doesn't make much sense to me. Although, RoAd itself is still much faster than LoRA, whether mixed or not mixed.

FYI attaching one plot I created with my fork of VLLM comparing mixed inference with LoRA vs RoAd vs just base model.
lora_vs_road_tpot_512_128

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding benchmarks: It makes sense that users who would be searching for the best performance would look into vLLM etc. Still, I think it's good to check if the advantages also manifest in PEFT.

I created a small toy benchmark:

Show code
import time

import torch
from transformers import AutoModelForCausalLM
from peft import get_peft_model, RoadConfig, LoraConfig

num_adapters = 10
num_iters = 50
model_id = "meta-llama/Llama-3.2-3B"

config0 = LoraConfig(init_lora_weights=False, r=4)  # matching params of RoadConfig
config1 = RoadConfig(init_weights=False)
inputs = [torch.randint(0, 100, (num_adapters, 50)).to(0) for _ in range(num_iters)]

for config in [config0, config1]:
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(0)
    model = get_peft_model(model, config, adapter_name="adapter0")
    for i in range(1, num_adapters):
        model.add_adapter(f"adapter{i}", config)
    model.eval()

    adapter_names = [f"adapter{i}" for i in range(num_adapters)]

    # warmup
    model(inputs[0])

    # Run the model w/o mixed adapter batch
    durations = {"no mixed batch": [], "   mixed batch": []}

    for i in range(num_iters):
        x = inputs[i]
        tic = time.perf_counter()
        model(x)
        toc = time.perf_counter()
        durations["no mixed batch"].append(toc - tic)

    # Run the model with mixed adapter batch
    for i in range(num_iters):
        x = inputs[i]
        tic = time.perf_counter()
        model(x, adapter_names=adapter_names)
        toc = time.perf_counter()
        durations["   mixed batch"].append(toc - tic)

    print(f"{config.peft_type.value} summary (over 10 runs)")
    for name, dur in durations.items():
        print(f"{name}: avg {sum(dur)/num_iters:.4f} sec, min {min(dur):.4f} sec, max {max(dur):.4f} sec, total: {sum(dur):.4f} sec")

    del model
    torch.cuda.empty_cache()

The results look quite promising so far:

LORA summary (over 10 runs)
no mixed batch: avg 0.0262 sec, min 0.0159 sec, max 0.0272 sec, total: 1.3117 sec
   mixed batch: avg 0.0878 sec, min 0.0848 sec, max 0.1182 sec, total: 4.3920 sec
ROAD summary (over 10 runs)
no mixed batch: avg 0.0263 sec, min 0.0179 sec, max 0.0269 sec, total: 1.3160 sec
   mixed batch: avg 0.0266 sec, min 0.0263 sec, max 0.0269 sec, total: 1.3297 sec

Comment on lines 103 to 104
self.road_theta[adapter_name] = nn.Parameter(torch.rand(size))
self.road_alpha[adapter_name] = nn.Parameter(torch.rand(size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing init_weights=False, the values should be randomly initialized in reset_road_parameters (which can be renamed to reset_parameters) to avoid this.

@ppetrushkov
Copy link
Contributor Author

@BenjaminBossan I think I addressed all your comments so far, please have another look. Thanks!

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments, there is not much missing. There is an error in the included example, please check my comments. The of my comments are just smaller changes.

Before we can merge, since we have bnb support, let's also add some GPU tests for this. You can copy the RandLoRA examples and adjust them for RoAD:

@pytest.mark.single_gpu_tests
def test_causal_lm_training_8bit_randlora(self):
r"""
Same as test_causal_lm_training but with RandLora
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
model = prepare_model_for_kbit_training(model)
config = RandLoraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
randlora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset("ybelkada/english_quotes_copy")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.single_gpu_tests
def test_causal_lm_training_4bit_randlora(self):
r"""
Same as test_causal_lm_training_4bit but with RandLora
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
model = prepare_model_for_kbit_training(model)
config = RandLoraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
randlora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset("ybelkada/english_quotes_copy")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.multi_gpu_tests
def test_causal_lm_training_multi_gpu_8bit_randlora(self):
r"""
Same as test_causal_lm_training_multi_gpu but with RandLoRA
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
device_map=DEVICE_MAP_MAP[self.causal_lm_model_id],
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)
assert set(model.hf_device_map.values()) == set(range(device_count))
assert {p.device.index for p in model.parameters()} == set(range(device_count))
model = prepare_model_for_kbit_training(model)
setattr(model, "model_parallel", True)
setattr(model, "is_parallelizable", True)
config = RandLoraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
randlora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None
@pytest.mark.multi_gpu_tests
def test_causal_lm_training_multi_gpu_4bit_randlora(self):
r"""
Same as test_causal_lm_training_multi_gpu_4bit but with RandLora
"""
with tempfile.TemporaryDirectory() as tmp_dir:
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
device_map=DEVICE_MAP_MAP[self.causal_lm_model_id],
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
)
assert set(model.hf_device_map.values()) == set(range(device_count))
assert {p.device.index for p in model.parameters()} == set(range(device_count))
model = prepare_model_for_kbit_training(model)
setattr(model, "model_parallel", True)
setattr(model, "is_parallelizable", True)
config = RandLoraConfig(
r=16,
target_modules=["q_proj", "v_proj"],
randlora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=3,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=tmp_dir,
),
data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()
model.cpu().save_pretrained(tmp_dir)
assert "adapter_config.json" in os.listdir(tmp_dir)
assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir)
# assert loss is not None
assert trainer.state.log_history[-1]["train_loss"] is not None

PS: I also ran the MetaMathQA benchmark and although the numbers generally look good (memory similar to LoRA rank 32, runtime a bit slower), the test score was significantly lower, only reaching 39.4% accuracy. My first guess is that the learning rate could be a bit too high, but I'm not sure. For the PR, it's not required to improve the score, but maybe you have some ideas.

parser.add_argument("--eval_step", type=int, default=10, help="Evaluation step interval")
parser.add_argument("--save_step", type=int, default=100, help="Save step interval")
parser.add_argument("--device", type=str, default="cuda:0", help="Device to use for training")
parser.add_argument("--variant", type=str, default="1", choices=["1", "2", "4"], help="RoAD variant")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variants should be "road_1", etc. or you have to allow "1" etc. too.

ppetrushkov and others added 2 commits August 6, 2025 01:49
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@ppetrushkov
Copy link
Contributor Author

@BenjaminBossan done.

Regarding MetaMath, I tried a bit, but didn't improve much. Including all linear layers helped most. I think some tasks that require larger changes and more parameters are harder for RoAd.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes, this PR looks pretty much ready for me. Please run make style to ensure that the linter is happy.

@BenjaminBossan
Copy link
Member

@ppetrushkov We're missing an entry in the toc tree, which is why the doc builder complains. Regarding the failing CI tests, this is just flakiness and can be ignored.

When you're finished with your changes, please ping me.

@ppetrushkov
Copy link
Contributor Author

@BenjaminBossan added it to the toc, let me know if anything else is missing.

@BenjaminBossan
Copy link
Member

Thanks for updating the toc. A recent commit created a merge conflict with your PR, could you please merge with/rebase on the latest main and fix the conflict? For the resolution, just take what's on main, it'll work with RoAd.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the final effort, the PR now LGTM and can be merged. Great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants