-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Support for RoAd: 2D Rotary Adaptation #2678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot for this PR to add RoAd to PEFT. I have to say, this is one of the most mature PRs to add a new PEFT method so far, really great work on understanding the PEFT code base and following the conventions.
I did a first review pass and didn't find any big flaws. There are a few smaller issues that I found, but I think they should be easily fixed. Please check my comments.
Something I'd be interested in is if you have checked the efficiency of mixed batch inference. If not, I think this it would be great to have a small script to check this. What I would imagine is to load a base model, add multiple RoAd/LoRA adapters (need not be trained), perform single adapter/batch inference and multi adapter/batch inference, then compare the relative overhead. If you don't have such a script, I'd be happy to help working on this.
Once you finish all your changes, don't forget to run make style before pushing and then ping me.
|
|
||
| # RoAd | ||
|
|
||
| [RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superior performance
Superior to what?
heterogeneous requests in the same batch
distributed interchange intervention framework
I would elaborate on that, as it might not be clear what is meant by that to some users.
significantly higher serving throughput when handling heterogeneous requests in the same batch
It would be great if you could add a toy example on how to achieve this. For reference, check this section for LoRA: https://huggingface.co/docs/peft/developer_guides/lora#inference-with-different-lora-adapters-in-the-same-batch
|
|
||
| [RoAd](https://arxiv.org/pdf/2409.00119) is a parameter‑efficient fine‑tuning technique that adapts large language models by learning a small set of 2×2 rotation matrices (and optional scaling factors) applied to pairs of hidden dimensions, achieving competitive or superior performance with under 0.1% trainable parameters. Unlike LoRA’s batched low‑rank updates, RoAd’s sparse rotations reformulate to simple element‑wise operations, yielding significantly higher serving throughput when handling heterogeneous requests in the same batch. Moreover, RoAd integrates seamlessly into a distributed interchange intervention framework, enabling interpretable, composable task‑specific adaptations by combining orthogonal subspaces learned for different tasks. | ||
|
|
||
| Finetuning with RoAd typically requires higher learning rate compared to LoRA or similar methods, around 1e-3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting that.
examples/road_finetuning/README.md
Outdated
| performance with < 0.1% trainable parameters; efficient | ||
| in serving requests requiring different adapters within a batch, with an overhead | ||
| comparable to element-wise multiplication instead of batch matrix multiplication; | ||
| enhances LLM’s interpretability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph could use a bit of rework to make it more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the directory to road/llama-3.2-3B-lr_0.001
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan thank you for the thorough review! I addressed some of the comments, will look at the remaining things and get back to you. One comment regarding benchmarking mixed adapters - my focus was mainly on the VLLM implementation, as this is how we want to use it. To do this efficiently one needs something like grouped gemm to multiply hidden outputs with different adapter weights based on the adapter mapping. I implemented this in VLLM with a triton kernel, somewhat like what punica is doing, but much simpler. Doing this in plain torch would be much slower, or at least I don't know of a good way to do that. So benchmarking the torch version of mixed road inference in peft doesn't make much sense to me. Although, RoAd itself is still much faster than LoRA, whether mixed or not mixed. FYI attaching one plot I created with my fork of VLLM comparing mixed inference with LoRA vs RoAd vs just base model. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding benchmarks: It makes sense that users who would be searching for the best performance would look into vLLM etc. Still, I think it's good to check if the advantages also manifest in PEFT.
I created a small toy benchmark:
Show code
import time
import torch
from transformers import AutoModelForCausalLM
from peft import get_peft_model, RoadConfig, LoraConfig
num_adapters = 10
num_iters = 50
model_id = "meta-llama/Llama-3.2-3B"
config0 = LoraConfig(init_lora_weights=False, r=4) # matching params of RoadConfig
config1 = RoadConfig(init_weights=False)
inputs = [torch.randint(0, 100, (num_adapters, 50)).to(0) for _ in range(num_iters)]
for config in [config0, config1]:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(0)
model = get_peft_model(model, config, adapter_name="adapter0")
for i in range(1, num_adapters):
model.add_adapter(f"adapter{i}", config)
model.eval()
adapter_names = [f"adapter{i}" for i in range(num_adapters)]
# warmup
model(inputs[0])
# Run the model w/o mixed adapter batch
durations = {"no mixed batch": [], " mixed batch": []}
for i in range(num_iters):
x = inputs[i]
tic = time.perf_counter()
model(x)
toc = time.perf_counter()
durations["no mixed batch"].append(toc - tic)
# Run the model with mixed adapter batch
for i in range(num_iters):
x = inputs[i]
tic = time.perf_counter()
model(x, adapter_names=adapter_names)
toc = time.perf_counter()
durations[" mixed batch"].append(toc - tic)
print(f"{config.peft_type.value} summary (over 10 runs)")
for name, dur in durations.items():
print(f"{name}: avg {sum(dur)/num_iters:.4f} sec, min {min(dur):.4f} sec, max {max(dur):.4f} sec, total: {sum(dur):.4f} sec")
del model
torch.cuda.empty_cache()The results look quite promising so far:
LORA summary (over 10 runs)
no mixed batch: avg 0.0262 sec, min 0.0159 sec, max 0.0272 sec, total: 1.3117 sec
mixed batch: avg 0.0878 sec, min 0.0848 sec, max 0.1182 sec, total: 4.3920 sec
ROAD summary (over 10 runs)
no mixed batch: avg 0.0263 sec, min 0.0179 sec, max 0.0269 sec, total: 1.3160 sec
mixed batch: avg 0.0266 sec, min 0.0263 sec, max 0.0269 sec, total: 1.3297 sec
src/peft/tuners/road/layer.py
Outdated
| self.road_theta[adapter_name] = nn.Parameter(torch.rand(size)) | ||
| self.road_alpha[adapter_name] = nn.Parameter(torch.rand(size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When passing init_weights=False, the values should be randomly initialized in reset_road_parameters (which can be renamed to reset_parameters) to avoid this.
|
@BenjaminBossan I think I addressed all your comments so far, please have another look. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments, there is not much missing. There is an error in the included example, please check my comments. The of my comments are just smaller changes.
Before we can merge, since we have bnb support, let's also add some GPU tests for this. You can copy the RandLoRA examples and adjust them for RoAD:
peft/tests/test_gpu_examples.py
Lines 1492 to 1718 in e0b2ca7
| @pytest.mark.single_gpu_tests | |
| def test_causal_lm_training_8bit_randlora(self): | |
| r""" | |
| Same as test_causal_lm_training but with RandLora | |
| """ | |
| with tempfile.TemporaryDirectory() as tmp_dir: | |
| model = AutoModelForCausalLM.from_pretrained( | |
| self.causal_lm_model_id, | |
| quantization_config=BitsAndBytesConfig(load_in_8bit=True), | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id) | |
| model = prepare_model_for_kbit_training(model) | |
| config = RandLoraConfig( | |
| r=16, | |
| target_modules=["q_proj", "v_proj"], | |
| randlora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| ) | |
| model = get_peft_model(model, config) | |
| data = load_dataset("ybelkada/english_quotes_copy") | |
| data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True) | |
| trainer = Trainer( | |
| model=model, | |
| train_dataset=data["train"], | |
| args=TrainingArguments( | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, | |
| warmup_steps=2, | |
| max_steps=3, | |
| learning_rate=2e-4, | |
| fp16=True, | |
| logging_steps=1, | |
| output_dir=tmp_dir, | |
| ), | |
| data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), | |
| ) | |
| model.config.use_cache = False | |
| trainer.train() | |
| model.cpu().save_pretrained(tmp_dir) | |
| assert "adapter_config.json" in os.listdir(tmp_dir) | |
| assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) | |
| # assert loss is not None | |
| assert trainer.state.log_history[-1]["train_loss"] is not None | |
| @pytest.mark.single_gpu_tests | |
| def test_causal_lm_training_4bit_randlora(self): | |
| r""" | |
| Same as test_causal_lm_training_4bit but with RandLora | |
| """ | |
| with tempfile.TemporaryDirectory() as tmp_dir: | |
| model = AutoModelForCausalLM.from_pretrained( | |
| self.causal_lm_model_id, | |
| quantization_config=BitsAndBytesConfig(load_in_4bit=True), | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id) | |
| model = prepare_model_for_kbit_training(model) | |
| config = RandLoraConfig( | |
| r=16, | |
| target_modules=["q_proj", "v_proj"], | |
| randlora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| ) | |
| model = get_peft_model(model, config) | |
| data = load_dataset("ybelkada/english_quotes_copy") | |
| data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True) | |
| trainer = Trainer( | |
| model=model, | |
| train_dataset=data["train"], | |
| args=TrainingArguments( | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, | |
| warmup_steps=2, | |
| max_steps=3, | |
| learning_rate=2e-4, | |
| fp16=True, | |
| logging_steps=1, | |
| output_dir=tmp_dir, | |
| ), | |
| data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), | |
| ) | |
| model.config.use_cache = False | |
| trainer.train() | |
| model.cpu().save_pretrained(tmp_dir) | |
| assert "adapter_config.json" in os.listdir(tmp_dir) | |
| assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) | |
| # assert loss is not None | |
| assert trainer.state.log_history[-1]["train_loss"] is not None | |
| @pytest.mark.multi_gpu_tests | |
| def test_causal_lm_training_multi_gpu_8bit_randlora(self): | |
| r""" | |
| Same as test_causal_lm_training_multi_gpu but with RandLoRA | |
| """ | |
| with tempfile.TemporaryDirectory() as tmp_dir: | |
| model = AutoModelForCausalLM.from_pretrained( | |
| self.causal_lm_model_id, | |
| device_map=DEVICE_MAP_MAP[self.causal_lm_model_id], | |
| quantization_config=BitsAndBytesConfig(load_in_8bit=True), | |
| ) | |
| assert set(model.hf_device_map.values()) == set(range(device_count)) | |
| assert {p.device.index for p in model.parameters()} == set(range(device_count)) | |
| model = prepare_model_for_kbit_training(model) | |
| setattr(model, "model_parallel", True) | |
| setattr(model, "is_parallelizable", True) | |
| config = RandLoraConfig( | |
| r=16, | |
| target_modules=["q_proj", "v_proj"], | |
| randlora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| ) | |
| model = get_peft_model(model, config) | |
| data = load_dataset("Abirate/english_quotes") | |
| data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True) | |
| trainer = Trainer( | |
| model=model, | |
| train_dataset=data["train"], | |
| args=TrainingArguments( | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, | |
| warmup_steps=2, | |
| max_steps=3, | |
| learning_rate=2e-4, | |
| fp16=True, | |
| logging_steps=1, | |
| output_dir=tmp_dir, | |
| ), | |
| data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False), | |
| ) | |
| model.config.use_cache = False | |
| trainer.train() | |
| model.cpu().save_pretrained(tmp_dir) | |
| assert "adapter_config.json" in os.listdir(tmp_dir) | |
| assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) | |
| # assert loss is not None | |
| assert trainer.state.log_history[-1]["train_loss"] is not None | |
| @pytest.mark.multi_gpu_tests | |
| def test_causal_lm_training_multi_gpu_4bit_randlora(self): | |
| r""" | |
| Same as test_causal_lm_training_multi_gpu_4bit but with RandLora | |
| """ | |
| with tempfile.TemporaryDirectory() as tmp_dir: | |
| model = AutoModelForCausalLM.from_pretrained( | |
| self.causal_lm_model_id, | |
| device_map=DEVICE_MAP_MAP[self.causal_lm_model_id], | |
| quantization_config=BitsAndBytesConfig(load_in_4bit=True), | |
| ) | |
| assert set(model.hf_device_map.values()) == set(range(device_count)) | |
| assert {p.device.index for p in model.parameters()} == set(range(device_count)) | |
| model = prepare_model_for_kbit_training(model) | |
| setattr(model, "model_parallel", True) | |
| setattr(model, "is_parallelizable", True) | |
| config = RandLoraConfig( | |
| r=16, | |
| target_modules=["q_proj", "v_proj"], | |
| randlora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| ) | |
| model = get_peft_model(model, config) | |
| data = load_dataset("Abirate/english_quotes") | |
| data = data.map(lambda samples: self.tokenizer(samples["quote"]), batched=True) | |
| trainer = Trainer( | |
| model=model, | |
| train_dataset=data["train"], | |
| args=TrainingArguments( | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, | |
| warmup_steps=2, | |
| max_steps=3, | |
| learning_rate=2e-4, | |
| fp16=True, | |
| logging_steps=1, | |
| output_dir=tmp_dir, | |
| ), | |
| data_collator=DataCollatorForLanguageModeling(self.tokenizer, mlm=False), | |
| ) | |
| model.config.use_cache = False | |
| trainer.train() | |
| model.cpu().save_pretrained(tmp_dir) | |
| assert "adapter_config.json" in os.listdir(tmp_dir) | |
| assert SAFETENSORS_WEIGHTS_NAME in os.listdir(tmp_dir) | |
| # assert loss is not None | |
| assert trainer.state.log_history[-1]["train_loss"] is not None |
PS: I also ran the MetaMathQA benchmark and although the numbers generally look good (memory similar to LoRA rank 32, runtime a bit slower), the test score was significantly lower, only reaching 39.4% accuracy. My first guess is that the learning rate could be a bit too high, but I'm not sure. For the PR, it's not required to improve the score, but maybe you have some ideas.
| parser.add_argument("--eval_step", type=int, default=10, help="Evaluation step interval") | ||
| parser.add_argument("--save_step", type=int, default=100, help="Save step interval") | ||
| parser.add_argument("--device", type=str, default="cuda:0", help="Device to use for training") | ||
| parser.add_argument("--variant", type=str, default="1", choices=["1", "2", "4"], help="RoAD variant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variants should be "road_1", etc. or you have to allow "1" etc. too.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan done. Regarding MetaMath, I tried a bit, but didn't improve much. Including all linear layers helped most. I think some tasks that require larger changes and more parameters are harder for RoAd. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the changes, this PR looks pretty much ready for me. Please run make style to ensure that the linter is happy.
|
@ppetrushkov We're missing an entry in the toc tree, which is why the doc builder complains. Regarding the failing CI tests, this is just flakiness and can be ignored. When you're finished with your changes, please ping me. |
|
@BenjaminBossan added it to the toc, let me know if anything else is missing. |
|
Thanks for updating the toc. A recent commit created a merge conflict with your PR, could you please merge with/rebase on the latest main and fix the conflict? For the resolution, just take what's on main, it'll work with RoAd. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the final effort, the PR now LGTM and can be merged. Great work.
Discussed in issue #2672
RoAd is a Parameter-efficient Fine-Tuning technique that is especially well-suited for efficient inference with mixed adapters in a batch, while still providing high output quality with very small parameter count.
RoAd learns 2D rotation matrices that are applied to the layer output which can be written using only element-wise multiplication (rather than matrix multiplication), enabling very fast inference with adapters in unmerged state. It is somewhat related to Orthogonal Finetuning (OFT) method.
Paper: https://arxiv.org/pdf/2409.00119
I also have a fork of VLLM where I implemented and tested the efficiency of this method which shows significantly better performance than LoRA https://github.com/ppetrushkov/vllm/tree/v0.9.1-road