Method comparison: LoRA that targets MLP modules #2845

BenjaminBossan · 2025-10-16T13:29:34Z

The "LoRA Without Regret" blog mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting:

["gate_proj", "up_proj", "down_proj"]

instead of the default layers (["q_proj", "v_proj"]).

I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score:

metric	target attention	target MLP
test accuracy	48.2%	51.3%
# trainable params	9175040	9461760
peak memory reserved	20.74 GB	23.02 GB

There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance.

Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score:

metric	rank 8	rank 10	rank 12	rank 32
test accuracy	50.3%	51.3%	52.2%	54.8%
# trainable params	7569408	9461760	11354112	30277632

In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.

The "LoRA Without Regret" blog post (https://thinkingmachines.ai/blog/lora/) mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting: ["gate_proj", "up_proj", "down_proj"] instead of the default layers (["q_proj", "v_proj"]). I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score: | metric | target attention | target MLP | |----------------------|------------------|------------| | test accuracy | 48.2% | 51.3% | | # trainable params | 9175040 | 9461760 | | peak memory reserved | 20.74 GB | 23.02 GB | There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance. Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score: | metric | rank 8 | rank 10 | rank 12 | rank 32 | |--------------------|---------|---------|----------|----------| | test accuracy | 50.3% | 51.3% | 52.2% | 54.8% | | # trainable params | 7569408 | 9461760 | 11354112 | 30277632 | In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.

HuggingFaceDocBuilderDev · 2025-10-16T13:34:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

imcoza · 2025-10-23T09:32:00Z

hey, @BenjaminBossan have you tried merging the MLP and attention layers and then checking how it affects the accuracy? Curious if combining them gives any noticeable difference.

BenjaminBossan · 2025-10-23T10:09:09Z

@imcoza You mean targeting both attention and MLP layers (target_modules="all-linear")? I haven't explicitly tested that in this context. The reason is simply:

The blog post mentions that it doesn't improve much compared to just MLP.
I wanted to match the parameter count, if we target all linear layers, we'd have to choose a very small rank.

If you want, you can still give it a try.

BenjaminBossan requested a review from githubnemo October 16, 2025 13:49

githubnemo approved these changes Oct 16, 2025

View reviewed changes

BenjaminBossan merged commit 8d8aa0b into huggingface:main Oct 16, 2025
2 of 13 checks passed

BenjaminBossan deleted the method-comparison-lora-experiment-target-mlp branch October 16, 2025 15:37

BenjaminBossan mentioned this pull request Oct 23, 2025

Comparison of Different Fine-Tuning Techniques for Conversational AI #2310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Method comparison: LoRA that targets MLP modules #2845

Method comparison: LoRA that targets MLP modules #2845

Uh oh!

BenjaminBossan commented Oct 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 16, 2025

Uh oh!

Uh oh!

imcoza commented Oct 23, 2025

Uh oh!

BenjaminBossan commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants