这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@BenjaminBossan
Copy link
Member

The "LoRA Without Regret" blog mentions that targeting the MLP part of the transformer is more effective than targeting the attention modules. This experiment tests this by targeting:

["gate_proj", "up_proj", "down_proj"]

instead of the default layers (["q_proj", "v_proj"]).

I chose a rank to match the parameter count we would get when targeting the attention modules with rank 32, which is rank 10. Testing on my machine, there is indeed a nice improvement in the test score:

metric target attention target MLP
test accuracy 48.2% 51.3%
# trainable params 9175040 9461760
peak memory reserved 20.74 GB 23.02 GB

There is, however, also a marked increase in memory usage, despite matching parameter count. Since the operations are different, this may not be a surprise, but let's wait for the final verdict once this experiment runs on our AWS instance.

Note: I also tested higher and lower ranks when targeting the MLP. The effect on memory usage was negligible, but it did improve the score:

metric rank 8 rank 10 rank 12 rank 32
test accuracy 50.3% 51.3% 52.2% 54.8%
# trainable params 7569408 9461760 11354112 30277632

In the end, I chose only to add the rank 10 experiment to match the number of trainable parameters.

The "LoRA Without Regret" blog
post (https://thinkingmachines.ai/blog/lora/) mentions that targeting
the MLP part of the transformer is more effective than targeting the
attention modules. This experiment tests this by targeting:

["gate_proj", "up_proj", "down_proj"]

instead of the default layers (["q_proj", "v_proj"]).

I chose a rank to match the parameter count we would get when targeting
the attention modules with rank 32, which is rank 10. Testing on my
machine, there is indeed a nice improvement in the test score:

| metric               | target attention | target MLP |
|----------------------|------------------|------------|
| test accuracy        | 48.2%            | 51.3%      |
| # trainable params   | 9175040          | 9461760    |
| peak memory reserved | 20.74 GB         | 23.02 GB   |

There is, however, also a marked increase in memory usage, despite
matching parameter count. Since the operations are different, this may
not be a surprise, but let's wait for the final verdict once this
experiment runs on our AWS instance.

Note: I also tested higher and lower ranks when targeting the MLP. The
effect on memory usage was negligible, but it did improve the score:

| metric             | rank 8  | rank 10 | rank 12  | rank 32  |
|--------------------|---------|---------|----------|----------|
| test accuracy      | 50.3%   | 51.3%   | 52.2%    | 54.8%    |
| # trainable params | 7569408 | 9461760 | 11354112 | 30277632 |

In the end, I chose only to add the rank 10 experiment to match the
number of trainable parameters.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan merged commit 8d8aa0b into huggingface:main Oct 16, 2025
2 of 13 checks passed
@BenjaminBossan BenjaminBossan deleted the method-comparison-lora-experiment-target-mlp branch October 16, 2025 15:37
@imcoza
Copy link

imcoza commented Oct 23, 2025

hey, @BenjaminBossan have you tried merging the MLP and attention layers and then checking how it affects the accuracy? Curious if combining them gives any noticeable difference.

@BenjaminBossan
Copy link
Member Author

@imcoza You mean targeting both attention and MLP layers (target_modules="all-linear")? I haven't explicitly tested that in this context. The reason is simply:

  1. The blog post mentions that it doesn't improve much compared to just MLP.
  2. I wanted to match the parameter count, if we target all linear layers, we'd have to choose a very small rank.

If you want, you can still give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants