DoRA embed_scale Support #2838 #2839
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds embed_scale support to DoRA (Weight-Decomposed LoRA), ensuring DoRA correctly handles models with scaled embeddings (e.g., Gemma3TextScaledWordEmbedding). This is a companion PR to the LoRA (#2825) and X-LoRA (#2831) embed_scale fixes, following the suggestion and need to extend support to DoRA's embedding variant.
Changes
Code
DoraEmbeddingVariant.forward()to applyembed_scaleto DoRA contributionsembed_scaleviamodule._get_embed_scale()(same method used by LoRA)Tests
@pytest.mark.parametrize("use_dora", [False, True])totest_lora_embed_scale_is_appliedLoraConfigto includeuse_doraparameteratol=1e-5, rtol=1e-5to handle small numerical differences on MPS backendKey Design Decision
Why apply embed_scale AFTER weight norm calculation?
DoRA decomposes weight updates as:
W_new = m * (W_base + ΔW) / ||W_base + ΔW||The weight norm calculation (
||W_base + ΔW||) is a geometric property of the weight matrix itself and should remain independent of output scaling. Theembed_scaleis an output-space transformation applied by specific embedding layers (like Gemma3's sqrt(hidden_size) scaling), so it's applied after DoRA's weight decomposition completes.This preserves DoRA's weight geometry semantics while ensuring output consistency with the base layer.
Test Results
test_lora_embed_scale_is_applied[False](vanilla LoRA) - 1.67stest_lora_embed_scale_is_applied[True](DoRA) - 0.06stest_lora_embed_scale_is_applied_mixed_batch- 0.04smake stylepassedFixes : #2838
cc : @BenjaminBossan