FIX: Trainable tokens error with DeepSpeed ZeRO3 #2605

BenjaminBossan · 2025-06-23T12:48:46Z

Resolves #2603

Trainable tokens are erroring when using DS Z3 because the embedding weights are not available on all ranks. This solution fixes this in an efficient way that collects these weights on a single rank, initializes them, and then broadcasts only the slice that is affected.

Resolves huggingface#2603 Trainable tokens are erroring when using DS Z3 because the embedding weights are not available on all ranks. This solution fixes this in an efficient way that collects these weights on a single rank, initializes them, and then broadcasts only the slice that is affected.

HuggingFaceDocBuilderDev · 2025-06-23T12:52:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

githubnemo

Thanks for taking this on. The changes so far look reasonable but I see that there are still some open points in #2603. Marking these changes as looking reasonable.

BenjaminBossan · 2025-06-25T09:22:50Z

@githubnemo The new solution works for me locally (at first I had repro issues because of an incorrect accelerate config) and the user also confirmed that their original issue was solved. Please review again.

Resolves huggingface#2603 Trainable tokens are erroring when using DS Z3 because the embedding weights are not available on all ranks. This solution fixes this in an efficient way that collects these weights on a single rank, initializes them, and then broadcasts only the slice that is affected.

BenjaminBossan mentioned this pull request Jun 23, 2025

lora with trainable_token_indices do NOT support zero3？ #2603

Closed

BenjaminBossan marked this pull request as draft June 24, 2025 08:41

githubnemo reviewed Jun 24, 2025

View reviewed changes

Fix solution

7a611f5

BenjaminBossan requested a review from githubnemo June 25, 2025 09:22

BenjaminBossan marked this pull request as ready for review June 25, 2025 09:23

githubnemo approved these changes Jun 26, 2025

View reviewed changes

BenjaminBossan merged commit 5af0cbe into huggingface:main Jun 26, 2025
25 of 40 checks passed

BenjaminBossan deleted the fix-trainable-tokens-deepspeed-compatibility branch June 26, 2025 14:48

aflueckiger mentioned this pull request Oct 23, 2025

Fix trainable_token_indices for lm_head #2863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX: Trainable tokens error with DeepSpeed ZeRO3 #2605

FIX: Trainable tokens error with DeepSpeed ZeRO3 #2605

Uh oh!

BenjaminBossan commented Jun 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 23, 2025

Uh oh!

githubnemo left a comment

Uh oh!

BenjaminBossan commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FIX: Trainable tokens error with DeepSpeed ZeRO3 #2605

FIX: Trainable tokens error with DeepSpeed ZeRO3 #2605

Uh oh!

Conversation

BenjaminBossan commented Jun 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 23, 2025

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants