FIX VeRA failure on multiple GPUs #2163

BenjaminBossan · 2024-10-18T10:23:33Z

The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error.

Example of a failing run: https://github.com/huggingface/peft/actions/runs/11396317278/job/31709933958

Since these buffers are shared, I chose not to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory.

The failing tests were introduced in #2076 but the error was already there beforehand.

I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.

The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error. Since these buffers are shared, I chose *not* to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory. The failing tests were introduced in huggingface#2076 but the error was already there beforehand. I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.

HuggingFaceDocBuilderDev · 2024-10-18T10:27:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-10-18T11:09:51Z

@dkopi @vvvm23 It would be great if you could double check if this is the best solution or if there is a better way.

SunMarc

LGTM ! Thanks for fixing !

dkopi · 2024-10-25T13:44:20Z

Sorry for the (too) late reply, I was really busy when received the notification, and forgot about the PR later.
If it's about the model parallelism (weights of first N layers on GPU 1, next layers on GPU 2 etc) then probably a slightly better solution would be to move whole buffers between GPUs, as with the most optimal split there would be only as many transfers as there are GPUs minus one. Just not sure how transformers library splits the weights by default.
And the difference in runtime should be negligible imo, so the current solution is also good.

In case of other parallelisms, I'm not sure what would be the best solution, but the current one should be ok too.

BenjaminBossan · 2024-10-25T15:13:00Z

Thanks for your comment @dkopi. I agree that moving the whole buffer once should in theory be faster. However, I was unsure how robust that solution would be across different parallelization strategies, so I went with this solution which should be more robust.

The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error. Since these buffers are shared, I chose *not* to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory. The failing tests were introduced in huggingface#2076 but the error was already there beforehand. I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.

Merge branch 'main' into fix-vera-multi-gpu-device-error

5fdff83

BenjaminBossan requested a review from SunMarc October 25, 2024 10:24

SunMarc approved these changes Oct 25, 2024

View reviewed changes

BenjaminBossan merged commit 28a5ba1 into huggingface:main Oct 25, 2024
14 checks passed

BenjaminBossan deleted the fix-vera-multi-gpu-device-error branch October 25, 2024 13:08

cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025

🩹 [Hotfix] Add setter for tokenizer (huggingface#2163)

1be4d86

cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025

↩️ Revert tokenizer hotfix huggingface#2163

d4564b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX VeRA failure on multiple GPUs #2163

FIX VeRA failure on multiple GPUs #2163

Uh oh!

BenjaminBossan commented Oct 18, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Oct 18, 2024

Uh oh!

BenjaminBossan commented Oct 18, 2024

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

dkopi commented Oct 25, 2024

Uh oh!

BenjaminBossan commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FIX VeRA failure on multiple GPUs #2163

FIX VeRA failure on multiple GPUs #2163

Uh oh!

Conversation

BenjaminBossan commented Oct 18, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Oct 18, 2024

Uh oh!

BenjaminBossan commented Oct 18, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dkopi commented Oct 25, 2024

Uh oh!

BenjaminBossan commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants