-
Notifications
You must be signed in to change notification settings - Fork 2.1k
FIX VeRA failure on multiple GPUs #2163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX VeRA failure on multiple GPUs #2163
Conversation
The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error. Since these buffers are shared, I chose *not* to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory. The failing tests were introduced in huggingface#2076 but the error was already there beforehand. I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for fixing !
|
Sorry for the (too) late reply, I was really busy when received the notification, and forgot about the PR later. In case of other parallelisms, I'm not sure what would be the best solution, but the current one should be ok too. |
|
Thanks for your comment @dkopi. I agree that moving the whole buffer once should in theory be faster. However, I was unsure how robust that solution would be across different parallelization strategies, so I went with this solution which should be more robust. |
The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error. Since these buffers are shared, I chose *not* to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory. The failing tests were introduced in huggingface#2076 but the error was already there beforehand. I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.
The shared buffers
vera_Aandvera_Bcould be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error.Example of a failing run: https://github.com/huggingface/peft/actions/runs/11396317278/job/31709933958
Since these buffers are shared, I chose not to move the whole buffer to the device. Instead, when we create the slices from those buffers during
forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory.The failing tests were introduced in #2076 but the error was already there beforehand.
I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.