这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@kkb-code
Copy link
Contributor

@kkb-code kkb-code commented Jun 11, 2025

We would like to add code for Sparse High Rank Adapters (SHiRA) which was published at NeurIPS 2024 (PAPER LINK).

This is an alternate type of adapter and has been found to have significant advantages over the low rank adapters. Specifically, SHiRA achieves better accuracy than LoRA for a variety of vision and language tasks. It also offers simpler and higher quality multi-adapter fusion by significantly reducing concept loss, a common problem faced by low rank adapters. Please see the paper for more details.

@kkb-code
Copy link
Contributor Author

Issue: #2585

@kkb-code kkb-code marked this pull request as draft June 12, 2025 22:50
@kkb-code kkb-code marked this pull request as ready for review June 12, 2025 22:52
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR to integrate SHiRA into PEFT. It's good to see a paper that not only deals with academic improvements but also considers practical applications.

I went through the code and I think this generally already looks quite fine. Still, I found a couple of issues, please check.

Regarding testing: I saw that you added test_shira.py. This is fine for SHiRA-specific tests, but for the general PEFT integration, we need to update the existing PEFT tests to include SHiRA. As a first step, let's add one or a few examples using SHiRA to the custom model tests here:

("Vanilla MLP 1 LoRA", "MLP", LoraConfig, {"target_modules": "lin0"}),

After that, please run pytest tests/test_custom_models.py -k "shira" to see if these tests pass.

Ideally, before we merge, we should also:

  1. add docs
  2. add an example
  3. add a benchmark

Those steps are not strictly necessary, as they can be added later, but often it's a good idea to include them, as they can sometimes reveal new areas for improvement. But it's fine to focus on the proper implementation for now.

Finally, please always run style before pushing your changes to ensure that the linter is happy.

@X1AOX1A
Copy link

X1AOX1A commented Jun 13, 2025

It seems like only the random mask is implemented, what about the SHiRA-WM, SHiRA-Grad and SHiRA-SNIP?

@kkb-code
Copy link
Contributor Author

Thanks for this PR to integrate SHiRA into PEFT. It's good to see a paper that not only deals with academic improvements but also considers practical applications.

I went through the code and I think this generally already looks quite fine. Still, I found a couple of issues, please check.

Regarding testing: I saw that you added test_shira.py. This is fine for SHiRA-specific tests, but for the general PEFT integration, we need to update the existing PEFT tests to include SHiRA. As a first step, let's add one or a few examples using SHiRA to the custom model tests here:

("Vanilla MLP 1 LoRA", "MLP", LoraConfig, {"target_modules": "lin0"}),

After that, please run pytest tests/test_custom_models.py -k "shira" to see if these tests pass.

Ideally, before we merge, we should also:

  1. add docs
  2. add an example
  3. add a benchmark

Those steps are not strictly necessary, as they can be added later, but often it's a good idea to include them, as they can sometimes reveal new areas for improvement. But it's fine to focus on the proper implementation for now.

Finally, please always run style before pushing your changes to ensure that the linter is happy.

Hi @BenjaminBossan, we have now included SHiRA in the test_custom_models.py and have conducted and passed all the pytests. Also, we have responded to and/or incorporated your requested changes in our new commits. Hopefully, this addresses your concerns.

We have fixed the Lint issues except for UP045 which existed in many other modules throughout the PEFT repo. Please let us know if you would like to fix those for our modules.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates, this is shaping up quite well.

I still have a couple of comments, but those should not be too hard to address.

Moreover, let's work on the following to finalize the PR:

  1. Please check your ruff version. It doesn't look like make style worked as expected. The CI uses ruff-0.9.10.
  2. Let's add an entry to the docs, check how the other PEFT methods do it.
  3. Let's add an example to the examples/ folder, what exactly you want is up to you. You can take an existing one and modify it for SHiRA.
  4. Let's extend the test coverage by adding entries to test_decoder_models.py, test_encoder_decoder_models.py, test_feature_extraction_models.py and test_seq_classifier.py. This should be straightforward, as you only need to add a config to the list of test cases.

@BenjaminBossan
Copy link
Member

Ah, forgot to reply to this:

We believe that the future users may want to experiment with their own custom mask functions without requiring PEFT library changes. This is why we used a function reference in this field to allow for maximum flexibility (and we suggest that keeping this functionality would be beneficial to the community).

IMO, it makes little sense to have a config parameter (mask_fn) that isn't persisted. It means that the user needs to run custom code after loading the model to apply the correct mask function. If they do that anyway, we might as well remove the argument. I think something like this would work:

@dataclass
class ShiraConfig(PeftConfig):
    ...
    mask_type: Litera["random", <more-options>] = field(default='random', metadata={"help": "..."})

    def __post_init__(self):
        if self.mask_type == "random":
            self.mask_fn = random_mask
        elif ...
        else:
            warnings.warn(f"Argument {mask_type=} not recognized, please supply your own masking function by calling `config.mask_fn = my_mask_fn`.")
            self.mask_fn = None

Then, if users want to provide a custom mask function, they can still do:

shira_config = ShiraConfig(...)
shira_config.mask_function = my_custom_mask_function

The same would be needed after loading the model, but that's necessary anyway, even with the current implementation. It is important to document this well though.

Moreover, I think it would be really great if all the mask functions from the paper could be added.

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 8, 2025

I investigated a bit more how we can avoid the spurious warning about mask_type and I believe my solution should help with that. Please check my suggestion.

Added.

Please also merge with/rebase on the latest main, as there was an issue with the latest transformers release. Then, CI should pass and we should be good to merge.

We have now merged with the latest main. We also verified make style and everything passed. This should be good to go from our side. Thanks a lot.

@BenjaminBossan
Copy link
Member

Hmm, there is a strange error occurring on the Windows CI. It seems that each SHiRA test that involves persistence is affected. At first, I thought it might just be flakiness but I re-ran the CI today and it happened again. The error occurs when we have code like this in the tests:

with tempfile.TemporaryDirectory() as tmp_dirname:
    # do stuff

When the context manager is left, the temp directory is cleaned up, but here the cleanup fails with:

    def _rmtree_unsafe(path, onerror):
        try:
>           with os.scandir(path) as scandir_it:
E           NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmptrnulxqo\\adapter_model.safetensors'

We have had some occasional issues in the past with Windows not being able to clean up temp directories, to wit:

# note: not using the context manager here because it fails on Windows CI for some reason
tmp_dirname = tempfile.mkdtemp()
try:
model.save_pretrained(tmp_dirname)
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)
model = PeftModel.from_pretrained(
model, tmp_dirname, torch_device=self.torch_device, low_cpu_mem_usage=True
)
assert {p.device.type for p in model.parameters()} == {self.torch_device}
model.load_adapter(tmp_dirname, adapter_name="other", low_cpu_mem_usage=True)
assert {p.device.type for p in model.parameters()} == {self.torch_device}
finally:
try:
shutil.rmtree(tmp_dirname)
except PermissionError:
# windows error
pass

However, we never had the case that so many tests were failing and that they were all tied to a single PEFT method. This makes me think that there is something going on with SHiRA that aggravates the problem, perhaps related to the shira_indices? Unfortunately, I don't have a Windows machine to test, do you happen to have one?

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 9, 2025

Hmm, there is a strange error occurring on the Windows CI. It seems that each SHiRA test that involves persistence is affected. At first, I thought it might just be flakiness but I re-ran the CI today and it happened again. The error occurs when we have code like this in the tests:

with tempfile.TemporaryDirectory() as tmp_dirname:
    # do stuff

When the context manager is left, the temp directory is cleaned up, but here the cleanup fails with:

    def _rmtree_unsafe(path, onerror):
        try:
>           with os.scandir(path) as scandir_it:
E           NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmptrnulxqo\\adapter_model.safetensors'

We have had some occasional issues in the past with Windows not being able to clean up temp directories, to wit:

# note: not using the context manager here because it fails on Windows CI for some reason
tmp_dirname = tempfile.mkdtemp()
try:
model.save_pretrained(tmp_dirname)
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)
model = PeftModel.from_pretrained(
model, tmp_dirname, torch_device=self.torch_device, low_cpu_mem_usage=True
)
assert {p.device.type for p in model.parameters()} == {self.torch_device}
model.load_adapter(tmp_dirname, adapter_name="other", low_cpu_mem_usage=True)
assert {p.device.type for p in model.parameters()} == {self.torch_device}
finally:
try:
shutil.rmtree(tmp_dirname)
except PermissionError:
# windows error
pass

However, we never had the case that so many tests were failing and that they were all tied to a single PEFT method. This makes me think that there is something going on with SHiRA that aggravates the problem, perhaps related to the shira_indices? Unfortunately, I don't have a Windows machine to test, do you happen to have one?

This is strange. I do not have a windows machine that would be able to test things out. We usually work with Linux based systems.

Also, about shira_indices, the only place we include new code to save/load them always works at the level of state-dicts (e.g., to_return here and reading from peft_model_state_dict here). So, I am not sure how/why the directory names could get affected or become invalid for SHiRA since we did not change any paths/file names, etc.. It is strange though that only SHiRA tests are affected for these tests, however.

I am also seeing some Access Denied Errors in Windows? Maybe that is a culprit? See here. Also, adapter_model.safetensors is a file, not a directory?

@BenjaminBossan
Copy link
Member

This is indeed a very strange situation and not something I've come across yet. Today, I re-ran the CI and still the same errors occur. Too bad that we don't have Windows machines to test on, so we need to resort to the GH CI instead.

Perhaps we can try to eliminate potential sources of error one by one. The first thing I'd test is to comment out all the additions to save_and_load.py. Of course, this will lead some tests to fail, but I'm curious if that's enough to fix the Windows permission error. Could you please try that? In addition, to avoid overexerting the CI, let's only test on a single Python version:

python-version: ["3.9", "3.10", "3.11", "3.12"]

-       python-version: ["3.9", "3.10", "3.11", "3.12"]
+       python-version: ["3.12"]

Of course, we will revert these changes before merging.

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 10, 2025

I have pushed a new commit. I have also limited the tests just to windows to make sure it runs fast. All shira related stuff from save_and_load.py was removed. UPDATE: I added back all OS (linux, mac, windows) since I thought that maybe removing the other two OS stopped automatic tests from running (not sure if there are automatic tests on the GH CI or if you run them manually....but just wanted to try this).

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 10, 2025

Interestingly, when I remove all SHiRA related lines from save_and_load.py, windows does not throw the PermissionError. Not sure what is going on...but I added another commit reverting back the save_and_load.py to see if it still fails. The windows and ubuntu tests on commit 34af1fd failed only 1 test which is expected (the custom mask save and load test would fail if we do not save and load shira-indices for the custom mask).

UPDATE: Commit ee9d1d5 fails with the same PermissionError as before. Not sure how adding something to a state-dict can cause that. We aren't really sure how to fix this. Nonetheless, I have added another commit f4ff99d to try and debug this. Here, I have kept the SHiRA weight saving elif block but removed everything else to that saves or loads shira indices.

Any help would be appreciated. Thanks.

@BenjaminBossan
Copy link
Member

Any help would be appreciated. Thanks.

I'm as stumped as you are, this really doesn't make a lot of sense. I wondered if you could test converting the ints to floats before saving and then back to ints when loading. This is not a final solution, just for testing if it could somehow be related to the dtype.

@kkb-code
Copy link
Contributor Author

Just pushed a commit for SHiRA indices dtype changed to torch.float32 while saving and then torch.int while loading. Let us see if this helps.

Can we think of any other solutions that might help us merge this? For example, can we skip certain tests on windows OS only? This clearly seems like a Permission issue that happens only on the windows. If there was anything wrong with our code, we would see these issues happening on other OS too?

@BenjaminBossan
Copy link
Member

Just pushed a commit for SHiRA indices dtype changed to torch.float32 while saving and then torch.int while loading. Let us see if this helps.

This seems to help indeed, there is only one Windows error related to 429 from the Hub, so it's good.

can we skip certain tests on windows OS only? This clearly seems like a Permission issue that happens only on the windows. If there was anything wrong with our code, we would see these issues happening on other OS too?

While generally I agree, I think there is a possibility that something more fundamental is broken here, so before we don't fully understand what is going on, merging is risky. I can't imagine that saving integers in the state_dict on Windows is generally broken, but maybe something more subtle is going on. Unfortunately, I currently don't have time to investigate and don't have a Windows machine, but perhaps you have an idea based on the latest finding.

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 11, 2025

This seems to help indeed, there is only one Windows error related to 429 from the Hub, so it's good.

While generally I agree, I think there is a possibility that something more fundamental is broken here, so before we don't fully understand what is going on, merging is risky. I can't imagine that saving integers in the state_dict on Windows is generally broken, but maybe something more subtle is going on. Unfortunately, I currently don't have time to investigate and don't have a Windows machine, but perhaps you have an idea based on the latest finding.

Okay, this is good, at least. Actually, it seems that there is something going on with Integers and Safetensors that is specific only to the Windows OS. Someone reported another issue on Safetensors github: please see here. Potentially looks like some kind of an overflow problem in Windows.

The reported array length is exactly 2 * (test_tensor.numel() - 2**32), leading me to suspect this is an integer overflow issue. The issue only seems to manifest on Windows, with the code above performing as expected on Linux.

Seems like the above issue was closed as it was not planned, so I don't think any fix was provided.

I am okay with casting indices to float32 while saving and then casting them back to int when loading them as a solution to this (because the file size should also stay the same). Seems like it does not affect any Linux tests.

@BenjaminBossan
Copy link
Member

I am okay with casting indices to float32 while saving and then casting them back to int when loading them as a solution to this. Seems like it does not affect any Linux tests.

Okay, so how about only doing that when the platform is Windows? Then we can be sure that Linux and MacOS are unaffected. For reasonably small values, this should work fine, but at a certain size, the conversion should fail. I wonder if this point is known in advance.

@kkb-code
Copy link
Contributor Author

I agree, precision issues can occur for very large integer values. So let me just limit this to windows systems. I will use something like platform.system() == 'Windows'. Similar platform.system() statements have been used in other places inside peft so this should work. Will push shortly.

@kkb-code
Copy link
Contributor Author

I just pushed cc47b6a. It brings back all tests (all python versions) and casts the shira_indices to float32 before saving only on windows. Hence, Mac-OS and Linux must be completely unaffected now. Thanks. Let us see if this works.

BTW, I am not sure what to do about Hub errors. I thought those were not happening before.

@BenjaminBossan
Copy link
Member

BenjaminBossan commented Jul 11, 2025

Thanks @kkb-code let's go with this approach. Could you please ensure the 120 char line limit is taken care of?

BTW, I am not sure what to do about Hub errors. I thought those were not happening before.

Just ignore those, they are just rate limits.

@kkb-code
Copy link
Contributor Author

kkb-code commented Jul 11, 2025

Yep, thanks @BenjaminBossan. Just pushed another commit 2014c43. This should take care of the 120 char limit. I also fixed a couple minor typos in the warning texts.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of the Windows issue and this contribution in general. The failing MacOS test is just flakiness and can be ignored. The PR is good to be merged.

@BenjaminBossan BenjaminBossan merged commit a4f9334 into huggingface:main Jul 14, 2025
32 of 40 checks passed
@kkb-code
Copy link
Contributor Author

Thank you so much @BenjaminBossan for the guidance throughout this process. We are happy to see that our SHiRA adapter was merged into PEFT. Please let us know if we should close Issue #2585 or if you would close it at a later time.

Also, I am wondering if there is any ETA on the next PEFT version release that would include the SHiRA update (e.g., v0.16.1)? Can you please let us know? That would allow users to directly use pip install ... and be able to use SHiRA.

Thanks again for the help. It was great working with you on this.

@BenjaminBossan
Copy link
Member

Also, I am wondering if there is any ETA on the next PEFT version release that would include the SHiRA update (e.g., v0.16.1)? Can you please let us know? That would allow users to directly use pip install ... and be able to use SHiRA.

Even though we had a release just recently, we might have another one soon, but we have external dependencies for that, so I can't give a date. FYI, for new features such as SHiRA, we always do minor releases (i.e. it would be 0.17.0), patch releases are for hot fixes only. In the meantime, users can install directly from GH, e.g. pip install git+https://github.com/huggingface/peft (+ optional hash to fix the commit).

Thanks again for the help. It was great working with you on this.

Thanks, same from my side.

BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants