这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@BenjaminBossan
Copy link
Member

Add support for torchao.

The current status is:

  • only LoRA explicitly supported
  • only linear layer
  • int8_weight_only works fully
  • int8_dynamic_activation_int8_weight only works partly (as dequantize is not supported, merging and DoRA won't work)
  • int4_weight_only not supported as some ops for forward call are missing
  • nf4 not supported on transformers side

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member Author

With huggingface/transformers#33361 being merged (which marks torchao as traininable), once the next transformers version is released (>4.44.2), the GPU tests on this PR should pass (I tested locally). This PR should not be merged before that.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making torchao compatible @BenjaminBossan ! LGTM ! Just a few nits.

cc @msaroufim

Comment on lines 117 to 119
# TODO
rep = super().__repr__()
return rep.replace("lora.Linear", f"lora.{self.__class__.__name__}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO left

raise ValueError(f"{type(self).__name__} only supports int8 weights for now.")

def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
from torchao import quantize_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize_ is only available from torchao 0.4.0. Maybe we should modify a bit is_torchao_available to take that into account ?

BenjaminBossan and others added 2 commits September 13, 2024 18:16
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@BenjaminBossan BenjaminBossan merged commit 9918977 into huggingface:main Oct 8, 2024
16 checks passed
@BenjaminBossan BenjaminBossan deleted the feat-support-torchao branch October 8, 2024 16:10
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Oct 22, 2024
Supports torch AO quantization. Currently supported:

- int8_weight_only
- int8_dynamic_activation_int8_weight

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
Supports torch AO quantization. Currently supported:

- int8_weight_only
- int8_dynamic_activation_int8_weight

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
@hieubnt235
Copy link

Hi @BenjaminBossan , is there any plan for support NF4Tensor ? I would like to training QLora with Pytorch model ( Custom one, HF, diffusers,...) and i think it's a good features to have.
Currently i just custom Lora layer (simply change merge and unmerge methods from current TorchaoLoraLinear, tricky solve get_apply_tensor_subclass for custom model not from huggingface) like you mentioned. Is it a correct way?
But after all, i think a simple change in official dispatch_torchao and TorchaoLoraLinear is still better.

@BenjaminBossan
Copy link
Member Author

is there any plan for support NF4Tensor

AFAICT, torchao NF4 is not supported in transformers (which may change in the future). Therefore, I don't have plans to support it in PEFT. However, if you already have a working implementation, feel free to create a (draft) PR with your layers and I can take a look.

@hieubnt235
Copy link

hieubnt235 commented Aug 2, 2025

is there any plan for support NF4Tensor

AFAICT, torchao NF4 is not supported in transformers (which may change in the future). Therefore, I don't have plans to support it in PEFT. However, if you already have a working implementation, feel free to create a (draft) PR with your layers and I can take a look.

I think maybe transformers already support it by passing the AOBaseConfig instead of raw string link but diffusers don't. Because it's custom config is too strict.

Back to PEFT, The main problem that restrict the flexibility is that the register_peft_method is not 100% self-contained. But I think it's enough to have custom layer. And the implement of dynamic_dispatch in LoraConfig._custom_modules is now very restrict. Why don't you just allow to custom "dispatch function" instead of mapping ?

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

class TorchAOLoraNF4Linear(Linear):
    def __init__(self, target:nn.Module, adapter_name:str, nf4_config: NF4Config, **kwargs):
        super().__init__(target, adapter_name, **kwargs)
        self.config = nf4_config

    def _get_base_layer_and_weight_with_checking(self)-> tuple[nn.Linear, NF4Tensor]:
        base_layer= self.get_base_layer()
        assert isinstance(base_layer, nn.Linear)
        nf4_weight = base_layer.weight
        assert isinstance(nf4_weight, NF4Tensor)
        return base_layer, nf4_weight

    def _accumulate_adapter_weights(
        self, base_weight: torch.Tensor, adapter_names:list[str],
        *,
        merge:bool = True,
        safe_merge:bool=False
    )->torch.Tensor:

        for active_adapter in adapter_names:
            if merge:
                base_weight += self.get_delta_weight(active_adapter)
                if safe_merge and not torch.isfinite(base_weight).all():
                    raise ValueError(
                    f"NaNs detected in the merged weights. The adapter {active_adapter} seems to be broken"
                    )
            else:
                # unmerge
                if active_adapter not in self.lora_A.keys():
                    continue
                base_weight -= self.get_delta_weight(active_adapter)
        return base_weight

    def _make_nf4_weight_param(self, weight_tensor: torch.Tensor)->nn.Parameter:
        nf4_tensor = to_nf4(weight_tensor,self.config.block_size,self.config.scaler_block_size)
        return nn.Parameter(nf4_tensor)

    def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
        if not (adapter_names:= check_adapters_to_merge(self, adapter_names)):
            return

        # I actually why the `TorchaoLoraLinear` make this inside a loop, why not update every thing
        # and merge one for all ?
        base_layer, nf4_weight = self._get_base_layer_and_weight_with_checking()
        weight = self._accumulate_adapter_weights(nf4_weight.get_original_weight(), adapter_names,merge=True,safe_merge=safe_merge)
        base_layer.weight = self._make_nf4_weight_param(weight)

        self.merged_adapters.extend(adapter_names)

    def unmerge(self) -> None:

        if not self.merged:
            warnings.warn("Already unmerged. Nothing to do.")
            return

        base_layer, nf4_weight = self._get_base_layer_and_weight_with_checking()
        weight = self._accumulate_adapter_weights(nf4_weight.get_original_weight(), self.merged_adapters, merge=False)
        base_layer.weight = self._make_nf4_weight_param(weight)

        self.merged_adapters.clear()

 def dispatch_torchao_linear(target:nn.Module|BaseTunerLayer, adapter_name:str, aobase_config: AOBaseConfig|None=None, **kwargs):

    if isinstance(target, BaseTunerLayer):
        target = target.get_base_layer()
    assert isinstance(target, nn.Module)

    # torchao only support Linear operation for now afaik.
    # If there's quantized weight support conv module, let use define dispatcher.
    if not isinstance(target, nn.Linear):
        return None

    if not is_torchao_available():
        return None

    if isinstance(target.weight, NF4Tensor):
        if not isinstance(aobase_config, NF4Config):
            raise ValueError("Weight is quantized by NF4Tensor need NF4Config.")
        nf4config = cast(NF4Config, aobase_config)
        return TorchAOLoraNF4Linear(target, adapter_name, nf4config, **kwargs)

    from torchao.dtypes import AffineQuantizedTensor
    from torchao.quantization import LinearActivationQuantizedTensor

    if isinstance(target.weight, (AffineQuantizedTensor, LinearActivationQuantizedTensor)):
        return TorchAOLoraAQLinear(target, adapter_name, aobase_config = aobase_config,**kwargs)

    return None

As you see, I want to do something like checking module before add layer such as if isinstance(target.weight, NF4Tensor), if not return None.... for example. It's useful for case that i want to quantize my model using different methods for different modules, and only some of them need to be add adapter.

Ofcourse I can custom my Model and Config, but I feel the register process is too complex and error prone.

@BenjaminBossan
Copy link
Member Author

I think maybe transformers already support it by passing the AOBaseConfig

I looked this up but didn't see any config for NF4. Could you please paste a snippet that illustrates how to load a transformers model with torchao NF4?

The main problem that restrict the flexibility is that the register_peft_method is not 100% self-contained. But I think it's enough to have custom layer.

I'm not quite sure what you mean by "not 100% self-contained", but yes, we would need to go with a custom layer.

Why don't you just allow to custom "dispatch function" instead of mapping ?

We can think about that, when I worked on this, I wanted to keep it simple, as I was not sure if anyone would use it at all.

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

Once we have a small example, we can do some testing, it's tough to say in the abstract. Once we have that, we can proceed with a draft PR with your implementation and I can guide you through the missing steps.

the register process is too complex and error prone.

LMK what exactly you're missing there.

@hieubnt235
Copy link

hieubnt235 commented Aug 5, 2025

I looked this up but didn't see any config for NF4. Could you please paste a snippet that illustrates how to load a transformers model with torchao NF4?

Here is the transformers code for NF4 torchao:

from transformers import Qwen2ForCausalLM, TorchAoConfig
from torchao.dtypes import NF4Tensor, to_nf4
from torchao.quantization import register_quantize_module_handler, Float8WeightOnlyConfig, ModuleFqnToConfig
from dataclasses import dataclass
from torchao.core.config import AOBaseConfig
import torch
from torch import nn
import types
from torchao.utils import get_model_size_in_bytes
@dataclass
class NF4Config(AOBaseConfig):
    block_size: int = 64
    scaler_block_size: int = 256

def linear_module_repr(module: nn.Linear):
    return f"in_features={module.weight.shape[1]}, out_features={module.weight.shape[0]}, weight={module.weight}, dtype={module.weight.dtype}"

@register_quantize_module_handler(NF4Config)
def _nf4_weight_only_transform(
    module: torch.nn.Module,
    config: NF4Config,
) -> torch.nn.Module:
    new_weight = to_nf4(module.weight, config.block_size, config.scaler_block_size)
    module.weight = nn.Parameter(new_weight, requires_grad=False) # Freeze
    module.extra_repr = types.MethodType(
        linear_module_repr,
        module
    )
    return module

config = TorchAoConfig(NF4Config())

model = quantized_model = Qwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
)

quantized_model = Qwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    quantization_config = config
)
print(get_model_size_in_bytes(model)) # 2520669824
print(get_model_size_in_bytes(quantized_model)) # 1273966688

I'm not quite sure what you mean by "not 100% self-contained", but yes, we would need to go with a custom layer.

No, that's your words, I just copy it :v

The main problem with torchao is that it not using the "swap module", but quantize the weight directly instead, and also allow to custom any quantized tensor. So that i think it's good for not just check for module only, like the current custom map, but have the custom dispatch function to check the weight. Or just have some way to integrate it.

LMK what exactly you're missing there.

Yeah I used to, but after review all source code carefully, I think I have a way to use it confidently.

Anyway, my code now just work (QLora with diffusers, Peft and TorchAO) but it's gassy with a lot of hack . I have no idea about how stable it its, so i think i need some works for testing. But I think the official support is still better. And so sorry i cannot make the PR now because I am quite busy right now. Thank you for the support. Sorry again for my bad English :v

@BenjaminBossan
Copy link
Member Author

Here is the transformers code for NF4 torchao:

I see, thanks. It's not really straightforward to use, I hope this will be simplified in the future.

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

At a first glance, this doesn't look bad, I think we could work based on this implementation.

No, that's your words, I just copy it :v

Haha, okay, but my quote is from a different context, namely adding completely new PEFT methods. Here, the job is much easier, just adding a new layer for an existing PEFT method.

but have the custom dispatch function to check the weight

Yes, for that we need to make changes directly in PEFT, the dynamic dispatch can't handle that.

And so sorry i cannot make the PR now because I am quite busy right now.

No worries, but if you have some time in the future, I'd be happy to see a PR. Don't worry about making it perfect on the first try, we can iterate on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants