FEAT: Support torchao #2062

BenjaminBossan · 2024-09-11T14:31:53Z

Add support for torchao.

The current status is:

only LoRA explicitly supported
only linear layer
int8_weight_only works fully
int8_dynamic_activation_int8_weight only works partly (as dequantize is not supported, merging and DoRA won't work)
int4_weight_only not supported as some ops for forward call are missing
nf4 not supported on transformers side

HuggingFaceDocBuilderDev · 2024-09-11T14:35:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-09-13T13:29:08Z

With huggingface/transformers#33361 being merged (which marks torchao as traininable), once the next transformers version is released (>4.44.2), the GPU tests on this PR should pass (I tested locally). This PR should not be merged before that.

SunMarc

Thanks for making torchao compatible @BenjaminBossan ! LGTM ! Just a few nits.

cc @msaroufim

SunMarc · 2024-09-13T14:10:26Z

src/peft/tuners/lora/torchao.py

+        # TODO
+        rep = super().__repr__()
+        return rep.replace("lora.Linear", f"lora.{self.__class__.__name__}")


SunMarc · 2024-09-13T14:22:22Z

src/peft/tuners/lora/torchao.py

+            raise ValueError(f"{type(self).__name__} only supports int8 weights for now.")
+
+    def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
+        from torchao import quantize_


quantize_ is only available from torchao 0.4.0. Maybe we should modify a bit is_torchao_available to take that into account ?

- min torchao version - remove TODO

src/peft/import_utils.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

review-notebook-app · 2024-09-17T10:16:08Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Supports torch AO quantization. Currently supported: - int8_weight_only - int8_dynamic_activation_int8_weight --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

hieubnt235 · 2025-07-31T15:54:56Z

Hi @BenjaminBossan , is there any plan for support NF4Tensor ? I would like to training QLora with Pytorch model ( Custom one, HF, diffusers,...) and i think it's a good features to have.
Currently i just custom Lora layer (simply change merge and unmerge methods from current TorchaoLoraLinear, tricky solve get_apply_tensor_subclass for custom model not from huggingface) like you mentioned. Is it a correct way?
But after all, i think a simple change in official dispatch_torchao and TorchaoLoraLinear is still better.

BenjaminBossan · 2025-08-01T13:51:16Z

is there any plan for support NF4Tensor

AFAICT, torchao NF4 is not supported in transformers (which may change in the future). Therefore, I don't have plans to support it in PEFT. However, if you already have a working implementation, feel free to create a (draft) PR with your layers and I can take a look.

hieubnt235 · 2025-08-02T08:49:25Z

is there any plan for support NF4Tensor

AFAICT, torchao NF4 is not supported in transformers (which may change in the future). Therefore, I don't have plans to support it in PEFT. However, if you already have a working implementation, feel free to create a (draft) PR with your layers and I can take a look.

I think maybe transformers already support it by passing the AOBaseConfig instead of raw string link but diffusers don't. Because it's custom config is too strict.

Back to PEFT, The main problem that restrict the flexibility is that the register_peft_method is not 100% self-contained. But I think it's enough to have custom layer. And the implement of dynamic_dispatch in LoraConfig._custom_modules is now very restrict. Why don't you just allow to custom "dispatch function" instead of mapping ?

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

class TorchAOLoraNF4Linear(Linear):
    def __init__(self, target:nn.Module, adapter_name:str, nf4_config: NF4Config, **kwargs):
        super().__init__(target, adapter_name, **kwargs)
        self.config = nf4_config

    def _get_base_layer_and_weight_with_checking(self)-> tuple[nn.Linear, NF4Tensor]:
        base_layer= self.get_base_layer()
        assert isinstance(base_layer, nn.Linear)
        nf4_weight = base_layer.weight
        assert isinstance(nf4_weight, NF4Tensor)
        return base_layer, nf4_weight

    def _accumulate_adapter_weights(
        self, base_weight: torch.Tensor, adapter_names:list[str],
        *,
        merge:bool = True,
        safe_merge:bool=False
    )->torch.Tensor:

        for active_adapter in adapter_names:
            if merge:
                base_weight += self.get_delta_weight(active_adapter)
                if safe_merge and not torch.isfinite(base_weight).all():
                    raise ValueError(
                    f"NaNs detected in the merged weights. The adapter {active_adapter} seems to be broken"
                    )
            else:
                # unmerge
                if active_adapter not in self.lora_A.keys():
                    continue
                base_weight -= self.get_delta_weight(active_adapter)
        return base_weight

    def _make_nf4_weight_param(self, weight_tensor: torch.Tensor)->nn.Parameter:
        nf4_tensor = to_nf4(weight_tensor,self.config.block_size,self.config.scaler_block_size)
        return nn.Parameter(nf4_tensor)

    def merge(self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None) -> None:
        if not (adapter_names:= check_adapters_to_merge(self, adapter_names)):
            return

        # I actually why the `TorchaoLoraLinear` make this inside a loop, why not update every thing
        # and merge one for all ?
        base_layer, nf4_weight = self._get_base_layer_and_weight_with_checking()
        weight = self._accumulate_adapter_weights(nf4_weight.get_original_weight(), adapter_names,merge=True,safe_merge=safe_merge)
        base_layer.weight = self._make_nf4_weight_param(weight)

        self.merged_adapters.extend(adapter_names)

    def unmerge(self) -> None:

        if not self.merged:
            warnings.warn("Already unmerged. Nothing to do.")
            return

        base_layer, nf4_weight = self._get_base_layer_and_weight_with_checking()
        weight = self._accumulate_adapter_weights(nf4_weight.get_original_weight(), self.merged_adapters, merge=False)
        base_layer.weight = self._make_nf4_weight_param(weight)

        self.merged_adapters.clear()

 def dispatch_torchao_linear(target:nn.Module|BaseTunerLayer, adapter_name:str, aobase_config: AOBaseConfig|None=None, **kwargs):

    if isinstance(target, BaseTunerLayer):
        target = target.get_base_layer()
    assert isinstance(target, nn.Module)

    # torchao only support Linear operation for now afaik.
    # If there's quantized weight support conv module, let use define dispatcher.
    if not isinstance(target, nn.Linear):
        return None

    if not is_torchao_available():
        return None

    if isinstance(target.weight, NF4Tensor):
        if not isinstance(aobase_config, NF4Config):
            raise ValueError("Weight is quantized by NF4Tensor need NF4Config.")
        nf4config = cast(NF4Config, aobase_config)
        return TorchAOLoraNF4Linear(target, adapter_name, nf4config, **kwargs)

    from torchao.dtypes import AffineQuantizedTensor
    from torchao.quantization import LinearActivationQuantizedTensor

    if isinstance(target.weight, (AffineQuantizedTensor, LinearActivationQuantizedTensor)):
        return TorchAOLoraAQLinear(target, adapter_name, aobase_config = aobase_config,**kwargs)

    return None

As you see, I want to do something like checking module before add layer such as if isinstance(target.weight, NF4Tensor), if not return None.... for example. It's useful for case that i want to quantize my model using different methods for different modules, and only some of them need to be add adapter.

Ofcourse I can custom my Model and Config, but I feel the register process is too complex and error prone.

BenjaminBossan · 2025-08-05T10:34:21Z

I think maybe transformers already support it by passing the AOBaseConfig

I looked this up but didn't see any config for NF4. Could you please paste a snippet that illustrates how to load a transformers model with torchao NF4?

The main problem that restrict the flexibility is that the register_peft_method is not 100% self-contained. But I think it's enough to have custom layer.

I'm not quite sure what you mean by "not 100% self-contained", but yes, we would need to go with a custom layer.

Why don't you just allow to custom "dispatch function" instead of mapping ?

We can think about that, when I worked on this, I wanted to keep it simple, as I was not sure if anyone would use it at all.

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

Once we have a small example, we can do some testing, it's tough to say in the abstract. Once we have that, we can proceed with a draft PR with your implementation and I can guide you through the missing steps.

the register process is too complex and error prone.

LMK what exactly you're missing there.

hieubnt235 · 2025-08-05T16:11:54Z

I looked this up but didn't see any config for NF4. Could you please paste a snippet that illustrates how to load a transformers model with torchao NF4?

Here is the transformers code for NF4 torchao:

from transformers import Qwen2ForCausalLM, TorchAoConfig
from torchao.dtypes import NF4Tensor, to_nf4
from torchao.quantization import register_quantize_module_handler, Float8WeightOnlyConfig, ModuleFqnToConfig
from dataclasses import dataclass
from torchao.core.config import AOBaseConfig
import torch
from torch import nn
import types
from torchao.utils import get_model_size_in_bytes
@dataclass
class NF4Config(AOBaseConfig):
    block_size: int = 64
    scaler_block_size: int = 256

def linear_module_repr(module: nn.Linear):
    return f"in_features={module.weight.shape[1]}, out_features={module.weight.shape[0]}, weight={module.weight}, dtype={module.weight.dtype}"

@register_quantize_module_handler(NF4Config)
def _nf4_weight_only_transform(
    module: torch.nn.Module,
    config: NF4Config,
) -> torch.nn.Module:
    new_weight = to_nf4(module.weight, config.block_size, config.scaler_block_size)
    module.weight = nn.Parameter(new_weight, requires_grad=False) # Freeze
    module.extra_repr = types.MethodType(
        linear_module_repr,
        module
    )
    return module

config = TorchAoConfig(NF4Config())

model = quantized_model = Qwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
)

quantized_model = Qwen2ForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    quantization_config = config
)
print(get_model_size_in_bytes(model)) # 2520669824
print(get_model_size_in_bytes(quantized_model)) # 1273966688

I'm not quite sure what you mean by "not 100% self-contained", but yes, we would need to go with a custom layer.

No, that's your words, I just copy it :v

The main problem with torchao is that it not using the "swap module", but quantize the weight directly instead, and also allow to custom any quantized tensor. So that i think it's good for not just check for module only, like the current custom map, but have the custom dispatch function to check the weight. Or just have some way to integrate it.

LMK what exactly you're missing there.

Yeah I used to, but after review all source code carefully, I think I have a way to use it confidently.

Anyway, my code now just work (QLora with diffusers, Peft and TorchAO) but it's gassy with a lot of hack . I have no idea about how stable it its, so i think i need some works for testing. But I think the official support is still better. And so sorry i cannot make the PR now because I am quite busy right now. Thank you for the support. Sorry again for my bad English :v

BenjaminBossan · 2025-08-06T14:09:16Z

Here is the transformers code for NF4 torchao:

I see, thanks. It's not really straightforward to use, I hope this will be simplified in the future.

I implement custom layer for torchao NF4Tensor like this (Not tested much, please tell me if i'm wrong):

At a first glance, this doesn't look bad, I think we could work based on this implementation.

No, that's your words, I just copy it :v

Haha, okay, but my quote is from a different context, namely adding completely new PEFT methods. Here, the job is much easier, just adding a new layer for an existing PEFT method.

but have the custom dispatch function to check the weight

Yes, for that we need to make changes directly in PEFT, the dynamic dispatch can't handle that.

And so sorry i cannot make the PR now because I am quite busy right now.

No worries, but if you have some time in the future, I'd be happy to see a PR. Don't worry about making it perfect on the first try, we can iterate on it.

BenjaminBossan added 5 commits September 9, 2024 17:20

[WIP][FEAT] Support for torchao

f482fe4

Merge branch 'main' into feat-support-torchao

a2ce2c4

More fixes to tests, merging, DoRA, add docs

365eb86

Add entry to docs about supported layer types

d88979b

Fix docstring

ffd00c4

BenjaminBossan requested a review from SunMarc September 11, 2024 15:08

SunMarc approved these changes Sep 13, 2024

View reviewed changes

BenjaminBossan added 2 commits September 13, 2024 17:47

Merge branch 'main' into feat-support-torchao

147f022

Address reviewer feedback

38294eb

- min torchao version - remove TODO

SunMarc reviewed Sep 13, 2024

View reviewed changes

src/peft/import_utils.py Show resolved Hide resolved

BenjaminBossan and others added 2 commits September 13, 2024 18:16

Update src/peft/import_utils.py

2ecc54c

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Add notebooks

5ef4c0a

BenjaminBossan added 3 commits September 19, 2024 11:46

Merge branch 'main' into feat-support-torchao

0d4eaf0

Skip torchao tests on XPU

134b50b

Merge branch 'main' into feat-support-torchao

6f78d0d

BenjaminBossan merged commit 9918977 into huggingface:main Oct 8, 2024
16 checks passed

BenjaminBossan deleted the feat-support-torchao branch October 8, 2024 16:10

jerryzh168 mentioned this pull request Nov 29, 2024

[core] TorchAO Quantizer huggingface/diffusers#10009

Merged

9 tasks

FEAT: Support torchao #2062

FEAT: Support torchao #2062

Uh oh!

Conversation

BenjaminBossan commented Sep 11, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Sep 11, 2024

Uh oh!

BenjaminBossan commented Sep 13, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Sep 13, 2024

Choose a reason for hiding this comment

Uh oh!

SunMarc Sep 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

review-notebook-app bot commented Sep 17, 2024

Uh oh!

Uh oh!

hieubnt235 commented Jul 31, 2025

Uh oh!

BenjaminBossan commented Aug 1, 2025

Uh oh!

hieubnt235 commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Aug 5, 2025

Uh oh!

hieubnt235 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hieubnt235 commented Aug 2, 2025 •

edited

Loading

hieubnt235 commented Aug 5, 2025 •

edited

Loading