Fixed 4bit compare UT on XPU #2843

YangKai0616 · 2025-10-16T06:26:46Z

Similar to #2473 . Our internal CI testing for XPU found that current test results for cases like test_bnb_regression.py::test_opt_350m_4bit_quant_storage and test_regression.py::TestOpt4bitBnb::test_lora_4bit are similar to test_bnb_regression.py::test_opt_350m_4bit, vary due to the bitsandbytes version and hardware platform.

XPU currently can pass all UT files in the latest bitsandbytes version. Therefore, can we temporarily disable these specific examples on XPU to prevent CI errors?

YangKai0616 · 2025-10-16T06:28:04Z

Hi @BenjaminBossan , please help review. Thanks!

BenjaminBossan · 2025-10-16T10:04:39Z

Hi, thanks for the PR. I assume you are from the same team as @yao-matrix?

XPU currently can pass all UT files in the latest bitsandbytes version

Do you mean the latest release from bitsandbytes makes the tests pass? Why is that version not used for CI then? Or do you mean the tests require bitsandbytes installed from source? In that case, I would change the condition such that:

test runs if CUDA available
test runs if XPU is available AND bitsandbytes version is greater than x.y.z (presumably >= 0.49)

yao-matrix · 2025-10-16T15:37:10Z

Echo Benjamin, our target is to cover all cases on XPU, so as a general principal, let's be very cautious of doing platform-specific skip. Pls follow Benjamin suggestion to figure out what bnb version is needed to pass. @YangKai0616

YangKai0616 · 2025-10-21T07:52:51Z

Do you mean the latest release from bitsandbytes makes the tests pass? Why is that version not used for CI then? Or do you mean the tests require bitsandbytes installed from source? In that case, I would change the condition such that:

test runs if CUDA available

test runs if XPU is available AND bitsandbytes version is greater than x.y.z (presumably >= 0.49)

Sorry for my unclear expression. What I meant is that the current XPU can pass all tests of bitsandbytes. Because I noticed you mentioned that test_bnb_regression.py exists to test bitsandbytes(link).

For the test case test_opt_350m_4bit_quant_storage, I compared the outputs of CUDA (A100) and XPU layer by layer and found no abnormal errors. The test failure seems more like a result of reasonable error accumulation when comparing across different platforms.

Additionally, I tested the FP32/4bit matmul, Linear outputs and quantization precision on both CUDA (A100) and XPU, and performed a cross-platform comparison. The results are as follows:

Run mode: Run on XPU and compare with CUDA
✅ Results loaded: cuda_results.pkl
   PyTorch version: 2.8.0+cu126
   BNB version: 0.48.1

================================================================================
Start comparative analysis...
================================================================================

================================================================================
FP32 Baseline Comparison (CUDA vs XPU)
================================================================================

1. FP32 matmul Comparison::

Size                      Max abs Diff         Mean abs Diff
------------------------------------------------------------
(64, 512, 256)            0.0000190735         0.0000031811
(128, 1024, 512)          0.0000419617         0.0000061635
(256, 2048, 1024)         0.0001602173         0.0000192979

2. FP32 Linear Comparison:

  Output max diff: 0.0000009537
  Output mean diff: 0.0000001015
  Weight max diff: 0.0000000000
  Weight mean diff: 0.0000000000

3. FP32 MLP Comparison:

  Max diff: 0.0000001192
  Mean diff: 0.0000000290

================================================================================
4-bit Quantization Comparison (CUDA vs XPU)
================================================================================

1. Quantization / Dequantization Comparison:

Size             Metric                          CUDA            XPU             Diff          
--------------------------------------------------------------------------------------------
128x512         Quantization max abs error       0.55697584      0.55697584      0.00000000
                Quantization mean abs error      0.09671264      0.09671264      0.00000000
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------
1024x1024       Quantization max abs error       0.64687514      0.64687514      0.00000000
                Quantization mean abs error      0.09654428      0.09654428      0.00000000
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------
2048x4096       Quantization max abs error       0.69555235      0.69555235      0.00000000
                Quantization mean abs error      0.09655710      0.09655712      0.00000001
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------

2. 4-bit matmul Comparison:

  Max abs diff: 0.00002289
  Mean abs diff: 0.00000319

3. Linear4bit Comparison:

  Max abs diff: 0.00000143
  Mean abs diff: 0.00000010

The results show that the 4-bit quantization precision on XPU and CUDA (A100) aligns perfectly. The differences in Linear, MLP, and matmul computations are reasonable. The increased differences in large size matmul demonstrate the effect of accumulated errors.

Considering that the ground truth for tests such as test_opt_350m_4bit_quant_storage is calculated on the CUDA platform, should we add a ground truth for the XPU platform for testing? What are your thoughts on this?

Thanks!

BenjaminBossan · 2025-10-21T09:40:49Z

I see, thanks for explaining. Honestly, I'm considering to remove these tests completely, as they are actually tests for bitsandbytes and not PEFT. We added them to PEFT because bnb didn't have its own CI for this at the time, but now the picture has changed. Therefore, I would like to avoid putting too much extra work into this, like providing separate artifacts per architecture.

I started a discussion with bnb maintainers about removing the bnb tests in PEFT and will reply here once we have come up with a conclusion.

BenjaminBossan · 2025-10-22T08:18:50Z

@YangKai0616 After discussing with bnb colleagues, we decided to remove test_bnb_regression.py
(#2858). Regarding test_regression.py, do we have the same situation there?

YangKai0616 · 2025-10-22T08:57:36Z

@YangKai0616 After discussing with bnb colleagues, we decided to remove test_bnb_regression.py (#2858). Regarding test_regression.py, do we have the same situation there?

Thank you for your prompt response!

Yes, you can see that the 4-bit BNB model is also used in test case test_regression.py::TestOpt4bitBnb::test_lora_4bit. The ground truth based on CUDA GPU is located at link. Should we maintain a similar ground truth for XPU?

BenjaminBossan · 2025-10-22T10:25:22Z

@YangKai0616 Okay, got it. So what could be done:

You create your own HF repo to upload the artifacts to.
In test_regression.py, HF_REPO points to your repo if XPU is being detected, otherwise it stays on "peft-internal-testing/regression-tests"
On an XPU enabled machine, you run the regression tests in "creation mode" (see the comment at the top). This will upload the XPU artifacts to your repo. It is recommended to use a PEFT release version for this, not the main branch.
In subsequent runs, if XPU is detected, the artifacts from your repo will be used.

Let's add a comment to let the user know that if they run with XPU, the regression artifacts are loaded from a repo outside of Hugging Face's control, so they run this at their own risk. We use torch.load for the tensors, so there is a potential for vulnerabilities. Of course, we trust Intel here, but it's still good to let users know.

YangKai0616 · 2025-10-30T03:06:16Z

@YangKai0616 Okay, got it. So what could be done:

@BenjaminBossan Done.

In the torch=2.9, peft=0.17.1, bnb=0.48.1 environment. I got the same results on both XPU and CUDA(A100).

Please help review!

BenjaminBossan

Thanks for adding the Intel repo with the necessary regression artifact. In principle, this looks good. I would like to see a small change, namely that users have to explicitly opt in to using XPU via an environment variable, e.g. PEFT_USE_XPU. This way, we can ensure that users don't use it by accident.

BenjaminBossan · 2025-10-31T13:20:38Z

tests/regression/test_regression.py

+    is_xpu = (infer_device() == "xpu")
+    if is_xpu:
+        lora_4bit_folder_path = os.path.join(REGRESSION_DIR, LORA_4BIT_FOLDER)
+        if os.path.isdir(lora_4bit_folder_path):


Since the REGRESSION_DIR is a temp folder created on script start, this shouldn't be necessary, right?

The setup_tearndown process will first download the complete artifacts files, and then download the XPU files to overwrite it. Therefore, the LORA_4BIT_FOLDER already exists at this point. The reason for detecting, deleting, and then overwriting in this way is that I have tried force_download=True, but it seems to have no effect.

tests/regression/test_regression.py

YangKai0616 · 2025-11-03T15:29:04Z

Thanks for adding the Intel repo with the necessary regression artifact. In principle, this looks good. I would like to see a small change, namely that users have to explicitly opt in to using XPU via an environment variable, e.g. PEFT_USE_XPU. This way, we can ensure that users don't use it by accident.

Done!

tests/regression/test_regression.py

HuggingFaceDocBuilderDev · 2025-11-05T09:26:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for making the regression tests pass on XPU.

YangKai0616 closed this Oct 17, 2025

YangKai0616 reopened this Oct 21, 2025

YangKai0616 closed this Oct 24, 2025

YangKai0616 force-pushed the main branch from 3540085 to 98a88c0 Compare October 24, 2025 07:44

Update

6d5ce25

YangKai0616 reopened this Oct 30, 2025

BenjaminBossan requested changes Oct 31, 2025

View reviewed changes

Update

d7354ca

BenjaminBossan reviewed Nov 4, 2025

View reviewed changes

tests/regression/test_regression.py Show resolved Hide resolved

Update comment

276d236

BenjaminBossan approved these changes Nov 5, 2025

View reviewed changes

BenjaminBossan merged commit 0eb4668 into huggingface:main Nov 5, 2025
13 checks passed

Fixed 4bit compare UT on XPU #2843

Fixed 4bit compare UT on XPU #2843

Uh oh!

Conversation

YangKai0616 commented Oct 16, 2025

Uh oh!

YangKai0616 commented Oct 16, 2025

Uh oh!

BenjaminBossan commented Oct 16, 2025

Uh oh!

yao-matrix commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangKai0616 commented Oct 21, 2025

Uh oh!

BenjaminBossan commented Oct 21, 2025

Uh oh!

BenjaminBossan commented Oct 22, 2025

Uh oh!

YangKai0616 commented Oct 22, 2025

Uh oh!

BenjaminBossan commented Oct 22, 2025

Uh oh!

YangKai0616 commented Oct 30, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

YangKai0616 Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YangKai0616 commented Nov 3, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 5, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yao-matrix commented Oct 16, 2025 •

edited

Loading