-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Method comparison: Add MiSS result #2740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method comparison: Add MiSS result #2740
Conversation
- default - mini - bat Results are pretty close to the corresponding experiments with Bone, which is what we expected.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
FYI @JL-er |
Thanks, I noticed that the BAT mode doesn't seem very stable. This time's test results were even worse than the default mode. |
|
Hmm, right, the test accuracy decreased from 51.7% to 50.5%, even though the train loss is pretty much identical (0.5763 vs 0.5761). I'd say that using the default is the more attractive setting anyway, as it's much more memory efficient, but it could still be worth investigating why the results changed so much. |
Could you run the bat file once? |
|
I re-ran the Bone example on the same machine, now getting an even lower accuracy (49.5%). I'll probably re-run it a few more times to see how high the variance is in general. Result for Bone Bat{
"run_info": {
"created_at": "2025-08-15T13:38:21+00:00",
"total_time": 2758.8984308090003,
"experiment_name": "bone/llama-3.2-3B-bat",
"peft_branch": "main",
"train_config": {
"model_id": "meta-llama/Llama-3.2-3B",
"dtype": "bfloat16",
"max_seq_length": 768,
"batch_size": 4,
"batch_size_eval": 50,
"max_steps": 5000,
"eval_steps": 250,
"compile": false,
"query_template": "Question: {query} Think step by step.\nAnswer:",
"seed": 0,
"grad_norm_clip": 1.0,
"optimizer_type": "AdamW",
"optimizer_kwargs": {
"lr": 0.0001,
"weight_decay": 0.1
},
"lr_scheduler": "cosine",
"use_amp": false,
"autocast_adapter_dtype": true,
"generation_kwargs": {
"max_length": 800,
"max_new_tokens": 300
},
"attn_implementation": null
},
"peft_config": {
"task_type": null,
"peft_type": "BONE",
"auto_mapping": null,
"base_model_name_or_path": "meta-llama/Llama-3.2-3B",
"revision": null,
"inference_mode": false,
"r": 64,
"target_modules": [
"q_proj",
"v_proj"
],
"exclude_modules": null,
"init_weights": "bat",
"layers_to_transform": null,
"layers_pattern": null,
"bias": "none",
"modules_to_save": null
},
"error_msg": ""
},
"train_info": {
"accelerator_memory_reserved_avg": 14713894417,
"accelerator_memory_max": 25251807232,
"accelerator_memory_reserved_99th": 20472733368,
"train_time": 2467.8785469740014,
"file_size": 29367552,
"num_trainable_params": 7340032,
"num_total_params": 3220089856,
"status": "success",
"metrics": [
{
"step": 250,
"valid accuracy": 0.32,
"train loss": 0.8741402707099915,
"train samples": 1000,
"train time": 44.84663501100022,
"eval time": 16.530845782000142,
"tokens / sec": 4720.956208822991,
"mem allocated avg": 6898546569.216,
"mem reserved avg": 14772112195.584,
"elapsed time": 125.1565625950002
},
{
"step": 500,
"valid accuracy": 0.42,
"train loss": 0.6949697629213333,
"train samples": 2000,
"train time": 44.66738984100175,
"eval time": 12.175719579000088,
"tokens / sec": 4656.529086216588,
"mem allocated avg": 6890138988.544,
"mem reserved avg": 14663949484.032,
"elapsed time": 240.0960283049999
},
{
"step": 750,
"valid accuracy": 0.38,
"train loss": 0.667268633723259,
"train samples": 3000,
"train time": 45.62526284499927,
"eval time": 8.235976585000117,
"tokens / sec": 4699.172928129208,
"mem allocated avg": 6901011800.064,
"mem reserved avg": 14819080011.776,
"elapsed time": 352.3821910219999
},
{
"step": 1000,
"valid accuracy": 0.44,
"train loss": 0.6479405733346939,
"train samples": 4000,
"train time": 44.807461878997856,
"eval time": 9.97781685100017,
"tokens / sec": 4649.582709295373,
"mem allocated avg": 6892128219.136,
"mem reserved avg": 14679913005.056,
"elapsed time": 465.37876664800024
},
{
"step": 1250,
"valid accuracy": 0.34,
"train loss": 0.643578136086464,
"train samples": 5000,
"train time": 45.07155244600017,
"eval time": 8.857318488000146,
"tokens / sec": 4626.8208810834185,
"mem allocated avg": 6892222337.024,
"mem reserved avg": 14675131498.496,
"elapsed time": 577.4979845480002
},
{
"step": 1500,
"valid accuracy": 0.48,
"train loss": 0.6369394363164902,
"train samples": 6000,
"train time": 45.09846532499796,
"eval time": 16.40352508900014,
"tokens / sec": 4641.643534685168,
"mem allocated avg": 6893671811.072,
"mem reserved avg": 14706127405.056,
"elapsed time": 697.4003612199999
},
{
"step": 1750,
"valid accuracy": 0.46,
"train loss": 0.6277884117364884,
"train samples": 7000,
"train time": 45.44054208400212,
"eval time": 16.52979276899987,
"tokens / sec": 4607.2293682804875,
"mem allocated avg": 6895174580.224,
"mem reserved avg": 14716906766.336,
"elapsed time": 817.9448886139999
},
{
"step": 2000,
"valid accuracy": 0.38,
"train loss": 0.6284448710680008,
"train samples": 8000,
"train time": 44.66441460200076,
"eval time": 16.455440011000064,
"tokens / sec": 4650.144905082808,
"mem allocated avg": 6891904557.056,
"mem reserved avg": 14653740548.096,
"elapsed time": 937.2175861730002
},
{
"step": 2250,
"valid accuracy": 0.38,
"train loss": 0.6159043073654175,
"train samples": 9000,
"train time": 46.132000129995504,
"eval time": 10.414071448999948,
"tokens / sec": 4659.412108607851,
"mem allocated avg": 6903182145.536,
"mem reserved avg": 14849203503.104,
"elapsed time": 1052.6482256069999
},
{
"step": 2500,
"valid accuracy": 0.46,
"train loss": 0.6123742452859878,
"train samples": 10000,
"train time": 44.48052799800462,
"eval time": 9.077246988999832,
"tokens / sec": 4630.498091417431,
"mem allocated avg": 6888177211.392,
"mem reserved avg": 14597721423.872,
"elapsed time": 1164.3923993470003
},
{
"step": 2750,
"valid accuracy": 0.46,
"train loss": 0.6012357432842255,
"train samples": 11000,
"train time": 45.48063802700244,
"eval time": 16.48930690899988,
"tokens / sec": 4658.7077312812435,
"mem allocated avg": 6898979477.504,
"mem reserved avg": 14782279188.48,
"elapsed time": 1285.076055947
},
{
"step": 3000,
"valid accuracy": 0.46,
"train loss": 0.590102802991867,
"train samples": 12000,
"train time": 44.94007473199872,
"eval time": 16.46088983999971,
"tokens / sec": 4644.651822338361,
"mem allocated avg": 6894117320.704,
"mem reserved avg": 14690650423.296,
"elapsed time": 1404.875696617
},
{
"step": 3250,
"valid accuracy": 0.5,
"train loss": 0.5990242173671723,
"train samples": 13000,
"train time": 45.31071554800019,
"eval time": 16.52300792999995,
"tokens / sec": 4654.55019743797,
"mem allocated avg": 6895772430.336,
"mem reserved avg": 14729800056.832,
"elapsed time": 1525.008872587
},
{
"step": 3500,
"valid accuracy": 0.5,
"train loss": 0.5803046126365662,
"train samples": 14000,
"train time": 44.94620334099545,
"eval time": 14.057770697999786,
"tokens / sec": 4666.690051853321,
"mem allocated avg": 6893924593.664,
"mem reserved avg": 14704818782.208,
"elapsed time": 1642.582958221
},
{
"step": 3750,
"valid accuracy": 0.5,
"train loss": 0.5769718471765518,
"train samples": 15000,
"train time": 46.15345219799838,
"eval time": 16.54434743999991,
"tokens / sec": 4695.271744144811,
"mem allocated avg": 6905346478.08,
"mem reserved avg": 14888957116.416,
"elapsed time": 1764.270111406
},
{
"step": 4000,
"valid accuracy": 0.5,
"train loss": 0.5857474536895751,
"train samples": 16000,
"train time": 44.38817892599582,
"eval time": 16.396790597999825,
"tokens / sec": 4604.22132524814,
"mem allocated avg": 6886660577.28,
"mem reserved avg": 14582965862.4,
"elapsed time": 1883.048193569
},
{
"step": 4250,
"valid accuracy": 0.52,
"train loss": 0.5724298695325851,
"train samples": 17000,
"train time": 46.189890748002654,
"eval time": 16.550546742999813,
"tokens / sec": 4576.520891839106,
"mem allocated avg": 6897394636.8,
"mem reserved avg": 14742080978.944,
"elapsed time": 2004.408629256
},
{
"step": 4500,
"valid accuracy": 0.5,
"train loss": 0.5789464256763458,
"train samples": 18000,
"train time": 45.51842483500286,
"eval time": 16.440088976999505,
"tokens / sec": 4565.579779030307,
"mem allocated avg": 6892786214.912,
"mem reserved avg": 14656919830.528,
"elapsed time": 2124.6471290850004
},
{
"step": 4750,
"valid accuracy": 0.5,
"train loss": 0.567945005774498,
"train samples": 19000,
"train time": 44.95984372499879,
"eval time": 16.491939075999653,
"tokens / sec": 4669.47797425881,
"mem allocated avg": 6893964591.104,
"mem reserved avg": 14709189246.976,
"elapsed time": 2244.6083886899996
},
{
"step": 5000,
"valid accuracy": 0.48,
"train loss": 0.5767219476699829,
"train samples": 20000,
"train time": 45.47602556899801,
"eval time": 16.543005558000004,
"tokens / sec": 4579.995665715981,
"mem allocated avg": 6891249879.04,
"mem reserved avg": 14656341016.576,
"elapsed time": 2364.9427617600004
},
{
"step": 5000,
"test accuracy": 0.49507202426080366,
"train loss": 0.5767219476699829,
"train samples": 20000,
"train total tokens": 4198051
}
]
},
"meta_info": {
"model_info": {
"sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
"created_at": "2024-09-18T15:23:48+00:00"
},
"dataset_info": {
"metamath": {
"sha": "aa4f34d3d2d3231299b5b03d9b3e5a20da45aa18",
"created_at": "2023-09-21T17:22:46+00:00"
},
"gsm8k": {
"sha": "e53f048856ff4f594e959d75785d2c2d37b678ee",
"created_at": "2022-04-12T10:22:10+00:00"
}
},
"package_info": {
"transformers-version": "4.52.4",
"transformers-commit-hash": null,
"peft-version": "0.17.1.dev0",
"peft-commit-hash": "04d41cbcd061bf1ab1185e111054bae012cb1894",
"datasets-version": "3.6.0",
"datasets-commit-hash": null,
"bitsandbytes-version": "0.46.0",
"bitsandbytes-commit-hash": null,
"torch-version": "2.7.1+cu126",
"torch-commit-hash": null
},
"system_info": {
"system": "Linux",
"release": "6.14.0-1010-aws",
"version": "#10~24.04.1-Ubuntu SMP Fri Jul 18 20:44:30 UTC 2025",
"machine": "x86_64",
"processor": "x86_64",
"accelerator": "NVIDIA L40S"
},
"pytorch_info": "PyTorch built with:\n - GCC 11.2\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 12.6\n - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n - CuDNN 90.7.1 (built against CUDA 12.8)\n - Built with CuDNN 90.5.1\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=e2d141dbde55c2a4370fac5165b0561b6af4798b, CUDA_VERSION=12.6, CUDNN_VERSION=9.5.1, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.7.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, \n"
}
} |
I will also check it. |
|
I ran a few Bone-bat experiments, variance of test accuracy is quite high:
I'll try a few other PEFT methods like LoRA next. |
Yes, we should indeed verify whether other PEFT methods also produce such large variances. |
|
I ran a few more experiments with LoRA rank 32, the rest being the same, and collected the test accuracy. I just from eyeballing, and from running a Levene/Brown-Forsythe test, it looks pretty much like {MiSS,Bone}-Bat has a higher variance than LoRA: import numpy as np
import scipy as sp
# note: test set size is 1319
acc_lora = [0.48218347232752085, 0.47536012130401817, 0.4829416224412434, 0.4836997725549659, 0.47763457164518575, 0.4783927217589083, 0.4829416224412434, 0.4715693707354056]
acc_bat = [0.5200909780136467, 0.49052312357846856, 0.49583017437452614, 0.5056861258529188, 0.5170583775587566, 0.5049279757391963]
stat, p = sp.stats.levene(acc_lora, acc_bat, center='mean')
print(f"Levene W={stat:.3f}, p={p:.4f}")
# prints: Levene W=4.087, p=0.0661
stat, p = sp.stats.levene(acc_lora, acc_bat, center='median')
print(f"Brown–Forsythe W={stat:.3f}, p={p:.4f}")
# prints: Brown–Forsythe W=3.960, p=0.0699Of course, this is not definitive proof, but it strongly suggests that Bat has higher variance. |
|
@JL-er Any ideas how to proceed? Is it possible that Bat is especially sensitive to small changes? The training loss is pretty consistent between the runs, so the differences there are not big, but the test accuracy still differs a lot (as generation can exacerbate small differences). |
I haven’t had time to run tests recently, but I suspect that the instability in evaluation may be caused by the interaction between bat and the original weights. |
* Nit fix about max_lenth argument. * copy to docstring * typo * consistency --------- Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
|
@JL-er Did you have the time to investigate this further? |
The bat should be fine, but it is too sensitive to LR. |
|
Okay, so we can just merge this PR, right? Maybe in a separate PR, it could be documented that Bat init can be unstable. |
Yes, we can merge the PR. We will propose new methods later. |
Results are pretty close to the corresponding experiments with Bone, which is what we expected.