quantization

Here are 840 public repositories matching this topic...

Grulmex / UFund-Me-Qbot

AI-powered Quantitative Investment Research Platform.

machine-learning deep-learning bitcoin blockchain fintech quantitative-finance trademarks quantization funds strategies quantitative-trading pytrade qlib quant-trade trade-bot quant-trader

Updated Jul 22, 2025
HTML

intel / neural-compressor

Star

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

sparsity pruning quantization knowledge-distillation auto-tuning int8 low-precision quantization-aware-training post-training-quantization awq int4 large-language-models gptq smoothquant sparsegpt fp4 mxformat

Updated Jul 22, 2025
Python

quic / aimet

Star

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

open-source machine-learning opensource deep-neural-networks compression deep-learning pruning quantization auto-ml network-quantization network-compression

Updated Jul 22, 2025
Python

nunchaku-tech / nunchaku

Star

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

flux lora quantization iclr diffusion-models mlsys comfyui genai iclr2025

Updated Jul 22, 2025
Python

bitsandbytes-foundation / bitsandbytes

Star

Accessible large language models via k-bit quantization for PyTorch.

machine-learning pytorch quantization llm qlora

Updated Jul 21, 2025
Python

pytorch / ao

Star

PyTorch native quantization and sparsity for training and inference

training sparsity cuda inference optimizer pytorch transformer offloading llama quantization mx brrr dtypes float8

Updated Jul 21, 2025
Python

vllm-project / llm-compressor

Sponsor

Star

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

sparsity compression quantization

Updated Jul 21, 2025
Python

Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Transformers, and vLLM. Export your models effortlessly to autogptq, autoawq, gguf and autoround formats with higher accuracy even at extremely low bit precision.

rounding quantization awq int4 gptq neural-compressor

Updated Jul 21, 2025
Python

actypedef / MixedGemm

Star

a mixed-precision gemm with quantize and reorder kernel.

cuda quantization mlsys inference-acceleration llm

Updated Jul 21, 2025
Cuda

hailo-ai / hailo_model_zoo

Star

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

computer-vision deep-learning quantization hailo quantized-neural-networks edge-ai ai-accelerators hailo8

Updated Jul 21, 2025
Python

huggingface / optimum

Star

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

training optimization intel transformers inference pytorch quantization onnx tflite onnxruntime graphcore habana

Updated Jul 21, 2025
Python

AIM-SKKU / TVQ

Star

Official repository of "Task Vector Quantization for Memory-Efficient Model Merging" [ICCV 2025]

quantization multi-task-learning model-merging task-arithmetic

Updated Jul 21, 2025
Python

whucs21Mzy / Model-Phase-Transitions

Star

Phase Transitions in Large Language Model Compression: A Perspective

nlp quantization model-compression pruning-optimization llm scaling-law low-rank-decomposition model-phase-transitions

Updated Jul 21, 2025

ModelTC / LightCompress

Star

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

benchmark deployment tool evaluation pruning quantization wan post-training-quantization awq large-language-models llm vllm smoothquant token-reduction mixtral internlm2 lvlm deepseek-v3 deepseek-r1

Updated Jul 21, 2025
Python

slinusc / fast_llm_inference

Star

Bench360 is a modular benchmarking suite for local LLM inference. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers and practitioners.

benchmark framework optimization engine inference quantization llm vllm llm-inference bench360

Updated Jul 21, 2025
Jupyter Notebook

huggingface / optimum-intel

Star

🤗 Optimum Intel: Accelerate inference with Intel optimization tools

optimization intel transformers inference pruning quantization distillation onnx openvino diffusers

Updated Jul 21, 2025
Jupyter Notebook

thu-ml / SageAttention

Star

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.