AI-powered Quantitative Investment Research Platform.
-
Updated
Jul 22, 2025 - HTML
AI-powered Quantitative Investment Research Platform.
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Accessible large language models via k-bit quantization for PyTorch.
PyTorch native quantization and sparsity for training and inference
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Transformers, and vLLM. Export your models effortlessly to autogptq, autoawq, gguf and autoround formats with higher accuracy even at extremely low bit precision.
a mixed-precision gemm with quantize and reorder kernel.
The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
Official repository of "Task Vector Quantization for Memory-Efficient Model Merging" [ICCV 2025]
Phase Transitions in Large Language Model Compression: A Perspective
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Bench360 is a modular benchmarking suite for local LLM inference. It offers a full-stack, extensible pipeline to evaluate the latency, throughput, quality, and cost of LLM inference on consumer and enterprise GPUs. Bench360 supports flexible backends, tasks and scenarios, enabling fair and reproducible comparisons for researchers and practitioners.
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
A friendly CLI tool for converting and uploading transformers for CTranslate2.
Palette quantization library that powers pngquant and other PNG optimizers
Add a description, image, and links to the quantization topic page so that developers can more easily learn about it.
To associate your repository with the quantization topic, visit your repo's landing page and select "manage topics."