Stars
MSCCL++: A GPU-driven communication stack for scalable AI applications
Simple, safe way to store and distribute tensors
nanobind: tiny and efficient C++/Python bindings
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Minimalistic 4D-parallelism distributed training framework for education purpose
Minimalistic large language model 3D-parallelism training
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Zero Bubble Pipeline Parallelism
Efficient Triton Kernels for LLM Training
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
slime is an LLM post-training framework for RL Scaling.
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
FlashInfer: Kernel Library for LLM Serving
基于 Playwright 和AI过滤的闲鱼多任务实时/定时监控与智能分析工具,配备了功能完善的后台管理界面。帮助用户节省闲鱼商品过滤,能及时找到心仪商品。
[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention
📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, GLM4.5, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, Llava, GLM4v, Ph…
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.
A lightweight design for computation-communication overlap.
SGLang is a fast serving framework for large language models and vision language models.
CUDA Python: Performance meets Productivity
Distributed Compiler based on Triton for Parallel Systems
Lightweight coding agent that runs in your terminal
Patch convolution to avoid large GPU memory usage of Conv2D