Highlights
- Pro
Stars
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
An extremely fast Python package and project manager, written in Rust.
VS Code extension for syntax highlighting C++/CUDA/HIP code in PyTorch load_inline() strings
RFC document, tooling and other content related to the array API standard
AGENTS.md — a simple, open format for guiding coding agents
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
🎡 Build Python wheels for all the platforms with minimal configuration.
A next generation Python CMake adaptor and Python API for plugins
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
Minimum example for deploying Apache TVM's Relax IR using C++ API
JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training
Distributed Compiler based on Triton for Parallel Systems
A Datacenter Scale Distributed Inference Serving Framework
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlashMLA: Efficient Multi-head Latent Attention Kernels
verl: Volcano Engine Reinforcement Learning for LLMs
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
TL2cgen (TreeLite 2 C GENerator) is a model compiler for decision tree models
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
A PyTorch native platform for training generative AI models
Modeling, training, eval, and inference code for OLMo
Fast, Flexible and Portable Structured Generation
CUDA Python: Performance meets Productivity