Stars
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Efficient GPU kernels for block-sparse matrix multiplication and convolution
PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Github mirror of trition-lang/triton repo.
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
The Orchestration Engine To Deliver Self-Service Infrastructure ⚡️
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
NanoGPT speedrun in JAX. Originally at https://nor-git.pages.dev/modded-nanogpt-jax/
KANditioned: A very fast implementation of Kolmogorov-Arnold Networks
NCCL communication API layer, and transport layer created from first principles.
A comprehensive collection of KAN(Kolmogorov-Arnold Network)-related resources, including libraries, projects, tutorials, papers, and more, for researchers and developers in the Kolmogorov-Arnold N…
Official Implementation of "ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate"
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
Efficient Triton Kernels for LLM Training