HazyResearch / ThunderKittens
Tile primitives for speedy kernels
See what the GitHub community is most excited about this month.
Tile primitives for speedy kernels
CUDA accelerated rasterization of gaussian splatting
LLM training in simple, raw C/CUDA
cuVS - a library for vector search and clustering on the GPU
FlashInfer: Kernel Library for LLM Serving
Causal depthwise conv1d in CUDA, with a PyTorch interface
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Fast CUDA matrix multiplication from scratch
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.
GPU accelerated decision optimization
CUDA Kernel Benchmarking Library
CUDA Library Samples
Sample codes for my CUDA programming book
RCCL Performance Benchmark Tests
NCCL Tests