Stars
[ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
An efficient implementation of the NSA (Native Sparse Attention) kernel
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Puzzles for learning Triton, play it with minimal environment configuration!
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
DeepEP: an efficient expert-parallel communication library
Modified version of PyTorch able to work with changes to GPGPU-Sim
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Distributed Compiler based on Triton for Parallel Systems
heyppen / AirPosture
Forked from allenv0/AirPostureTurn your AirPods into a posture coach on macOS
Latest Advances on System-2 Reasoning
Paper list for Efficient Reasoning.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A lightweight design for computation-communication overlap.
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
Fast Hadamard transform in CUDA, with a PyTorch interface
An open source GPU based off of the AMD Southern Islands ISA.