Stars
A lightweight design for computation-communication overlap.
A Datacenter Scale Distributed Inference Serving Framework
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
FlashInfer: Kernel Library for LLM Serving
FlashMLA: Efficient Multi-head Latent Attention Kernels
Github mirror of trition-lang/triton repo.
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower execution latency, and lower execution cost. Also has a simple …
Translation of C++ Core Guidelines [https://github.com/isocpp/CppCoreGuidelines] into Simplified Chinese.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
Make a personal website using Notion and GitHub Pages
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Paper collections of retrieval-based (augmented) language model.
Universal cross-platform tokenizers binding to HF and sentencepiece
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Quick, visual, principled introduction to pytorch code through five colab notebooks.
we want to create a repo to illustrate usage of transformers in chinese
C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.