Lists (8)
Sort Name ascending (A-Z)
Stars
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
A Virtualized Multi-LoRA Framework for Unified LLM Fine-tuning and Serving
A std::execution style runtime context and High Performance RPC Transport for using OpenUCX. Including CUDA/ROCM/... devices with RDMA.
SCORPIO is a system-algorithm co-designed LLM serving engine that prioritizes heterogeneous Service Level Objectives (SLOs) like TTFT and TPOT across all scheduling stages.
Venus Collective Communication Library, supported by SII and Infrawaves.
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores >70% on SWE-bench verified!
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
A high-performance inference engine for LLMs, optimized for diverse AI accelerators.
Awesome list for LLM quantization
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]
Unleashing the Power of Reinforcement Learning for Math and Code Reasoners
[Neurips 2025] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models
The calflops is designed to calculate FLOPs、MACs and Parameters in all various neural networks, such as Linear、 CNN、 RNN、 GCN、Transformer(Bert、LlaMA etc Large Language Model)
A Unified Cache Acceleration Framework for 🤗 Diffusers: Qwen-Image-Lightning, Qwen-Image, HunyuanImage, FLUX, Wan, etc.
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
The official implementation of flow Q-learning (FQL)
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Open-Sora: Democratizing Efficient Video Production for All
[ASPLOS'25] Towards End-to-End Optimization of LLM-based Applications with Ayo
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
Ring attention implementation with flash attention