Stars
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
verl: Volcano Engine Reinforcement Learning for LLMs
Allow torch tensor memory to be released and resumed later
Examples of CUDA implementations by Cutlass CuTe
Implement Flash Attention using Cute.
Distributed Compiler based on Triton for Parallel Systems
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
This is the homepage of a new book entitled "Mathematical Foundations of Reinforcement Learning."
Tensors and Dynamic neural networks in Python with strong GPU acceleration
A minimal GPU design in Verilog to learn how GPUs work from the ground up
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
Modeling, training, eval, and inference code for OLMo
Development repository for the Triton language and compiler
High-speed Large Language Model Serving for Local Deployment
Data preparation code for CrystalCoder 7B LLM
Pre-training code for CrystalCoder 7B LLM
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
DLRover: An Automatic Distributed Deep Learning System
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.