-
SageAttention Public
Forked from thu-ml/SageAttentionQuantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Cuda Apache License 2.0 UpdatedJul 16, 2025 -
quack Public
Forked from Dao-AILab/quackA Quirky Assortment of CuTe Kernels
Python Apache License 2.0 UpdatedJul 10, 2025 -
FP8_RL Public
Forked from volcengine/verlverl: Volcano Engine Reinforcement Learning for LLMs
Python Apache License 2.0 UpdatedJul 6, 2025 -
DeepGEMM Public
Forked from deepseek-ai/DeepGEMMDeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Python MIT License UpdatedMay 7, 2025 -
cute-flash-attention Public
Forked from luliyucoordinate/cute-flash-attentionImplement Flash Attention using Cute.
Cuda UpdatedMay 2, 2025 -
Triton-distributed Public
Forked from ByteDance-Seed/Triton-distributedDistributed Triton for Parallel Systems
MLIR MIT License UpdatedApr 8, 2025 -
sglang Public
Forked from sgl-project/sglangSGLang is a fast serving framework for large language models and vision language models.
Python Apache License 2.0 UpdatedApr 1, 2025 -
DeepEP Public
Forked from deepseek-ai/DeepEPDeepEP: an efficient expert-parallel communication library
Cuda MIT License UpdatedFeb 28, 2025 -
tiny-flash-attention Public
Forked from 66RING/tiny-flash-attentionflash attention tutorial written in python, triton, cuda, cutlass
Cuda MIT License UpdatedJan 18, 2025 -
-
Cute-Learning Public
Forked from DD-DuDa/Cute-LearningExamples of CUDA implementations by Cutlass CuTe
Makefile MIT License UpdatedNov 24, 2024 -
Megatron-LM Public
Forked from NVIDIA/Megatron-LMOngoing research training transformer models at scale
Python Other UpdatedNov 12, 2024 -
Book-Mathematical-Foundation-of-Reinforcement-Learning Public
Forked from MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-LearningThis is the homepage of a new book entitled "Mathematical Foundations of Reinforcement Learning."
MATLAB UpdatedNov 6, 2024 -
O1-Journey Public
Forked from GAIR-NLP/O1-JourneyO1 Replication Journey: A Strategic Progress Report – Part I
UpdatedOct 9, 2024 -
cutlass Public
Forked from NVIDIA/cutlassCUDA Templates for Linear Algebra Subroutines
C++ Other UpdatedAug 29, 2024 -
anole Public
Forked from GAIR-NLP/anoleAnole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
Python UpdatedJul 9, 2024 -
alpha-zero-general Public
Forked from suragnair/alpha-zero-generalA clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more
Jupyter Notebook MIT License UpdatedJun 6, 2024 -
-
-
pytorch_examples Public
Forked from pytorch/examplesA set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
Python BSD 3-Clause "New" or "Revised" License UpdatedApr 22, 2024 -
OLMo Public
Forked from allenai/OLMoModeling, training, eval, and inference code for OLMo
Python Apache License 2.0 UpdatedMar 12, 2024 -
DiT Public
Forked from facebookresearch/DiTOfficial PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Python Other UpdatedFeb 23, 2024 -
-
triton Public
Forked from triton-lang/tritonDevelopment repository for the Triton language and compiler
C++ MIT License UpdatedJan 26, 2024 -
dlrover Public
Forked from intelligent-machine-learning/dlroverDLRover: An Automatic Distributed Deep Learning System
-
-
crystalcoder-train Public
Forked from LLM360/crystalcoder-trainPre-training code for CrystalCoder 7B LLM
UpdatedDec 11, 2023 -
DeepSpeed Public
Forked from deepspeedai/DeepSpeedDeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Python Apache License 2.0 UpdatedOct 26, 2023 -
how-to-optim-algorithm-in-cuda Public
Forked from BBuf/how-to-optim-algorithm-in-cudahow to optimize some algorithm in cuda.
Cuda UpdatedOct 16, 2023 -
easydl Public
Forked from samplise/easydlEasyDL: A Kubernetes-native Deep Learning Training Service