+
Skip to content
View catswe's full-sized avatar

Block or report catswe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

CPU Plugin for Triton

C++ 5 Updated Oct 10, 2025

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 93,818 25,516 Updated Oct 10, 2025

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Cuda 1,061 199 Updated Jun 8, 2023

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

Python 1,088 158 Updated Aug 12, 2025

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.

Python 400 74 Updated Oct 10, 2025

Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.

Jupyter Notebook 2,559 134 Updated Oct 9, 2025

Tokamax: A GPU and TPU kernel library.

Python 89 Updated Oct 10, 2025

Github mirror of trition-lang/triton repo.

MLIR 85 20 Updated Oct 10, 2025
Python 1,126 108 Updated Oct 9, 2025

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 376 9 Updated Oct 9, 2025

Low-bit LLM inference on CPU/NPU with lookup table

C++ 866 72 Updated Jun 5, 2025

CUDA Embedding Lookup Kernel Library

Cuda 28 4 Updated Jul 25, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 7,936 791 Updated Sep 19, 2025

The Orchestration Engine To Deliver Self-Service Infrastructure ⚡️

Rust 2,399 77 Updated Sep 1, 2025

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

Python 83 19 Updated Oct 10, 2025

NanoGPT speedrun in JAX. Originally at https://nor-git.pages.dev/modded-nanogpt-jax/

Python 2 1 Updated Aug 28, 2025

KANditioned: A very fast implementation of Kolmogorov-Arnold Networks

Python 12 Updated Oct 4, 2025

Fast low-bit matmul kernels in Triton

Python 379 28 Updated Sep 28, 2025

NCCL communication API layer, and transport layer created from first principles.

C++ 11 Updated Aug 20, 2025
Python 740 62 Updated May 24, 2024

A comprehensive collection of KAN(Kolmogorov-Arnold Network)-related resources, including libraries, projects, tutorials, papers, and more, for researchers and developers in the Kolmogorov-Arnold N…

3,068 292 Updated Jun 30, 2025

Official Implementation of "ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate"

Jupyter Notebook 425 20 Updated Dec 12, 2024
Python 67 4 Updated Nov 15, 2024

Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X

C++ 70 5 Updated Aug 2, 2025

Efficient optimizers

Python 269 25 Updated Oct 9, 2025

Efficient Triton Kernels for LLM Training

Python 5,733 414 Updated Oct 10, 2025
Python 529 49 Updated Sep 23, 2025

Puzzles for learning Triton

Jupyter Notebook 2,030 168 Updated Nov 18, 2024
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载