+
Skip to content
View gty111's full-sized avatar
🎯
Focusing is all you need
🎯
Focusing is all you need

Highlights

  • Pro

Block or report gty111

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Official PyTorch implementation for "Large Language Diffusion Models"

Python 2,585 173 Updated Jun 17, 2025

Standardized Serverless ML Inference Platform on Kubernetes

Python 4,368 1,214 Updated Jul 16, 2025

llm-d is a Kubernetes-native high-performance distributed LLM inference framework

Makefile 1,394 113 Updated Jul 16, 2025

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,144 178 Updated Mar 27, 2024

Kimi K2 is the large language model series developed by Moonshot AI team

5,652 317 Updated Jul 16, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 11,159 1,855 Updated Jul 18, 2025

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 1,585 99 Updated Jul 18, 2025

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 476 33 Updated Feb 10, 2025

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Python 18,734 1,542 Updated Jun 16, 2025

Supercharge Your LLM with the Fastest KV Cache Layer

Python 3,204 354 Updated Jul 18, 2025

Distributed Compiler based on Triton for Parallel Systems

Python 891 75 Updated Jul 15, 2025

The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Python 91 Updated Jul 1, 2025

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Cuda 2,014 157 Updated Jul 16, 2025

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 48 5 Updated May 8, 2025

MMaDA - Open-Sourced Multimodal Large Diffusion Language Models

Python 1,204 55 Updated Jun 13, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,827 300 Updated Mar 10, 2025

Analyze computation-communication overlap in V3/R1.

1,078 144 Updated Mar 21, 2025

Expert Parallelism Load Balancer

Python 1,236 194 Updated Mar 24, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,285 860 Updated Jul 18, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 14,658 1,041 Updated Jul 12, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,870 280 Updated May 15, 2025

My learning notes/codes for ML SYS.

Python 2,939 181 Updated Jul 17, 2025

A markdown version emoji cheat sheet

TypeScript 13,216 4,545 Updated Jul 18, 2025

Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1 and other large language models.

Go 146,829 12,431 Updated Jul 17, 2025

Materials for learning SGLang

492 37 Updated Jul 8, 2025

High performance Transformer implementation in C++.

C++ 126 16 Updated Jan 18, 2025

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 643 70 Updated Apr 6, 2025

Graph Neural Network Library for PyTorch

Python 22,620 3,851 Updated Jul 16, 2025

Best Practices on Recommendation Systems

Python 20,486 3,226 Updated Jul 18, 2025
Next
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载