-
Tiktok
- Sunnyvale, CA
-
04:09
(UTC -07:00) - https://xsxszab.github.io/
- in/yifei--wang
Stars
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
verl: Volcano Engine Reinforcement Learning for LLMs
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…
Tensors and Dynamic neural networks in Python with strong GPU acceleration
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without lossing end-to-end metrics across language, image, and video models.
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
MambaOut: Do We Really Need Mamba for Vision? (CVPR 2025)
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Hands on exercises with real-life examples to study and practice Go concurrency patterns. Test-cases are provided to verify your answers.
Virtual whiteboard for sketching hand-drawn like diagrams
Ongoing research training transformer models at scale
SGLang is a fast serving framework for large language models and vision language models.
Seamless operability between C++11 and Python
Development repository for the Triton language and compiler
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Fully open reproduction of DeepSeek-R1
A high-throughput and memory-efficient inference and serving engine for LLMs
This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems. RAG systems combine information retrieval with generative models to provide accurate and cont…
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A fast, clean, responsive Hugo theme.
The world’s fastest framework for building websites.
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs