Starred repositories
MiMo-Audio: Audio Language Models are Few-Shot Learners
Long-form streaming TTS system for multi-speaker dialogue generation
[ICML 2025] Official PyTorch Implementation of "History-Guided Video Diffusion"
Legacy-Mess Detector – assess the “legacy-mess level” of your code and output a beautiful report | 屎山代码检测器,评估代码的“屎山等级”并输出美观的报告
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
[ACL 2024] Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer
Tongyi Deep Research, the Leading Open-source Deep Research Agent
Comfyui custom node for FunAudioLLM include CosyVoice and SenseVoice
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
A must-read paper for speech separation based on neural networks
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
Scalable toolkit for efficient model reinforcement
An official implementation of "SIM-CoT: Supervised Implicit Chain-of-Thought"
Datawhale成员整理的面经,内容包括机器学习,CV,NLP,推荐,开发等,欢迎大家star
PyTorch Implementation of StyleSinger(AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Efficient audio understanding with general audio captions
[ACMMM'2024] Generative Expressive Conversational Speech Synthesis
[ICCV 2025] Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Wan: Open and Advanced Large-Scale Video Generative Models
DonArtkins / MetaGPT
Forked from FoundationAgents/MetaGPT🌟 The Multi-Agent Framework: First AI Software Company, Towards Natural Language Programming
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Text-audio foundation model from Boson AI
Repo for counting stars and contributing. Press F to pay respect to glorious developers.