Stars
🎓Automatically Update Recommendation Papers Daily using Github Actions (Update Every 12th hours)
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
This repository contains the official implementation of the research paper, "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" CVPR 2024
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Qlib is an AI-oriented Quant investment platform that aims to use AI tech to empower Quant Research, from exploring ideas to implementing productions. Qlib supports diverse ML modeling paradigms, i…
🎓 Path to a free self-taught education in Computer Science!
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
The official repo for "DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models".
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Q-Insight is open-sourced at https://github.com/bytedance/Q-Insight. This repository will not receive further updates.
A comprehensive collection of IQA papers
Evaluation and Tracking for LLM Experiments and AI Agents
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Official inference framework for 1-bit LLMs
State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
Fast, Accurate, Lightweight Python library to make State of the Art Embedding
Port of OpenAI's Whisper model in C/C++
An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
Development repository for the Triton language and compiler
Enforce the output format (JSON Schema, Regex etc) of a language model
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Large-scale LLM inference engine
Learn how to design systems at scale and prepare for system design interviews
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions