Stars
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
Fully open reproduction of DeepSeek-R1
[ICLR 2025] LAPA: Latent Action Pretraining from Videos
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
[CVPR 2024 Highlight] Official PyTorch implementation of SpatialTracker: Tracking Any 2D Pixels in 3D Space
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs".
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Reaching LLaMA2 Performance with 0.1M Dollars
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL
[CVPR 2024] Official implementation of the paper "Visual In-context Learning"
Must-have resource for anyone who wants to experiment with and build on the OpenAI vision API 🔥
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
[arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMs
A high-throughput and memory-efficient inference and serving engine for LLMs
[CVPR 2023] Official Implementation of X-Decoder for generalized decoding for pixel, image and language
Official repository for "Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition" [ICCV 2023]
[ECCV 2024] Official implementation of the paper "Semantic-SAM: Segment and Recognize Anything at Any Granularity"
Official PyTorch implementation of the paper "In-Context Learning Unlocked for Diffusion Models"
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
[ICCV 2023] Official implementation of the paper "A Simple Framework for Open-Vocabulary Segmentation and Detection"
[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"