-
Nankai University
- Hangzhou, China
-
12:23
(UTC +08:00) - https://zhengli97.github.io/
Lists (3)
Sort Name ascending (A-Z)
Stars
Automatic Video Generation from Scientific Papers
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
Learning audio concepts from natural language supervision
🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.
🔥 🔥 🔥 Awesome MLLMs/Benchmarks for Short/Long/Streaming Video Understanding 📹
Reference PyTorch implementation and models for DINOv3
Official Release of ICCV 2025 paper -- DiscretizedSDF
Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
Collection of Composed Image Retrieval (CIR) papers.
Code for paper "Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters" CVPR2024
[CVPR 2025] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
[IJCV 2025] Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
🔍 Search-o1: Agentic Search-Enhanced Large Reasoning Models [EMNLP 2025]
[arXiv 25] Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR
[NeurIPS 2025 Oral] Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
[ICCV 2025] Official PyTorch Code for "Advancing Textual Prompt Learning with Anchored Attributes"
[ICML 2024] The offical implementation of A2PR, a simple way to achieve SOTA in offline reinforcement learning with an adaptive advantage-guided policy regularization method, in Pytorch
Official Code for Paper: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
MedSeg-R: Medical Image Segmentation with Clinical Reasoning
A paper list of some recent works about Token Compress for Vit and VLM
[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
[CVPR 2024 Highlight] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-based Reasoning MLLMs!
When do we not need larger vision models?
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.
[CVPR 2025] Official PyTorch Code for "DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models"