-
The University of Hong Kong
- Hong Kong
- ttengwang.com
Lists (1)
Sort Name ascending (A-Z)
Stars
Code release for the paper "Progress-Aware Video Frame Captioning" (CVPR 2025)
Latest Advances on System-2 Reasoning
This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.
[ICML 2025 Oral] This is the official repository of the paper "What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities"
😎 Awesome list of Retrieval-Augmented Generation (RAG) applications in Generative AI.
StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation
A collection of papers on discrete diffusion models
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.
Latest Advances on Vison-Language-Action Models.
A curated list for vision-and-language navigation. ACL 2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions"
[Lumina Embodied AI Community] 具身智能技术指南 Embodied-AI-Guide
[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
[ECCV2024] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
[Arxiv] Discrete Diffusion in Large Language and Multimodal Models: A Survey
The development and future prospects of multimodal reasoning models.
the official repo for "D-AR: Diffusion via Autoregressive Models"
🔥🔥🔥Latest Papers, Codes on Uncertainty-based RL
Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))