-
The University of Queensland
- Brisbane
Stars
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
Astro template to help you build an interactive project page for your research paper
Video Summarization Dataset, Papers, Codes
TVSum: Title-based Video Summarization dataset (CVPR 2015)
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
verl: Volcano Engine Reinforcement Learning for LLMs
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
Replication package for paper: Representation-Based Fairness Evaluation and Bias Correction Robustness Assessment in Neural Networks
[ICLR 2025] VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
🔥 Comprehensive survey on Context Engineering: from prompt engineering to production-grade AI systems. hundreds of papers, frameworks, and implementation guides for LLMs and AI agents.
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
Official inference repo for FLUX.1 models
🚀 Cross attention map tools for huggingface/diffusers
[TMLR 2025] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electro…
[NeurIPS 2025 DB] OneIG-Bench is a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including subject-element alignment,…
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
[CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Official Jax Implementation of MaskGIT