-
University of California, Santa Barbara
Highlights
- Pro
Stars
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
[NeurIPS 2025] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
[EMNLP 2025] Official code for the paper "SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning"
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
Official implementation of the NeurIPS 2025 paper "Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space"
Agent S: an open agentic framework that uses computers like a human
Universal memory layer for AI Agents; Announcing OpenMemory MCP - local and secure memory management.
[ICLR 2025] EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
[ACL 2025 Findings] "Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models"
Official repo for the paper "Mojito: Motion Trajectory and Intensity Control for Video Generation""
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Large Concept Models: Language modeling in a sentence representation space
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
A simple screen parsing tool towards pure vision based GUI agent
Educational framework exploring ergonomic, lightweight multi-agent orchestration. Managed by OpenAI Solution team.
[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"
[ECCV 2024] Official implementation of NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
This is the implementation of ACL 2024 Findings paper ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Code repo for "Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding"
Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
[ACL 2025 Findings] "Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA"
Letta is the platform for building stateful agents: open AI with advanced memory that can learn and self-improve over time.
Official repo for the TMLR paper "Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners"