-
Peking University
Stars
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
VideoSys: An easy and efficient system for video generation
The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
QWen-VL-Plus & QWen-VL-Max in ComfyUI
A Python implementation of John Gruber’s Markdown with Extension support.
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Robust Speech Recognition via Large-Scale Weak Supervision
This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headles…
A framework to enable multimodal models to operate a computer.
The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.
Fast and memory-efficient exact attention
✨✨Latest Advances on Multimodal Large Language Models
Touchstone: Evaluating Vision-Language Models by Language Models
(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Generate text images for training deep learning ocr model
Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".
Implementation of Toolformer, Language Models That Can Use Tools, by MetaAI
FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.