-
Notifications
You must be signed in to change notification settings - Fork 472
Description
Hello Qwen-VL-Chat Team!
I am currently exploring multi-image understanding tasks with Qwen-VL-Chat. I understand that the model already supports two-image comparison questions, for example:
query = tokenizer.from_list_format([ {'image': 'assets/mm_tutorial/Chongqing.jpeg'}, {'image': 'assets/mm_tutorial/Beijing.jpeg'}, {'text': '上面两张图片分别是哪两个城市?请对它们进行对比。'}, ])
This functionality is extremely useful.
I have tried adding more images (up to 10) in a similar format, but I observed that the model’s answers became repetitive and out of order. This led me to wonder:
- What is the maximum number of images that the model can accept in a single input?
- If more images are provided than this limit, how does the model behave (e.g., error, truncation, unexpected output)?
- Are there any recommended or more “compliant” ways to format multi-image inputs to ensure reliable responses, especially when the number of images exceeds two?
- If multi-round calls to model.chat are necessary to handle many images, do you have any suggestions or best practices to reduce time consumption during these calls?
Your guidance would be greatly appreciated, as it would help me design experiments effectively and ensure that I use the model in the best way possible.
Thank you very much for your time and support!
Best regards~