Respectfully asking: maximum images supported in multi-image understanding?

Hello Qwen-VL-Chat Team!

I am currently exploring multi-image understanding tasks with Qwen-VL-Chat. I understand that the model already supports two-image comparison questions, for example:

`query = tokenizer.from_list_format([
    {'image': 'assets/mm_tutorial/Chongqing.jpeg'},
    {'image': 'assets/mm_tutorial/Beijing.jpeg'},
    {'text': '上面两张图片分别是哪两个城市？请对它们进行对比。'},
])`

This functionality is extremely useful.

I have tried adding more images (up to 10) in a similar format, but I observed that the model’s answers became repetitive and out of order. This led me to wonder:

1. What is the maximum number of images that the model can accept in a single input?
2. If more images are provided than this limit, how does the model behave (e.g., error, truncation, unexpected output)?
3. Are there any recommended or more “compliant” ways to format multi-image inputs to ensure reliable responses, especially when the number of images exceeds two?
4. If multi-round calls to model.chat are necessary to handle many images, do you have any suggestions or best practices to reduce time consumption during these calls?

Your guidance would be greatly appreciated, as it would help me design experiments effectively and ensure that I use the model in the best way possible.

Thank you very much for your time and support!

Best regards~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Respectfully asking: maximum images supported in multi-image understanding? #514

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Respectfully asking: maximum images supported in multi-image understanding? #514

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions