💡 [REQUEST] - <title>Release of Jointly-Trained Text Encoder (like CLIP)

### 起始日期 | Start Date

_No response_

### 实现PR | Implementation PR

_No response_

### 相关Issues | Reference Issues

_No response_

### 摘要 | Summary

Hi Qwen-VL Team,

Thank you for your amazing work on Qwen-VL!!!! it’s a powerful and much-appreciated contribution to the community.

I’d like to kindly request the release of the text encoder trained jointly with the image encoder. This would enable broader use in tasks like multi-modal retrieval, alignment-based applications, and research on cross-modal embeddings.

It would be a valuable addition to an already excellent project. Thank you for considering this!

Best regards,
Michael

### 基本示例 | Basic Example

like other Image-Text Retrieval.

### 缺陷 | Drawbacks

tiny effort - release it

### 未解决问题 | Unresolved questions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💡 [REQUEST] - <title>Release of Jointly-Trained Text Encoder (like CLIP) #510

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

💡 [REQUEST] - <title>Release of Jointly-Trained Text Encoder (like CLIP) #510

Description

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions