Support for combined STT + LLM with models like gpt4o-audio-preview

Would love to see support of natively multi modal models like `gpt-4o-audio-preview-2025-06-03` and `gpt-4o-mini-audio-preview-2024-12-17` that can understand speech and directly generate text. I tried using them with the open ai realtime plugin but that was unsuccessful. 

I understand that this is possible with agents 1.2 release but only with realtime api which is not good for instruction following / tool calling + does not reliably produce audio markup tags like [angry], [sigh], etc which affect the downstram TTS models. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for combined STT + LLM with models like gpt4o-audio-preview #2955

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for combined STT + LLM with models like gpt4o-audio-preview #2955

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions