Overview
LiveKit Agents supports text inputs and outputs in addition to audio, based on the text streams feature of the LiveKit SDKs. This guide explains what's possible and how to use it in your app.
Transcriptions
When an agent performs STT as part of its processing pipeline, the transcriptions are also published to the frontend in realtime. Additionally, a text representation of the agent speech is also published in sync with audio playback when the agent speaks. These features are both enabled by default when using AgentSession
.
Transcriptions use the lk.transcription
text stream topic. They include a lk.transcribed_track_id
attribute and the sender identity is the transcribed participant.
To disable transcription output, set transcription_enabled=False
in RoomOutputOptions
.
Synchronized transcription forwarding
When both voice and transcription are enabled, the agent's speech is synchronized with its transcriptions, displaying text word by word as it speaks. If the agent is interrupted, the transcription stops and is truncated to match the spoken output.
Disabling synchronization
To send transcriptions to the client as soon as they become available, without synchronizing to the original speech, set sync_transcription
to False in RoomOutputOptions
.
await session.start(agent=MyAgent(),room=ctx.room,room_output_options=RoomOutputOptions(sync_transcription=False),)
Accessing from AgentSession
You can be notified within your agent whenever text input or output is committed to the chat history by listening to the conversation_item_added event.
TTS-aligned transcriptions
If your TTS provider supports it, you can enable TTS-aligned transcription forwarding to improve transcription synchronization to your frontend. This feature synchronizes the transcription output with the actual speech timing, enabling word-level synchronization. When using this feature, certain formatting may be lost from the original text (dependent on the TTS provider).
Currently, only Cartesia and ElevenLabs support word-level transcription timing. For other providers, the alignment is applied at the sentence level and still improves synchronization reliability for multi-sentence turns.
To enable this feature, set use_tts_aligned_transcript=True
in your AgentSession
configuration:
session = AgentSession(# ... stt, llm, tts, vad, etc...use_tts_aligned_transcript=True,)
To access timing information in your code, implement a transcription_node method in your agent. The iterator yields a TimedString
which includes start_time
and end_time
for each word, in seconds relative to the start of the agent's current turn.
The transcription_node
and TimedString
implementations are experimental and may change in a future version of the SDK.
async def transcription_node(self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings) -> AsyncGenerator[str | TimedString, None]:async for chunk in text:if isinstance(chunk, TimedString):logger.info(f"TimedString: '{chunk}' ({chunk.start_time} - {chunk.end_time})")yield chunk
Text input
Your agent also monitors the lk.chat
text stream topic for incoming text messages from its linked participant. The agent interrupts its current speech, if any, to process the message and generate a new response.
To disable text input, set text_enabled=False
in RoomInputOptions
.
Text-only session
Disable audio for the entire session
To disable audio input or output for the entire session, set audio_enabled=False
in RoomInputOptions
or RoomOutputOptions
. When audio output is disabled, the agent will not publish audio tracks to the room. Text responses will be sent without the lk.transcribed_track_id
attribute and without speech synchronization.
session = AgentSession(...,room_input_options=RoomInputOptions(audio_enabled=False),room_output_options=RoomOutputOptions(audio_enabled=False),)
Toggle audio input/output
For hybrid sessions where audio input and output may be used, such as when a user toggles an audio switch, the audio track should remain published to the room. The agent can toggle audio input and output dynamically using session.input.set_audio_enabled()
and session.output.set_audio_enabled()
.
session = AgentSession(...)# start with audio disabledsession.input.set_audio_enabled(False)session.output.set_audio_enabled(False)await session.start(...)# user toggles audio switch@room.local_participant.register_rpc_method("toggle_audio")async def on_toggle_audio(data: rtc.RpcInvocationData) -> None:session.input.set_audio_enabled(not session.input.audio_enabled)session.output.set_audio_enabled(not session.output.audio_enabled)
Frontend integration
LiveKit client SDKs have native support for text streams. For more information, see the text streams documentation.
Receiving text streams
Use the registerTextStreamHandler
method to receive incoming transcriptions or text:
room.registerTextStreamHandler('lk.transcription', async (reader, participantInfo) => {const message = await reader.readAll();if (reader.info.attributes['lk.transcribed_track_id']) {console.log(`New transcription from ${participantInfo.identity}: ${message}`);} else {console.log(`New message from ${participantInfo.identity}: ${message}`);}});
Sending text input
Use the sendText
method to send text messages:
const text = 'Hello how are you today?';const info = await room.localParticipant.sendText(text, {topic: 'lk.chat',});
Manual text input
To insert text input and generate a response, use the generate_reply
method of AgentSession: session.generate_reply(user_input="...")
.
Transcription events
Frontend SDKs can also receive transcription events via RoomEvent.TranscriptionReceived
.
Transcription events will be removed in a future version. Use text streams on the lk.chat
topic instead.
room.events.collect { event ->if (event is RoomEvent.TranscriptionReceived) {event.transcriptionSegments.forEach { segment ->println("New transcription from ${segment.senderIdentity}: ${segment.text}")}}}