Overview
Speech capabilities are a core feature of LiveKit agents, enabling them to interact with users through voice. This guide covers the various speech features and functionalities available for agents.
LiveKit Agents provide a unified interface for controlling agents using both the STT-LLM-TTS pipeline and realtime models.
To learn more and see usage examples, see the following topics:
Text-to-speech (TTS)
TTS is a synthesis process that converts text into audio, giving AI agents a "voice."
Speech-to-speech
Multimodal, realtime APIs can understand speech input and generate speech output directly.
Preemptive speech generation
Preemptive generation allows the agent to begin generating a response before the user's end of turn is committed. The response is based on partial transcription or early signals from user input, helping reduce perceived response delay and improving conversational flow.
When enabled, the agent starts generating a response as soon as the final transcript is available. If the chat context or tools change in the on_user_turn_completed
node, the preemptive response is canceled and replaced with a new one based on the final transcript.
This feature reduces latency when the following are true:
- STT node returns the final transcript faster than VAD emits the
end_of_speech
event. - Turn detection model is enabled.
You can enable this feature for STT-LLM-TTS pipeline agents using the preemptive_generation
parameter for AgentSession:
session = AgentSession(preemptive_generation=True,... # STT, LLM, TTS, etc.)
Preemptive generation doesn't guarantee reduced latency. Use logging, metrics, and telemetry to validate and fine tune agent performance.
Example
Preemptive generation example
An example of an agent using preemptive generation.
Initiating speech
By default, the agent waits for user input before responding—the Agents framework automatically handles response generation.
In some cases, though, the agent might need to initiate the conversation. For example, it might greet the user at the start of a session or check in after a period of silence.
session.say
To have the agent speak a predefined message, use session.say()
. This triggers the configured TTS to synthesize speech and play it back to the user.
You can also optionally provide pre-synthesized audio for playback. This skips the TTS step and reduces response time.
The say
method requires a TTS plugin. If you're using a realtime model, you need to add a TTS plugin to your session or use the generate_reply()
method instead.
await session.say("Hello. How can I help you today?",allow_interruptions=False,)
Parameters
True
, allow the user to interrupt the agent while speaking. (default True
)True
, add the text to the agent's chat context after playback. (default True
)Returns
Returns a SpeechHandle
object.
Events
This method triggers a speech_created
event.
generate_reply
To make conversations more dynamic, use session.generate_reply()
to prompt the LLM to generate a response.
There are two ways to use generate_reply
:
give the agent instructions to generate a response
session.generate_reply(instructions="greet the user and ask where they are from",)provide the user's input via text
session.generate_reply(user_input="how is the weather today?",)
When using generate_reply
with instructions
, the agent uses the instructions to generate a response, which is added to the chat history. The instructions themselves are not recorded in the history.
In contrast, user_input
is directly added to the chat history.
Parameters
True
, allow the user to interrupt the agent while speaking. (default True
)Returns
Returns a SpeechHandle
object.
Events
This method triggers a speech_created
event.
Controlling agent speech
You can control agent speech using the SpeechHandle
object returned by the say()
and generate_reply()
methods, and allowing user interruptions.
SpeechHandle
The say()
and generate_reply()
methods return a SpeechHandle
object, which lets you track the state of the agent's speech. This can be useful for coordinating follow-up actions—for example, notifying the user before ending the call.
await session.say("Goodbye for now.", allow_interruptions=False)# the above is a shortcut for# handle = session.say("Goodbye for now.", allow_interruptions=False)# await handle.wait_for_playout()
You can wait for the agent to finish speaking before continuing:
handle = session.generate_reply(instructions="Tell the user we're about to run some slow operations.")# perform an operation that takes time...await handle # finally wait for the speech
The following example makes a web request for the user, and cancels the request when the user interrupts:
async with aiohttp.ClientSession() as client_session:web_request = client_session.get('https://api.example.com/data')handle = await session.generate_reply(instructions="Tell the user we're processing their request.")if handle.interrupted:# if the user interrupts, cancel the web_request tooweb_request.cancel()
SpeechHandle
has an API similar to ayncio.Future
, allowing you to add a callback:
handle = session.say("Hello world")handle.add_done_callback(lambda _: print("speech done"))
Getting the current speech handle
The agent session's active speech handle, if any, is available with the current_speech
property. If no speech is active, this property returns None
. Otherwise, it returns the active SpeechHandle
.
Use the active speech handle to coordinate with the speaking state. For instance, you can ensure that a hang up occurs only after the current speech has finished, rather than mid-speech:
# to hang up the call as part of a function call@function_toolasync def end_call(self, ctx: RunContext):"""Use this tool when the user has signaled they wish to end the current call. The session ends automatically after invoking this tool."""# let the agent finish speakingcurrent_speech = ctx.session.current_speechif current_speech:await current_speech.wait_for_playout()# call API to delete_room...
Interruptions
By default, the agent stops speaking when it detects that the user has started speaking. This behavior can be disabled by setting allow_interruptions=False
when scheduling speech.
To explicitly interrupt the agent, call the interrupt()
method on the handle or session at any time. This can be performed even when allow_interruptions
is set to False
.
handle = session.say("Hello world")handle.interrupt()# or from the sessionsession.interrupt()
Customizing pronunciation
Most TTS providers allow you to customize pronunciation of words using Speech Synthesis Markup Language (SSML). The following example uses the tts_node to add custom pronunciation rules:
async def tts_node(self,text: AsyncIterable[str],model_settings: ModelSettings) -> AsyncIterable[rtc.AudioFrame]:# Pronunciation replacements for common technical terms and abbreviations.# Support for custom pronunciations depends on the TTS provider.pronunciations = {"API": "A P I","REST": "rest","SQL": "sequel","kubectl": "kube control","AWS": "A W S","UI": "U I","URL": "U R L","npm": "N P M","LiveKit": "Live Kit","async": "a sink","nginx": "engine x",}async def adjust_pronunciation(input_text: AsyncIterable[str]) -> AsyncIterable[str]:async for chunk in input_text:modified_chunk = chunk# Apply pronunciation rulesfor term, pronunciation in pronunciations.items():# Use word boundaries to avoid partial replacementsmodified_chunk = re.sub(rf'\b{term}\b',pronunciation,modified_chunk,flags=re.IGNORECASE)yield modified_chunk# Process with modified text through base TTS implementationasync for frame in Agent.default.tts_node(self,adjust_pronunciation(text),model_settings):yield frame
The following table lists the SSML tags supported by most TTS providers:
SSML Tag | Description |
---|---|
phoneme | Used for phonetic pronunciation using a standard phonetic alphabet. These tags provide a phonetic pronunciation for the enclosed text. |
say as | Specifies how to interpret the enclosed text. For example, use character to speak each character individually, or date to specify a calendar date. |
lexicon | A custom dictionary that defines the pronunciation of certain words using phonetic notation or text-to-pronunciation mappings. |
emphasis | Speak text with an emphasis. |
break | Add a manual pause. |
prosody | Controls pitch, speaking rate, and volume of speech output. |
Adjusting speech volume
To adjust the volume of the agent's speech, add a processor to the tts_node
or the realtime_audio_output_node
. Alternative, you can also adjust the volume of playback in the frontend SDK.
The following example agent has an adjustable volume between 0 and 100, and offers a tool call to change it.
class Assistant(Agent):def __init__(self) -> None:self.volume: int = 50super().__init__(instructions=f"You are a helpful voice AI assistant. Your starting volume level is {self.volume}.")@function_tool()async def set_volume(self, volume: int):"""Set the volume of the audio output.Args:volume (int): The volume level to set. Must be between 0 and 100."""self.volume = volume# Audio node used by STT-LLM-TTS pipeline modelsasync def tts_node(self, text: AsyncIterable[str], model_settings: ModelSettings):return self._adjust_volume_in_stream(Agent.default.tts_node(self, text, model_settings))# Audio node used by realtime modelsasync def realtime_audio_output_node(self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings) -> AsyncIterable[rtc.AudioFrame]:return self._adjust_volume_in_stream(Agent.default.realtime_audio_output_node(self, audio, model_settings))async def _adjust_volume_in_stream(self, audio: AsyncIterable[rtc.AudioFrame]) -> AsyncIterable[rtc.AudioFrame]:stream: utils.audio.AudioByteStream | None = Noneasync for frame in audio:if stream is None:stream = utils.audio.AudioByteStream(sample_rate=frame.sample_rate,num_channels=frame.num_channels,samples_per_channel=frame.sample_rate // 10, # 100ms)for f in stream.push(frame.data):yield self._adjust_volume_in_frame(f)if stream is not None:for f in stream.flush():yield self._adjust_volume_in_frame(f)def _adjust_volume_in_frame(self, frame: rtc.AudioFrame) -> rtc.AudioFrame:audio_data = np.frombuffer(frame.data, dtype=np.int16)audio_float = audio_data.astype(np.float32) / np.iinfo(np.int16).maxaudio_float = audio_float * max(0, min(self.volume, 100)) / 100.0processed = (audio_float * np.iinfo(np.int16).max).astype(np.int16)return rtc.AudioFrame(data=processed.tobytes(),sample_rate=frame.sample_rate,num_channels=frame.num_channels,samples_per_channel=len(processed) // frame.num_channels,)
Adding background audio
To add more realism to your agent, or add additional sound effects, publish background audio. This audio is played on a separate audio track. The BackgroundAudioPlayer
class supports on-demand playback of custom audio as well as automatic ambient and thinking sounds synchronized to the agent lifecycle.
For a complete example, see the following recipe:
Background Audio
Create the player
The BackgroundAudioPlayer
class manages audio playback to a room. It can also play ambient and thinking sounds automatically during the lifecycle of the agent session, if desired.
Ambient sound plays on a loop in the background during the agent session. See Supported audio sources and Multiple audio clips for more details.
Thinking sound plays while the agent is in the "thinking" state. See Supported audio sources and Multiple audio clips for more details.
Create the player within your entrypoint
function:
from livekit.agents import BackgroundAudioPlayer, AudioConfig, BuiltinAudioClip# An audio player with automated ambient and thinking soundsbackground_audio = BackgroundAudioPlayer(ambient_sound=AudioConfig(BuiltinAudioClip.OFFICE_AMBIENCE, volume=0.8),thinking_sound=[AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING, volume=0.8),AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING2, volume=0.7),],)# An audio player with a custom ambient sound played on a loopbackground_audio = BackgroundAudioPlayer(ambient_sound="/path/to/my-custom-sound.mp3",)# An audio player for on-demand playback onlybackground_audio = BackgroundAudioPlayer()
Start and stop the player
Call the start
method after room connection and after starting the agent session. Ambient sounds, if any, begin playback immediately.
room
: The room to publish the audio to.agent_session
: The agent session to publish the audio to.
await background_audio.start(room=ctx.room, agent_session=session)
To stop and dispose the player, call the aclose
method. You must create a new player instance if you want to start again.
await background_audio.aclose()
Play audio on-demand
You can play audio at any time, after starting the player, with the play
method.
The audio source or a probabilistic list of sources to play. To learn more, see Supported audio sources and Multiple audio clips.
Set to True
to continuously loop playback.
For example, if you created background_audio
in the previous example, you can play an audio file like this:
background_audio.play("/path/to/my-custom-sound.mp3")
The play
method returns a PlayHandle
which you can use to await or cancel the playback.
The following example uses the handle to await playback completion:
# Wait for playback to completeawait background_audio.play("/path/to/my-custom-sound.mp3")
The next example shows the handle's stop
method, which stops playback early:
handle = background_audio.play("/path/to/my-custom-sound.mp3")await(asyncio.sleep(1))handle.stop() # Stop playback early
Multiple audio clips
You can pass a list of audio sources to any of play
, ambient_sound
, or thinking_sound
. The player selects a single entry in the list based on the probability
parameter. This is useful to avoid repetitive sound effects. To allow for the possibility of no audio at all, ensure the sum of the probabilities is less than 1.
AudioConfig
has the following properties:
The audio source to play. See Supported audio sources for more details.
The volume at which to play the given audio.
The relative probability of selecting this audio source from the list.
# Play the KEYBOARD_TYPING sound with an 80% probability and the KEYBOARD_TYPING2 sound with a 20% probabilitybackground_audio.play([AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING, volume=0.8, probability=0.8),AudioConfig(BuiltinAudioClip.KEYBOARD_TYPING2, volume=0.7, probability=0.2),])
Supported audio sources
The following audio sources are supported:
Local audio file
Pass a string path to any local audio file. The player decodes files with FFmpeg via PyAV and supports all common audio formats including MP3, WAV, AAC, FLAC, OGG, Opus, WebM, and MP4.
The player uses an optimized custom decoder to load WAV data directly to audio frames, without the overhead of FFmpeg. For small files, WAV is the highest-efficiency option.
Built-in audio clips
The following built-in audio clips are available by default for common sound effects:
BuiltinAudioClip.OFFICE_AMBIENCE
: Chatter and general background noise of a busy office.BuiltinAudioClip.KEYBOARD_TYPING
: The sound of an operator typing on a keyboard, close to their microphone.BuiltinAudioClip.KEYBOARD_TYPING2
: A shorter version ofKEYBOARD_TYPING
.
Raw audio frames
Pass an AsyncIterator[rtc.AudioFrame]
to play raw audio frames from any source.
Additional resources
To learn more, see the following resources.
Voice AI quickstart
Use the quickstart as a starting base for adding audio code.
Speech related event
Learn more about the speech_created
event, triggered when new agent speech is created.
LiveKit SDK
Learn how to use the LiveKit SDK to play audio tracks.
Background audio example
An example of using the BackgroundAudioPlayer
class to play ambient office noise and thinking sounds.
Text-to-speech (TTS)
TTS usage and examples for pipeline agents.
Speech-to-speech
Multimodal, realtime APIs understand speech input and generate speech output directly.