Overview
Writing effective tests and evaluations are a key part of developing a reliable and production-ready AI agent. LiveKit Agents includes helpers that work with any Python testing framework, such as pytest, to write behavioral tests and evaluations alongside your existing unit and integration tests.
Use these tools to fine-tune your agent's behavior, work around tricky edge cases, and iterate on your agent's capabilities without breaking previously existing functionality.
What to test
You should plan to test your agent's behavior in the following areas:
- Expected behavior: Does your agent respond with the right intent and tone for typical use cases?
- Tool usage: Are functions called with correct arguments and proper context?
- Error handling: How does your agent respond to invalid inputs or tool failures?
- Grounding: Does your agent stay factual and avoid hallucinating information?
- Misuse resistance: How does your agent handle intentional attempts to misuse or manipulate it?
The built-in testing helpers are designed to work with text input and output, using an LLM plugin or realtime model in text-only mode. This is the most cost-effective and intuitive way to write comprehensive tests of your agent's behavior.
For testing options that exercise the entire audio pipeline, see the external tools section at the end of this guide.
Example test
Here is a simple behavioral test for the agent created in the voice AI quickstart, using pytest. It ensures that the agent responds with a friendly greeting and offers assistance.
from livekit.agents import AgentSessionfrom livekit.plugins import openaifrom my_agent import Assistant@pytest.mark.asyncioasync def test_assistant_greeting() -> None:async with (openai.LLM(model="gpt-4o-mini") as llm,AgentSession(llm=llm) as session,):await session.start(Assistant())result = await session.run(user_input="Hello")await result.expect.next_event().is_message(role="assistant").judge(llm, intent="Makes a friendly introduction and offers assistance.")result.expect.no_more_events()
Writing tests
This guide assumes the use of pytest, but is adaptable to other testing frameworks.
You must install both the pytest
and pytest-asyncio
packages to write tests for your agent.
pip install pytest pytest-asyncio
Test setup
Each test typically follows the same pattern:
@pytest.mark.asyncio # Or your async testing framework of choiceasync def test_your_agent() -> None:async with (# You must create an LLM instance for the `judge` methodopenai.LLM(model="gpt-4o-mini") as llm,# Create a session for the life of this test.# LLM is not required - it will use the agent's LLM if you don't provide one hereAgentSession(llm=llm) as session,):# Start the agent in the sessionawait session.start(Assistant())# Run a single conversation turn based on the given user inputresult = await session.run(user_input="Hello")# ...your assertions go here...
Result structure
The run
method executes a single conversation turn and returns a RunResult
, which contains each of the events that occurred during the turn, in order, and offers a fluent assertion API.
Simple turns where the agent responds with a single message and no tool calls can be straightforward, with only a single entry:
Loading diagram…
However, a more complex turn may contain tool calls, tool outputs, handoffs, and one or more messages.
Loading diagram…
To validate these multi-part turns, you can use either of the following approaches.
Sequential navigation
- Cursor through the events with
next_event()
. - Validate individual events with
is_*
assertions such asis_message()
. - Use
no_more_events()
to assert that you have reached the end of the list and no more events remain.
For example, to validate that the agent responds with a friendly greeting, you can use the following code:
result.expect.next_event().is_message(role="assistant")
Skipping events
You can also skip events without validation:
- Use
skip_next()
to skip one event, or pass a number to skip multiple events. - Use
skip_next_event_if()
to skip events conditionally if it matches the given type ("message"
,"function_call"
,"function_call_output"
, or"agent_handoff"
), plus optional other arguments of the same format as theis_*
assertions. - Use
next_event()
with a type and other arguments in the same format as theis_*
assertions to skip non-matching events implicitly.
Example:
result.expect.skip_next() # skips one eventresult.expect.skip_next(2) # skips two eventsresult.expect.skip_next_event_if(type="message", role="assistant") # Skips the next assistant messageresult.expect.next_event(type="message", role="assistant") # Advances to the next assistant message, skipping anything else. If no matching event is found, an assertion error is raised.
Indexed access
Access single events by index, without advancing the cursor, using the []
operator.
result.expect[0].is_message(role="assistant")
Search
Look for the presence of individual events in an order-agnostic way with the contains_*
methods such as contains_message()
. This can be combined with slices using the [:]
operator to search within a range.
result.expect.contains_message(role="assistant")result.expect[0:2].contains_message(role="assistant")
Assertions
The framework includes a number of assertion helpers to validate the content and types of events within each result.
Message assertions
Use is_message()
and contains_message()
to test individual messages. These methods accept an optional role
argument to match the message role.
result.expect.next_event().is_message(role="assistant")result.expect[0:2].contains_message(role="assistant")
Access additional properties with the event()
method:
event().item.content
- Message contentevent().item.role
- Message role
LLM-based judgment
Use judge()
to perform a qualitative evaluation of the message content using your LLM of choice. Specify the intended content, structure, or style of the message as a string, and include an LLM instance to evaluate it. The LLM receives the message string and the intent string, without surrounding context.
Here's an example:
result = await session.run(user_input="Hello")await (result.expect.next_event().is_message(role="assistant").judge(llm, intent="Offers a friendly introduction and offer of assistance."))
The llm
argument can be any LLM instance and does not need to be the same one used in the agent itself. Ensure you have setup the plugin correctly with the appropriate API keys and any other needed setup.
Tool call assertions
You can test three aspects of your agent's use of tools in these ways:
- Function calls: Verify that the agent calls the correct tool with the correct arguments.
- Function call outputs: Verify that the tool returns the expected output.
- Agent response: Verify that the agent performs the appropriate next step based on the tool output.
This example tests all three aspects in order:
result = await session.run(user_input="What's the weather in Tokyo?")# Test that the agent's first conversation item is a function callfnc_call = result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})# Test that the tool returned the expected output to the agentresult.expect.next_event().is_function_call_output(output="sunny with a temperature of 70 degrees.")# Test that the agent's response is appropriate based on the tool outputawait (result.expect.next_event().is_message(role="assistant").judge(llm,intent="Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.",))# Verify the agent's turn is complete, with no additional messages or function callsresult.expect.no_more_events()
Access individual properties with the event()
method:
is_function_call().event().item.name
- Function nameis_function_call().event().item.arguments
- Function argumentsis_function_call_output().event().item.output
- Raw function outputis_function_call_output().event().item.is_error
- Whether the output is an erroris_function_call_output().event().item.call_id
- The function call ID
Agent handoff assertions
Use is_agent_handoff()
and contains_agent_handoff()
to test that the agent performs a handoff to a new agent.
# The next event must be an agent handoff to the specified agentresult.expect.next_event().is_agent_handoff(new_agent_type=MyAgent)# A handoff must occur somewhere in the turnresult.expect.contains_agent_handoff(new_agent_type=MyAgent)
Mocking tools
In many cases, you should mock your tools for testing. This is useful to easily test edge cases, such as errors or other unexpected behavior, or when the tool has a dependency on an external service that you don't need to test against.
Use the mock_tools
helper in a with
block to mock one or more tools for a specific Agent. For instance, to mock a tool to raise an error, use the following code:
from livekit.agents.testing import mock_tools# Mock a tool errorwith mock_tools(Assistant,{"lookup_weather": lambda: RuntimeError("Weather service is unavailable")},):result = await session.run(user_input="What's the weather in Tokyo?")await result.expect.next_event(type="message").judge(llm, intent="Should inform the user that an error occurred while looking up the weather.")
If you need a more complex mock, pass a function instead of a lambda:
def _mock_weather_tool(location: str) -> str:if location == "Tokyo":return "sunny with a temperature of 70 degrees."else:return "UNSUPPORTED_LOCATION"# Mock a specific tool responsewith mock_tools(Assistant, {"lookup_weather": _mock_weather_tool}):result = await session.run(user_input="What's the weather in Tokyo?")await result.expect.next_event(type="message").judge(llm,intent="Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.",)result = await session.run(user_input="What's the weather in Paris?")await result.expect.next_event(type="message").judge(llm,intent="Should indicate that weather lookups in Paris are not supported.",)
Testing multiple turns
You can test multiple turns of a conversation by executing the run
method multiple times. The conversation history builds automatically across turns.
# First turnresult1 = await session.run(user_input="Hello")await result1.expect.next_event().is_message(role="assistant").judge(llm, intent="Friendly greeting")# Second turn builds on conversation historyresult2 = await session.run(user_input="What's the weather like?")result2.expect.next_event().is_function_call(name="lookup_weather")result2.expect.next_event().is_function_call_output()await result2.expect.next_event().is_message(role="assistant").judge(llm, intent="Provides weather information")
Loading conversation history
To load conversation history manually, use the ChatContext
class just as in your agent code:
from livekit.agents import ChatContextagent = Assistant()await session.start(agent)chat_ctx = ChatContext()chat_ctx.add_message(role="user", content="My name is Alice")chat_ctx.add_message(role="assistant", content="Nice to meet you, Alice!")await agent.update_chat_ctx(chat_ctx)# Test that the agent remembers the contextresult = await session.run(user_input="What's my name?")await result.expect.next_event().is_message(role="assistant").judge(llm, intent="Should remember and mention the user's name is Alice")
Verbose output
The LIVEKIT_EVALS_VERBOSE
environment variable turns on detailed output for each agent execution. To use it with pytest, you must also set the -s
flag to disable pytest's automatic capture of stdout:
LIVEKIT_EVALS_VERBOSE=1 pytest -s -o log_cli=true <your-test-file>
Sample verbose output:
evals/test_agent.py::test_offers_assistance+ RunResult(user_input=`Hello`events:[0] ChatMessageEvent(item={'role': 'assistant', 'content': ['Hi there! How can I assist you today?']}))- Judgment succeeded for `Hi there! How can I assist...`: `The message provides a friendly greeting and explicitly offers assistance, fulfilling the intent.`PASSED
Integrating with CI
As the testing helpers work live against your LLM provider to test real agent behavior, you need to set up your CI system to include any necessary LLM API keys in order to work. Testing does not require LiveKit API keys as it does not make a LiveKit connection.
For GitHub Actions, see the guide on using secrets in GitHub Actions.
Never commit API keys to your repository. Use environment variables and CI secrets instead.
Third-party testing tools
To perform end-to-end testing of deployed agents, including the audio pipeline, consider these third-party services:
Bluejay
End-to-end testing for voice agents powered by real-world simulations.
Cekura
Testing and monitoring for voice AI agents.
Coval
Manage your AI conversational agents. Simulation & evaluations for voice and chat agents.
Hamming
At-scale testing & production monitoring for AI voice agents.
Additional resources
These examples and resources provide more help with testing and evaluation.