Testing and evaluation | LiveKit Docs

Overview

Writing effective tests and evaluations are a key part of developing a reliable and production-ready AI agent. LiveKit Agents includes helpers that work with any Python testing framework, such as pytest, to write behavioral tests and evaluations alongside your existing unit and integration tests.

Use these tools to fine-tune your agent's behavior, work around tricky edge cases, and iterate on your agent's capabilities without breaking previously existing functionality.

What to test

You should plan to test your agent's behavior in the following areas:

Expected behavior: Does your agent respond with the right intent and tone for typical use cases?
Tool usage: Are functions called with correct arguments and proper context?
Error handling: How does your agent respond to invalid inputs or tool failures?
Grounding: Does your agent stay factual and avoid hallucinating information?
Misuse resistance: How does your agent handle intentional attempts to misuse or manipulate it?

Text-only testing

The built-in testing helpers are designed to work with text input and output, using an LLM plugin or realtime model in text-only mode. This is the most cost-effective and intuitive way to write comprehensive tests of your agent's behavior.

For testing options that exercise the entire audio pipeline, see the external tools section at the end of this guide.

Example test

Here is a simple behavioral test for the agent created in the voice AI quickstart, using pytest. It ensures that the agent responds with a friendly greeting and offers assistance.

from livekit.agents import AgentSession
from livekit.plugins import openai

from my_agent import Assistant

@pytest.mark.asyncio
async def test_assistant_greeting() -> None:
    async with (
        openai.LLM(model="gpt-4o-mini") as llm,
        AgentSession(llm=llm) as session,
    ):
        await session.start(Assistant())

        result = await session.run(user_input="Hello")

        await result.expect.next_event().is_message(role="assistant").judge(
            llm, intent="Makes a friendly introduction and offers assistance."
        )
        
        result.expect.no_more_events()

Writing tests

Testing frameworks

This guide assumes the use of pytest, but is adaptable to other testing frameworks.

You must install both the pytest and pytest-asyncio packages to write tests for your agent.

pip install pytest pytest-asyncio

Test setup

Each test typically follows the same pattern:

@pytest.mark.asyncio # Or your async testing framework of choice
async def test_your_agent() -> None:
    async with (
        # You must create an LLM instance for the `judge` method
        openai.LLM(model="gpt-4o-mini") as llm,

        # Create a session for the life of this test. 
        # LLM is not required - it will use the agent's LLM if you don't provide one here
        AgentSession(llm=llm) as session,
    ):
        # Start the agent in the session
        await session.start(Assistant())
        
        # Run a single conversation turn based on the given user input
        result = await session.run(user_input="Hello")
        
        # ...your assertions go here...

Result structure

The run method executes a single conversation turn and returns a RunResult, which contains each of the events that occurred during the turn, in order, and offers a fluent assertion API.

Simple turns where the agent responds with a single message and no tool calls can be straightforward, with only a single entry:

Loading diagram…

However, a more complex turn may contain tool calls, tool outputs, handoffs, and one or more messages.

Loading diagram…

To validate these multi-part turns, you can use either of the following approaches.

Cursor through the events with next_event().
Validate individual events with is_* assertions such as is_message().
Use no_more_events() to assert that you have reached the end of the list and no more events remain.

For example, to validate that the agent responds with a friendly greeting, you can use the following code:

result.expect.next_event().is_message(role="assistant")

Skipping events

You can also skip events without validation:

Use skip_next() to skip one event, or pass a number to skip multiple events.
Use skip_next_event_if() to skip events conditionally if it matches the given type ("message", "function_call", "function_call_output", or "agent_handoff"), plus optional other arguments of the same format as the is_* assertions.
Use next_event() with a type and other arguments in the same format as the is_* assertions to skip non-matching events implicitly.

Example:

result.expect.skip_next() # skips one event
result.expect.skip_next(2) # skips two events
result.expect.skip_next_event_if(type="message", role="assistant") # Skips the next assistant message

result.expect.next_event(type="message", role="assistant") # Advances to the next assistant message, skipping anything else. If no matching event is found, an assertion error is raised.

Indexed access

Access single events by index, without advancing the cursor, using the [] operator.

result.expect[0].is_message(role="assistant")

Search

Look for the presence of individual events in an order-agnostic way with the contains_* methods such as contains_message(). This can be combined with slices using the [:] operator to search within a range.

result.expect.contains_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

Assertions

The framework includes a number of assertion helpers to validate the content and types of events within each result.

Message assertions

Use is_message() and contains_message() to test individual messages. These methods accept an optional role argument to match the message role.

result.expect.next_event().is_message(role="assistant")
result.expect[0:2].contains_message(role="assistant")

Access additional properties with the event() method:

event().item.content - Message content
event().item.role - Message role

LLM-based judgment

Use judge() to perform a qualitative evaluation of the message content using your LLM of choice. Specify the intended content, structure, or style of the message as a string, and include an LLM instance to evaluate it. The LLM receives the message string and the intent string, without surrounding context.

Here's an example:

result = await session.run(user_input="Hello")

await (
    result.expect.next_event().is_message(role="assistant")
    .judge(
        llm, intent="Offers a friendly introduction and offer of assistance."
    )
)

The llm argument can be any LLM instance and does not need to be the same one used in the agent itself. Ensure you have setup the plugin correctly with the appropriate API keys and any other needed setup.

Tool call assertions

You can test three aspects of your agent's use of tools in these ways:

Function calls: Verify that the agent calls the correct tool with the correct arguments.
Function call outputs: Verify that the tool returns the expected output.
Agent response: Verify that the agent performs the appropriate next step based on the tool output.

This example tests all three aspects in order:

result = await session.run(user_input="What's the weather in Tokyo?")

# Test that the agent's first conversation item is a function call
fnc_call = result.expect.next_event().is_function_call(name="lookup_weather", arguments={"location": "Tokyo"})

# Test that the tool returned the expected output to the agent
result.expect.next_event().is_function_call_output(output="sunny with a temperature of 70 degrees.")

# Test that the agent's response is appropriate based on the tool output
await (
    result.expect.next_event()
    .is_message(role="assistant")
    .judge(
        llm,
        intent="Informs the user that the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )
)

# Verify the agent's turn is complete, with no additional messages or function calls
result.expect.no_more_events()

Access individual properties with the event() method:

is_function_call().event().item.name - Function name
is_function_call().event().item.arguments - Function arguments
is_function_call_output().event().item.output - Raw function output
is_function_call_output().event().item.is_error - Whether the output is an error
is_function_call_output().event().item.call_id - The function call ID

Agent handoff assertions

Use is_agent_handoff() and contains_agent_handoff() to test that the agent performs a handoff to a new agent.

# The next event must be an agent handoff to the specified agent
result.expect.next_event().is_agent_handoff(new_agent_type=MyAgent)

# A handoff must occur somewhere in the turn
result.expect.contains_agent_handoff(new_agent_type=MyAgent)

Mocking tools

In many cases, you should mock your tools for testing. This is useful to easily test edge cases, such as errors or other unexpected behavior, or when the tool has a dependency on an external service that you don't need to test against.

Use the mock_tools helper in a with block to mock one or more tools for a specific Agent. For instance, to mock a tool to raise an error, use the following code:

from livekit.agents.testing import mock_tools

# Mock a tool error
with mock_tools(
    Assistant,
    {"lookup_weather": lambda: RuntimeError("Weather service is unavailable")},
):
    result = await session.run(user_input="What's the weather in Tokyo?")
    
    await result.expect.next_event(type="message").judge(
        llm, intent="Should inform the user that an error occurred while looking up the weather."
    )

If you need a more complex mock, pass a function instead of a lambda:

def _mock_weather_tool(location: str) -> str:
    if location == "Tokyo":
        return "sunny with a temperature of 70 degrees."
    else:
        return "UNSUPPORTED_LOCATION"

# Mock a specific tool response
with mock_tools(Assistant, {"lookup_weather": _mock_weather_tool}):
    result = await session.run(user_input="What's the weather in Tokyo?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate the weather in Tokyo is sunny with a temperature of 70 degrees.",
    )

    result = await session.run(user_input="What's the weather in Paris?")

    await result.expect.next_event(type="message").judge(
        llm,
        intent="Should indicate that weather lookups in Paris are not supported.",
    )

Testing multiple turns

You can test multiple turns of a conversation by executing the run method multiple times. The conversation history builds automatically across turns.

# First turn
result1 = await session.run(user_input="Hello")
await result1.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Friendly greeting"
)

# Second turn builds on conversation history
result2 = await session.run(user_input="What's the weather like?")
result2.expect.next_event().is_function_call(name="lookup_weather")
result2.expect.next_event().is_function_call_output()
await result2.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Provides weather information"
)

Loading conversation history

To load conversation history manually, use the ChatContext class just as in your agent code:

from livekit.agents import ChatContext

agent = Assistant()
await session.start(agent)

chat_ctx = ChatContext()
chat_ctx.add_message(role="user", content="My name is Alice")
chat_ctx.add_message(role="assistant", content="Nice to meet you, Alice!")
await agent.update_chat_ctx(chat_ctx)

# Test that the agent remembers the context
result = await session.run(user_input="What's my name?")
await result.expect.next_event().is_message(role="assistant").judge(
    llm, intent="Should remember and mention the user's name is Alice"
)

Verbose output

The LIVEKIT_EVALS_VERBOSE environment variable turns on detailed output for each agent execution. To use it with pytest, you must also set the -s flag to disable pytest's automatic capture of stdout:

LIVEKIT_EVALS_VERBOSE=1 pytest -s -o log_cli=true <your-test-file>

Sample verbose output:

evals/test_agent.py::test_offers_assistance 
+ RunResult(
   user_input=`Hello`
   events:
     [0] ChatMessageEvent(item={'role': 'assistant', 'content': ['Hi there! How can I assist you today?']})
)
- Judgment succeeded for `Hi there! How can I assist...`: `The message provides a friendly greeting and explicitly offers assistance, fulfilling the intent.`
PASSED

Integrating with CI

As the testing helpers work live against your LLM provider to test real agent behavior, you need to set up your CI system to include any necessary LLM API keys in order to work. Testing does not require LiveKit API keys as it does not make a LiveKit connection.

For GitHub Actions, see the guide on using secrets in GitHub Actions.

Warning

Never commit API keys to your repository. Use environment variables and CI secrets instead.

Third-party testing tools

To perform end-to-end testing of deployed agents, including the audio pipeline, consider these third-party services:

Bluejay

End-to-end testing for voice agents powered by real-world simulations.

Cekura

Testing and monitoring for voice AI agents.

Coval

Manage your AI conversational agents. Simulation & evaluations for voice and chat agents.

Hamming

At-scale testing & production monitoring for AI voice agents.

Additional resources

These examples and resources provide more help with testing and evaluation.

Drive-thru agent evals

Complete evaluation suite for a complex food ordering agent.

Front-desk agent evals

Complete evaluation suite for a calendar booking agent.

Agent starter project

Starter project with a complete testing integration.

RunResult API reference

API reference for the RunResult class.