introduce test & eval primitives #2662

theomonnom · 2025-06-20T13:13:21Z

No description provided.

davidzhao

really great stuff!

davidzhao · 2025-06-21T07:44:41Z

examples/drive-thru/evals/test_agent.py

+        # add big mac
+        await sess.start(DriveThruAgent(userdata=userdata))
+        result = await sess.run(user_input="Can I get a Big Mac, no meal?")
+        result.expect.nth(0).is_function_call(


nth feels a bit strange.. [0] or get(0) seems more familiar.

also. does it make sense to have a bit more tolerance? i.e. sometimes the model generates a response before a function call. it feels like we should make it easy to test that case.. where I just care if a function was called, not necessarily that no other output has been generated.

This looks great! I agree that we may need more tolerance for the order or event counts, for the case LLM generate a text response alongside the tool calls, or parallel tool calls which the order is not determined.

Can we have something like result.expect.has_function_calls(names=["fnc1", "fnc2"], arguments=[{}, {}]).

Or result.expect[0:2].has_function_calls(...) which result.expect[0:2] return a copy of RunAssert with a subset of events, just thinking loudly...

davidzhao · 2025-06-21T07:48:03Z

examples/drive-thru/evals/test_agent.py

+            result.expect.nth(3).is_function_call_output()
+            result.expect.nth(4).is_message(role="assistant")
+        except AssertionError:
+            result.expect.nth(0).is_function_call(name="remove_order_item")


this is the reason why I suggest that we shouldn't be too explicit about the ordering of calls and/or optional function calls like list_order_items

davidzhao · 2025-06-21T07:48:38Z

examples/drive-thru/evals/test_agent.py

+            user_input="Can I get a large Combo McCrispy Original with mayonnaise?"
+        )
+        msg_assert = result.expect.message(role="assistant")
+        await msg_assert.judge(llm, intent="should prompt the user to choose a drink")


whoa judge 😍

the message(role=.. is basically doing what I said above then? i.e. making sure there is a assistant response there?

theomonnom added 2 commits June 20, 2025 15:12

initial test & eval primitives

723f2e0

use new api

a0c1f5c

theomonnom merged commit 92109ee into theo/agents1.2 Jun 20, 2025
1 check passed

theomonnom deleted the theo/evals branch June 20, 2025 13:28

theomonnom mentioned this pull request Jun 20, 2025

livekit-agents 1.2.0 #2482

Merged

davidzhao reviewed Jun 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

introduce test & eval primitives #2662

introduce test & eval primitives #2662

Uh oh!

theomonnom commented Jun 20, 2025

Uh oh!

Uh oh!

davidzhao left a comment

Uh oh!

davidzhao Jun 21, 2025

Uh oh!

longcw Jun 23, 2025

Uh oh!

davidzhao Jun 21, 2025

Uh oh!

davidzhao Jun 21, 2025

Uh oh!

Uh oh!

introduce test & eval primitives #2662

introduce test & eval primitives #2662

Uh oh!

Conversation

theomonnom commented Jun 20, 2025

Uh oh!

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

davidzhao Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

longcw Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

davidzhao Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

davidzhao Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!