这是indexloc提供的服务,不要输入任何密码
Skip to content

introduce test & eval primitives #2662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 20, 2025
Merged

introduce test & eval primitives #2662

merged 2 commits into from
Jun 20, 2025

Conversation

theomonnom
Copy link
Member

No description provided.

@theomonnom theomonnom merged commit 92109ee into theo/agents1.2 Jun 20, 2025
1 check passed
@theomonnom theomonnom deleted the theo/evals branch June 20, 2025 13:28
@theomonnom theomonnom mentioned this pull request Jun 20, 2025
Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really great stuff!

# add big mac
await sess.start(DriveThruAgent(userdata=userdata))
result = await sess.run(user_input="Can I get a Big Mac, no meal?")
result.expect.nth(0).is_function_call(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nth feels a bit strange.. [0] or get(0) seems more familiar.

also. does it make sense to have a bit more tolerance? i.e. sometimes the model generates a response before a function call. it feels like we should make it easy to test that case.. where I just care if a function was called, not necessarily that no other output has been generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I agree that we may need more tolerance for the order or event counts, for the case LLM generate a text response alongside the tool calls, or parallel tool calls which the order is not determined.

Can we have something like result.expect.has_function_calls(names=["fnc1", "fnc2"], arguments=[{}, {}]).

Or result.expect[0:2].has_function_calls(...) which result.expect[0:2] return a copy of RunAssert with a subset of events, just thinking loudly...

result.expect.nth(3).is_function_call_output()
result.expect.nth(4).is_message(role="assistant")
except AssertionError:
result.expect.nth(0).is_function_call(name="remove_order_item")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the reason why I suggest that we shouldn't be too explicit about the ordering of calls and/or optional function calls like list_order_items

user_input="Can I get a large Combo McCrispy Original with mayonnaise?"
)
msg_assert = result.expect.message(role="assistant")
await msg_assert.judge(llm, intent="should prompt the user to choose a drink")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa judge 😍

the message(role=.. is basically doing what I said above then? i.e. making sure there is a assistant response there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants