-
Notifications
You must be signed in to change notification settings - Fork 1.1k
introduce test & eval primitives #2662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really great stuff!
# add big mac | ||
await sess.start(DriveThruAgent(userdata=userdata)) | ||
result = await sess.run(user_input="Can I get a Big Mac, no meal?") | ||
result.expect.nth(0).is_function_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nth
feels a bit strange.. [0]
or get(0)
seems more familiar.
also. does it make sense to have a bit more tolerance? i.e. sometimes the model generates a response before a function call. it feels like we should make it easy to test that case.. where I just care if a function was called, not necessarily that no other output has been generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I agree that we may need more tolerance for the order or event counts, for the case LLM generate a text response alongside the tool calls, or parallel tool calls which the order is not determined.
Can we have something like result.expect.has_function_calls(names=["fnc1", "fnc2"], arguments=[{}, {}])
.
Or result.expect[0:2].has_function_calls(...)
which result.expect[0:2]
return a copy of RunAssert
with a subset of events, just thinking loudly...
result.expect.nth(3).is_function_call_output() | ||
result.expect.nth(4).is_message(role="assistant") | ||
except AssertionError: | ||
result.expect.nth(0).is_function_call(name="remove_order_item") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the reason why I suggest that we shouldn't be too explicit about the ordering of calls and/or optional function calls like list_order_items
user_input="Can I get a large Combo McCrispy Original with mayonnaise?" | ||
) | ||
msg_assert = result.expect.message(role="assistant") | ||
await msg_assert.judge(llm, intent="should prompt the user to choose a drink") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa judge 😍
the message(role=..
is basically doing what I said above then? i.e. making sure there is a assistant response there?
No description provided.