-
Notifications
You must be signed in to change notification settings - Fork 754
Stagehand Evaluator & first agent evals #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: f53fed6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a new Evaluator class for LLM-based task assessment and adds agent-focused evaluation tasks, enabling more scalable and flexible testing of model capabilities.
- Added
Evaluator
class inevals/evaluator.ts
with single/batch evaluation methods using LLM screenshot analysis - Added 5 new agent evaluation tasks in
/evals/tasks/agent/
for testing form filling and flight search capabilities - Added dynamic model selection in
taskConfig.ts
with support for multiple providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras) - Added viewport configuration in
initStagehand.ts
for consistent browserbase session rendering - Added optional
requestId
inLLMClient.ts
for more flexible API usage
13 file(s) reviewed, 12 comment(s)
Edit PR Review Bot Settings | Greptile
My only comment would be that in general for LLM-as-a-Judge you should enforce JSON output with a schema rather than freeform text. This helps drive more determinism in my experience even if its just {"passed": True} You can also then pass in response_mime_type: "application/json" in Gemini's case which does token filtering. I believe this is one of the largest drivers of Gemini's data extraction capabilities. Other than that, great idea EDIT Just realized you are doing that essentially but providing YES | NO as options inside a JSON object. Apologies. |
Thanks for the pointer on the mime type @peytoncasper! I was hoping to integrate that next into the gemini client abstraction |
@@ -322,6 +388,7 @@ const generateFilteredTestcases = (): Testcase[] => { | |||
logger, | |||
llmClient, | |||
useTextExtract: shouldUseTextExtract, | |||
modelName: input.modelName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pass in the modelName? is this bc agent wont work with external client providers like aiSDK etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because the modelName is defined inside the agent
constructor, whereas LLM models are set in the Stagehand constructor. For example, in google_flights.ts
:
export const google_flights: EvalFunction = async ({
debugUrl,
sessionUrl,
stagehand,
logger,
modelName, // <-- This is needed here
}) => {
await stagehand.page.goto("https://google.com/travel/flights");
const agent = stagehand.agent({
model: modelName, // <-- so that we can pass it here
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should prob put these prompts & prompt building functions in prompts.ts
, could do this in a fast follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg made a note for this
why
Deterministic evals require high maintenance. A balanced approach is what most frontier labs use for evaluating models: LLMs as judges. Although not as deterministic as our current evals, this new category scales better while randomizing the potential judging error. Introducing:
Stagehand Evaluator
what changed
Created a new class
Evaluator
which reuses a stagehand object to leverage thellmClient
functionality, sending a screenshot (currently, maybe text-only in the future) asking a question or set of questions (batch).Example usage:
Evaluator will then return a response in the following format:
Also added a few evals for agent that use
Evaluator
as the judgetest plan