Stagehand Evaluator & first agent evals #668

miguelg719 · 2025-04-15T18:38:49Z

why

Deterministic evals require high maintenance. A balanced approach is what most frontier labs use for evaluating models: LLMs as judges. Although not as deterministic as our current evals, this new category scales better while randomizing the potential judging error. Introducing:

Stagehand Evaluator

what changed

Created a new class Evaluator which reuses a stagehand object to leverage the llmClient functionality, sending a screenshot (currently, maybe text-only in the future) asking a question or set of questions (batch).

Example usage:

  const evaluator = new Evaluator(stagehand);

  // single evaluation
  const result = await evaluator.evaluate({
    question: "Is the form name input filled with 'John Smith'?"
  });

  // batch evaluate
  const results = await evaluator.batchEvaluate({
    questions: [
      "Is the form name input filled with 'John Smith'?",
      "Is the form email input filled with '[email protected]'?",
      "Is the 'Are you the domain owner?' option selected as 'No'?",
    ],
    strictResponse: true, // whether or not to throw an error on evaulations !== YES | NO
  });

Evaluator will then return a response in the following format:

{
   evaluation: "YES" | "NO",
   reasoning:  "reasoning about the evaluation process"
}

Also added a few evals for agent that use Evaluator as the judge

test plan

Tested with local evals
Tested with BB evals

changeset-bot · 2025-04-15T18:38:52Z

🦋 Changeset detected

Latest commit: f53fed6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

evals/initStagehand.ts

evals/evaluator.ts

greptile-apps

PR Summary

This PR introduces a new Evaluator class for LLM-based task assessment and adds agent-focused evaluation tasks, enabling more scalable and flexible testing of model capabilities.

Added Evaluator class in evals/evaluator.ts with single/batch evaluation methods using LLM screenshot analysis
Added 5 new agent evaluation tasks in /evals/tasks/agent/ for testing form filling and flight search capabilities
Added dynamic model selection in taskConfig.ts with support for multiple providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras)
Added viewport configuration in initStagehand.ts for consistent browserbase session rendering
Added optional requestId in LLMClient.ts for more flexible API usage

_{13 file(s) reviewed, 12 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

evals/evaluator.ts

evals/index.eval.ts

evals/taskConfig.ts

evals/tasks/agent/iframe_form.ts

evals/tasks/agent/iframe_form_multiple.ts

evals/tasks/agent/sf_library_card.ts

peytoncasper · 2025-04-18T15:19:52Z

My only comment would be that in general for LLM-as-a-Judge you should enforce JSON output with a schema rather than freeform text. This helps drive more determinism in my experience even if its just

{"passed": True}

You can also then pass in response_mime_type: "application/json" in Gemini's case which does token filtering. I believe this is one of the largest drivers of Gemini's data extraction capabilities.

Other than that, great idea

EDIT

Just realized you are doing that essentially but providing YES | NO as options inside a JSON object. Apologies.

miguelg719 · 2025-04-18T17:47:03Z

Thanks for the pointer on the mime type @peytoncasper! I was hoping to integrate that next into the gemini client abstraction

evals/evaluator.ts

seanmcguire12 · 2025-04-18T20:57:19Z

evals/index.eval.ts

@@ -322,6 +388,7 @@ const generateFilteredTestcases = (): Testcase[] => {
            logger,
            llmClient,
            useTextExtract: shouldUseTextExtract,
+            modelName: input.modelName,


why do we need to pass in the modelName? is this bc agent wont work with external client providers like aiSDK etc?

It's because the modelName is defined inside the agent constructor, whereas LLM models are set in the Stagehand constructor. For example, in google_flights.ts:

export const google_flights: EvalFunction = async ({ debugUrl, sessionUrl, stagehand, logger, modelName, // <-- This is needed here }) => { await stagehand.page.goto("https://google.com/travel/flights"); const agent = stagehand.agent({ model: modelName, // <-- so that we can pass it here ...

seanmcguire12 · 2025-04-22T20:08:22Z

evals/evaluator.ts

nit: should prob put these prompts & prompt building functions in prompts.ts, could do this in a fast follow

sg made a note for this

Stagehand evaluator and first agent evals

81d9cbb

viewport size for bb sessions

45d8393

miguelg719 commented Apr 15, 2025

View reviewed changes

evals/initStagehand.ts Show resolved Hide resolved

prettier

05bb09e

miguelg719 commented Apr 15, 2025

View reviewed changes

evals/evaluator.ts Outdated Show resolved Hide resolved

miguelg719 added 2 commits April 15, 2025 14:42

refactor

e1b4135

cleanup

fb0de8e

miguelg719 marked this pull request as ready for review April 15, 2025 21:44

greptile-apps bot reviewed Apr 15, 2025

View reviewed changes

miguelg719 added 2 commits April 15, 2025 14:49

changeset

afbe824

remove log

b7caabf

seanmcguire12 reviewed Apr 18, 2025

View reviewed changes

evals/evaluator.ts Outdated Show resolved Hide resolved

seanmcguire12 reviewed Apr 18, 2025

View reviewed changes

miguelg719 added 2 commits April 21, 2025 18:14

Merge branch 'main' into miguel/stg-318-stagehand-evaluator

6351121

addressing comments

f53fed6

seanmcguire12 self-requested a review April 22, 2025 20:06

seanmcguire12 approved these changes Apr 22, 2025

View reviewed changes

miguelg719 merged commit 5c6d2cf into main Apr 22, 2025
13 checks passed

github-actions bot mentioned this pull request Apr 22, 2025

Version Packages #665

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stagehand Evaluator & first agent evals #668

Stagehand Evaluator & first agent evals #668

Uh oh!

miguelg719 commented Apr 15, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peytoncasper commented Apr 18, 2025 •

edited

Loading

Uh oh!

miguelg719 commented Apr 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

seanmcguire12 Apr 18, 2025

Uh oh!

miguelg719 Apr 22, 2025

Uh oh!

seanmcguire12 Apr 22, 2025

Uh oh!

miguelg719 Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

Stagehand Evaluator & first agent evals #668

Stagehand Evaluator & first agent evals #668

Uh oh!

Conversation

miguelg719 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why