Skip to content

Conversation

@ahau-square
Copy link
Contributor

@ahau-square ahau-square commented Mar 7, 2025

This PR:

  • Adds a suite of goosebench evals "vibes" including
    • summarize a blog
    • implement flappy bird
    • make a wikipedia-style web page
    • do some web research for restaurants
    • data analysis on a csv file
  • Implements some common metrics including
    • time to run
    • token usage
    • tool counts
  • Utility functions to
    • Copy the eval's session.jsonl file into the output directory
    • Copy the agent's last message into a txt file in the output directory

Includes some changes dependent on #1448 (updating the run-benchmarks.sh script to allow passing args for the toolshim).

@ahau-square ahau-square requested review from laanak08 and zakiali March 7, 2025 17:17
@alicehau alicehau force-pushed the ahau/evals branch 2 times, most recently from e74abdb to 22eb00e Compare March 10, 2025 00:24
@alicehau alicehau force-pushed the ahau/evals branch 3 times, most recently from ffd9e5f to 68f45fc Compare March 10, 2025 16:53
@github-actions
Copy link
Contributor

PR Preview Action v1.6.0

🚀 View preview at
https://block.github.io/goose/pr-preview/pr-1571/

Built to branch gh-pages at 2025-03-10 16:55 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@alicehau alicehau force-pushed the ahau/evals branch 2 times, most recently from 68f45fc to 9761806 Compare March 10, 2025 16:58
Copy link
Collaborator

@zakiali zakiali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 🔥 !

@ahau-square ahau-square merged commit bb4feac into main Mar 10, 2025
6 checks passed
@ahau-square ahau-square deleted the ahau/evals branch March 10, 2025 19:11
michaelneale added a commit that referenced this pull request Mar 11, 2025
* main:
  feat: enable smart approve for user by default (#1599)
  ui: fix modal state (#1598)
  ui: setting configuration (#1597)
  fix: merge error logging in goose bench  (#1545)
  feat: add additional goosebench evals (#1571)
  chore: update types and imports (#1594)
  Retain session through view changes (#1580)
  docs: Add steps for desktop tutorial (#1590)
  remove env vars from bottom menu model setting (#1584)
  Fix Goosehints modal UI (#1581)
  docs: typo fix (#1593)
  feat: update config endpoints for use with providers (#1563)
  fix: update anthropic provider headers (#1592)
  feat: Build Goose in a Docker Container (#1551)
  docs: voyp blog post (#1588)
sheagcraig added a commit to sheagcraig/goose that referenced this pull request Mar 11, 2025
* upstream/main: (48 commits)
  feat: enable smart approve for user by default (block#1599)
  ui: fix modal state (block#1598)
  ui: setting configuration (block#1597)
  fix: merge error logging in goose bench  (block#1545)
  feat: add additional goosebench evals (block#1571)
  chore: update types and imports (block#1594)
  Retain session through view changes (block#1580)
  docs: Add steps for desktop tutorial (block#1590)
  remove env vars from bottom menu model setting (block#1584)
  Fix Goosehints modal UI (block#1581)
  docs: typo fix (block#1593)
  feat: update config endpoints for use with providers (block#1563)
  fix: update anthropic provider headers (block#1592)
  feat: Build Goose in a Docker Container (block#1551)
  docs: voyp blog post (block#1588)
  fix: included files was panicing because dir didnt exist (block#1583)
  feat: work with docs/xls and simple html (block#1526)
  feat: parallel processing in approve mode (block#1575)
  Feat: support auto-including dirs in binary/bench-work-dir (block#1576)
  refactor models component (block#1535)
  ...
ahau-square added a commit that referenced this pull request May 2, 2025
Co-authored-by: Alice Hau <alice.a.hau@gmail.com>
cbruyndoncx pushed a commit to cbruyndoncx/goose that referenced this pull request Jul 20, 2025
Co-authored-by: Alice Hau <alice.a.hau@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants