Skip to content

Conversation

@laanak08
Copy link
Contributor

@laanak08 laanak08 commented Mar 16, 2025

Changes

For Bench Users

  • The CLI flag -s now refers to --selectors, where a selector is
    • a colon delimited string of suite, sub-suite(s), and eval filename. ex
    • bench -s core:developer:web_scrape -s "core:memory, vibes"
  • top-level suite-result reports will be for the lowest level of suite, ex.
    • for selector core:developer,
    • the results-report will be for core:developer, and not just core,
    • because here, developer is the lowest level of suite-grouping.
  • if multiple selectors are supplied where one selector is a child of another, the more general selector will be chosen. ex
    • -s core -s core:developer
    • here, everything in core will be run.
  • the --list flag has been updated to return a list of every valid selector that can be passed to -s, and the number of evals they will run

For Eval Authors

Adding an Eval

  • If a suitable suite doesn't yet exist at eval_suites/ create a rust module and any desired sub-modules for it and place eval file there
  • Within the eval file, be sure to end the implementation with a call toregister_evaluation!(MyNewEval);
  • it will now be selectable as
  • your_new_suite_name:eval_filename, or if nested deeper,
  • your_new_suite_name:your_also_new_subsuite_name:eval_filename

Design

Ingestion/Pre-Processing

  • each eval registers itself with the register_evaluation macro defined in factory.rs
  • this macro updates the registry which is a map between the path to the eval, and the eval constructor
  • in the registry, the path to the eval is converted to being a "selector" by substituting all the path-separators with colons.
  • once complete, the registry will be populated with all eval-paths (where components are colon-separated), and their respective constructors.

Lookup

  • The CLI expects to be supplied with one or more selectors of varying granularity.
  • since the only knowledge of where evals live, and to which suite they belong is in the registry keys (the paths to evals), these keys are matched against the user-supplied selectors by prefix matching the user-string against the registry key. its from here that the idea of a suite emerges, its not actually tracked in any other way.
  • naturally, this also applies to the impl. of --list and any related functionality, the suite hierarchies and their constituent evals are extracted from the registry keys

* main: (31 commits)
  feat: add default metrics for core evals (#1602)
  feat(google_drive): use oauth2 crate for PKCE support, make token storage generic over Serializable (#1645)
  ui: reorganize extensions settings (#1702)
  feat: google_drive write tools and read comment tool (#1650)
  fix: developer builtin name (#1699)
  chore: update extensions section to work with new endpoints (#1696)
  chore: move things around (#1662)
  ui: extensions state updates (#1674)
  docs: goose ollama blog, updated (#1691)
  ui: load builtins (#1679)
  chore(release): release version 1.0.14 (#1676)
  Revert "feat: handling larger more complex PDF docs (and fix) (#1663)" (#1675)
  fix: uvshim default to existing uv configuration (#1670)
  fix: handle interruptions during tool responses (#1651)
  feat: Copy error message button in toast (#1658)
  feat: handling larger more complex PDF docs (and fix) (#1663)
  Add Filesystem Tutorial (#1666)
  docs: figma blog post (#1647)
  docs: updating goose modes doc (#1665)
  docs: Add running tasks guide (#1626)
  ...
@laanak08 laanak08 requested review from ahau-square and zakiali March 16, 2025 16:57
@laanak08 laanak08 changed the title feat: refactor register eval proposal feat: refactor register eval Mar 16, 2025
Copy link
Collaborator

@zakiali zakiali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this is working well for me

@laanak08 laanak08 merged commit 4c03b34 into main Mar 18, 2025
6 checks passed
@laanak08 laanak08 deleted the marcelle/refactor-register-eval-proposal branch March 18, 2025 19:18
michaelneale added a commit that referenced this pull request Mar 18, 2025
* main:
  chore(release): release version 1.0.15 (#1749)
  docs: goosing around: langfuse blog (#1746)
  feat: update the deny call response (#1741)
  feat: refactor register eval (#1713)
  fix: Goose UI fix typos (#1744)
  feat(google_drive): comment read (#1732)
  feat: build cli workflow  (#1697)
  fix: fix initial model configuration in cli when using toolshim (#1720)
  feat: add basic support for aws bedrock to desktop app (#1271)
  feat(google_drive): add image resizing logic from developer, and use Content::Image (#1735)
  Standardize Radio Button input (#1701)
  ui: tweaks to settings v2 (#1731)
  feat(google_drive): set read/write scope on all commands to use the same token (#1707)
  refactor: clean up log usage (#1704)
  docs: fix docusaurus sidebar limit (#1722)
  docs: Add Session List To CLI Commands Guide (#1729)
  ui: start extensions on add (#1714)
  ui: new extensions modal (#1711)
  docs: Add Filesystem Short Video to Tutorial (#1723)
  fix: update the mcp client protocol version to 2024-11-05 (#1690)
salman1993 added a commit that referenced this pull request Mar 20, 2025
* origin/main: (74 commits)
  config: add optional extension description (#1743)
  docs: add deployment for install link generator (#1737)
  ui: new configure provider flow (#1736)
  Revert "Standardize Radio Button input" (#1758)
  Settings v2 Add Model (#1708)
  fix: use lowercase names for builtin external extensions (#1756)
  chore(release): release version 1.0.15 (#1749)
  docs: goosing around: langfuse blog (#1746)
  feat: update the deny call response (#1741)
  feat: refactor register eval (#1713)
  fix: Goose UI fix typos (#1744)
  feat(google_drive): comment read (#1732)
  feat: build cli workflow  (#1697)
  fix: fix initial model configuration in cli when using toolshim (#1720)
  feat: add basic support for aws bedrock to desktop app (#1271)
  feat(google_drive): add image resizing logic from developer, and use Content::Image (#1735)
  Standardize Radio Button input (#1701)
  ui: tweaks to settings v2 (#1731)
  feat(google_drive): set read/write scope on all commands to use the same token (#1707)
  refactor: clean up log usage (#1704)
  ...
ahau-square pushed a commit that referenced this pull request May 2, 2025
cbruyndoncx pushed a commit to cbruyndoncx/goose that referenced this pull request Jul 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants