Add self-test recipe for goose validation #5111

tlongwell-block · 2025-10-10T16:44:58Z

This PR introduces goose-self-test.yaml, a meta-testing recipe that enables goose to validate its own capabilities through first-person integration testing.

What is First-Person Integration Testing?

Traditional testing approaches rely on external test harnesses, unit tests, or integration suites that examine a system from the outside. This recipe takes a different approach: it has a running goose instance test itself using its own tools and capabilities.

This is meta-testing - the system under test is also the tester, examining its own behavior from within an active session. For an AI agent like goose, this approach offers unique insights into behavioral consistency and tool reliability that external testing cannot provide.

Primary Use Case: Goose Testing Goose

The most powerful application of this recipe is when goose itself is developing new goose features. A goose instance working on the codebase can:

Make changes to the goose source code
Build a new goose binary with cargo build --release
Shell out to run the self-test recipe using the newly built binary:
```
./target/release/goose run --recipe goose-self-test.yaml
```
Analyze the results to verify the new feature works correctly

This creates a recursive development loop where goose can autonomously develop, test, and validate improvements to itself. The goose doing the development can examine test outputs, debug failures, and iterate on fixes - all while using the self-test recipe to validate each iteration.

How It Works

The self-test recipe guides goose through a structured validation process:

Environment Setup - Creates a workspace and establishes testing infrastructure
Tool Validation - Tests file operations, shell commands, and code analysis
Extension Testing - Validates dynamic extension management
Subagent Orchestration - Tests recursive self-creation and parallel execution
Advanced Scenarios - Explores error boundaries and security controls
Report Generation - Produces comprehensive documentation of results

The recipe uses goose's own capabilities to create test scenarios, execute them, and validate the outcomes. Each test phase builds on the previous, creating a comprehensive assessment of functionality.

Design Principles

What Can Be Tested

From within a running session, goose can test:

Tool execution and error handling
File and shell operations
Code analysis accuracy
Extension management
Subagent creation and coordination
Observable behaviors and consistency

What Cannot Be Tested

Certain aspects require external observation:

Provider switching mid-session
Session persistence across restarts
Internal token counting mechanisms
Network transport layers
Security boundaries from outside perspective

The recipe focuses on what's testable from within, providing meaningful validation of user-facing functionality.

Key Features

Flexible Execution

The recipe supports parameterized testing:

test_phases: Select specific test categories or run all
test_depth: Choose between quick, standard, or exhaustive testing
parallel_tests: Enable/disable parallel test execution
workspace_dir: Specify test artifact location

Self-Documenting

The test generates comprehensive reports:

Detailed logs for each phase
Summary statistics
Pass/fail status for each capability
Terminal-displayed executive summary

Clean Artifacts

Test artifacts are organized in a single gooseselftest directory, which is automatically added to .gitignore to keep the repository clean.

Why This Matters

Continuous Validation

Provides a standardized method to verify goose functionality across different:

Environments
Configurations
Model providers
Operating systems

Behavioral Testing

Unlike unit tests that verify code correctness, this tests actual agent behavior - crucial for AI systems where behavior can vary with context and model.

Meta-Cognitive Assessment

The successful completion of self-testing demonstrates goose's ability to:

Understand complex instructions
Coordinate multiple tools
Maintain context across operations
Reason about its own capabilities

Quality Assurance

Enables rapid validation after:

Code changes
Dependency updates
Configuration modifications
New feature additions

Initial Validation

The recipe has been successfully tested with:

./target/release/goose run --recipe goose-self-test.yaml --params test_phases=basic --params test_depth=quick

Results from initial testing:

21 tests executed
100% pass rate
~3 minute execution time
All core developer tools validated

zanesq

Nice! Assuming this is cli only right?

tlongwell-block · 2025-10-10T18:48:10Z

Nice! Assuming this is cli only right?

Yes, this one is. But @DOsinga and I were talking about using playwright to test the desktop app. Will try to explore that in a subsequent PR

tlongwell-block · 2025-10-12T15:03:37Z

cc @angiejones you might think this new feature is fun

DOsinga · 2025-10-12T16:22:58Z

can you add it to the checklist @tlongwell-block that we run on a new release?

…sion-streaming * 'main' of github.com:block/goose: (37 commits) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) Make async (#5126) docs: unlist tutorials for extensions with archived or moved servers (#5116) Add API Documentation Generator prompt (#5001) Add flag for enabling eleven labs voice dictation (#5095) force re-render fields to pick up custom params usage in instructions (#5112) Remove isUserInputDisabled (#5115) Improve Rust analysis output for `analyze` tool (#5072) Remove duplicate prepare_reply_context call (#5063) install react dev tools in development (#4979) ... # Conflicts: # ui/desktop/src/components/BaseChat2.tsx # ui/desktop/src/hooks/useChatStream.ts

* 'main' of github.com:block/goose: (49 commits) fixing video embed (#5171) chore: clean up random unused files (#5166) fix: adjust download_cli.sh to tolerate no OS variable (#5169) mcp tutorial page for firecrawl (#5152) Remove orphaned tool calls before compaction (#5059) feat: add copy as markdown button to documentation pages (#5158) chore: include vendored node executable (#5160) remove extra whitespace from message (#5159) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) Make async (#5126) ...

* main: (35 commits) fix: include apple silicon build of the desktop app in build artifacts (#5174) fixing video embed (#5171) chore: clean up random unused files (#5166) fix: adjust download_cli.sh to tolerate no OS variable (#5169) mcp tutorial page for firecrawl (#5152) Remove orphaned tool calls before compaction (#5059) feat: add copy as markdown button to documentation pages (#5158) chore: include vendored node executable (#5160) remove extra whitespace from message (#5159) Clear deeplinks after use (#5128) Revert "Fix gpt-5 input context limit (#4619)" (#5135) fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028) Fix bedrock tool input schema (#5064) Add self-test recipe for goose validation (#5111) fix: modifies openai request logic for reasoning models (#4221) (#4294) Fix race condition threat when set_param and set_secret of c… (#5109) Clean room implementation of the chat process (#5079) Bump rmcp (#5096) set version in an env variable for testing (#5100) fix : enhance fuzzy file search in goose desktop (#5071) ...

Add self-test recipe for goose validatio

a223531

tlongwell-block requested a review from michaelneale October 10, 2025 16:50

todo

8f51604

zanesq approved these changes Oct 10, 2025

View reviewed changes

Update goose-self-test.yaml

d4548a3

tlongwell-block merged commit ff3d4e9 into main Oct 12, 2025
11 checks passed

tlongwell-block deleted the self_testing branch October 12, 2025 15:03

alexhancock mentioned this pull request Oct 17, 2025

Release/1.11.0 #5224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add self-test recipe for goose validation #5111

Add self-test recipe for goose validation #5111

Uh oh!

tlongwell-block commented Oct 10, 2025

Uh oh!

zanesq left a comment

Uh oh!

tlongwell-block commented Oct 10, 2025

Uh oh!

Uh oh!

tlongwell-block commented Oct 12, 2025

Uh oh!

DOsinga commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add self-test recipe for goose validation #5111

Add self-test recipe for goose validation #5111

Uh oh!

Conversation

tlongwell-block commented Oct 10, 2025

What is First-Person Integration Testing?

Primary Use Case: Goose Testing Goose

How It Works

Design Principles

What Can Be Tested

What Cannot Be Tested

Key Features

Flexible Execution

Self-Documenting

Clean Artifacts

Why This Matters

Continuous Validation

Behavioral Testing

Meta-Cognitive Assessment

Quality Assurance

Initial Validation

Uh oh!

zanesq left a comment

Choose a reason for hiding this comment

Uh oh!

tlongwell-block commented Oct 10, 2025

Uh oh!

Uh oh!

tlongwell-block commented Oct 12, 2025

Uh oh!

DOsinga commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants