Skip to content

Conversation

@tlongwell-block
Copy link
Collaborator

This PR introduces goose-self-test.yaml, a meta-testing recipe that enables goose to validate its own capabilities through first-person integration testing.

What is First-Person Integration Testing?

Traditional testing approaches rely on external test harnesses, unit tests, or integration suites that examine a system from the outside. This recipe takes a different approach: it has a running goose instance test itself using its own tools and capabilities.

This is meta-testing - the system under test is also the tester, examining its own behavior from within an active session. For an AI agent like goose, this approach offers unique insights into behavioral consistency and tool reliability that external testing cannot provide.

Primary Use Case: Goose Testing Goose

The most powerful application of this recipe is when goose itself is developing new goose features. A goose instance working on the codebase can:

  1. Make changes to the goose source code
  2. Build a new goose binary with cargo build --release
  3. Shell out to run the self-test recipe using the newly built binary:
    ./target/release/goose run --recipe goose-self-test.yaml
  4. Analyze the results to verify the new feature works correctly

This creates a recursive development loop where goose can autonomously develop, test, and validate improvements to itself. The goose doing the development can examine test outputs, debug failures, and iterate on fixes - all while using the self-test recipe to validate each iteration.

How It Works

The self-test recipe guides goose through a structured validation process:

  1. Environment Setup - Creates a workspace and establishes testing infrastructure
  2. Tool Validation - Tests file operations, shell commands, and code analysis
  3. Extension Testing - Validates dynamic extension management
  4. Subagent Orchestration - Tests recursive self-creation and parallel execution
  5. Advanced Scenarios - Explores error boundaries and security controls
  6. Report Generation - Produces comprehensive documentation of results

The recipe uses goose's own capabilities to create test scenarios, execute them, and validate the outcomes. Each test phase builds on the previous, creating a comprehensive assessment of functionality.

Design Principles

What Can Be Tested

From within a running session, goose can test:

  • Tool execution and error handling
  • File and shell operations
  • Code analysis accuracy
  • Extension management
  • Subagent creation and coordination
  • Observable behaviors and consistency

What Cannot Be Tested

Certain aspects require external observation:

  • Provider switching mid-session
  • Session persistence across restarts
  • Internal token counting mechanisms
  • Network transport layers
  • Security boundaries from outside perspective

The recipe focuses on what's testable from within, providing meaningful validation of user-facing functionality.

Key Features

Flexible Execution

The recipe supports parameterized testing:

  • test_phases: Select specific test categories or run all
  • test_depth: Choose between quick, standard, or exhaustive testing
  • parallel_tests: Enable/disable parallel test execution
  • workspace_dir: Specify test artifact location

Self-Documenting

The test generates comprehensive reports:

  • Detailed logs for each phase
  • Summary statistics
  • Pass/fail status for each capability
  • Terminal-displayed executive summary

Clean Artifacts

Test artifacts are organized in a single gooseselftest directory, which is automatically added to .gitignore to keep the repository clean.

Why This Matters

Continuous Validation

Provides a standardized method to verify goose functionality across different:

  • Environments
  • Configurations
  • Model providers
  • Operating systems

Behavioral Testing

Unlike unit tests that verify code correctness, this tests actual agent behavior - crucial for AI systems where behavior can vary with context and model.

Meta-Cognitive Assessment

The successful completion of self-testing demonstrates goose's ability to:

  • Understand complex instructions
  • Coordinate multiple tools
  • Maintain context across operations
  • Reason about its own capabilities

Quality Assurance

Enables rapid validation after:

  • Code changes
  • Dependency updates
  • Configuration modifications
  • New feature additions

Initial Validation

The recipe has been successfully tested with:

./target/release/goose run --recipe goose-self-test.yaml --params test_phases=basic --params test_depth=quick

Results from initial testing:

  • 21 tests executed
  • 100% pass rate
  • ~3 minute execution time
  • All core developer tools validated

Copy link
Collaborator

@zanesq zanesq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Assuming this is cli only right?

@tlongwell-block
Copy link
Collaborator Author

Nice! Assuming this is cli only right?

Yes, this one is. But @DOsinga and I were talking about using playwright to test the desktop app. Will try to explore that in a subsequent PR

@tlongwell-block tlongwell-block merged commit ff3d4e9 into main Oct 12, 2025
11 checks passed
@tlongwell-block tlongwell-block deleted the self_testing branch October 12, 2025 15:03
@tlongwell-block
Copy link
Collaborator Author

cc @angiejones you might think this new feature is fun

@DOsinga
Copy link
Collaborator

DOsinga commented Oct 12, 2025

can you add it to the checklist @tlongwell-block that we run on a new release?

zanesq added a commit that referenced this pull request Oct 13, 2025
…sion-streaming

* 'main' of github.com:block/goose: (37 commits)
  Clear deeplinks after use (#5128)
  Revert "Fix gpt-5 input context limit (#4619)" (#5135)
  fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028)
  Fix bedrock tool input schema (#5064)
  Add self-test recipe for goose validation (#5111)
  fix: modifies openai request logic for reasoning models (#4221) (#4294)
  Fix race condition threat when set_param and set_secret of c… (#5109)
  Clean room implementation of the chat process (#5079)
  Bump rmcp (#5096)
  set version in an env variable for testing (#5100)
  fix : enhance fuzzy file search in goose desktop (#5071)
  Make async (#5126)
  docs: unlist tutorials for extensions with archived or moved servers (#5116)
  Add API Documentation Generator prompt (#5001)
  Add flag for enabling eleven labs voice dictation (#5095)
  force re-render fields to pick up custom params usage in instructions (#5112)
  Remove isUserInputDisabled (#5115)
  Improve Rust analysis output for `analyze` tool (#5072)
  Remove duplicate prepare_reply_context call (#5063)
  install react dev tools in development (#4979)
  ...

# Conflicts:
#	ui/desktop/src/components/BaseChat2.tsx
#	ui/desktop/src/hooks/useChatStream.ts
katzdave added a commit that referenced this pull request Oct 15, 2025
* 'main' of github.com:block/goose: (49 commits)
  fixing video embed (#5171)
  chore: clean up random unused files (#5166)
  fix: adjust download_cli.sh to tolerate no OS variable (#5169)
  mcp tutorial page for firecrawl (#5152)
  Remove orphaned tool calls before compaction (#5059)
  feat: add copy as markdown button to documentation pages (#5158)
  chore: include vendored node executable (#5160)
  remove extra whitespace from message (#5159)
  Clear deeplinks after use (#5128)
  Revert "Fix gpt-5 input context limit (#4619)" (#5135)
  fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028)
  Fix bedrock tool input schema (#5064)
  Add self-test recipe for goose validation (#5111)
  fix: modifies openai request logic for reasoning models (#4221) (#4294)
  Fix race condition threat when set_param and set_secret of c… (#5109)
  Clean room implementation of the chat process (#5079)
  Bump rmcp (#5096)
  set version in an env variable for testing (#5100)
  fix : enhance fuzzy file search in goose desktop (#5071)
  Make async (#5126)
  ...
michaelneale added a commit that referenced this pull request Oct 16, 2025
* main: (35 commits)
  fix: include apple silicon build of the desktop app in build artifacts (#5174)
  fixing video embed (#5171)
  chore: clean up random unused files (#5166)
  fix: adjust download_cli.sh to tolerate no OS variable (#5169)
  mcp tutorial page for firecrawl (#5152)
  Remove orphaned tool calls before compaction (#5059)
  feat: add copy as markdown button to documentation pages (#5158)
  chore: include vendored node executable (#5160)
  remove extra whitespace from message (#5159)
  Clear deeplinks after use (#5128)
  Revert "Fix gpt-5 input context limit (#4619)" (#5135)
  fix: missing cmake and protobuf for windows build, deduplicate sh/pws… (#5028)
  Fix bedrock tool input schema (#5064)
  Add self-test recipe for goose validation (#5111)
  fix: modifies openai request logic for reasoning models (#4221) (#4294)
  Fix race condition threat when set_param and set_secret of c… (#5109)
  Clean room implementation of the chat process (#5079)
  Bump rmcp (#5096)
  set version in an env variable for testing (#5100)
  fix : enhance fuzzy file search in goose desktop (#5071)
  ...
@alexhancock alexhancock mentioned this pull request Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants