Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -68,4 +68,7 @@ crates/goose/tests/mcp_replays/*errors.txt

.gromastate
# Nix build output
result
result

# Goose self-test artifacts
gooseselftest/
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ ui/desktop/ # Electron app
## Rules

Test: Prefer tests/ folder, e.g. crates/goose/tests/
Test: When adding features, update goose-self-test.yaml, rebuild, then run `goose run --recipe goose-self-test.yaml` to validate
Error: Use anyhow::Result
Provider: Implement Provider trait see providers/base.rs
MCP: Extensions in crates/goose-mcp/
Expand Down
310 changes: 310 additions & 0 deletions goose-self-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
version: 1.0.0
title: Goose Self-Testing Integration Suite
description: A comprehensive meta-testing recipe where goose tests its own capabilities using its own tools - true first-person integration testing
author:
contact: goose-self-test

activities:
- Initialize test workspace and logging infrastructure
- Test file operations (create, read, update, delete, undo)
- Validate shell command execution and error handling
- Analyze code structure and parsing capabilities
- Test extension discovery and management
- Create and orchestrate subagents for meta-testing
- Generate and execute test recipes
- Validate scheduling system operations
- Test error boundaries and security controls
- Measure performance and resource usage
- Generate comprehensive test report

parameters:
- key: test_phases
input_type: string
requirement: optional
default: "all"
description: "Which test phases to run: all, basic, extensions, subagents, recipes, advanced, scheduling"

- key: test_depth
input_type: string
requirement: optional
default: "standard"
description: "Testing depth: quick (smoke tests), standard (normal coverage), deep (exhaustive)"

- key: workspace_dir
input_type: string
requirement: optional
default: "./gooseselftest"
description: "Directory for test artifacts and results"

- key: parallel_tests
input_type: string
requirement: optional
default: "true"
description: "Run independent tests in parallel where possible"

- key: cleanup_after
input_type: string
requirement: optional
default: "true"
description: "Clean up test artifacts after completion"

instructions: |
You are testing yourself - a running goose instance validating its own capabilities through meta-testing.
This is true first-person integration testing where you use your own tools to test your own functionality.

## Understanding First-Person Integration Testing
This is a crucial distinction - as a running goose instance, you are testing yourself using your own capabilities.
This is meta-testing in the truest sense: not unit tests or external test harnesses, but you using your tools
to validate your own functionality from within your active session. You can only test what you can observe and
control from inside your running instance - your tools, your behaviors, your error handling, your consistency.

## Core Testing Philosophy
- You ARE the system under test AND the tester
- Use your tools to create test scenarios, then validate the results
- Test both success and failure paths
- Document everything meticulously
- Handle errors gracefully - a test failure shouldn't stop the suite

## Test Execution Framework

### Phase 1: Environment Setup & Basic Tool Validation
Create a structured test workspace and validate core developer tools:
- File operations (CRUD + undo)
- Shell command execution
- Code analysis capabilities
- Error handling and recovery

### Phase 2: Extension System Testing
Test dynamic extension management:
- Discover available extensions
- Enable/disable extensions
- Test extension interactions
- Verify isolation between extensions

### Phase 3: Subagent Testing (Meta-Recursion)
Create subagents to test yourself recursively:
- Basic subagent creation and execution
- Parallel subagent orchestration
- Sequential subagent chains
- Recursive depth testing (subagent creating subagent)
- Test return_last_only optimization

### Phase 4: Advanced Self-Testing
Push boundaries and test limits:
- Intentionally trigger errors
- Test timeout scenarios
- Validate security controls
- Measure performance metrics
- Test resource constraints

### Phase 5: Report Generation
Compile comprehensive test results:
- Aggregate all test outcomes
- Calculate success metrics
- Document failures and issues
- Generate recommendations

## Success Criteria
- Phase success: ≥80% tests pass
- Suite success: All phases complete, critical features work
- Each test logs: setup → execute → validate → result → cleanup

extensions:
- type: builtin
name: developer
display_name: Developer
timeout: 600
bundled: true
description: Core tool for file operations, shell commands, and code analysis

prompt: |
Execute the Goose Self-Testing Integration Suite in {{ workspace_dir }}.
Test phases: {{ test_phases }}, Depth: {{ test_depth }}, Parallel: {{ parallel_tests }}

## 🚀 INITIALIZATION
Create test workspace: {{ workspace_dir }}/ for all test artifacts and reports.

Track your progress using the todo extension. Start with:
- [ ] Initialize test workspace
- [ ] Set up logging infrastructure
- [ ] Begin Phase 1 testing

{% if test_phases == "all" or "basic" in test_phases %}
## 📝 PHASE 1: Basic Tool Validation

### File Operations Testing
1. Create test files with various content types (.txt, .py, .md, .json)
2. Test str_replace on each file type
3. Test insert operations at different line positions
4. Test undo functionality
5. Verify file deletion and recreation
6. Test with special characters and Unicode

### Shell Command Testing
Test comprehensive shell workflow: command chaining (mkdir test && cd test && echo "test" > file.txt),
error handling (false || echo "handled"), and environment variables (export VAR=test && echo $VAR).
Verify both success and failure paths work correctly.

### Code Analysis Testing
1. Create sample code files in Python, JavaScript, and Go
2. Analyze each file for structure
3. Test directory-wide analysis
4. Test symbol focus and call graphs
5. Verify LOC, function, and class counting

Log results to: {{ workspace_dir }}/phase1_basic_tools.md
{% endif %}

{% if test_phases == "all" or "extensions" in test_phases %}
## 🔧 PHASE 2: Extension System Testing

### Todo Extension Testing (Built-in)
1. Create initial todos and verify they persist
2. Update todos and confirm changes are retained
3. Clear todos and verify clean state

### Dynamic Extension Management
1. Use platform__search_available_extensions to discover available extensions
2. Document all available extensions
3. Test enabling and disabling dynamic extensions (if any available)
4. Verify extension isolation between enabled extensions

Log results to: {{ workspace_dir }}/phase2_extensions.md
{% endif %}

{% if test_phases == "all" or "subagents" in test_phases %}
## 🤖 PHASE 3: Subagent Meta-Testing

### Basic Subagent Test
Create a simple subagent task:
```
Task: "Create a file called subagent_test.txt with 'Hello from subagent'"
```

### Parallel Subagent Test
{% if parallel_tests == "true" %}
Create 3 parallel subagents:
1. Count files in current directory
2. Get current timestamp
3. Create a test file

Use execution_mode: "parallel" and verify all complete.
{% endif %}

### Sequential Chain Test
Create dependent subagents:
1. First: Create a Python file
2. Second: Analyze the created file
3. Third: Run the Python file

### Recursive Depth Test (if test_depth == "deep")
{% if test_depth == "deep" %}
Create a subagent that creates another subagent (test depth limit).
Monitor for resource constraints and context window limits.
{% endif %}

### Return Last Only Test
Create subagents with return_last_only=true and verify condensed output.

Log results to: {{ workspace_dir }}/phase3_subagents.md
{% endif %}

{% if test_phases == "all" or "advanced" in test_phases %}
## 🔬 PHASE 4: Advanced Testing

### Error Boundary Testing
1. Create a file with an invalid path (should fail gracefully)
2. Run a non-existent shell command
3. Try to analyze a binary file
4. Test with extremely long filenames
5. Test with nested directory creation beyond limits

### Performance Measurement
{% if test_depth == "deep" %}
1. Create and analyze a large file (>1MB)
2. Run multiple parallel operations
3. Track execution times for each operation
4. Monitor token usage if accessible
{% endif %}

### Security Validation
1. Test input with special shell characters: $(echo test)
2. Attempt directory traversal: ../../../etc/passwd
3. Test with harmful Unicode characters
4. Verify command injection prevention

Log results to: {{ workspace_dir }}/phase4_advanced.md
{% endif %}

## 📊 PHASE 5: Final Report Generation

Create TWO reports:

### 1. Detailed Report at {{ workspace_dir }}/detailed_report.md
Include all test details, logs, and technical information.

### 2. Executive Summary (REQUIRED - Display in Terminal)

**IMPORTANT**: At the very end, generate and display a concise summary directly in the terminal:

```
========================================
GOOSE SELF-TEST SUMMARY
========================================

✅ OVERALL RESULT: [PASS/FAIL]

📊 Quick Stats:
• Tests Run: [X]
• Passed: [X] ([%])
• Failed: [X] ([%])
• Duration: [X minutes]

✅ Working Features:
• File operations: [✓/✗]
• Shell commands: [✓/✗]
• Code analysis: [✓/✗]
• Extensions: [✓/✗]
• Subagents: [✓/✗]

⚠️ Issues Found:
• [Issue 1 - brief description]
• [Issue 2 - brief description]

💡 Key Insights:
• [Most important finding]
• [Performance observation]
• [Recommendation]

📁 Full report: {{ workspace_dir }}/detailed_report.md
========================================
```

This summary should be:
- **Concise**: Under 30 lines
- **Visual**: Use emojis and formatting for clarity
- **Actionable**: Clear pass/fail status
- **Informative**: Key findings at a glance

Always end with this summary so users immediately see the results without digging through files.

{% if cleanup_after == "true" %}
## 🧹 CLEANUP
After report generation:
1. Archive results to {{ workspace_dir }}/archive/
2. Remove temporary test artifacts
3. Keep only the final report and logs
{% endif %}

## 🎯 META-TESTING NOTES
Remember: You are testing yourself. This is recursive validation where:
- Success means your tools work as expected
- Failure reveals areas needing attention
- The ability to complete this test IS itself a test
- Document everything - your future self (or another goose) will thank you

Use your todo extension to track progress throughout.
Handle errors gracefully - a failed test shouldn't crash the suite.
Be thorough but efficient based on the test_depth parameter.

This is true first-person integration testing. Execute with precision and document with clarity.
Loading