Skip to content

t1085.7: AI supervisor testing + validation — dry-run, mock context, token budget, cost reporting#1635

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1085.7
Feb 18, 2026
Merged

t1085.7: AI supervisor testing + validation — dry-run, mock context, token budget, cost reporting#1635
marcusquinn merged 1 commit intomainfrom
feature/t1085.7

Conversation

@marcusquinn
Copy link
Owner

Summary

Comprehensive end-to-end test suite for the AI Supervisor pipeline (t1085.7), covering all testing and validation requirements:

  • Dry-run mode (4 tests): Context builder with mock DB, reasoning engine dry-run, full pipeline dry-run, action executor dry-run with 6 action types
  • Token budget tracking (3 tests): Context size measurement in logs, quick scope under 50K tokens, full scope under 50K tokens
  • Cost reporting (3 tests): Reasoning log captures timestamp/mode/bytes, action log captures execution counts, DB state_log records AI events for audit trail
  • JSON parser (4 tests): Pure JSON, markdown code blocks, empty arrays, unparseable responses
  • Concurrency safety (2 tests): Lock file prevents concurrent sessions, stale lock cleanup
  • Mailbox/memory/pattern integration (4 tests): Pattern tracker section in full context, memory section in full context, quick scope skips expensive sections, self-improvement prompt analysis
  • Integration tests (3 tests, --live flag): Real GitHub data, has_actionable_work(), full dry-run pipeline against live repo
  • CLI interface (4 tests): --help for all 3 scripts, required arg validation
  • Supervisor integration (2 tests): Module sourcing, ai-status command
  • Error handling (2 tests): Missing repo slug, missing gh CLI

35 tests total: 32 pass, 3 skip (live mode requires --live flag)

Uses mock gh CLI and mock helper scripts to avoid slow API calls — test suite runs in ~10 seconds.

Ref #1606

Tests cover: dry-run mode with mock context, token budget tracking
(context size measurement + 50K token budget enforcement), cost
reporting (log audit trail + DB state_log events), JSON action plan
parser (4 extraction strategies), concurrency safety (lock files),
mailbox/memory/pattern integration verification, CLI interface tests,
error handling (missing repo slug, missing gh CLI), and integration
tests against live repo (--live flag).

35 tests total: 32 pass, 3 skip (live mode). Uses mock gh CLI and
mock helper scripts to avoid slow API calls in unit tests.
@marcusquinn marcusquinn marked this pull request as ready for review February 18, 2026 02:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 39 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1085.7

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive end-to-end test suite for the AI Supervisor pipeline. The tests cover critical aspects such as dry-run modes, token budget adherence, cost reporting, JSON parsing robustness, concurrency safety, and integration with various supervisor components. The suite utilizes mock data and a mock GitHub CLI to ensure fast execution, providing a solid foundation for validating the AI supervisor's behavior and reliability.

Highlights

  • Comprehensive End-to-End Test Suite Added: Introduced a new, extensive test suite for the AI Supervisor pipeline (t1085.7), encompassing 35 tests to validate various functionalities.
  • Dry-Run Mode Validation: Added 4 tests to verify the dry-run functionality of the context builder, reasoning engine, full pipeline, and action executor with mock data.
  • Token Budget Tracking Verified: Implemented 3 tests to ensure accurate context size measurement in logs and confirm that both quick and full scope contexts remain within a 50K token budget.
  • Cost Reporting Mechanisms Tested: Included 3 tests to validate that reasoning logs capture timestamp, mode, and context bytes, action logs record execution counts, and the DB state_log tracks AI events for audit purposes.
  • Robust JSON Parser Checks: Added 4 tests to confirm the JSON parser's ability to handle pure JSON arrays, markdown code blocks, empty arrays, and gracefully manage unparseable responses.
  • Concurrency Safety Ensured: Developed 2 tests to verify that lock files prevent concurrent reasoning sessions and that stale lock files are properly cleaned up.
  • Integration with Mailbox, Memory, and Patterns: Included 4 tests to confirm that the full context integrates pattern tracker and memory sections, while the quick scope correctly skips these expensive sections, and the reasoning prompt includes self-improvement analysis.
  • CLI Interface and Error Handling Tests: Added tests for CLI --help functionality across supervisor scripts and validated graceful error handling for missing repository slugs and the gh CLI.
Changelog
  • tests/test-ai-supervisor-e2e.sh
    • Added a new shell script for end-to-end testing of the AI Supervisor pipeline.
    • Implemented tests for dry-run mode, token budget tracking, cost reporting, JSON parsing, concurrency, and integration with other supervisor components.
    • Included mock GitHub CLI and helper scripts to enable fast, isolated testing without external API calls.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 22 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 02:55:35 UTC 2026: Code review monitoring started
Wed Feb 18 02:55:35 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 22

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 22
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 02:55:38 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a comprehensive end-to-end test suite for the AI Supervisor pipeline, covering dry-run modes, token budget tracking, cost reporting, and integration with various modules. The tests use mock environments and a mock gh CLI to ensure fast execution. While the test coverage is excellent, the script violates several rules in the Repository Style Guide, particularly regarding explicit return statements, the local variable pattern for function arguments, and SQLite initialization pragmas. Additionally, some functions wrapping commands should propagate their exit codes using return $? instead of a hardcoded return 0 to avoid masking potential errors. Adhering to these standards will ensure consistency with the rest of the framework's shell scripts.

Comment on lines +57 to +61
pass() {
PASS=$((PASS + 1))
TOTAL=$((TOTAL + 1))
echo " PASS: $1"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function violates the Repository Style Guide in two ways: it does not use the local var="$1" pattern for its argument, and it lacks an explicit return statement.

Suggested change
pass() {
PASS=$((PASS + 1))
TOTAL=$((TOTAL + 1))
echo " PASS: $1"
}
pass() {
local msg="$1"
PASS=$((PASS + 1))
TOTAL=$((TOTAL + 1))
echo " PASS: $msg"
return 0
}
References
  1. All functions must have explicit return statements. (link)
  2. Use local var="$1" pattern in functions. (link)

Comment on lines +63 to +67
fail() {
FAIL=$((FAIL + 1))
TOTAL=$((TOTAL + 1))
echo " FAIL: $1"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is missing an explicit return statement and does not follow the local var="$1" pattern for arguments.

Suggested change
fail() {
FAIL=$((FAIL + 1))
TOTAL=$((TOTAL + 1))
echo " FAIL: $1"
}
fail() {
local msg="$1"
FAIL=$((FAIL + 1))
TOTAL=$((TOTAL + 1))
echo " FAIL: $msg"
return 0
}
References
  1. All functions must have explicit return statements. (link)
  2. Use local var="$1" pattern in functions. (link)

Comment on lines +69 to +73
skip() {
SKIP=$((SKIP + 1))
TOTAL=$((TOTAL + 1))
echo " SKIP: $1"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is missing an explicit return statement and does not follow the local var="$1" pattern for arguments.

Suggested change
skip() {
SKIP=$((SKIP + 1))
TOTAL=$((TOTAL + 1))
echo " SKIP: $1"
}
skip() {
local msg="$1"
SKIP=$((SKIP + 1))
TOTAL=$((TOTAL + 1))
echo " SKIP: $msg"
return 0
}
References
  1. All functions must have explicit return statements. (link)
  2. Use local var="$1" pattern in functions. (link)

Comment on lines +77 to +80
setup_test_env() {
TEST_TMP=$(mktemp -d)
mkdir -p "$TEST_TMP/logs" "$TEST_TMP/db"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is missing an explicit return statement. Additionally, functions that wrap commands like mktemp and mkdir should propagate their exit codes to the caller instead of masking potential errors with a hardcoded return 0.

Suggested change
setup_test_env() {
TEST_TMP=$(mktemp -d)
mkdir -p "$TEST_TMP/logs" "$TEST_TMP/db"
}
setup_test_env() {
TEST_TMP=$(mktemp -d)
mkdir -p "$TEST_TMP/logs" "$TEST_TMP/db"
return $?
}
References
  1. All functions must have explicit return statements. (link)
  2. In shell scripts, functions that wrap a command should propagate its exit code to the caller. Avoid masking potential errors with a hardcoded return 0. Instead, use return $? or capture the exit code and return it.

Comment on lines +82 to +84
cleanup_test_env() {
[[ -n "$TEST_TMP" && -d "$TEST_TMP" ]] && rm -rf "$TEST_TMP"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is missing an explicit return statement. Functions that wrap commands like rm -rf should propagate their exit codes to the caller instead of masking potential errors with a hardcoded return 0.

Suggested change
cleanup_test_env() {
[[ -n "$TEST_TMP" && -d "$TEST_TMP" ]] && rm -rf "$TEST_TMP"
}
cleanup_test_env() {
[[ -n "$TEST_TMP" && -d "$TEST_TMP" ]] && rm -rf "$TEST_TMP"
return $?
}
References
  1. All functions must have explicit return statements. (link)
  2. In shell scripts, functions that wrap a command should propagate its exit code to the caller. Avoid masking potential errors with a hardcoded return 0. Instead, use return $? or capture the exit code and return it.

AI_ACTIONS_LOG_DIR="$TEST_TMP/logs"

# Create a minimal DB with test data
sqlite3 "$SUPERVISOR_DB" "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Repository Style Guide requires all SQLite databases to use WAL mode and a busy timeout. While this is a test setup, it should still follow the established pattern for consistency and to avoid potential locking issues during concurrent test execution.

Suggested change
sqlite3 "$SUPERVISOR_DB" "
sqlite3 "$SUPERVISOR_DB" "
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
CREATE TABLE IF NOT EXISTS tasks (
References
  1. All SQLite databases use WAL mode + busy_timeout=5000. (link)

Comment on lines +234 to +236
exit 0
)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function _test_mock_context is missing an explicit return statement. Although it ends with a subshell that exits, the style guide requires the function itself to have a return. Functions that wrap commands should propagate their exit code to the caller, which is correctly handled by return $? in the suggestion.

Suggested change
exit 0
)
}
exit 0
)
return $?
}
References
  1. All functions must have explicit return statements. (link)
  2. In shell scripts, functions that wrap a command should propagate its exit code to the caller. Avoid masking potential errors with a hardcoded return 0. Instead, use return $? or capture the exit code and return it.

local no_gh_dir="$TEST_TMP/no-gh-bin"
mkdir -p "$no_gh_dir"
# Only provide essential commands (sqlite3, jq, etc.)
for cmd in sqlite3 jq date wc tr sed grep head cut mktemp mv tail printf bash; do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop variable cmd should be declared as local to avoid leaking into the subshell's scope, following the naming convention for locals.

Suggested change
for cmd in sqlite3 jq date wc tr sed grep head cut mktemp mv tail printf bash; do
local cmd
for cmd in sqlite3 jq date wc tr sed grep head cut mktemp mv tail printf bash; do
References
  1. Variables: lower_snake for locals. (link)

@marcusquinn marcusquinn merged commit d1ad813 into main Feb 18, 2026
19 checks passed
@marcusquinn marcusquinn deleted the feature/t1085.7 branch February 18, 2026 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant