Skip to content

t2838: Fix Phase 0.9 sanity check that resets completed tasks to queued#2845

Merged
alex-solovyev merged 1 commit intomainfrom
bugfix/fix-sanity-check-downgrade
Mar 4, 2026
Merged

t2838: Fix Phase 0.9 sanity check that resets completed tasks to queued#2845
alex-solovyev merged 1 commit intomainfrom
bugfix/fix-sanity-check-downgrade

Conversation

@alex-solovyev
Copy link
Collaborator

Summary

  • Prevent completed-task downgrade: Add deterministic guard in _execute_sanity_action() that blocks unclaim/reset actions when DB status is complete/verified/deployed/merged — the hard safety net
  • Add trigger_update_todo action: New sanity check action that syncs TODO.md to match DB (marks [x]) instead of resetting DB to match TODO.md, with fallback force-mark when deliverable verification fails
  • Accept verified_complete deliverables: verify_task_deliverables() now accepts verified_complete as a valid pr_url for tasks that don't produce PRs (audit, documentation, research)

Root Cause

Workers completed tasks (DB status = complete) but update_todo_on_complete failed because verify_task_deliverables rejected non-PR deliverables like verified_complete. TODO.md stayed [ ]. Phase 0.9 AI sanity check saw the contradiction (DB=complete, TODO=open) and proposed reset to queued, causing re-dispatch of already-completed work. Observed on t025.3, t025.4, t025.7.

Three-Layer Fix

  1. Harness guard (deterministic, in _execute_sanity_action): Even if the AI proposes a downgrade, the shell code refuses to execute it for advanced states
  2. AI prompt update (in _build_sanity_prompt): Directional authority rule + new trigger_update_todo action + state snapshot section showing completed-but-stale tasks
  3. Deliverable verification (in verify_task_deliverables): Accept verified_complete signal so update_todo_on_complete succeeds for non-PR tasks

Testing

  • Structural tests added to test-supervisor-state-machine.sh verifying all fix components are present
  • ShellCheck passes on all modified files

Closes #2838

… queued (t2838)

Three-layer fix for the completed-task downgrade loop:

1. sanity-check.sh: Add deterministic guard in _execute_sanity_action() that
   blocks unclaim/reset actions when DB status is complete/verified/deployed/merged.
   This is the hard safety net — even if the AI proposes a downgrade, the harness
   refuses to execute it.

2. sanity-check.sh: Add 'trigger_update_todo' action type that syncs TODO.md to
   match the DB (marks [x]) instead of resetting DB to match TODO.md. Update the
   AI prompt with directional authority rule and new state snapshot section showing
   completed tasks with stale TODO.md entries.

3. deploy.sh: Accept 'verified_complete' as a valid pr_url in
   verify_task_deliverables(), allowing tasks without PRs (audit, documentation)
   to pass deliverable verification and get their TODO.md updated normally.

Root cause: Workers completed tasks (DB=complete) but update_todo_on_complete
failed because verify_task_deliverables rejected non-PR deliverables. TODO.md
stayed [ ]. Phase 0.9 AI saw the contradiction and proposed 'reset' to queued,
causing re-dispatch of already-completed work.

Closes #2838
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Warning

Rate limit exceeded

@alex-solovyev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 55 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29d54823-560e-41d9-bf91-f84e49d473a4

📥 Commits

Reviewing files that changed from the base of the PR and between 55b249c and dc021b6.

📒 Files selected for processing (3)
  • .agents/scripts/supervisor-archived/deploy.sh
  • .agents/scripts/supervisor-archived/sanity-check.sh
  • tests/test-supervisor-state-machine.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/fix-sanity-check-downgrade

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the bug Auto-created from TODO.md tag label Mar 4, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue (t2838) where the Phase 0.9 AI sanity check was erroneously resetting completed tasks to a 'queued' state, causing redundant work. The problem stemmed from update_todo_on_complete failing for tasks without traditional PRs, leading to a discrepancy between the database's 'complete' status and TODO.md's 'open' status. The solution introduces a robust, three-pronged fix: a hardcoded safety mechanism to prevent downgrades of advanced-state tasks, an intelligent new action (trigger_update_todo) to synchronize TODO.md, and an update to the deliverable verification process to correctly handle non-PR task completions.

Highlights

  • Prevent completed-task downgrade: Added a deterministic guard in _execute_sanity_action() to block unclaim/reset actions for tasks in advanced DB states (complete, verified, deployed, merged).
  • Introduce trigger_update_todo action: Implemented a new sanity check action to synchronize TODO.md with the database for completed tasks, including a fallback to force-mark completion if standard update fails.
  • Accept verified_complete deliverables: Modified verify_task_deliverables() to recognize verified_complete as a valid PR URL for non-PR tasks (e.g., audit, documentation).
  • Updated AI prompt guidance: Enhanced the AI prompt with a "Directional Authority Rule" and a "Completed Tasks with Stale TODO.md" section to improve decision-making and prevent incorrect task resets.
Changelog
  • .agents/scripts/supervisor-archived/deploy.sh
    • verify_task_deliverables function was updated to accept "verified_complete" as a valid pr_url for tasks that do not produce pull requests, ensuring proper completion tracking for non-PR deliverables.
  • .agents/scripts/supervisor-archived/sanity-check.sh
    • The _build_sanity_state_snapshot function was enhanced to include a new "Completed Tasks with Stale TODO.md" section, providing the AI with crucial context about tasks completed in the DB but not yet marked in TODO.md.
    • The AI prompt was augmented with a "CRITICAL: Directional Authority Rule" and updated guidance to explicitly prevent reset or unclaim actions for tasks in advanced DB states, instead recommending the new trigger_update_todo action.
    • The _execute_sanity_action function was modified to include a deterministic guard that blocks unclaim or reset actions if the task's database status is complete, verified, deployed, or merged.
    • A new trigger_update_todo action was added to _execute_sanity_action, which attempts to update TODO.md for completed tasks and includes a fallback mechanism to force-mark the task as complete if the initial update fails.
  • tests/test-supervisor-state-machine.sh
    • A new test section was added to comprehensively verify the t2838 fixes, including tests for the verified_complete acceptance in deploy.sh, the downgrade prevention guard in sanity-check.sh, the presence and functionality of the trigger_update_todo action, and the updated AI prompt rules and state snapshot.
Activity
  • The pull request addresses and resolves a critical bug (t2838) related to incorrect task state management.
  • Comprehensive structural tests have been added to test-supervisor-state-machine.sh to ensure the robustness and correctness of the implemented fixes.
  • All modified shell scripts have passed ShellCheck, confirming adherence to scripting best practices.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 107 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Mar 4 17:48:31 UTC 2026: Code review monitoring started
Wed Mar 4 17:48:32 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 107

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 107
  • VULNERABILITIES: 0

Generated on: Wed Mar 4 17:48:34 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 4, 2026

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug causing completed tasks to reset, implementing a robust multi-layered fix with a deterministic guard, AI prompt improvements, and enhanced deliverable verification. It also introduces security enhancements like a directional authority rule and a new trigger_update_todo action. However, two critical security vulnerabilities were identified in sanity-check.sh: a regex/sed injection due to incomplete sanitization of task IDs, and a potential prompt injection where untrusted database content is directly included in the AI prompt. These require immediate attention. Further improvements include enhancing error visibility by removing stderr suppression and strengthening the new tests for better functionality.

Comment on lines +481 to +487
completed_stale=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, COALESCE(pr_url,'')
FROM tasks
WHERE repo = '$(sql_escape "$repo_path")'
AND status IN ('complete', 'verified', 'deployed', 'merged')
ORDER BY id;
" 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr with 2>/dev/null for the db command can mask critical issues like database connection failures, SQL syntax errors, or permission problems. Removing it will ensure these errors are visible for easier debugging. The || echo "" will still handle command failure gracefully.

Suggested change
completed_stale=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, COALESCE(pr_url,'')
FROM tasks
WHERE repo = '$(sql_escape "$repo_path")'
AND status IN ('complete', 'verified', 'deployed', 'merged')
ORDER BY id;
" 2>/dev/null || echo "")
completed_stale=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, COALESCE(pr_url,'')
FROM tasks
WHERE repo = '$(sql_escape "$repo_path")'
AND status IN ('complete', 'verified', 'deployed', 'merged')
ORDER BY id;
" || echo "")
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

# than TODO.md. Completed tasks showing [ ] in TODO.md need their TODO.md
# updated (via trigger_update_todo), NOT their DB state reset to queued.
local _current_db_status=""
_current_db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr with 2>/dev/null for the db command can mask critical issues like database connection failures or permission problems. Removing it will ensure these errors are visible for easier debugging. The || echo "" will still handle command failure gracefully.

Suggested change
_current_db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || echo "")
_current_db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$task_id")';" || echo "")
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines +880 to +882
escaped_task_id=$(printf '%s' "$task_id" | sed 's/\./\\./g')
if grep -qE "^[[:space:]]*- \[ \] ${escaped_task_id}( |$)" "$todo_file" 2>/dev/null; then
sed_inplace -E "s/^([[:space:]]*- )\[ \] (${escaped_task_id} .*)$/\1[x] \2 verified:${today} completed:${today}/" "$todo_file"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The task_id variable is used to construct a regular expression for grep and a command for sed without sufficient sanitization. While dots are escaped, other special characters such as /, [, ], *, (, ), and & are not. An attacker who can control the task ID (e.g., by creating a task with a malicious name) could inject regex patterns to match unintended lines or inject sed commands to corrupt the TODO.md file. In some environments, this could potentially lead to arbitrary command execution if the sed implementation supports the e flag.

@@ -501,6 +535,18 @@ You are a supervisor sanity checker for an automated task dispatch system. The q

$state_snapshot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Untrusted data from the database (task IDs, statuses, PR URLs) is concatenated into the AI prompt via the $state_snapshot variable. An attacker could craft malicious metadata for a task that, when included in the snapshot, performs a prompt injection attack. This could trick the AI into proposing unauthorized or harmful actions, such as cancelling legitimate tasks or bypassing security rules.

Comment on lines +2580 to +2583
write_proof_log --task "$task_id" --event "deliverable_verified" --stage "complete" \
--decision "verified:no_pr:verified_complete" \
--evidence "pr_url=verified_complete,signal=FULL_LOOP_COMPLETE" \
--maker "verify_task_deliverables" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Suppressing stderr with 2>/dev/null can hide important errors from the write_proof_log command, such as issues with the log file path or permissions. It's better to let stderr be visible for debugging. The || true already prevents the script from exiting on failure.

Suggested change
write_proof_log --task "$task_id" --event "deliverable_verified" --stage "complete" \
--decision "verified:no_pr:verified_complete" \
--evidence "pr_url=verified_complete,signal=FULL_LOOP_COMPLETE" \
--maker "verify_task_deliverables" 2>/dev/null || true
write_proof_log --task "$task_id" --event "deliverable_verified" --stage "complete" \
--decision "verified:no_pr:verified_complete" \
--evidence "pr_url=verified_complete,signal=FULL_LOOP_COMPLETE" \
--maker "verify_task_deliverables" || true
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

[[ -z "$csid" ]] && continue
local escaped_csid
escaped_csid=$(printf '%s' "$csid" | sed 's/\./\\./g')
if grep -qE "^[[:space:]]*- \[ \] ${escaped_csid}( |$)" "$todo_file" 2>/dev/null; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Suppressing stderr with 2>/dev/null can hide errors, for instance if $todo_file doesn't exist. It's better to let potential errors be visible for debugging.

Suggested change
if grep -qE "^[[:space:]]*- \[ \] ${escaped_csid}( |$)" "$todo_file" 2>/dev/null; then
if grep -qE "^[[:space:]]*- \[ \] ${escaped_csid}( |$)" "$todo_file"; then
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines +1960 to +1965
if grep -q 'complete.*verified.*deployed.*merged' "$SANITY_CHECK_SCRIPT" &&
grep -q 'BLOCKED downgrade' "$SANITY_CHECK_SCRIPT"; then
pass "t2838: sanity-check.sh has downgrade prevention guard"
else
fail "t2838: sanity-check.sh missing downgrade prevention guard"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is structural, verifying the presence of keywords in the script using grep. It would be more robust to create a functional test that actually attempts to perform a downgrade and asserts that it is blocked by the guard. This would validate the behavior of the code, not just its text. For example, you could try to execute a reset action on the complete task t2838a and verify that the action is blocked and the task's status remains complete.

@alex-solovyev alex-solovyev merged commit b1015c0 into main Mar 4, 2026
27 of 28 checks passed
@alex-solovyev alex-solovyev deleted the bugfix/fix-sanity-check-downgrade branch March 4, 2026 19:21
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
…eview (GH#2866)

- Remove 2>/dev/null from db() calls to surface database errors (critical)
- Replace dot-only regex escaping with full metacharacter _escape_regex() helper
- Add task_id format validation to reject malformed IDs before regex/sed use
- Replace grep 2>/dev/null with explicit file-existence checks
- Add prompt injection mitigation: control char stripping, DATA boundary markers,
  and anti-injection instruction for AI prompt containing DB data

Closes #2866
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
* fix: address critical quality-debt in sanity-check.sh from PR #2845 review (GH#2866)

- Remove 2>/dev/null from db() calls to surface database errors (critical)
- Replace dot-only regex escaping with full metacharacter _escape_regex() helper
- Add task_id format validation to reject malformed IDs before regex/sed use
- Replace grep 2>/dev/null with explicit file-existence checks
- Add prompt injection mitigation: control char stripping, DATA boundary markers,
  and anti-injection instruction for AI prompt containing DB data

Closes #2866

* fix: replace remaining grep 2>/dev/null on $todo_file with file-existence guards

Address Gemini Code Assist review feedback on PR #2870: the grep at
line 342 (orphan detection) still suppressed stderr. Apply the same
[[ -f "$todo_file" ]] guard pattern consistently to all remaining
grep calls on $todo_file (lines 260, 342, 996-999) so file-not-found
and permission errors are visible.

---------

Co-authored-by: marcusquinn <6428977+marcusquinn@users.noreply.github.com>
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
Remove 2>/dev/null from all 6 write_proof_log calls in deploy.sh.
The || true already prevents script exit on failure, so suppressing
stderr just hides debugging info (path errors, permission issues).

Addresses review feedback from Gemini on PR #2845.
alex-solovyev added a commit that referenced this pull request Mar 5, 2026
)

Remove 2>/dev/null from all 6 write_proof_log calls in deploy.sh.
The || true already prevents script exit on failure, so suppressing
stderr just hides debugging info (path errors, permission issues).

Addresses review feedback from Gemini on PR #2845.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Auto-created from TODO.md tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Phase 0.9 sanity check resets completed tasks to queued when TODO.md shows [ ]

1 participant