Skip to content

t1273: Add supervisor sanity-check for self-healing queue stalls#2015

Merged
marcusquinn merged 4 commits intomainfrom
feature/supervisor-sanity-check
Feb 20, 2026
Merged

t1273: Add supervisor sanity-check for self-healing queue stalls#2015
marcusquinn merged 4 commits intomainfrom
feature/supervisor-sanity-check

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 20, 2026

Summary

  • Adds Phase 0.9 sanity check to the supervisor pulse cycle — runs when zero tasks are dispatchable but open tasks exist in TODO.md
  • Cross-references DB state, TODO.md state, and system state to find contradictions that cause silent queue stalls
  • Fixes Phase 0.5d to treat permanently failed blockers as resolved, preventing infinite dependency chain stalls

Problem

The supervisor pulse was running every 2 minutes, logging "No new tasks to pick up" hundreds of times, while 13 open tasks sat idle. Three interlocking failures caused the stall:

  1. DB-failed tasks with TODO.md claims (double-lock): Tasks failed in DB but still had assignee:/started: in TODO.md. Phase 0.5e has a 24h threshold, but the DB already knows the task failed — waiting 24h is unnecessary
  2. Failed dependency chains never cascade: When a root task fails permanently, its dependents wait forever because Phase 0.5d only checks for [x] or [-] status
  3. Missing #auto-dispatch tag: Tasks with clear specs, model assignments, and estimates were invisible to auto-pickup because they lacked the tag

Changes

New: .agents/scripts/supervisor/sanity-check.sh

Phase 0.9 module with 5 checks:

  • Check 1: Strip stale claims on DB-failed tasks immediately (DB state is authoritative)
  • Check 2: Unblock tasks whose blockers permanently failed (retries exhausted)
  • Check 3: Auto-tag dispatchable tasks missing #auto-dispatch
  • Check 4: Cancel DB orphans with no TODO.md entry
  • Check 5: Log structured stall breakdown when nothing is fixable

Modified: .agents/scripts/supervisor/pulse.sh

  • Integrates Phase 0.9 between Phase 1d (post-completion unblock) and Phase 2 (dispatch)
  • Only runs when zero tasks are queued but open tasks exist
  • Re-runs auto-pickup after fixes so newly dispatchable tasks enter the queue immediately

Modified: .agents/scripts/supervisor/todo-sync.sh

  • Phase 0.5d (auto_unblock_resolved_tasks) now checks if a blocker is permanently failed in DB (retries >= max_retries)
  • Permanently failed blockers are treated as resolved, unblocking dependents

Modified: .agents/scripts/supervisor-helper.sh

  • Sources the new sanity-check.sh module

Design Principle

"Don't assume, verify." Deterministic state machines make assumptions that cause silent failures. When state is contradictory, cross-reference all available data sources to resolve it rather than waiting for timeouts.

Testing

  • ShellCheck: zero violations on all 4 files
  • Designed to be triggered by the current stalled queue (0 queued, 0 running, 13 open tasks)

Summary by CodeRabbit

  • Chores
    • Supervisor now runs an automated sanity-check phase when the queue is idle.
    • Automatically detects and remediates stale, blocked, or inconsistent task states across repos.
    • Treats permanently failed blockers as resolved to prevent indefinite blocking.
    • Adds automatic tagging for dispatchable-looking tasks missing auto-dispatch.
    • Logs structured stall summaries to improve visibility into why the queue is blocked.

Phase 0.9 runs when zero tasks are dispatchable but open tasks exist:
- Check 1: Strip stale claims on DB-failed tasks (don't wait 24h)
- Check 2: Unblock tasks whose blockers permanently failed
- Check 3: Auto-tag dispatchable tasks missing #auto-dispatch
- Check 4: Cancel DB orphans with no TODO.md entry
- Log structured stall breakdown when nothing is fixable

Also fixes Phase 0.5d to treat permanently failed blockers as resolved,
preventing infinite dependency chain stalls.
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust self-healing mechanism to the supervisor system, specifically designed to prevent and resolve silent queue stalls. By cross-referencing various state sources (database, TODO.md, system state), the supervisor can now detect and automatically rectify inconsistencies that previously led to tasks remaining idle. This significantly improves the reliability and autonomy of the task dispatching process, ensuring that tasks are processed efficiently even when underlying states diverge.

Highlights

  • New Sanity Check Module: Introduced sanity-check.sh (Phase 0.9) to proactively identify and resolve common causes of supervisor queue stalls by cross-referencing DB, TODO.md, and system states.
  • Proactive Stall Resolution: Implemented five distinct checks within the new module to address issues such as stale claims on DB-failed tasks, permanently failed blockers, missing #auto-dispatch tags, and DB orphans.
  • Enhanced Dependency Management: Updated todo-sync.sh to recognize permanently failed blockers in the database as 'resolved,' preventing dependent tasks from being indefinitely blocked.
  • Dynamic Auto-Pickup Integration: Integrated the new sanity check into pulse.sh to run specifically when the queue is empty but open tasks exist, re-running auto-pickup if any issues are fixed to immediately dispatch newly available tasks.
Changelog
  • .agents/scripts/supervisor-helper.sh
    • Added sourcing of the new sanity-check.sh module to integrate it into the supervisor's helper scripts.
  • .agents/scripts/supervisor/pulse.sh
    • Integrated a new Phase 0.9 sanity check, which executes when no tasks are queued but open tasks exist in TODO.md.
    • Implemented logic to re-run auto-pickup if the sanity check fixes any issues, ensuring newly dispatchable tasks are processed immediately.
  • .agents/scripts/supervisor/sanity-check.sh
    • Added a new script implementing Phase 0.9, which performs five distinct checks to identify and fix common queue stalls.
    • Included checks for DB-failed tasks with TODO.md claims, permanently failed blockers, tasks missing #auto-dispatch tags, and DB orphans.
    • Added functionality to log structured stall breakdowns when no fixable issues are found, providing visibility into persistent stalls.
  • .agents/scripts/supervisor/todo-sync.sh
    • Modified auto_unblock_resolved_tasks to treat tasks with permanently failed blockers (retries exhausted) in the database as resolved, allowing dependent tasks to proceed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 15:23:23 UTC 2026: Code review monitoring started
Fri Feb 20 15:23:23 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 15:23:26 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable sanity-check mechanism to the supervisor, which should effectively address several causes of silent queue stalls. The new sanity-check.sh module is well-structured and the logic for detecting and fixing inconsistencies is sound. The changes in todo-sync.sh to handle permanently failed blockers are also a great improvement.

My review focuses on improving the shell scripting robustness and maintainability, and ensuring adherence to the repository's established rules. Key feedback points, supported by specific repository rules, include:

  • Refactoring duplicated code in pulse.sh into a helper function for better maintainability.
  • Addressing widespread suppression of stderr with 2>/dev/null in db SELECT operations, as this can hide important errors and violates general rules against suppressing diagnostic output.
  • Aligning local variable declarations with the rule for exit code safety.
  • Suggesting a more robust pattern for complex sed logic in sanity-check.sh for better readability and maintainability.

Overall, this is a strong contribution that enhances the supervisor's resilience. Addressing the feedback will make the implementation even more robust and compliant with repository standards.

# contradictions that cause silent stalls. Fixes: stale claims on DB-failed
# tasks, failed blocker cascades, missing #auto-dispatch tags, DB orphans.
local queued_before_sanity
queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo 0 guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1955, 1971).

Suggested change
queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0)
queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" || echo 0)
References
  1. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

Comment on lines +120 to +124
failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, error, retries, max_retries FROM tasks
WHERE status IN ('failed', 'blocked')
AND repo = '$(sql_escape "$repo_path")';
" 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls throughout this script.

Suggested change
failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, error, retries, max_retries FROM tasks
WHERE status IN ('failed', 'blocked')
AND repo = '$(sql_escape "$repo_path")';
" 2>/dev/null || echo "")
failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT id, status, error, retries, max_retries FROM tasks
WHERE status IN ('failed', 'blocked')
AND repo = '$(sql_escape "$repo_path")';
" || echo "")
References
  1. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

Comment on lines +1206 to +1208
blocker_failed_status=$(db "$SUPERVISOR_DB" \
"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1211, 1212).

Suggested change
blocker_failed_status=$(db "$SUPERVISOR_DB" \
"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
2>/dev/null || echo "")
blocker_failed_status=$(db "$SUPERVISOR_DB" \
"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
|| echo "")
References
  1. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

Comment on lines +1949 to +1982
if [[ "$queued_before_sanity" -eq 0 ]]; then
# No queued tasks — check if there are open tasks in TODO.md that should be
if [[ -n "$all_repos" ]]; then
while IFS= read -r repo_path; do
if [[ -f "$repo_path/TODO.md" ]]; then
local open_count
open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$repo_path/TODO.md" 2>/dev/null || echo 0)
if [[ "$open_count" -gt 0 ]]; then
local sanity_fixed
sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG")
if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup"
cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true
fi
fi
fi
done <<<"$all_repos"
else
local cwd_todo
cwd_todo="$(pwd)/TODO.md"
if [[ -f "$cwd_todo" ]]; then
local open_count
open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$cwd_todo" 2>/dev/null || echo 0)
if [[ "$open_count" -gt 0 ]]; then
local sanity_fixed
sanity_fixed=$(run_sanity_check "$(pwd)" 2>>"$SUPERVISOR_LOG")
if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) — re-running auto-pickup"
cmd_auto_pickup --repo "$(pwd)" 2>>"$SUPERVISOR_LOG" || true
fi
fi
fi
fi
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic inside the if [[ -n "$all_repos" ]] block and the else block is nearly identical. This duplication can be avoided by refactoring the common logic into a helper function. This would make the code more maintainable and easier to read.

_run_sanity_check_for_repo() {
    local repo_path="$1"
    local todo_file="$repo_path/TODO.md"
    if [[ ! -f "$todo_file" ]]; then
        return
    fi

    local open_count
    open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$todo_file" || echo 0)
    if [[ "$open_count" -gt 0 ]]; then
        local sanity_fixed
        sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG")
        if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
            log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup"
            cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true
        fi
    fi
}

if [[ "$queued_before_sanity" -eq 0 ]]; then
    # No queued tasks — check if there are open tasks in TODO.md that should be
    if [[ -n "$all_repos" ]]; then
        while IFS= read -r repo_path; do
            _run_sanity_check_for_repo "$repo_path"
        done <<<"$all_repos"
    else
        _run_sanity_check_for_repo "$(pwd)"
    fi
fi
References
  1. In shell scripts, extract repeated logic into an internal helper function to improve maintainability. This applies even for standalone scripts where external source dependencies are avoided.

# 0 on success
#######################################
run_sanity_check() {
local repo_path="${1:-$REPO_PATH}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository rules specify declaring and assigning local variables separately for exit code safety. This function, and others in this file, use local var="$1". Please update this to follow the rule.

Suggested change
local repo_path="${1:-$REPO_PATH}"
local repo_path
repo_path="${1:-$REPO_PATH}"
References
  1. When declaring and assigning a local variable in shell scripts, separate the 'local' declaration from the assignment if command substitution is involved, to ensure exit code safety.

Comment on lines +240 to +253
if [[ "$blocked_by" == "$blocker_id" ]]; then
local escaped_blocker
escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file"
else
# Multiple blockers — remove just this one
local escaped_blocker
escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
# Remove blocker_id from comma-separated list
sed_inplace "${line_num}s/blocked-by:\([^ ]*\)${escaped_blocker},\?/blocked-by:\1/" "$todo_file"
# Clean up trailing/leading commas
sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file"
sed_inplace "${line_num}s/,\( \)/\1/" "$todo_file"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This series of sed commands to remove a specific blocker from a comma-separated list is quite complex and potentially fragile. A more robust and readable approach is to manipulate the string in bash and then replace the entire blocked-by field. Consider refactoring to this more maintainable pattern.

Suggested change
if [[ "$blocked_by" == "$blocker_id" ]]; then
local escaped_blocker
escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file"
else
# Multiple blockers — remove just this one
local escaped_blocker
escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
# Remove blocker_id from comma-separated list
sed_inplace "${line_num}s/blocked-by:\([^ ]*\)${escaped_blocker},\?/blocked-by:\1/" "$todo_file"
# Clean up trailing/leading commas
sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file"
sed_inplace "${line_num}s/,\( \)/\1/" "$todo_file"
fi
# Example
local all_blockers="t1,t2,t3"
local blocker_to_remove="t2"
# Robustly remove the item
local new_blockers
new_blockers=$(echo ",${all_blockers}," | sed "s/,${blocker_to_remove},/,/" | sed 's/^,//;s/,$//')
# Now use sed to replace the whole field
if [[ -n "$new_blockers" ]]; then
sed_inplace "${line_num}s/blocked-by:[^ ]*/blocked-by:${new_blockers}/" "$todo_file"
else
sed_inplace "${line_num}s/ blocked-by:[^ ]*//" "$todo_file"
fi

…ness (t1273)

Address valid review feedback:
- Extract _run_sanity_check_for_repo() to deduplicate multi-repo/single-repo
  code paths in Phase 0.9
- Replace fragile multi-step sed for comma-separated blocker removal with
  robust string rebuild approach
Copy link
Owner Author

@marcusquinn marcusquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review response

Addressed (valid feedback)

  • Duplicated code in pulse.sh: Extracted _run_sanity_check_for_repo() helper function — good catch
  • Fragile sed for comma-separated blocker removal: Replaced with robust string rebuild approach (",$list," | sed "s/,$item,/,/")

Dismissed (incorrect or inconsistent with codebase conventions)

2>/dev/null on SELECT operations: This is the established convention across the entire supervisor codebase (279 occurrences in pulse.sh alone, 33 in todo-sync.sh, 98 in evaluate.sh). The db() wrapper can emit SQLite warnings on locked databases during concurrent access — these are expected in a multi-process supervisor and not actionable errors. Removing 2>/dev/null would flood the supervisor log with noise during normal operation. The || echo "" fallback handles actual failures.

local var="$1" pattern: The Gemini suggestion to separate declaration from assignment applies to command substitutions where you need to capture exit codes (e.g., local x; x=$(some_cmd)). For simple parameter assignments like local repo_path="$1", there is no exit code to capture — $1 is a variable expansion, not a command. This is the established pattern throughout the codebase (used in every function in todo-sync.sh, pulse.sh, etc.).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds a Phase 0.9 supervisor sanity-check: a new sanity-check.sh module performs deterministic checks (TODO.md ↔ DB ↔ repo) when the queue is empty to detect and remediate stale claims, failed blockers, missing auto-dispatch tags, and DB orphans; wired into supervisor init and pulse flow.

Changes

Cohort / File(s) Summary
Supervisor Initialization
​.agents/scripts/supervisor-helper.sh
Source the new sanity-check.sh helper during supervisor startup (sourced after routine-scheduler modules).
Pulse Phase 0.9 Integration
​.agents/scripts/supervisor/pulse.sh
Add _run_sanity_check_for_repo() and call it from cmd_pulse when the queue is empty to iterate repos and run sanity checks, triggering cmd_auto_pickup on fixes.
Sanity Check Implementation
​.agents/scripts/supervisor/sanity-check.sh
New module implementing run_sanity_check() and helpers: _check_db_failed_with_claims, _check_failed_blocker_chains, _check_missing_auto_dispatch, _check_db_orphans, _log_queue_stall_reasons — reconciles TODO.md, DB, and repo state and applies remediation (unclaim/reset/commit/tag/cancel).
Todo Sync Enhancement
​.agents/scripts/supervisor/todo-sync.sh
Extend auto_unblock_resolved_tasks to treat permanently failed blockers (retries exhausted) like resolved blockers so dependents can unblock.

Sequence Diagram

sequenceDiagram
    participant Pulse as Pulse Cycle
    participant Sanity as Sanity Check
    participant TODO as TODO.md
    participant DB as Database
    participant Repo as Git Repo

    Pulse->>Pulse: Run phases 0.0–0.8
    Pulse->>Pulse: If queue empty, enter Phase 0.9
    Pulse->>Sanity: _run_sanity_check_for_repo(repo)
    Sanity->>TODO: Open TODO.md
    Sanity->>DB: Query task rows & statuses
    Sanity->>Sanity: _check_db_failed_with_claims()
    DB-->>Sanity: Stale claims / failed rows
    Sanity->>Sanity: _check_failed_blocker_chains()
    DB-->>Sanity: Blocker status (failed/complete)
    Sanity->>TODO: Edit TODO (remove blocker, add `#auto-dispatch`, etc.)
    Sanity->>DB: _check_db_orphans() -> cancel orphan rows
    Sanity->>Repo: commit_and_push_todo (if edited)
    Sanity->>Pulse: Return fixed count
    alt fixes > 0
        Sanity->>Pulse: trigger cmd_auto_pickup(repo)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🩺 When queues sleep and tasks complain,
Phase 0.9 walks the repo lane.
It scans, it heals, it tags with care,
Unblocks the stuck and clears the air.
A tiny fix, a tidy sweep — automation's gentle sweep.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 't1273: Add supervisor sanity-check for self-healing queue stalls' is specific and directly describes the main change: introducing a Phase 0.9 sanity-check module to address supervisor queue stalls through automated self-healing logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/supervisor-sanity-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:16:49 UTC 2026: Code review monitoring started
Fri Feb 20 16:16:49 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 16:16:52 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
.agents/scripts/supervisor/todo-sync.sh (1)

1188-1218: Consolidate the two separate DB guard blocks into a single SELECT status, retries, max_retries query.

The existing block (lines 1188–1200) and the new block (lines 1202–1218) each open a separate sqlite3 connection for the same blocker_id. In the worst case this is 4 DB calls per blocker (status check → failed-status recheck → retries → max_retries). A single query handles both fast-paths atomically, also eliminating the empty-string hazard noted above.

The WHERE … AND status = 'failed' predicate on line 1207 makes the subsequent == "failed" comparison redundant — a non-empty result is always "failed".

♻️ Proposed refactor — single DB round-trip for both fast-paths
-			# t1247: DB fallback — blocker may be deployed/verified in DB but TODO.md
-			# not yet updated …
-			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
-				local blocker_db_status=""
-				blocker_db_status=$(db "$SUPERVISOR_DB" \
-					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \
-					2>/dev/null || echo "")
-				if [[ "$blocker_db_status" == "complete" ||
-					"$blocker_db_status" == "deployed" ||
-					"$blocker_db_status" == "verified" ||
-					"$blocker_db_status" == "merged" ]]; then
-					log_verbose "  auto-unblock: blocker $blocker_id is '$blocker_db_status' in DB (TODO.md not yet updated) — treating as resolved"
-					continue # Resolved in DB
-				fi
-			fi
-
-			# Check if blocker is permanently failed in DB (retries exhausted)
-			# A failed blocker will never complete — don't let it block dependents forever
-			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
-				local blocker_failed_status=""
-				blocker_failed_status=$(db "$SUPERVISOR_DB" \
-					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
-					2>/dev/null || echo "")
-				if [[ "$blocker_failed_status" == "failed" ]]; then
-					local blocker_retries_left blocker_max_retries_left
-					blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
-						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
-						continue # Permanently failed = treat as resolved
-					fi
-				fi
-			fi
+			# t1247 / t1273: DB fast-path — single query covers resolved states and
+			# permanently-failed blockers (retries exhausted), keeping the loop to
+			# one sqlite3 round-trip per blocker.
+			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
+				local _blocker_row="" _bd_status="" _bd_retries="" _bd_max_retries=""
+				_blocker_row=$(db -separator '|' "$SUPERVISOR_DB" \
+					"SELECT status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \
+					2>/dev/null || echo "")
+				if [[ -n "$_blocker_row" ]]; then
+					IFS='|' read -r _bd_status _bd_retries _bd_max_retries <<<"$_blocker_row"
+					# Resolved states: treat as unblocked
+					if [[ "$_bd_status" == "complete" || "$_bd_status" == "deployed" ||
+						"$_bd_status" == "verified" || "$_bd_status" == "merged" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is '$_bd_status' in DB (TODO.md not yet updated) — treating as resolved"
+						continue
+					fi
+					# Permanently failed: retries exhausted — treat as resolved
+					if [[ "$_bd_status" == "failed" && "${_bd_retries:-0}" -ge "${_bd_max_retries:-0}" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB (${_bd_retries:-0}/${_bd_max_retries:-0} retries) — treating as resolved"
+						continue
+					fi
+				fi
+			fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/todo-sync.sh around lines 1188 - 1218, Combine
the two DB guard blocks into a single DB call: use db "$SUPERVISOR_DB" "SELECT
status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")'
LIMIT 1;" and parse the three fields into local variables (e.g., blocker_status,
blocker_retries_left, blocker_max_retries_left). Then apply the same logic: if
blocker_status is one of complete|deployed|verified|merged -> log auto-unblock
and continue; else if blocker_status == failed and blocker_retries_left -ge
blocker_max_retries_left -> log permanently failed and continue. Remove the
separate blocker_db_status/blocker_failed_status queries and the redundant WHERE
... AND status='failed' clause to eliminate multiple sqlite3 connections per
blocker.
.agents/scripts/supervisor/sanity-check.sh (2)

205-262: Use local IFS instead of manual save/restore for safer scoping.

The static analysis flags the IFS manipulation on lines 206 and 262. The current save/restore pattern works, but if an unexpected continue, return, or error occurs between lines 206 and 262, IFS remains modified for the rest of the function. Using local IFS=',' scopes it to the enclosing function, which is actually safe here since the outer while IFS= read sets IFS per-invocation.

Proposed simplification
-		local _saved_ifs="$IFS"
-		IFS=','
+		local IFS=','
 		for blocker_id in $blocked_by; do
 			...
 		done
-		IFS="$_saved_ifs"

Note: local IFS in bash scopes the variable to the function and restores the previous value when the function returns. Since IFS is only needed for the for loop's word splitting, this is safe — the outer while IFS= read loop sets IFS per-read invocation regardless.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/sanity-check.sh around lines 205 - 262, Replace
the manual save/restore IFS pattern with a function-scoped IFS by removing the
_saved_ifs/restore code and declaring local IFS=',' immediately before the for
blocker_id in $blocked_by loop; update the block that iterates over blocked_by
(which references variables like blocked_by, blocker_id, task_id, todo_file,
cmd_reset, sed_inplace) so it uses local IFS=',' and no longer restores IFS at
the end.

335-354: sed insertion targets the first #tag, not the last as the comment states.

Line 337 comment says "after the last #tag before any —" but the sed on line 338 replaces the first match of #tag\s#tag #auto-dispatch\s. For a line like #bugfix #security — desc, this inserts between the first and second tags rather than after the last. Functionally harmless since tag order doesn't affect dispatch, but the comment is misleading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/sanity-check.sh around lines 335 - 354, The
comment claims the code inserts "#auto-dispatch" after the last tag but the sed
invocation in sed_inplace targets the first tag; fix by either updating the
comment to state "insert before the first `#tag` (or at end of tags)" or change
the sed expression in sed_inplace to match the last tag before the separator
(e.g., use a greedy prefix capture so the replacement targets the final
#[a-zA-Z0-9_-]* token before the space/—); refer to the sed_inplace call and the
surrounding variables todo_file, line_num and task_id when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1966-1983: The inline comment in the Phase 0.9 block is truncated;
update the comment that starts with "No queued tasks — check if there are open
tasks in TODO.md that should be" to a complete phrase (e.g., "...that should be
queued" or "...that should be dispatchable") to clarify intent; locate the block
around the queued_before_sanity variable and the _run_sanity_check_for_repo
calls and replace the truncated comment with the full, unambiguous sentence.

In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 244-254: The sed substitution building new_blockers uses an
unescaped $blocker_id which can contain regex metacharacters (e.g., dots);
update the construction of new_blockers so $blocker_id is escaped for use in the
sed s/// pattern (mirror the escaping approach used for $blocked_by), i.e.,
create an escaped_blocker_id variable from blocker_id before calling sed and use
that escaped variable in the sed "s/,${escaped_blocker_id},/,/" invocation;
ensure this change is applied where new_blockers is computed and keep the
existing sed_inplace and later escaped_blocked_by usage unchanged.
- Around line 44-106: run_sanity_check collects integer results from the helper
functions (_check_db_failed_with_claims, _check_failed_blocker_chains,
_check_missing_auto_dispatch, _check_db_orphans) using command substitution, but
side-effect commands inside those helpers (cmd_unclaim, cmd_reset,
cmd_transition, commit_and_push_todo, etc.) may emit stdout and corrupt the
numeric return values; update each _check_* helper (and the calls to
commit_and_push_todo) to redirect all non-essential stdout/stderr from
side-effect commands into the supervisor log (use "$SUPERVISOR_LOG") or
/dev/null so the helper functions only emit their final numeric result on
stdout, ensuring run_sanity_check's arithmetic (fixed=$((fixed + ...))) receives
clean integers.
- Around line 131-172: The loop reads retries/max_retries into
db_retries/db_max_retries but the DB query can return NULL, causing the integer
comparison in [[ "$db_retries" -lt "$db_max_retries" ]] to fail; fix by ensuring
non-null defaults: update the query that produces failed_tasks to use
COALESCE(retries,0) AS retries and COALESCE(max_retries,0) AS max_retries, or
alternately immediately after the read set db_retries=${db_retries:-0} and
db_max_retries=${db_max_retries:-0} so the comparison in the loop (and the
variables used elsewhere) always sees integers.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1211-1213: The numeric comparison can fail because
blocker_retries_left and blocker_max_retries_left may be empty strings when the
sqlite query returns no rows; change the code that sets blocker_retries_left and
blocker_max_retries_left (the db "SELECT retries..." and "SELECT max_retries..."
calls around variables blocker_retries_left and blocker_max_retries_left in
todo-sync.sh) to coerce empty output to 0 (for example by wrapping the db output
with a default-to-zero check) before doing the [[ ... -ge ... ]] test, or
better, replace the two separate queries with a single consolidated query that
returns both retries and max_retries (and defaults missing values to 0) to avoid
a race between the queries and ensure the numeric comparison never receives an
empty string.

---

Nitpick comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 205-262: Replace the manual save/restore IFS pattern with a
function-scoped IFS by removing the _saved_ifs/restore code and declaring local
IFS=',' immediately before the for blocker_id in $blocked_by loop; update the
block that iterates over blocked_by (which references variables like blocked_by,
blocker_id, task_id, todo_file, cmd_reset, sed_inplace) so it uses local IFS=','
and no longer restores IFS at the end.
- Around line 335-354: The comment claims the code inserts "#auto-dispatch"
after the last tag but the sed invocation in sed_inplace targets the first tag;
fix by either updating the comment to state "insert before the first `#tag` (or at
end of tags)" or change the sed expression in sed_inplace to match the last tag
before the separator (e.g., use a greedy prefix capture so the replacement
targets the final #[a-zA-Z0-9_-]* token before the space/—); refer to the
sed_inplace call and the surrounding variables todo_file, line_num and task_id
when applying the change.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1188-1218: Combine the two DB guard blocks into a single DB call:
use db "$SUPERVISOR_DB" "SELECT status, retries, max_retries FROM tasks WHERE id
= '$(sql_escape "$blocker_id")' LIMIT 1;" and parse the three fields into local
variables (e.g., blocker_status, blocker_retries_left,
blocker_max_retries_left). Then apply the same logic: if blocker_status is one
of complete|deployed|verified|merged -> log auto-unblock and continue; else if
blocker_status == failed and blocker_retries_left -ge blocker_max_retries_left
-> log permanently failed and continue. Remove the separate
blocker_db_status/blocker_failed_status queries and the redundant WHERE ... AND
status='failed' clause to eliminate multiple sqlite3 connections per blocker.

…ping (t1273)

- Redirect stdout of side-effect commands (cmd_unclaim, cmd_reset,
  cmd_transition, commit_and_push_todo) to SUPERVISOR_LOG inside _check_*
  functions to prevent stdout leakage into $() arithmetic capture
- Add COALESCE for retries/max_retries in SQL queries to handle NULL values
- Add empty-string guards (${var:-0}) for race condition protection
- Escape blocker_id dots in sed regex for task IDs like t215.3
- Fix truncated comment in pulse.sh Phase 0.9
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:25:58 UTC 2026: Code review monitoring started
Fri Feb 20 16:25:58 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 16:26:01 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)

1977-1982: _run_sanity_check_for_repo called without || true — inconsistent with every other phase helper in cmd_pulse

Every other helper call in cmd_pulse is guarded with || true (e.g., lines 718, 781, 851, 871, 889, 957, 1957). When _run_sanity_check_for_repo reaches its inner if [[ "${sanity_fixed:-0}" -gt 0 ]] and the condition is false, the function's last-executed command exits 1. While the while loop absorbs exit codes at runtime, the omission is inconsistent and could surprise future maintainers or cause subtle issues if set -e is ever introduced.

♻️ Proposed fix — add `|| true` for consistency
 		while IFS= read -r repo_path; do
-			_run_sanity_check_for_repo "$repo_path"
+			_run_sanity_check_for_repo "$repo_path" || true
 		done <<<"$all_repos"
 	else
-		_run_sanity_check_for_repo "$(pwd)"
+		_run_sanity_check_for_repo "$(pwd)" || true
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1977 - 1982, The call to
_run_sanity_check_for_repo inside cmd_pulse is missing the customary "|| true"
guard used by all other phase helpers; update both invocation sites (the while
loop body that iterates over repo_path and the else branch that calls
_run_sanity_check_for_repo "$(pwd)") to append "|| true" so the helper's
non-zero exit (e.g., when sanity_fixed is 0) won't propagate and to keep
behavior consistent with other helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Line 136: The grep ERE patterns use unescaped task_id (so dots match any char)
in the four _check_* functions; create an escaped_task_id variable (like the
existing escaped_blocker_id) by replacing '.' with '\.' and use escaped_task_id
in the grep -E patterns (e.g., the lines that currently reference "${task_id}")
before each first grep call in the functions that use task_id (the sites at the
grep calls around the current task_id uses on lines noted in the review). This
ensures patterns match literal dots in IDs and prevents accidental matches of
adjacent IDs.
- Around line 205-268: The function currently mutates global IFS (saved in
_saved_ifs) and restores it later, which can be skipped if an early exit occurs;
change the for-loop that iterates over blocked_by to use a safe array split: use
IFS=',' read -r -a _blocker_list <<< "$blocked_by" and then iterate with for
blocker_id in "${_blocker_list[@]}"; ensure all references to $blocker_id and
$blocked_by remain quoted where used (e.g., in db calls, grep, sed_inplace and
the rebuilding logic), remove the _saved_ifs/IFS restore code, and keep the
existing logic for resetting/unblocking (cmd_reset, sed_inplace, log_warn,
fixed, task_id, todo_file) intact so no global IFS mutation can persist on early
exit.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1202-1221: The three separate `db` queries (using db, sql_escape
and blocker_id) create a TOCTOU and a fallback bug where
blocker_max_retries_left can become "0" if `db` fails; replace the three queries
with a single `db` query that SELECTs status, COALESCE(retries,0),
COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND status = 'failed' LIMIT
1, then parse the single result (e.g., IFS='|' read -r _pf_status
blocker_retries_left blocker_max_retries_left) and treat an empty result as "not
permanently failed"; keep the existing guards blocker's retries defaults
(blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing
comparison and log_verbose call (using blocker_id) so we only auto-unblock when
the row exists, status is failed, and retries >= max_retries.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1977-1982: The call to _run_sanity_check_for_repo inside cmd_pulse
is missing the customary "|| true" guard used by all other phase helpers; update
both invocation sites (the while loop body that iterates over repo_path and the
else branch that calls _run_sanity_check_for_repo "$(pwd)") to append "|| true"
so the helper's non-zero exit (e.g., when sanity_fixed is 0) won't propagate and
to keep behavior consistent with other helpers.

Comment on lines +1202 to +1221
# Check if blocker is permanently failed in DB (retries exhausted)
# A failed blocker will never complete — don't let it block dependents forever
if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
local blocker_failed_status=""
blocker_failed_status=$(db "$SUPERVISOR_DB" \
"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
2>/dev/null || echo "")
if [[ "$blocker_failed_status" == "failed" ]]; then
local blocker_retries_left blocker_max_retries_left
blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
# Guard against empty strings from race conditions
blocker_retries_left="${blocker_retries_left:-0}"
blocker_max_retries_left="${blocker_max_retries_left:-3}"
if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
continue # Permanently failed = treat as resolved
fi
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent max_retries fallback when db fails; consolidating the three queries eliminates both the TOCTOU and the inconsistency.

|| echo "0" on line 1212 fires when db exits non-zero, setting blocker_max_retries_left="0". Because "0" is not an empty string, the :-3 guard on line 1215 never substitutes. If db is briefly unavailable for queries 2–3 (e.g., SQLite BUSY after query 1 succeeds), the comparison becomes [[ 0 -ge 0 ]]true, and the blocker is incorrectly treated as permanently failed — prematurely unblocking a dependent task.

A single query that combines the status filter with the retry columns eliminates the three-query TOCTOU window and makes the guard logic uniform:

♻️ Proposed fix — single-query consolidation
-			# Check if blocker is permanently failed in DB (retries exhausted)
-			# A failed blocker will never complete — don't let it block dependents forever
-			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
-				local blocker_failed_status=""
-				blocker_failed_status=$(db "$SUPERVISOR_DB" \
-					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
-					2>/dev/null || echo "")
-				if [[ "$blocker_failed_status" == "failed" ]]; then
-					local blocker_retries_left blocker_max_retries_left
-					blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					# Guard against empty strings from race conditions
-					blocker_retries_left="${blocker_retries_left:-0}"
-					blocker_max_retries_left="${blocker_max_retries_left:-3}"
-					if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
-						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
-						continue # Permanently failed = treat as resolved
-					fi
-				fi
-			fi
+			# Check if blocker is permanently failed in DB (retries exhausted)
+			# A failed blocker will never complete — don't let it block dependents forever
+			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
+				local _pf_row _pf_retries _pf_max
+				_pf_row=$(db -separator '|' "$SUPERVISOR_DB" "
+					SELECT COALESCE(retries, 0), COALESCE(max_retries, 3) FROM tasks
+					WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed';
+				" 2>/dev/null || echo "")
+				if [[ -n "$_pf_row" ]]; then
+					IFS='|' read -r _pf_retries _pf_max <<< "$_pf_row"
+					_pf_retries="${_pf_retries:-0}"
+					_pf_max="${_pf_max:-3}"
+					if [[ "$_pf_retries" -ge "$_pf_max" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB ($_pf_retries/$_pf_max retries) — treating as resolved"
+						continue # Permanently failed = treat as resolved
+					fi
+				fi
+			fi

The WHERE … AND status = 'failed' filter means an empty _pf_row unambiguously means "not permanently failed" (either task absent, not failed, or db itself failed). The IFS='|' read is command-prefix scoped and does not mutate global IFS.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/todo-sync.sh around lines 1202 - 1221, The three
separate `db` queries (using db, sql_escape and blocker_id) create a TOCTOU and
a fallback bug where blocker_max_retries_left can become "0" if `db` fails;
replace the three queries with a single `db` query that SELECTs status,
COALESCE(retries,0), COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND
status = 'failed' LIMIT 1, then parse the single result (e.g., IFS='|' read -r
_pf_status blocker_retries_left blocker_max_retries_left) and treat an empty
result as "not permanently failed"; keep the existing guards blocker's retries
defaults (blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing
comparison and log_verbose call (using blocker_id) so we only auto-unblock when
the row exists, status is failed, and retries >= max_retries.

…k consistency (t1273)

Address CodeRabbit round 2:
- Escape dots in task_id for all grep ERE patterns across all _check_*
  functions (t215.3 would match t21503 without escaping)
- Replace IFS=',' mutation with read-based splitting via tr to prevent
  IFS corruption on early exit under set -euo pipefail
- Fix max_retries fallback: use empty string + :-3 guard instead of
  || echo '0' which would falsely trigger 'retries exhausted' on db failure
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:42:51 UTC 2026: Code review monitoring started
Fri Feb 20 16:42:51 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 16:42:54 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit a2f6c82 into main Feb 20, 2026
11 checks passed
@marcusquinn marcusquinn deleted the feature/supervisor-sanity-check branch February 21, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant