t1273: Add supervisor sanity-check for self-healing queue stalls#2015
t1273: Add supervisor sanity-check for self-healing queue stalls#2015marcusquinn merged 4 commits intomainfrom
Conversation
Phase 0.9 runs when zero tasks are dispatchable but open tasks exist: - Check 1: Strip stale claims on DB-failed tasks (don't wait 24h) - Check 2: Unblock tasks whose blockers permanently failed - Check 3: Auto-tag dispatchable tasks missing #auto-dispatch - Check 4: Cancel DB orphans with no TODO.md entry - Log structured stall breakdown when nothing is fixable Also fixes Phase 0.5d to treat permanently failed blockers as resolved, preventing infinite dependency chain stalls.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust self-healing mechanism to the supervisor system, specifically designed to prevent and resolve silent queue stalls. By cross-referencing various state sources (database, TODO.md, system state), the supervisor can now detect and automatically rectify inconsistencies that previously led to tasks remaining idle. This significantly improves the reliability and autonomy of the task dispatching process, ensuring that tasks are processed efficiently even when underlying states diverge. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 15:23:26 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable sanity-check mechanism to the supervisor, which should effectively address several causes of silent queue stalls. The new sanity-check.sh module is well-structured and the logic for detecting and fixing inconsistencies is sound. The changes in todo-sync.sh to handle permanently failed blockers are also a great improvement.
My review focuses on improving the shell scripting robustness and maintainability, and ensuring adherence to the repository's established rules. Key feedback points, supported by specific repository rules, include:
- Refactoring duplicated code in
pulse.shinto a helper function for better maintainability. - Addressing widespread suppression of
stderrwith2>/dev/nullindbSELECToperations, as this can hide important errors and violates general rules against suppressing diagnostic output. - Aligning local variable declarations with the rule for exit code safety.
- Suggesting a more robust pattern for complex
sedlogic insanity-check.shfor better readability and maintainability.
Overall, this is a strong contribution that enhances the supervisor's resilience. Addressing the feedback will make the implementation even more robust and compliant with repository standards.
| # contradictions that cause silent stalls. Fixes: stale claims on DB-failed | ||
| # tasks, failed blocker cascades, missing #auto-dispatch tags, DB orphans. | ||
| local queued_before_sanity | ||
| queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0) |
There was a problem hiding this comment.
Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo 0 guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1955, 1971).
| queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0) | |
| queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" || echo 0) |
References
- Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.
| failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" " | ||
| SELECT id, status, error, retries, max_retries FROM tasks | ||
| WHERE status IN ('failed', 'blocked') | ||
| AND repo = '$(sql_escape "$repo_path")'; | ||
| " 2>/dev/null || echo "") |
There was a problem hiding this comment.
Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls throughout this script.
| failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" " | |
| SELECT id, status, error, retries, max_retries FROM tasks | |
| WHERE status IN ('failed', 'blocked') | |
| AND repo = '$(sql_escape "$repo_path")'; | |
| " 2>/dev/null || echo "") | |
| failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" " | |
| SELECT id, status, error, retries, max_retries FROM tasks | |
| WHERE status IN ('failed', 'blocked') | |
| AND repo = '$(sql_escape "$repo_path")'; | |
| " || echo "") |
References
- Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.
| blocker_failed_status=$(db "$SUPERVISOR_DB" \ | ||
| "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ | ||
| 2>/dev/null || echo "") |
There was a problem hiding this comment.
Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1211, 1212).
| blocker_failed_status=$(db "$SUPERVISOR_DB" \ | |
| "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ | |
| 2>/dev/null || echo "") | |
| blocker_failed_status=$(db "$SUPERVISOR_DB" \ | |
| "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ | |
| || echo "") |
References
- Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.
| if [[ "$queued_before_sanity" -eq 0 ]]; then | ||
| # No queued tasks — check if there are open tasks in TODO.md that should be | ||
| if [[ -n "$all_repos" ]]; then | ||
| while IFS= read -r repo_path; do | ||
| if [[ -f "$repo_path/TODO.md" ]]; then | ||
| local open_count | ||
| open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$repo_path/TODO.md" 2>/dev/null || echo 0) | ||
| if [[ "$open_count" -gt 0 ]]; then | ||
| local sanity_fixed | ||
| sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG") | ||
| if [[ "${sanity_fixed:-0}" -gt 0 ]]; then | ||
| log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup" | ||
| cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true | ||
| fi | ||
| fi | ||
| fi | ||
| done <<<"$all_repos" | ||
| else | ||
| local cwd_todo | ||
| cwd_todo="$(pwd)/TODO.md" | ||
| if [[ -f "$cwd_todo" ]]; then | ||
| local open_count | ||
| open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$cwd_todo" 2>/dev/null || echo 0) | ||
| if [[ "$open_count" -gt 0 ]]; then | ||
| local sanity_fixed | ||
| sanity_fixed=$(run_sanity_check "$(pwd)" 2>>"$SUPERVISOR_LOG") | ||
| if [[ "${sanity_fixed:-0}" -gt 0 ]]; then | ||
| log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) — re-running auto-pickup" | ||
| cmd_auto_pickup --repo "$(pwd)" 2>>"$SUPERVISOR_LOG" || true | ||
| fi | ||
| fi | ||
| fi | ||
| fi | ||
| fi |
There was a problem hiding this comment.
The logic inside the if [[ -n "$all_repos" ]] block and the else block is nearly identical. This duplication can be avoided by refactoring the common logic into a helper function. This would make the code more maintainable and easier to read.
_run_sanity_check_for_repo() {
local repo_path="$1"
local todo_file="$repo_path/TODO.md"
if [[ ! -f "$todo_file" ]]; then
return
fi
local open_count
open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$todo_file" || echo 0)
if [[ "$open_count" -gt 0 ]]; then
local sanity_fixed
sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG")
if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup"
cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true
fi
fi
}
if [[ "$queued_before_sanity" -eq 0 ]]; then
# No queued tasks — check if there are open tasks in TODO.md that should be
if [[ -n "$all_repos" ]]; then
while IFS= read -r repo_path; do
_run_sanity_check_for_repo "$repo_path"
done <<<"$all_repos"
else
_run_sanity_check_for_repo "$(pwd)"
fi
fiReferences
- In shell scripts, extract repeated logic into an internal helper function to improve maintainability. This applies even for standalone scripts where external
sourcedependencies are avoided.
| # 0 on success | ||
| ####################################### | ||
| run_sanity_check() { | ||
| local repo_path="${1:-$REPO_PATH}" |
There was a problem hiding this comment.
The repository rules specify declaring and assigning local variables separately for exit code safety. This function, and others in this file, use local var="$1". Please update this to follow the rule.
| local repo_path="${1:-$REPO_PATH}" | |
| local repo_path | |
| repo_path="${1:-$REPO_PATH}" |
References
- When declaring and assigning a local variable in shell scripts, separate the 'local' declaration from the assignment if command substitution is involved, to ensure exit code safety.
| if [[ "$blocked_by" == "$blocker_id" ]]; then | ||
| local escaped_blocker | ||
| escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g') | ||
| sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file" | ||
| else | ||
| # Multiple blockers — remove just this one | ||
| local escaped_blocker | ||
| escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g') | ||
| # Remove blocker_id from comma-separated list | ||
| sed_inplace "${line_num}s/blocked-by:\([^ ]*\)${escaped_blocker},\?/blocked-by:\1/" "$todo_file" | ||
| # Clean up trailing/leading commas | ||
| sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file" | ||
| sed_inplace "${line_num}s/,\( \)/\1/" "$todo_file" | ||
| fi |
There was a problem hiding this comment.
This series of sed commands to remove a specific blocker from a comma-separated list is quite complex and potentially fragile. A more robust and readable approach is to manipulate the string in bash and then replace the entire blocked-by field. Consider refactoring to this more maintainable pattern.
| if [[ "$blocked_by" == "$blocker_id" ]]; then | |
| local escaped_blocker | |
| escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g') | |
| sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file" | |
| else | |
| # Multiple blockers — remove just this one | |
| local escaped_blocker | |
| escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g') | |
| # Remove blocker_id from comma-separated list | |
| sed_inplace "${line_num}s/blocked-by:\([^ ]*\)${escaped_blocker},\?/blocked-by:\1/" "$todo_file" | |
| # Clean up trailing/leading commas | |
| sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file" | |
| sed_inplace "${line_num}s/,\( \)/\1/" "$todo_file" | |
| fi | |
| # Example | |
| local all_blockers="t1,t2,t3" | |
| local blocker_to_remove="t2" | |
| # Robustly remove the item | |
| local new_blockers | |
| new_blockers=$(echo ",${all_blockers}," | sed "s/,${blocker_to_remove},/,/" | sed 's/^,//;s/,$//') | |
| # Now use sed to replace the whole field | |
| if [[ -n "$new_blockers" ]]; then | |
| sed_inplace "${line_num}s/blocked-by:[^ ]*/blocked-by:${new_blockers}/" "$todo_file" | |
| else | |
| sed_inplace "${line_num}s/ blocked-by:[^ ]*//" "$todo_file" | |
| fi |
…ness (t1273) Address valid review feedback: - Extract _run_sanity_check_for_repo() to deduplicate multi-repo/single-repo code paths in Phase 0.9 - Replace fragile multi-step sed for comma-separated blocker removal with robust string rebuild approach
marcusquinn
left a comment
There was a problem hiding this comment.
Review response
Addressed (valid feedback)
- Duplicated code in pulse.sh: Extracted
_run_sanity_check_for_repo()helper function — good catch - Fragile sed for comma-separated blocker removal: Replaced with robust string rebuild approach (
",$list," | sed "s/,$item,/,/")
Dismissed (incorrect or inconsistent with codebase conventions)
2>/dev/null on SELECT operations: This is the established convention across the entire supervisor codebase (279 occurrences in pulse.sh alone, 33 in todo-sync.sh, 98 in evaluate.sh). The db() wrapper can emit SQLite warnings on locked databases during concurrent access — these are expected in a multi-process supervisor and not actionable errors. Removing 2>/dev/null would flood the supervisor log with noise during normal operation. The || echo "" fallback handles actual failures.
local var="$1" pattern: The Gemini suggestion to separate declaration from assignment applies to command substitutions where you need to capture exit codes (e.g., local x; x=$(some_cmd)). For simple parameter assignments like local repo_path="$1", there is no exit code to capture — $1 is a variable expansion, not a command. This is the established pattern throughout the codebase (used in every function in todo-sync.sh, pulse.sh, etc.).
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds a Phase 0.9 supervisor sanity-check: a new Changes
Sequence DiagramsequenceDiagram
participant Pulse as Pulse Cycle
participant Sanity as Sanity Check
participant TODO as TODO.md
participant DB as Database
participant Repo as Git Repo
Pulse->>Pulse: Run phases 0.0–0.8
Pulse->>Pulse: If queue empty, enter Phase 0.9
Pulse->>Sanity: _run_sanity_check_for_repo(repo)
Sanity->>TODO: Open TODO.md
Sanity->>DB: Query task rows & statuses
Sanity->>Sanity: _check_db_failed_with_claims()
DB-->>Sanity: Stale claims / failed rows
Sanity->>Sanity: _check_failed_blocker_chains()
DB-->>Sanity: Blocker status (failed/complete)
Sanity->>TODO: Edit TODO (remove blocker, add `#auto-dispatch`, etc.)
Sanity->>DB: _check_db_orphans() -> cancel orphan rows
Sanity->>Repo: commit_and_push_todo (if edited)
Sanity->>Pulse: Return fixed count
alt fixes > 0
Sanity->>Pulse: trigger cmd_auto_pickup(repo)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 16:16:52 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
.agents/scripts/supervisor/todo-sync.sh (1)
1188-1218: Consolidate the two separate DB guard blocks into a singleSELECT status, retries, max_retriesquery.The existing block (lines 1188–1200) and the new block (lines 1202–1218) each open a separate sqlite3 connection for the same
blocker_id. In the worst case this is 4 DB calls per blocker (status check → failed-status recheck → retries → max_retries). A single query handles both fast-paths atomically, also eliminating the empty-string hazard noted above.The
WHERE … AND status = 'failed'predicate on line 1207 makes the subsequent== "failed"comparison redundant — a non-empty result is always"failed".♻️ Proposed refactor — single DB round-trip for both fast-paths
- # t1247: DB fallback — blocker may be deployed/verified in DB but TODO.md - # not yet updated … - if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then - local blocker_db_status="" - blocker_db_status=$(db "$SUPERVISOR_DB" \ - "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \ - 2>/dev/null || echo "") - if [[ "$blocker_db_status" == "complete" || - "$blocker_db_status" == "deployed" || - "$blocker_db_status" == "verified" || - "$blocker_db_status" == "merged" ]]; then - log_verbose " auto-unblock: blocker $blocker_id is '$blocker_db_status' in DB (TODO.md not yet updated) — treating as resolved" - continue # Resolved in DB - fi - fi - - # Check if blocker is permanently failed in DB (retries exhausted) - # A failed blocker will never complete — don't let it block dependents forever - if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then - local blocker_failed_status="" - blocker_failed_status=$(db "$SUPERVISOR_DB" \ - "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ - 2>/dev/null || echo "") - if [[ "$blocker_failed_status" == "failed" ]]; then - local blocker_retries_left blocker_max_retries_left - blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") - blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") - if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then - log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved" - continue # Permanently failed = treat as resolved - fi - fi - fi + # t1247 / t1273: DB fast-path — single query covers resolved states and + # permanently-failed blockers (retries exhausted), keeping the loop to + # one sqlite3 round-trip per blocker. + if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then + local _blocker_row="" _bd_status="" _bd_retries="" _bd_max_retries="" + _blocker_row=$(db -separator '|' "$SUPERVISOR_DB" \ + "SELECT status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \ + 2>/dev/null || echo "") + if [[ -n "$_blocker_row" ]]; then + IFS='|' read -r _bd_status _bd_retries _bd_max_retries <<<"$_blocker_row" + # Resolved states: treat as unblocked + if [[ "$_bd_status" == "complete" || "$_bd_status" == "deployed" || + "$_bd_status" == "verified" || "$_bd_status" == "merged" ]]; then + log_verbose " auto-unblock: blocker $blocker_id is '$_bd_status' in DB (TODO.md not yet updated) — treating as resolved" + continue + fi + # Permanently failed: retries exhausted — treat as resolved + if [[ "$_bd_status" == "failed" && "${_bd_retries:-0}" -ge "${_bd_max_retries:-0}" ]]; then + log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB (${_bd_retries:-0}/${_bd_max_retries:-0} retries) — treating as resolved" + continue + fi + fi + fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/todo-sync.sh around lines 1188 - 1218, Combine the two DB guard blocks into a single DB call: use db "$SUPERVISOR_DB" "SELECT status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" and parse the three fields into local variables (e.g., blocker_status, blocker_retries_left, blocker_max_retries_left). Then apply the same logic: if blocker_status is one of complete|deployed|verified|merged -> log auto-unblock and continue; else if blocker_status == failed and blocker_retries_left -ge blocker_max_retries_left -> log permanently failed and continue. Remove the separate blocker_db_status/blocker_failed_status queries and the redundant WHERE ... AND status='failed' clause to eliminate multiple sqlite3 connections per blocker..agents/scripts/supervisor/sanity-check.sh (2)
205-262: Uselocal IFSinstead of manual save/restore for safer scoping.The static analysis flags the IFS manipulation on lines 206 and 262. The current save/restore pattern works, but if an unexpected
continue,return, or error occurs between lines 206 and 262, IFS remains modified for the rest of the function. Usinglocal IFS=','scopes it to the enclosing function, which is actually safe here since the outerwhile IFS= readsets IFS per-invocation.Proposed simplification
- local _saved_ifs="$IFS" - IFS=',' + local IFS=',' for blocker_id in $blocked_by; do ... done - IFS="$_saved_ifs"Note:
local IFSin bash scopes the variable to the function and restores the previous value when the function returns. SinceIFSis only needed for theforloop's word splitting, this is safe — the outerwhile IFS= readloop sets IFS per-readinvocation regardless.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/sanity-check.sh around lines 205 - 262, Replace the manual save/restore IFS pattern with a function-scoped IFS by removing the _saved_ifs/restore code and declaring local IFS=',' immediately before the for blocker_id in $blocked_by loop; update the block that iterates over blocked_by (which references variables like blocked_by, blocker_id, task_id, todo_file, cmd_reset, sed_inplace) so it uses local IFS=',' and no longer restores IFS at the end.
335-354:sedinsertion targets the first#tag, not the last as the comment states.Line 337 comment says "after the last
#tagbefore any —" but the sed on line 338 replaces the first match of#tag\s→#tag#auto-dispatch\s. For a line like#bugfix#security— desc, this inserts between the first and second tags rather than after the last. Functionally harmless since tag order doesn't affect dispatch, but the comment is misleading.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/sanity-check.sh around lines 335 - 354, The comment claims the code inserts "#auto-dispatch" after the last tag but the sed invocation in sed_inplace targets the first tag; fix by either updating the comment to state "insert before the first `#tag` (or at end of tags)" or change the sed expression in sed_inplace to match the last tag before the separator (e.g., use a greedy prefix capture so the replacement targets the final #[a-zA-Z0-9_-]* token before the space/—); refer to the sed_inplace call and the surrounding variables todo_file, line_num and task_id when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1966-1983: The inline comment in the Phase 0.9 block is truncated;
update the comment that starts with "No queued tasks — check if there are open
tasks in TODO.md that should be" to a complete phrase (e.g., "...that should be
queued" or "...that should be dispatchable") to clarify intent; locate the block
around the queued_before_sanity variable and the _run_sanity_check_for_repo
calls and replace the truncated comment with the full, unambiguous sentence.
In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 244-254: The sed substitution building new_blockers uses an
unescaped $blocker_id which can contain regex metacharacters (e.g., dots);
update the construction of new_blockers so $blocker_id is escaped for use in the
sed s/// pattern (mirror the escaping approach used for $blocked_by), i.e.,
create an escaped_blocker_id variable from blocker_id before calling sed and use
that escaped variable in the sed "s/,${escaped_blocker_id},/,/" invocation;
ensure this change is applied where new_blockers is computed and keep the
existing sed_inplace and later escaped_blocked_by usage unchanged.
- Around line 44-106: run_sanity_check collects integer results from the helper
functions (_check_db_failed_with_claims, _check_failed_blocker_chains,
_check_missing_auto_dispatch, _check_db_orphans) using command substitution, but
side-effect commands inside those helpers (cmd_unclaim, cmd_reset,
cmd_transition, commit_and_push_todo, etc.) may emit stdout and corrupt the
numeric return values; update each _check_* helper (and the calls to
commit_and_push_todo) to redirect all non-essential stdout/stderr from
side-effect commands into the supervisor log (use "$SUPERVISOR_LOG") or
/dev/null so the helper functions only emit their final numeric result on
stdout, ensuring run_sanity_check's arithmetic (fixed=$((fixed + ...))) receives
clean integers.
- Around line 131-172: The loop reads retries/max_retries into
db_retries/db_max_retries but the DB query can return NULL, causing the integer
comparison in [[ "$db_retries" -lt "$db_max_retries" ]] to fail; fix by ensuring
non-null defaults: update the query that produces failed_tasks to use
COALESCE(retries,0) AS retries and COALESCE(max_retries,0) AS max_retries, or
alternately immediately after the read set db_retries=${db_retries:-0} and
db_max_retries=${db_max_retries:-0} so the comparison in the loop (and the
variables used elsewhere) always sees integers.
In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1211-1213: The numeric comparison can fail because
blocker_retries_left and blocker_max_retries_left may be empty strings when the
sqlite query returns no rows; change the code that sets blocker_retries_left and
blocker_max_retries_left (the db "SELECT retries..." and "SELECT max_retries..."
calls around variables blocker_retries_left and blocker_max_retries_left in
todo-sync.sh) to coerce empty output to 0 (for example by wrapping the db output
with a default-to-zero check) before doing the [[ ... -ge ... ]] test, or
better, replace the two separate queries with a single consolidated query that
returns both retries and max_retries (and defaults missing values to 0) to avoid
a race between the queries and ensure the numeric comparison never receives an
empty string.
---
Nitpick comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 205-262: Replace the manual save/restore IFS pattern with a
function-scoped IFS by removing the _saved_ifs/restore code and declaring local
IFS=',' immediately before the for blocker_id in $blocked_by loop; update the
block that iterates over blocked_by (which references variables like blocked_by,
blocker_id, task_id, todo_file, cmd_reset, sed_inplace) so it uses local IFS=','
and no longer restores IFS at the end.
- Around line 335-354: The comment claims the code inserts "#auto-dispatch"
after the last tag but the sed invocation in sed_inplace targets the first tag;
fix by either updating the comment to state "insert before the first `#tag` (or at
end of tags)" or change the sed expression in sed_inplace to match the last tag
before the separator (e.g., use a greedy prefix capture so the replacement
targets the final #[a-zA-Z0-9_-]* token before the space/—); refer to the
sed_inplace call and the surrounding variables todo_file, line_num and task_id
when applying the change.
In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1188-1218: Combine the two DB guard blocks into a single DB call:
use db "$SUPERVISOR_DB" "SELECT status, retries, max_retries FROM tasks WHERE id
= '$(sql_escape "$blocker_id")' LIMIT 1;" and parse the three fields into local
variables (e.g., blocker_status, blocker_retries_left,
blocker_max_retries_left). Then apply the same logic: if blocker_status is one
of complete|deployed|verified|merged -> log auto-unblock and continue; else if
blocker_status == failed and blocker_retries_left -ge blocker_max_retries_left
-> log permanently failed and continue. Remove the separate
blocker_db_status/blocker_failed_status queries and the redundant WHERE ... AND
status='failed' clause to eliminate multiple sqlite3 connections per blocker.
…ping (t1273)
- Redirect stdout of side-effect commands (cmd_unclaim, cmd_reset,
cmd_transition, commit_and_push_todo) to SUPERVISOR_LOG inside _check_*
functions to prevent stdout leakage into $() arithmetic capture
- Add COALESCE for retries/max_retries in SQL queries to handle NULL values
- Add empty-string guards (${var:-0}) for race condition protection
- Escape blocker_id dots in sed regex for task IDs like t215.3
- Fix truncated comment in pulse.sh Phase 0.9
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 16:26:01 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)
1977-1982:_run_sanity_check_for_repocalled without|| true— inconsistent with every other phase helper incmd_pulseEvery other helper call in
cmd_pulseis guarded with|| true(e.g., lines 718, 781, 851, 871, 889, 957, 1957). When_run_sanity_check_for_reporeaches its innerif [[ "${sanity_fixed:-0}" -gt 0 ]]and the condition is false, the function's last-executed command exits1. While thewhileloop absorbs exit codes at runtime, the omission is inconsistent and could surprise future maintainers or cause subtle issues ifset -eis ever introduced.♻️ Proposed fix — add `|| true` for consistency
while IFS= read -r repo_path; do - _run_sanity_check_for_repo "$repo_path" + _run_sanity_check_for_repo "$repo_path" || true done <<<"$all_repos" else - _run_sanity_check_for_repo "$(pwd)" + _run_sanity_check_for_repo "$(pwd)" || true fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 1977 - 1982, The call to _run_sanity_check_for_repo inside cmd_pulse is missing the customary "|| true" guard used by all other phase helpers; update both invocation sites (the while loop body that iterates over repo_path and the else branch that calls _run_sanity_check_for_repo "$(pwd)") to append "|| true" so the helper's non-zero exit (e.g., when sanity_fixed is 0) won't propagate and to keep behavior consistent with other helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Line 136: The grep ERE patterns use unescaped task_id (so dots match any char)
in the four _check_* functions; create an escaped_task_id variable (like the
existing escaped_blocker_id) by replacing '.' with '\.' and use escaped_task_id
in the grep -E patterns (e.g., the lines that currently reference "${task_id}")
before each first grep call in the functions that use task_id (the sites at the
grep calls around the current task_id uses on lines noted in the review). This
ensures patterns match literal dots in IDs and prevents accidental matches of
adjacent IDs.
- Around line 205-268: The function currently mutates global IFS (saved in
_saved_ifs) and restores it later, which can be skipped if an early exit occurs;
change the for-loop that iterates over blocked_by to use a safe array split: use
IFS=',' read -r -a _blocker_list <<< "$blocked_by" and then iterate with for
blocker_id in "${_blocker_list[@]}"; ensure all references to $blocker_id and
$blocked_by remain quoted where used (e.g., in db calls, grep, sed_inplace and
the rebuilding logic), remove the _saved_ifs/IFS restore code, and keep the
existing logic for resetting/unblocking (cmd_reset, sed_inplace, log_warn,
fixed, task_id, todo_file) intact so no global IFS mutation can persist on early
exit.
In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1202-1221: The three separate `db` queries (using db, sql_escape
and blocker_id) create a TOCTOU and a fallback bug where
blocker_max_retries_left can become "0" if `db` fails; replace the three queries
with a single `db` query that SELECTs status, COALESCE(retries,0),
COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND status = 'failed' LIMIT
1, then parse the single result (e.g., IFS='|' read -r _pf_status
blocker_retries_left blocker_max_retries_left) and treat an empty result as "not
permanently failed"; keep the existing guards blocker's retries defaults
(blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing
comparison and log_verbose call (using blocker_id) so we only auto-unblock when
the row exists, status is failed, and retries >= max_retries.
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1977-1982: The call to _run_sanity_check_for_repo inside cmd_pulse
is missing the customary "|| true" guard used by all other phase helpers; update
both invocation sites (the while loop body that iterates over repo_path and the
else branch that calls _run_sanity_check_for_repo "$(pwd)") to append "|| true"
so the helper's non-zero exit (e.g., when sanity_fixed is 0) won't propagate and
to keep behavior consistent with other helpers.
| # Check if blocker is permanently failed in DB (retries exhausted) | ||
| # A failed blocker will never complete — don't let it block dependents forever | ||
| if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then | ||
| local blocker_failed_status="" | ||
| blocker_failed_status=$(db "$SUPERVISOR_DB" \ | ||
| "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ | ||
| 2>/dev/null || echo "") | ||
| if [[ "$blocker_failed_status" == "failed" ]]; then | ||
| local blocker_retries_left blocker_max_retries_left | ||
| blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") | ||
| blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") | ||
| # Guard against empty strings from race conditions | ||
| blocker_retries_left="${blocker_retries_left:-0}" | ||
| blocker_max_retries_left="${blocker_max_retries_left:-3}" | ||
| if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then | ||
| log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved" | ||
| continue # Permanently failed = treat as resolved | ||
| fi | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Inconsistent max_retries fallback when db fails; consolidating the three queries eliminates both the TOCTOU and the inconsistency.
|| echo "0" on line 1212 fires when db exits non-zero, setting blocker_max_retries_left="0". Because "0" is not an empty string, the :-3 guard on line 1215 never substitutes. If db is briefly unavailable for queries 2–3 (e.g., SQLite BUSY after query 1 succeeds), the comparison becomes [[ 0 -ge 0 ]] → true, and the blocker is incorrectly treated as permanently failed — prematurely unblocking a dependent task.
A single query that combines the status filter with the retry columns eliminates the three-query TOCTOU window and makes the guard logic uniform:
♻️ Proposed fix — single-query consolidation
- # Check if blocker is permanently failed in DB (retries exhausted)
- # A failed blocker will never complete — don't let it block dependents forever
- if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
- local blocker_failed_status=""
- blocker_failed_status=$(db "$SUPERVISOR_DB" \
- "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
- 2>/dev/null || echo "")
- if [[ "$blocker_failed_status" == "failed" ]]; then
- local blocker_retries_left blocker_max_retries_left
- blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
- blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
- # Guard against empty strings from race conditions
- blocker_retries_left="${blocker_retries_left:-0}"
- blocker_max_retries_left="${blocker_max_retries_left:-3}"
- if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
- log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
- continue # Permanently failed = treat as resolved
- fi
- fi
- fi
+ # Check if blocker is permanently failed in DB (retries exhausted)
+ # A failed blocker will never complete — don't let it block dependents forever
+ if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
+ local _pf_row _pf_retries _pf_max
+ _pf_row=$(db -separator '|' "$SUPERVISOR_DB" "
+ SELECT COALESCE(retries, 0), COALESCE(max_retries, 3) FROM tasks
+ WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed';
+ " 2>/dev/null || echo "")
+ if [[ -n "$_pf_row" ]]; then
+ IFS='|' read -r _pf_retries _pf_max <<< "$_pf_row"
+ _pf_retries="${_pf_retries:-0}"
+ _pf_max="${_pf_max:-3}"
+ if [[ "$_pf_retries" -ge "$_pf_max" ]]; then
+ log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($_pf_retries/$_pf_max retries) — treating as resolved"
+ continue # Permanently failed = treat as resolved
+ fi
+ fi
+ fiThe WHERE … AND status = 'failed' filter means an empty _pf_row unambiguously means "not permanently failed" (either task absent, not failed, or db itself failed). The IFS='|' read is command-prefix scoped and does not mutate global IFS.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/todo-sync.sh around lines 1202 - 1221, The three
separate `db` queries (using db, sql_escape and blocker_id) create a TOCTOU and
a fallback bug where blocker_max_retries_left can become "0" if `db` fails;
replace the three queries with a single `db` query that SELECTs status,
COALESCE(retries,0), COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND
status = 'failed' LIMIT 1, then parse the single result (e.g., IFS='|' read -r
_pf_status blocker_retries_left blocker_max_retries_left) and treat an empty
result as "not permanently failed"; keep the existing guards blocker's retries
defaults (blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing
comparison and log_verbose call (using blocker_id) so we only auto-unblock when
the row exists, status is failed, and retries >= max_retries.
…k consistency (t1273) Address CodeRabbit round 2: - Escape dots in task_id for all grep ERE patterns across all _check_* functions (t215.3 would match t21503 without escaping) - Replace IFS=',' mutation with read-based splitting via tr to prevent IFS corruption on early exit under set -euo pipefail - Fix max_retries fallback: use empty string + :-3 guard instead of || echo '0' which would falsely trigger 'retries exhausted' on db failure
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 16:42:54 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|



Summary
Problem
The supervisor pulse was running every 2 minutes, logging "No new tasks to pick up" hundreds of times, while 13 open tasks sat idle. Three interlocking failures caused the stall:
assignee:/started:in TODO.md. Phase 0.5e has a 24h threshold, but the DB already knows the task failed — waiting 24h is unnecessary[x]or[-]status#auto-dispatchtag: Tasks with clear specs, model assignments, and estimates were invisible to auto-pickup because they lacked the tagChanges
New:
.agents/scripts/supervisor/sanity-check.shPhase 0.9 module with 5 checks:
#auto-dispatchModified:
.agents/scripts/supervisor/pulse.shModified:
.agents/scripts/supervisor/todo-sync.shauto_unblock_resolved_tasks) now checks if a blocker is permanently failed in DB (retries >= max_retries)Modified:
.agents/scripts/supervisor-helper.shsanity-check.shmoduleDesign Principle
"Don't assume, verify." Deterministic state machines make assumptions that cause silent failures. When state is contradictory, cross-reference all available data sources to resolve it rather than waiting for timeouts.
Testing
Summary by CodeRabbit