t1273: Add supervisor sanity-check for self-healing queue stalls by marcusquinn · Pull Request #2015 · marcusquinn/aidevops

marcusquinn · 2026-02-20T15:22:53Z

Summary

Adds Phase 0.9 sanity check to the supervisor pulse cycle — runs when zero tasks are dispatchable but open tasks exist in TODO.md
Cross-references DB state, TODO.md state, and system state to find contradictions that cause silent queue stalls
Fixes Phase 0.5d to treat permanently failed blockers as resolved, preventing infinite dependency chain stalls

Problem

The supervisor pulse was running every 2 minutes, logging "No new tasks to pick up" hundreds of times, while 13 open tasks sat idle. Three interlocking failures caused the stall:

DB-failed tasks with TODO.md claims (double-lock): Tasks failed in DB but still had assignee:/started: in TODO.md. Phase 0.5e has a 24h threshold, but the DB already knows the task failed — waiting 24h is unnecessary
Failed dependency chains never cascade: When a root task fails permanently, its dependents wait forever because Phase 0.5d only checks for [x] or [-] status
Missing #auto-dispatch tag: Tasks with clear specs, model assignments, and estimates were invisible to auto-pickup because they lacked the tag

Changes

New: `.agents/scripts/supervisor/sanity-check.sh`

Phase 0.9 module with 5 checks:

Check 1: Strip stale claims on DB-failed tasks immediately (DB state is authoritative)
Check 2: Unblock tasks whose blockers permanently failed (retries exhausted)
Check 3: Auto-tag dispatchable tasks missing #auto-dispatch
Check 4: Cancel DB orphans with no TODO.md entry
Check 5: Log structured stall breakdown when nothing is fixable

Modified: `.agents/scripts/supervisor/pulse.sh`

Integrates Phase 0.9 between Phase 1d (post-completion unblock) and Phase 2 (dispatch)
Only runs when zero tasks are queued but open tasks exist
Re-runs auto-pickup after fixes so newly dispatchable tasks enter the queue immediately

Modified: `.agents/scripts/supervisor/todo-sync.sh`

Phase 0.5d (auto_unblock_resolved_tasks) now checks if a blocker is permanently failed in DB (retries >= max_retries)
Permanently failed blockers are treated as resolved, unblocking dependents

Modified: `.agents/scripts/supervisor-helper.sh`

Sources the new sanity-check.sh module

Design Principle

"Don't assume, verify." Deterministic state machines make assumptions that cause silent failures. When state is contradictory, cross-reference all available data sources to resolve it rather than waiting for timeouts.

Testing

ShellCheck: zero violations on all 4 files
Designed to be triggered by the current stalled queue (0 queued, 0 running, 13 open tasks)

Summary by CodeRabbit

Chores
- Supervisor now runs an automated sanity-check phase when the queue is idle.
- Automatically detects and remediates stale, blocked, or inconsistent task states across repos.
- Treats permanently failed blockers as resolved to prevent indefinite blocking.
- Adds automatic tagging for dispatchable-looking tasks missing auto-dispatch.
- Logs structured stall summaries to improve visibility into why the queue is blocked.

Phase 0.9 runs when zero tasks are dispatchable but open tasks exist: - Check 1: Strip stale claims on DB-failed tasks (don't wait 24h) - Check 2: Unblock tasks whose blockers permanently failed - Check 3: Auto-tag dispatchable tasks missing #auto-dispatch - Check 4: Cancel DB orphans with no TODO.md entry - Log structured stall breakdown when nothing is fixable Also fixes Phase 0.5d to treat permanently failed blockers as resolved, preventing infinite dependency chain stalls.

gemini-code-assist · 2026-02-20T15:23:12Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust self-healing mechanism to the supervisor system, specifically designed to prevent and resolve silent queue stalls. By cross-referencing various state sources (database, TODO.md, system state), the supervisor can now detect and automatically rectify inconsistencies that previously led to tasks remaining idle. This significantly improves the reliability and autonomy of the task dispatching process, ensuring that tasks are processed efficiently even when underlying states diverge.

Highlights

New Sanity Check Module: Introduced sanity-check.sh (Phase 0.9) to proactively identify and resolve common causes of supervisor queue stalls by cross-referencing DB, TODO.md, and system states.
Proactive Stall Resolution: Implemented five distinct checks within the new module to address issues such as stale claims on DB-failed tasks, permanently failed blockers, missing #auto-dispatch tags, and DB orphans.
Enhanced Dependency Management: Updated todo-sync.sh to recognize permanently failed blockers in the database as 'resolved,' preventing dependent tasks from being indefinitely blocked.
Dynamic Auto-Pickup Integration: Integrated the new sanity check into pulse.sh to run specifically when the queue is empty but open tasks exist, re-running auto-pickup if any issues are fixed to immediately dispatch newly available tasks.

Changelog

.agents/scripts/supervisor-helper.sh
- Added sourcing of the new sanity-check.sh module to integrate it into the supervisor's helper scripts.
.agents/scripts/supervisor/pulse.sh
- Integrated a new Phase 0.9 sanity check, which executes when no tasks are queued but open tasks exist in TODO.md.
- Implemented logic to re-run auto-pickup if the sanity check fixes any issues, ensuring newly dispatchable tasks are processed immediately.
.agents/scripts/supervisor/sanity-check.sh
- Added a new script implementing Phase 0.9, which performs five distinct checks to identify and fix common queue stalls.
- Included checks for DB-failed tasks with TODO.md claims, permanently failed blockers, tasks missing #auto-dispatch tags, and DB orphans.
- Added functionality to log structured stall breakdowns when no fixable issues are found, providing visibility into persistent stalls.
.agents/scripts/supervisor/todo-sync.sh
- Modified auto_unblock_resolved_tasks to treat tasks with permanently failed blockers (retries exhausted) in the database as resolved, allowing dependent tasks to proceed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-20T15:23:27Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 15:23:23 UTC 2026: Code review monitoring started
Fri Feb 20 15:23:23 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 15:23:26 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist

Code Review

This pull request introduces a valuable sanity-check mechanism to the supervisor, which should effectively address several causes of silent queue stalls. The new sanity-check.sh module is well-structured and the logic for detecting and fixing inconsistencies is sound. The changes in todo-sync.sh to handle permanently failed blockers are also a great improvement.

My review focuses on improving the shell scripting robustness and maintainability, and ensuring adherence to the repository's established rules. Key feedback points, supported by specific repository rules, include:

Refactoring duplicated code in pulse.sh into a helper function for better maintainability.
Addressing widespread suppression of stderr with 2>/dev/null in db SELECT operations, as this can hide important errors and violates general rules against suppressing diagnostic output.
Aligning local variable declarations with the rule for exit code safety.
Suggesting a more robust pattern for complex sed logic in sanity-check.sh for better readability and maintainability.

Overall, this is a strong contribution that enhances the supervisor's resilience. Addressing the feedback will make the implementation even more robust and compliant with repository standards.

gemini-code-assist · 2026-02-20T15:25:23Z

.agents/scripts/supervisor/pulse.sh

+	# contradictions that cause silent stalls. Fixes: stale claims on DB-failed
+	# tasks, failed blocker cascades, missing #auto-dispatch tags, DB orphans.
+	local queued_before_sanity
+	queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0)


Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo 0 guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1955, 1971).

Suggested change

queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null || echo 0)

queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" || echo 0)

References

Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

gemini-code-assist · 2026-02-20T15:25:23Z

.agents/scripts/supervisor/sanity-check.sh

+	failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "
+		SELECT id, status, error, retries, max_retries FROM tasks
+		WHERE status IN ('failed', 'blocked')
+		AND repo = '$(sql_escape "$repo_path")';
+	" 2>/dev/null || echo "")


Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls throughout this script.

Suggested change

failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "

SELECT id, status, error, retries, max_retries FROM tasks

WHERE status IN ('failed', 'blocked')

AND repo = '$(sql_escape "$repo_path")';

" 2>/dev/null || echo "")

failed_tasks=$(db -separator '|' "$SUPERVISOR_DB" "

SELECT id, status, error, retries, max_retries FROM tasks

WHERE status IN ('failed', 'blocked')

AND repo = '$(sql_escape "$repo_path")';

" || echo "")

References

Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

gemini-code-assist · 2026-02-20T15:25:23Z

.agents/scripts/supervisor/todo-sync.sh

+				blocker_failed_status=$(db "$SUPERVISOR_DB" \
+					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
+					2>/dev/null || echo "")


Suppressing stderr with 2>/dev/null can hide important underlying errors from db (like SQL syntax errors or connection issues) and violates general repository rules. For SELECT operations, 2>/dev/null should generally be avoided to ensure diagnostic information is visible for debugging. The || echo "" guard is sufficient to handle cases where no rows are returned or the command fails gracefully. Please remove 2>/dev/null to allow genuine errors to be visible for debugging. This applies to other similar calls in this file (lines 1211, 1212).

Suggested change

blocker_failed_status=$(db "$SUPERVISOR_DB" \

"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \

2>/dev/null || echo "")

blocker_failed_status=$(db "$SUPERVISOR_DB" \

"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \

|| echo "")

References

Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging. This rule applies generally, with specific exceptions for non-blocking SQLite writes that may fail due to DB lock or missing columns, which is not the case here for a SELECT operation.

gemini-code-assist · 2026-02-20T15:25:23Z

.agents/scripts/supervisor/pulse.sh

+	if [[ "$queued_before_sanity" -eq 0 ]]; then
+		# No queued tasks — check if there are open tasks in TODO.md that should be
+		if [[ -n "$all_repos" ]]; then
+			while IFS= read -r repo_path; do
+				if [[ -f "$repo_path/TODO.md" ]]; then
+					local open_count
+					open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$repo_path/TODO.md" 2>/dev/null || echo 0)
+					if [[ "$open_count" -gt 0 ]]; then
+						local sanity_fixed
+						sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG")
+						if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
+							log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup"
+							cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true
+						fi
+					fi
+				fi
+			done <<<"$all_repos"
+		else
+			local cwd_todo
+			cwd_todo="$(pwd)/TODO.md"
+			if [[ -f "$cwd_todo" ]]; then
+				local open_count
+				open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$cwd_todo" 2>/dev/null || echo 0)
+				if [[ "$open_count" -gt 0 ]]; then
+					local sanity_fixed
+					sanity_fixed=$(run_sanity_check "$(pwd)" 2>>"$SUPERVISOR_LOG")
+					if [[ "${sanity_fixed:-0}" -gt 0 ]]; then
+						log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) — re-running auto-pickup"
+						cmd_auto_pickup --repo "$(pwd)" 2>>"$SUPERVISOR_LOG" || true
+					fi
+				fi
+			fi
+		fi
+	fi


The logic inside the if [[ -n "$all_repos" ]] block and the else block is nearly identical. This duplication can be avoided by refactoring the common logic into a helper function. This would make the code more maintainable and easier to read.

_run_sanity_check_for_repo() { local repo_path="$1" local todo_file="$repo_path/TODO.md" if [[ ! -f "$todo_file" ]]; then return fi local open_count open_count=$(grep -cE '^\s*- \[ \] t[0-9]+' "$todo_file" || echo 0) if [[ "$open_count" -gt 0 ]]; then local sanity_fixed sanity_fixed=$(run_sanity_check "$repo_path" 2>>"$SUPERVISOR_LOG") if [[ "${sanity_fixed:-0}" -gt 0 ]]; then log_info "Phase 0.9: Sanity check fixed $sanity_fixed issue(s) in $repo_path — re-running auto-pickup" cmd_auto_pickup --repo "$repo_path" 2>>"$SUPERVISOR_LOG" || true fi fi } if [[ "$queued_before_sanity" -eq 0 ]]; then # No queued tasks — check if there are open tasks in TODO.md that should be if [[ -n "$all_repos" ]]; then while IFS= read -r repo_path; do _run_sanity_check_for_repo "$repo_path" done <<<"$all_repos" else _run_sanity_check_for_repo "$(pwd)" fi fi

References

In shell scripts, extract repeated logic into an internal helper function to improve maintainability. This applies even for standalone scripts where external source dependencies are avoided.

gemini-code-assist · 2026-02-20T15:25:24Z

.agents/scripts/supervisor/sanity-check.sh

+#   0 on success
+#######################################
+run_sanity_check() {
+	local repo_path="${1:-$REPO_PATH}"


The repository rules specify declaring and assigning local variables separately for exit code safety. This function, and others in this file, use local var="$1". Please update this to follow the rule.

Suggested change

local repo_path="${1:-$REPO_PATH}"

local repo_path

repo_path="${1:-$REPO_PATH}"

References

When declaring and assigning a local variable in shell scripts, separate the 'local' declaration from the assignment if command substitution is involved, to ensure exit code safety.

gemini-code-assist · 2026-02-20T15:25:24Z

.agents/scripts/supervisor/sanity-check.sh

+						if [[ "$blocked_by" == "$blocker_id" ]]; then
+							local escaped_blocker
+							escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
+							sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file"
+						else
+							# Multiple blockers — remove just this one
+							local escaped_blocker
+							escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')
+							# Remove blocker_id from comma-separated list
+							sed_inplace "${line_num}s/blocked-by:\([^ ]*\)${escaped_blocker},\?/blocked-by:\1/" "$todo_file"
+							# Clean up trailing/leading commas
+							sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file"
+							sed_inplace "${line_num}s/,\( \)/\1/" "$todo_file"
+						fi


This series of sed commands to remove a specific blocker from a comma-separated list is quite complex and potentially fragile. A more robust and readable approach is to manipulate the string in bash and then replace the entire blocked-by field. Consider refactoring to this more maintainable pattern.

Suggested change

if [[ "$blocked_by" == "$blocker_id" ]]; then

local escaped_blocker

escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')

sed_inplace "${line_num}s/ blocked-by:${escaped_blocker}//" "$todo_file"

else

# Multiple blockers — remove just this one

local escaped_blocker

escaped_blocker=$(printf '%s' "$blocker_id" | sed 's/\./\\./g')

# Remove blocker_id from comma-separated list

sed_inplace "${line_num}s/blocked-by:$[^ ]*$${escaped_blocker},\?/blocked-by:\1/" "$todo_file"

# Clean up trailing/leading commas

sed_inplace "${line_num}s/blocked-by:,/blocked-by:/" "$todo_file"

sed_inplace "${line_num}s/,/\1/" "$todo_file"

fi

# Example

local all_blockers="t1,t2,t3"

local blocker_to_remove="t2"

# Robustly remove the item

local new_blockers

new_blockers=$(echo ",${all_blockers}," | sed "s/,${blocker_to_remove},/,/" | sed 's/^,//;s/,$//')

# Now use sed to replace the whole field

if [[ -n "$new_blockers" ]]; then

sed_inplace "${line_num}s/blocked-by:[^ ]*/blocked-by:${new_blockers}/" "$todo_file"

else

sed_inplace "${line_num}s/ blocked-by:[^ ]*//" "$todo_file"

fi

…ness (t1273) Address valid review feedback: - Extract _run_sanity_check_for_repo() to deduplicate multi-repo/single-repo code paths in Phase 0.9 - Replace fragile multi-step sed for comma-separated blocker removal with robust string rebuild approach

marcusquinn

Review response

Addressed (valid feedback)

Duplicated code in pulse.sh: Extracted _run_sanity_check_for_repo() helper function — good catch
Fragile sed for comma-separated blocker removal: Replaced with robust string rebuild approach (",$list," | sed "s/,$item,/,/")

Dismissed (incorrect or inconsistent with codebase conventions)

2>/dev/null on SELECT operations: This is the established convention across the entire supervisor codebase (279 occurrences in pulse.sh alone, 33 in todo-sync.sh, 98 in evaluate.sh). The db() wrapper can emit SQLite warnings on locked databases during concurrent access — these are expected in a multi-process supervisor and not actionable errors. Removing 2>/dev/null would flood the supervisor log with noise during normal operation. The || echo "" fallback handles actual failures.

local var="$1" pattern: The Gemini suggestion to separate declaration from assignment applies to command substitutions where you need to capture exit codes (e.g., local x; x=$(some_cmd)). For simple parameter assignments like local repo_path="$1", there is no exit code to capture — $1 is a variable expansion, not a command. This is the established pattern throughout the codebase (used in every function in todo-sync.sh, pulse.sh, etc.).

coderabbitai · 2026-02-20T16:16:52Z

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 54 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds a Phase 0.9 supervisor sanity-check: a new sanity-check.sh module performs deterministic checks (TODO.md ↔ DB ↔ repo) when the queue is empty to detect and remediate stale claims, failed blockers, missing auto-dispatch tags, and DB orphans; wired into supervisor init and pulse flow.

Changes

Cohort / File(s)	Summary
Supervisor Initialization `.agents/scripts/supervisor-helper.sh`	Source the new `sanity-check.sh` helper during supervisor startup (sourced after routine-scheduler modules).
Pulse Phase 0.9 Integration `.agents/scripts/supervisor/pulse.sh`	Add `_run_sanity_check_for_repo()` and call it from `cmd_pulse` when the queue is empty to iterate repos and run sanity checks, triggering `cmd_auto_pickup` on fixes.
Sanity Check Implementation `.agents/scripts/supervisor/sanity-check.sh`	New module implementing `run_sanity_check()` and helpers: `_check_db_failed_with_claims`, `_check_failed_blocker_chains`, `_check_missing_auto_dispatch`, `_check_db_orphans`, `_log_queue_stall_reasons` — reconciles TODO.md, DB, and repo state and applies remediation (unclaim/reset/commit/tag/cancel).
Todo Sync Enhancement `.agents/scripts/supervisor/todo-sync.sh`	Extend `auto_unblock_resolved_tasks` to treat permanently failed blockers (retries exhausted) like resolved blockers so dependents can unblock.

Sequence Diagram

sequenceDiagram
    participant Pulse as Pulse Cycle
    participant Sanity as Sanity Check
    participant TODO as TODO.md
    participant DB as Database
    participant Repo as Git Repo

    Pulse->>Pulse: Run phases 0.0–0.8
    Pulse->>Pulse: If queue empty, enter Phase 0.9
    Pulse->>Sanity: _run_sanity_check_for_repo(repo)
    Sanity->>TODO: Open TODO.md
    Sanity->>DB: Query task rows & statuses
    Sanity->>Sanity: _check_db_failed_with_claims()
    DB-->>Sanity: Stale claims / failed rows
    Sanity->>Sanity: _check_failed_blocker_chains()
    DB-->>Sanity: Blocker status (failed/complete)
    Sanity->>TODO: Edit TODO (remove blocker, add `#auto-dispatch`, etc.)
    Sanity->>DB: _check_db_orphans() -> cancel orphan rows
    Sanity->>Repo: commit_and_push_todo (if edited)
    Sanity->>Pulse: Return fixed count
    alt fixes > 0
        Sanity->>Pulse: trigger cmd_auto_pickup(repo)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix(supervisor): add cron PATH resolution and no_pr retry limit #435 — Modifies supervisor-helper.sh initialization patterns; overlaps with the new sourcing change.
t1261: Fix dispatch stall from orphaned DB tasks #1974 — Implements DB↔TODO cancellation of orphaned tasks; overlaps with _check_db_orphans and todo-sync changes.
t311.3: Extract supervisor modules #1220 — Moves/adjusts supervisor helper functions (cmd_unclaim, cmd_reset, etc.) that the new sanity checks invoke.

Poem

🩺 When queues sleep and tasks complain,
Phase 0.9 walks the repo lane.
It scans, it heals, it tags with care,
Unblocks the stuck and clears the air.
A tiny fix, a tidy sweep — automation's gentle sweep.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 't1273: Add supervisor sanity-check for self-healing queue stalls' is specific and directly describes the main change: introducing a Phase 0.9 sanity-check module to address supervisor queue stalls through automated self-healing logic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/supervisor-sanity-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-20T16:16:53Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:16:49 UTC 2026: Code review monitoring started
Fri Feb 20 16:16:49 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 16:16:52 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

.agents/scripts/supervisor/todo-sync.sh (1)

1188-1218: Consolidate the two separate DB guard blocks into a single SELECT status, retries, max_retries query.

The existing block (lines 1188–1200) and the new block (lines 1202–1218) each open a separate sqlite3 connection for the same blocker_id. In the worst case this is 4 DB calls per blocker (status check → failed-status recheck → retries → max_retries). A single query handles both fast-paths atomically, also eliminating the empty-string hazard noted above.

The WHERE … AND status = 'failed' predicate on line 1207 makes the subsequent == "failed" comparison redundant — a non-empty result is always "failed".

♻️ Proposed refactor — single DB round-trip for both fast-paths

-			# t1247: DB fallback — blocker may be deployed/verified in DB but TODO.md
-			# not yet updated …
-			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
-				local blocker_db_status=""
-				blocker_db_status=$(db "$SUPERVISOR_DB" \
-					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \
-					2>/dev/null || echo "")
-				if [[ "$blocker_db_status" == "complete" ||
-					"$blocker_db_status" == "deployed" ||
-					"$blocker_db_status" == "verified" ||
-					"$blocker_db_status" == "merged" ]]; then
-					log_verbose "  auto-unblock: blocker $blocker_id is '$blocker_db_status' in DB (TODO.md not yet updated) — treating as resolved"
-					continue # Resolved in DB
-				fi
-			fi
-
-			# Check if blocker is permanently failed in DB (retries exhausted)
-			# A failed blocker will never complete — don't let it block dependents forever
-			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
-				local blocker_failed_status=""
-				blocker_failed_status=$(db "$SUPERVISOR_DB" \
-					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
-					2>/dev/null || echo "")
-				if [[ "$blocker_failed_status" == "failed" ]]; then
-					local blocker_retries_left blocker_max_retries_left
-					blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
-					if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
-						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
-						continue # Permanently failed = treat as resolved
-					fi
-				fi
-			fi
+			# t1247 / t1273: DB fast-path — single query covers resolved states and
+			# permanently-failed blockers (retries exhausted), keeping the loop to
+			# one sqlite3 round-trip per blocker.
+			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
+				local _blocker_row="" _bd_status="" _bd_retries="" _bd_max_retries=""
+				_blocker_row=$(db -separator '|' "$SUPERVISOR_DB" \
+					"SELECT status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")' LIMIT 1;" \
+					2>/dev/null || echo "")
+				if [[ -n "$_blocker_row" ]]; then
+					IFS='|' read -r _bd_status _bd_retries _bd_max_retries <<<"$_blocker_row"
+					# Resolved states: treat as unblocked
+					if [[ "$_bd_status" == "complete" || "$_bd_status" == "deployed" ||
+						"$_bd_status" == "verified" || "$_bd_status" == "merged" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is '$_bd_status' in DB (TODO.md not yet updated) — treating as resolved"
+						continue
+					fi
+					# Permanently failed: retries exhausted — treat as resolved
+					if [[ "$_bd_status" == "failed" && "${_bd_retries:-0}" -ge "${_bd_max_retries:-0}" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB (${_bd_retries:-0}/${_bd_max_retries:-0} retries) — treating as resolved"
+						continue
+					fi
+				fi
+			fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/todo-sync.sh around lines 1188 - 1218, Combine
the two DB guard blocks into a single DB call: use db "$SUPERVISOR_DB" "SELECT
status, retries, max_retries FROM tasks WHERE id = '$(sql_escape "$blocker_id")'
LIMIT 1;" and parse the three fields into local variables (e.g., blocker_status,
blocker_retries_left, blocker_max_retries_left). Then apply the same logic: if
blocker_status is one of complete|deployed|verified|merged -> log auto-unblock
and continue; else if blocker_status == failed and blocker_retries_left -ge
blocker_max_retries_left -> log permanently failed and continue. Remove the
separate blocker_db_status/blocker_failed_status queries and the redundant WHERE
... AND status='failed' clause to eliminate multiple sqlite3 connections per
blocker.

.agents/scripts/supervisor/sanity-check.sh (2)

205-262: Use local IFS instead of manual save/restore for safer scoping.

The static analysis flags the IFS manipulation on lines 206 and 262. The current save/restore pattern works, but if an unexpected continue, return, or error occurs between lines 206 and 262, IFS remains modified for the rest of the function. Using local IFS=',' scopes it to the enclosing function, which is actually safe here since the outer while IFS= read sets IFS per-invocation.
Proposed simplification
-		local _saved_ifs="$IFS"
-		IFS=','
+		local IFS=','
 		for blocker_id in $blocked_by; do
 			...
 		done
-		IFS="$_saved_ifs"
Note: local IFS in bash scopes the variable to the function and restores the previous value when the function returns. Since IFS is only needed for the for loop's word splitting, this is safe — the outer while IFS= read loop sets IFS per-read invocation regardless.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/sanity-check.sh around lines 205 - 262, Replace
the manual save/restore IFS pattern with a function-scoped IFS by removing the
_saved_ifs/restore code and declaring local IFS=',' immediately before the for
blocker_id in $blocked_by loop; update the block that iterates over blocked_by
(which references variables like blocked_by, blocker_id, task_id, todo_file,
cmd_reset, sed_inplace) so it uses local IFS=',' and no longer restores IFS at
the end.
335-354: sed insertion targets the first #tag, not the last as the comment states.

Line 337 comment says "after the last #tag before any —" but the sed on line 338 replaces the first match of #tag\s → #tag #auto-dispatch\s. For a line like #bugfix #security — desc, this inserts between the first and second tags rather than after the last. Functionally harmless since tag order doesn't affect dispatch, but the comment is misleading.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/sanity-check.sh around lines 335 - 354, The
comment claims the code inserts "#auto-dispatch" after the last tag but the sed
invocation in sed_inplace targets the first tag; fix by either updating the
comment to state "insert before the first `#tag` (or at end of tags)" or change
the sed expression in sed_inplace to match the last tag before the separator
(e.g., use a greedy prefix capture so the replacement targets the final
#[a-zA-Z0-9_-]* token before the space/—); refer to the sed_inplace call and the
surrounding variables todo_file, line_num and task_id when applying the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1966-1983: The inline comment in the Phase 0.9 block is truncated;
update the comment that starts with "No queued tasks — check if there are open
tasks in TODO.md that should be" to a complete phrase (e.g., "...that should be
queued" or "...that should be dispatchable") to clarify intent; locate the block
around the queued_before_sanity variable and the _run_sanity_check_for_repo
calls and replace the truncated comment with the full, unambiguous sentence.

In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 244-254: The sed substitution building new_blockers uses an
unescaped $blocker_id which can contain regex metacharacters (e.g., dots);
update the construction of new_blockers so $blocker_id is escaped for use in the
sed s/// pattern (mirror the escaping approach used for $blocked_by), i.e.,
create an escaped_blocker_id variable from blocker_id before calling sed and use
that escaped variable in the sed "s/,${escaped_blocker_id},/,/" invocation;
ensure this change is applied where new_blockers is computed and keep the
existing sed_inplace and later escaped_blocked_by usage unchanged.
- Around line 44-106: run_sanity_check collects integer results from the helper
functions (_check_db_failed_with_claims, _check_failed_blocker_chains,
_check_missing_auto_dispatch, _check_db_orphans) using command substitution, but
side-effect commands inside those helpers (cmd_unclaim, cmd_reset,
cmd_transition, commit_and_push_todo, etc.) may emit stdout and corrupt the
numeric return values; update each _check_* helper (and the calls to
commit_and_push_todo) to redirect all non-essential stdout/stderr from
side-effect commands into the supervisor log (use "$SUPERVISOR_LOG") or
/dev/null so the helper functions only emit their final numeric result on
stdout, ensuring run_sanity_check's arithmetic (fixed=$((fixed + ...))) receives
clean integers.
- Around line 131-172: The loop reads retries/max_retries into
db_retries/db_max_retries but the DB query can return NULL, causing the integer
comparison in [[ "$db_retries" -lt "$db_max_retries" ]] to fail; fix by ensuring
non-null defaults: update the query that produces failed_tasks to use
COALESCE(retries,0) AS retries and COALESCE(max_retries,0) AS max_retries, or
alternately immediately after the read set db_retries=${db_retries:-0} and
db_max_retries=${db_max_retries:-0} so the comparison in the loop (and the
variables used elsewhere) always sees integers.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1211-1213: The numeric comparison can fail because
blocker_retries_left and blocker_max_retries_left may be empty strings when the
sqlite query returns no rows; change the code that sets blocker_retries_left and
blocker_max_retries_left (the db "SELECT retries..." and "SELECT max_retries..."
calls around variables blocker_retries_left and blocker_max_retries_left in
todo-sync.sh) to coerce empty output to 0 (for example by wrapping the db output
with a default-to-zero check) before doing the [[ ... -ge ... ]] test, or
better, replace the two separate queries with a single consolidated query that
returns both retries and max_retries (and defaults missing values to 0) to avoid
a race between the queries and ensure the numeric comparison never receives an
empty string.

---

Nitpick comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Around line 205-262: Replace the manual save/restore IFS pattern with a
function-scoped IFS by removing the _saved_ifs/restore code and declaring local
IFS=',' immediately before the for blocker_id in $blocked_by loop; update the
block that iterates over blocked_by (which references variables like blocked_by,
blocker_id, task_id, todo_file, cmd_reset, sed_inplace) so it uses local IFS=','
and no longer restores IFS at the end.
- Around line 335-354: The comment claims the code inserts "#auto-dispatch"
after the last tag but the sed invocation in sed_inplace targets the first tag;
fix by either updating the comment to state "insert before the first `#tag` (or at
end of tags)" or change the sed expression in sed_inplace to match the last tag
before the separator (e.g., use a greedy prefix capture so the replacement
targets the final #[a-zA-Z0-9_-]* token before the space/—); refer to the
sed_inplace call and the surrounding variables todo_file, line_num and task_id
when applying the change.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1188-1218: Combine the two DB guard blocks into a single DB call:
use db "$SUPERVISOR_DB" "SELECT status, retries, max_retries FROM tasks WHERE id
= '$(sql_escape "$blocker_id")' LIMIT 1;" and parse the three fields into local
variables (e.g., blocker_status, blocker_retries_left,
blocker_max_retries_left). Then apply the same logic: if blocker_status is one
of complete|deployed|verified|merged -> log auto-unblock and continue; else if
blocker_status == failed and blocker_retries_left -ge blocker_max_retries_left
-> log permanently failed and continue. Remove the separate
blocker_db_status/blocker_failed_status queries and the redundant WHERE ... AND
status='failed' clause to eliminate multiple sqlite3 connections per blocker.

.agents/scripts/supervisor/pulse.sh

.agents/scripts/supervisor/sanity-check.sh

.agents/scripts/supervisor/todo-sync.sh

…ping (t1273) - Redirect stdout of side-effect commands (cmd_unclaim, cmd_reset, cmd_transition, commit_and_push_todo) to SUPERVISOR_LOG inside _check_* functions to prevent stdout leakage into $() arithmetic capture - Add COALESCE for retries/max_retries in SQL queries to handle NULL values - Add empty-string guards (${var:-0}) for race condition protection - Escape blocker_id dots in sed regex for task IDs like t215.3 - Fix truncated comment in pulse.sh Phase 0.9

github-actions · 2026-02-20T16:26:03Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:25:58 UTC 2026: Code review monitoring started
Fri Feb 20 16:25:58 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 16:26:01 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

.agents/scripts/supervisor/pulse.sh (1)
1977-1982: _run_sanity_check_for_repo called without || true — inconsistent with every other phase helper in cmd_pulse

Every other helper call in cmd_pulse is guarded with || true (e.g., lines 718, 781, 851, 871, 889, 957, 1957). When _run_sanity_check_for_repo reaches its inner if [[ "${sanity_fixed:-0}" -gt 0 ]] and the condition is false, the function's last-executed command exits 1. While the while loop absorbs exit codes at runtime, the omission is inconsistent and could surprise future maintainers or cause subtle issues if set -e is ever introduced.
♻️ Proposed fix — add `|| true` for consistency
 		while IFS= read -r repo_path; do
-			_run_sanity_check_for_repo "$repo_path"
+			_run_sanity_check_for_repo "$repo_path" || true
 		done <<<"$all_repos"
 	else
-		_run_sanity_check_for_repo "$(pwd)"
+		_run_sanity_check_for_repo "$(pwd)" || true
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1977 - 1982, The call to
_run_sanity_check_for_repo inside cmd_pulse is missing the customary "|| true"
guard used by all other phase helpers; update both invocation sites (the while
loop body that iterates over repo_path and the else branch that calls
_run_sanity_check_for_repo "$(pwd)") to append "|| true" so the helper's
non-zero exit (e.g., when sanity_fixed is 0) won't propagate and to keep
behavior consistent with other helpers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/sanity-check.sh:
- Line 136: The grep ERE patterns use unescaped task_id (so dots match any char)
in the four _check_* functions; create an escaped_task_id variable (like the
existing escaped_blocker_id) by replacing '.' with '\.' and use escaped_task_id
in the grep -E patterns (e.g., the lines that currently reference "${task_id}")
before each first grep call in the functions that use task_id (the sites at the
grep calls around the current task_id uses on lines noted in the review). This
ensures patterns match literal dots in IDs and prevents accidental matches of
adjacent IDs.
- Around line 205-268: The function currently mutates global IFS (saved in
_saved_ifs) and restores it later, which can be skipped if an early exit occurs;
change the for-loop that iterates over blocked_by to use a safe array split: use
IFS=',' read -r -a _blocker_list <<< "$blocked_by" and then iterate with for
blocker_id in "${_blocker_list[@]}"; ensure all references to $blocker_id and
$blocked_by remain quoted where used (e.g., in db calls, grep, sed_inplace and
the rebuilding logic), remove the _saved_ifs/IFS restore code, and keep the
existing logic for resetting/unblocking (cmd_reset, sed_inplace, log_warn,
fixed, task_id, todo_file) intact so no global IFS mutation can persist on early
exit.

In @.agents/scripts/supervisor/todo-sync.sh:
- Around line 1202-1221: The three separate `db` queries (using db, sql_escape
and blocker_id) create a TOCTOU and a fallback bug where
blocker_max_retries_left can become "0" if `db` fails; replace the three queries
with a single `db` query that SELECTs status, COALESCE(retries,0),
COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND status = 'failed' LIMIT
1, then parse the single result (e.g., IFS='|' read -r _pf_status
blocker_retries_left blocker_max_retries_left) and treat an empty result as "not
permanently failed"; keep the existing guards blocker's retries defaults
(blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing
comparison and log_verbose call (using blocker_id) so we only auto-unblock when
the row exists, status is failed, and retries >= max_retries.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1977-1982: The call to _run_sanity_check_for_repo inside cmd_pulse
is missing the customary "|| true" guard used by all other phase helpers; update
both invocation sites (the while loop body that iterates over repo_path and the
else branch that calls _run_sanity_check_for_repo "$(pwd)") to append "|| true"
so the helper's non-zero exit (e.g., when sanity_fixed is 0) won't propagate and
to keep behavior consistent with other helpers.

.agents/scripts/supervisor/sanity-check.sh

coderabbitai · 2026-02-20T16:37:42Z

.agents/scripts/supervisor/todo-sync.sh

+			# Check if blocker is permanently failed in DB (retries exhausted)
+			# A failed blocker will never complete — don't let it block dependents forever
+			if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then
+				local blocker_failed_status=""
+				blocker_failed_status=$(db "$SUPERVISOR_DB" \
+					"SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \
+					2>/dev/null || echo "")
+				if [[ "$blocker_failed_status" == "failed" ]]; then
+					local blocker_retries_left blocker_max_retries_left
+					blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
+					blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0")
+					# Guard against empty strings from race conditions
+					blocker_retries_left="${blocker_retries_left:-0}"
+					blocker_max_retries_left="${blocker_max_retries_left:-3}"
+					if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then
+						log_verbose "  auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved"
+						continue # Permanently failed = treat as resolved
+					fi
+				fi
+			fi


⚠️ Potential issue | 🟡 Minor

Inconsistent max_retries fallback when db fails; consolidating the three queries eliminates both the TOCTOU and the inconsistency.

|| echo "0" on line 1212 fires when db exits non-zero, setting blocker_max_retries_left="0". Because "0" is not an empty string, the :-3 guard on line 1215 never substitutes. If db is briefly unavailable for queries 2–3 (e.g., SQLite BUSY after query 1 succeeds), the comparison becomes [[ 0 -ge 0 ]] → true, and the blocker is incorrectly treated as permanently failed — prematurely unblocking a dependent task.

A single query that combines the status filter with the retry columns eliminates the three-query TOCTOU window and makes the guard logic uniform:

♻️ Proposed fix — single-query consolidation

- # Check if blocker is permanently failed in DB (retries exhausted) - # A failed blocker will never complete — don't let it block dependents forever - if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then - local blocker_failed_status="" - blocker_failed_status=$(db "$SUPERVISOR_DB" \ - "SELECT status FROM tasks WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed' LIMIT 1;" \ - 2>/dev/null || echo "") - if [[ "$blocker_failed_status" == "failed" ]]; then - local blocker_retries_left blocker_max_retries_left - blocker_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(retries, 0) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") - blocker_max_retries_left=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_retries, 3) FROM tasks WHERE id = '$(sql_escape "$blocker_id")';" 2>/dev/null || echo "0") - # Guard against empty strings from race conditions - blocker_retries_left="${blocker_retries_left:-0}" - blocker_max_retries_left="${blocker_max_retries_left:-3}" - if [[ "$blocker_retries_left" -ge "$blocker_max_retries_left" ]]; then - log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($blocker_retries_left/$blocker_max_retries_left retries) — treating as resolved" - continue # Permanently failed = treat as resolved - fi - fi - fi + # Check if blocker is permanently failed in DB (retries exhausted) + # A failed blocker will never complete — don't let it block dependents forever + if [[ -n "${SUPERVISOR_DB:-}" && -f "${SUPERVISOR_DB}" ]]; then + local _pf_row _pf_retries _pf_max + _pf_row=$(db -separator '|' "$SUPERVISOR_DB" " + SELECT COALESCE(retries, 0), COALESCE(max_retries, 3) FROM tasks + WHERE id = '$(sql_escape "$blocker_id")' AND status = 'failed'; + " 2>/dev/null || echo "") + if [[ -n "$_pf_row" ]]; then + IFS='|' read -r _pf_retries _pf_max <<< "$_pf_row" + _pf_retries="${_pf_retries:-0}" + _pf_max="${_pf_max:-3}" + if [[ "$_pf_retries" -ge "$_pf_max" ]]; then + log_verbose " auto-unblock: blocker $blocker_id is permanently failed in DB ($_pf_retries/$_pf_max retries) — treating as resolved" + continue # Permanently failed = treat as resolved + fi + fi + fi

The WHERE … AND status = 'failed' filter means an empty _pf_row unambiguously means "not permanently failed" (either task absent, not failed, or db itself failed). The IFS='|' read is command-prefix scoped and does not mutate global IFS.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/todo-sync.sh around lines 1202 - 1221, The three separate `db` queries (using db, sql_escape and blocker_id) create a TOCTOU and a fallback bug where blocker_max_retries_left can become "0" if `db` fails; replace the three queries with a single `db` query that SELECTs status, COALESCE(retries,0), COALESCE(max_retries,3) FROM tasks WHERE id = '...' AND status = 'failed' LIMIT 1, then parse the single result (e.g., IFS='|' read -r _pf_status blocker_retries_left blocker_max_retries_left) and treat an empty result as "not permanently failed"; keep the existing guards blocker's retries defaults (blocker_retries_left:-0, blocker_max_retries_left:-3) and the existing comparison and log_verbose call (using blocker_id) so we only auto-unblock when the row exists, status is failed, and retries >= max_retries.

…k consistency (t1273) Address CodeRabbit round 2: - Escape dots in task_id for all grep ERE patterns across all _check_* functions (t215.3 would match t21503 without escaping) - Replace IFS=',' mutation with read-based splitting via tr to prevent IFS corruption on early exit under set -euo pipefail - Fix max_retries fallback: use empty string + :-3 guard instead of || echo '0' which would falsely trigger 'retries exhausted' on db failure

github-actions · 2026-02-20T16:42:55Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 16:42:51 UTC 2026: Code review monitoring started
Fri Feb 20 16:42:51 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 16:42:54 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-20T16:43:37Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

marcusquinn commented Feb 20, 2026

View reviewed changes

coderabbitai bot requested changes Feb 20, 2026

View reviewed changes

marcusquinn merged commit a2f6c82 into main Feb 20, 2026
11 checks passed

github-actions bot mentioned this pull request Feb 20, 2026

t1273: Supervisor sanity-check for self-healing queue stalls #2014

Closed

coderabbitai bot mentioned this pull request Feb 20, 2026

t1276: Subtask-aware queue analysis, fix auto-pickup head-50 limit, orphan issue intake #2026

Merged

marcusquinn deleted the feature/supervisor-sanity-check branch February 21, 2026 01:59

coderabbitai bot mentioned this pull request Feb 24, 2026

fix: register blocked tasks in DB during auto-pickup #2239

Merged

	queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" 2>/dev/null \|\| echo 0)
	queued_before_sanity=$(db "$SUPERVISOR_DB" "SELECT COUNT(*) FROM tasks WHERE status = 'queued';" \|\| echo 0)

	local repo_path="${1:-$REPO_PATH}"
	local repo_path
	repo_path="${1:-$REPO_PATH}"

Conversation

marcusquinn commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

New: .agents/scripts/supervisor/sanity-check.sh

Modified: .agents/scripts/supervisor/pulse.sh

Modified: .agents/scripts/supervisor/todo-sync.sh

Modified: .agents/scripts/supervisor-helper.sh

Design Principle

Testing

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 20, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

marcusquinn left a comment

Choose a reason for hiding this comment

Review response

Addressed (valid feedback)

Dismissed (incorrect or inconsistent with codebase conventions)

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 20, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

marcusquinn commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

New: `.agents/scripts/supervisor/sanity-check.sh`

Modified: `.agents/scripts/supervisor/pulse.sh`

Modified: `.agents/scripts/supervisor/todo-sync.sh`

Modified: `.agents/scripts/supervisor-helper.sh`

coderabbitai bot commented Feb 20, 2026 •

edited

Loading