feat: add Phase 3b2 reconciliation and triage command to supervisor#1488
feat: add Phase 3b2 reconciliation and triage command to supervisor#1488marcusquinn merged 2 commits intomainfrom
Conversation
Add automatic reconciliation of stuck blocked/verify_failed tasks: - Phase 3b2: runs every pulse, checks blocked/verify_failed tasks against GitHub PR state. If PR is merged, advances to deployed via direct DB update (bypasses state machine since ground truth is verified). If PR is closed without merge, resets to queued for re-dispatch. Handles obsolete/sentinel PR URLs by cancelling. - triage command: interactive bulk diagnosis and resolution. Categorizes all stuck tasks by root cause (merged-but-stuck, closed-no-merge, obsolete, rebase-exhausted, open-pr, no-pr). Supports --dry-run for preview and --auto-resolve for automated fix. Root cause: the supervisor's pulse cycle had no mechanism to detect when a PR was merged externally (by a later pulse or manual action) while the task was stuck in blocked/verify_failed state. 27 tasks were stuck with merged PRs that the DB didn't know about.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where supervisor tasks could become stuck in 'blocked' or 'verify_failed' states even after their corresponding GitHub Pull Requests were merged or closed externally. This led to an accumulation of unresolved tasks. The changes introduce an automatic reconciliation phase (Phase 3b2) within the supervisor's pulse cycle to detect and correct these discrepancies, ensuring tasks accurately reflect their PR status. Additionally, a new 'triage' command is provided to enable manual, interactive diagnosis and resolution of various stuck task scenarios, significantly improving the supervisor's robustness and maintainability. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe changes add triage capability to the supervisor system by introducing a Phase 3b2 reconciliation step that validates task status against GitHub PR state, and a cmd_triage() command that diagnoses and resolves stuck tasks through automatic state transitions. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Helper as supervisor-helper.sh
participant Pulse as pulse.sh
participant DB as Task DB
participant GitHub as GitHub API
User->>Helper: supervisor-helper.sh triage [--dry-run/--auto-resolve]
activate Helper
Helper->>Pulse: cmd_triage()
activate Pulse
Pulse->>DB: Query blocked/verify_failed tasks
DB-->>Pulse: Task list with PR URLs
Pulse->>GitHub: gh pr view for each task PR
GitHub-->>Pulse: PR state (MERGED/CLOSED/OPEN)
Pulse->>Pulse: Categorize tasks by state & status
Pulse-->>Helper: Print diagnostic report
alt --auto-resolve flag
Pulse->>DB: Update task states (deployed/queued/cancel)
Pulse->>DB: Update state logs & cleanup records
DB-->>Pulse: Confirmation
Pulse->>GitHub: Update issue status if needed
end
deactivate Pulse
deactivate Helper
Helper-->>User: Report & resolution summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sun Feb 15 02:15:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1912-1962: The gh CLI call in the while-loop uses gh pr view
"$pr_num" but ignores the repo captured in _trepo; update the call to include
the repository by passing --repo "$_trepo" (i.e., change pr_json=$(gh pr view
"$pr_num" --json state,mergedAt ...) to pr_json=$(gh pr view --repo "$_trepo"
"$pr_num" --json state,mergedAt ...) ), ensuring the loop's read fields (_trepo)
are actually used when querying PR info and keeping the existing error
handling/fallback unchanged.
- Around line 557-579: The gh PR lookup omits repository context and currently
calls gh pr view "$pr_number" which can target the wrong repo; update the
read/variables so the repo from the input is preserved (use stale_repo instead
of _stale_repo) and call gh pr view with the --repo flag (e.g., gh pr view
"$pr_number" --repo "$stale_repo" --json state,mergedAt ...) so the PR is
queried in the correct repository; keep the existing fallback behavior for
unreachable PRs.
- Around line 616-638: The CLOSED branch currently updates tasks directly
without recording the transition in state_log; add a state_log INSERT when
resetting the task to queued (similar to the MERGED path) immediately after the
DB UPDATE that sets status='queued'. Use the same escaped_stale_id used for the
UPDATE, record previous status ($stale_status) and new status 'queued', include
relevant metadata (actor like "pulse:phase_3b2" and a reason such as "PR closed
without merge"), and ensure this runs before cleanup_after_merge/write_proof_log
and that failures do not block the rest of the branch (match surrounding || true
behavior).
- Around line 565-571: The current logic cancels tasks when pr_number is empty
because the regex `/pull/([0-9]+)$` is too strict; update the extraction around
pr_number/stale_pr to be more tolerant (e.g., strip trailing slashes and query
strings, or use a regex that finds the first numeric ID after "/pull/" anywhere
in the path) and, if still not found, log a warning with stale_pr via log_warn
and skip the cancel path (increment reconciled_obsolete as needed and continue)
instead of calling cmd_transition "$stale_id" "cancelled" --error "..."; keep
the existing log message but remove the cancellation command so tasks with
slightly different PR URL formats are not prematurely cancelled.
- Around line 1903-1962: The category builders (cat_merged_stuck,
cat_closed_no_merge, cat_obsolete, cat_rebase_exhausted, cat_no_pr, cat_open_pr)
currently concatenate entries with literal "\n" and later rely on echo -e, which
misinterprets backslash sequences in error text; change each of those string
variables into an array (e.g., cat_no_pr -> cat_no_pr_array) and push entries
into the array instead of concatenating strings inside the while loop that reads
stuck_tasks (refer to the while loop and the variables cat_merged_stuck,
cat_closed_no_merge, cat_obsolete, cat_rebase_exhausted, cat_no_pr,
cat_open_pr); when you need to emit the lists, use printf '%s\n'
"${cat_no_pr_array[@]}" (and similarly for the other arrays) so entries are
printed exactly without interpreting backslash escapes, and update any
downstream consumers that expected newline-joined strings to accept the printed
output or the array form.
🧹 Nitpick comments (2)
.agents/scripts/supervisor/pulse.sh (2)
2002-2055: Duplicated resolution logic between Phase 3b2 andcmd_triage --auto-resolve.The merged→deployed DB update (lines 2011-2019 vs 592-597), closed→queued reset (lines 2035-2040 vs 622-631), and obsolete cancellation patterns are near-identical copies. Consider extracting shared helpers (e.g.,
_resolve_merged_task,_reset_closed_task) to keep them in sync and reduce maintenance burden.
1955-1959: Hardcoded rebase threshold of3should referencemax_retry_cycles.Phase 3.5 (line 730) defines
local max_retry_cycles=3for the same threshold. The triage command hardcodes3directly, so if the threshold changes in Phase 3.5, triage categorization will drift silently. Consider extracting this to a shared constant or at minimum adding a comment noting the dependency.
| while IFS='|' read -r stale_id stale_status stale_pr _stale_repo; do | ||
| [[ -z "$stale_id" ]] && continue | ||
|
|
||
| # Extract PR number from URL | ||
| local pr_number="" | ||
| if [[ "$stale_pr" =~ /pull/([0-9]+)$ ]]; then | ||
| pr_number="${BASH_REMATCH[1]}" | ||
| fi | ||
| if [[ -z "$pr_number" ]]; then | ||
| # Non-standard PR URL or obsolete marker — mark as cancelled | ||
| log_warn " Phase 3b2: $stale_id has non-parseable PR URL '$stale_pr' — cancelling" | ||
| cmd_transition "$stale_id" "cancelled" --error "PR URL not parseable: $stale_pr" 2>>"$SUPERVISOR_LOG" || true | ||
| reconciled_obsolete=$((reconciled_obsolete + 1)) | ||
| continue | ||
| fi | ||
|
|
||
| # Query GitHub for actual PR state | ||
| local pr_json | ||
| pr_json=$(gh pr view "$pr_number" --json state,mergedAt 2>/dev/null || echo "") | ||
| if [[ -z "$pr_json" ]]; then | ||
| log_warn " Phase 3b2: $stale_id PR #$pr_number unreachable — skipping" | ||
| continue | ||
| fi |
There was a problem hiding this comment.
gh pr view missing --repo flag — will query wrong repo in multi-repo setups.
The SQL query selects repo but it's assigned to the unused _stale_repo variable. Without --repo, gh pr view uses the current working directory's remote, which may not match the task's repo. This could produce incorrect PR state lookups or silent failures.
🔧 Proposed fix: pass repo context to `gh pr view`
- while IFS='|' read -r stale_id stale_status stale_pr _stale_repo; do
+ while IFS='|' read -r stale_id stale_status stale_pr stale_repo; do
[[ -z "$stale_id" ]] && continue
# Extract PR number from URL
local pr_number=""
if [[ "$stale_pr" =~ /pull/([0-9]+)$ ]]; then
pr_number="${BASH_REMATCH[1]}"
fi
if [[ -z "$pr_number" ]]; then
# Non-standard PR URL or obsolete marker — mark as cancelled
log_warn " Phase 3b2: $stale_id has non-parseable PR URL '$stale_pr' — cancelling"
cmd_transition "$stale_id" "cancelled" --error "PR URL not parseable: $stale_pr" 2>>"$SUPERVISOR_LOG" || true
reconciled_obsolete=$((reconciled_obsolete + 1))
continue
fi
# Query GitHub for actual PR state
+ local repo_slug=""
+ repo_slug=$(detect_repo_slug "${stale_repo:-.}" 2>/dev/null || echo "")
local pr_json
- pr_json=$(gh pr view "$pr_number" --json state,mergedAt 2>/dev/null || echo "")
+ pr_json=$(gh pr view "$pr_number" ${repo_slug:+--repo "$repo_slug"} --json state,mergedAt 2>/dev/null || echo "")As per coding guidelines, .agents/scripts/*.sh: "Reliability and robustness".
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 557 - 579, The gh PR lookup
omits repository context and currently calls gh pr view "$pr_number" which can
target the wrong repo; update the read/variables so the repo from the input is
preserved (use stale_repo instead of _stale_repo) and call gh pr view with the
--repo flag (e.g., gh pr view "$pr_number" --repo "$stale_repo" --json
state,mergedAt ...) so the PR is queried in the correct repository; keep the
existing fallback behavior for unreachable PRs.
| # Categorize | ||
| local cat_merged_stuck="" | ||
| local cat_closed_no_merge="" | ||
| local cat_obsolete="" | ||
| local cat_rebase_exhausted="" | ||
| local cat_no_pr="" | ||
| local cat_open_pr="" | ||
| local total_stuck=0 | ||
|
|
||
| while IFS='|' read -r tid tstatus tpr terror _trepo trebase; do | ||
| [[ -z "$tid" ]] && continue | ||
| total_stuck=$((total_stuck + 1)) | ||
|
|
||
| # No PR or sentinel | ||
| if [[ -z "$tpr" || "$tpr" == "no_pr" || "$tpr" == "task_only" ]]; then | ||
| cat_no_pr="${cat_no_pr}${tid}|${tstatus}|${terror}\n" | ||
| continue | ||
| fi | ||
| if [[ "$tpr" == "task_obsolete" ]]; then | ||
| cat_obsolete="${cat_obsolete}${tid}|${tstatus}|${tpr}\n" | ||
| continue | ||
| fi | ||
|
|
||
| # Extract PR number | ||
| local pr_num="" | ||
| if [[ "$tpr" =~ /pull/([0-9]+)$ ]]; then | ||
| pr_num="${BASH_REMATCH[1]}" | ||
| fi | ||
| if [[ -z "$pr_num" ]]; then | ||
| cat_obsolete="${cat_obsolete}${tid}|${tstatus}|unparseable:${tpr}\n" | ||
| continue | ||
| fi | ||
|
|
||
| # Query GitHub | ||
| local pr_json | ||
| pr_json=$(gh pr view "$pr_num" --json state,mergedAt 2>/dev/null || echo "") | ||
| if [[ -z "$pr_json" ]]; then | ||
| cat_no_pr="${cat_no_pr}${tid}|${tstatus}|PR #${pr_num} unreachable\n" | ||
| continue | ||
| fi | ||
|
|
||
| local pr_state | ||
| pr_state=$(echo "$pr_json" | grep -o '"state":"[^"]*"' | cut -d'"' -f4) | ||
|
|
||
| case "$pr_state" in | ||
| MERGED) | ||
| cat_merged_stuck="${cat_merged_stuck}${tid}|${tstatus}|PR #${pr_num} MERGED\n" | ||
| ;; | ||
| CLOSED) | ||
| cat_closed_no_merge="${cat_closed_no_merge}${tid}|${tstatus}|PR #${pr_num} CLOSED\n" | ||
| ;; | ||
| OPEN) | ||
| if [[ "${trebase:-0}" -ge 3 ]]; then | ||
| cat_rebase_exhausted="${cat_rebase_exhausted}${tid}|${tstatus}|PR #${pr_num} rebase_attempts=${trebase}\n" | ||
| else | ||
| cat_open_pr="${cat_open_pr}${tid}|${tstatus}|PR #${pr_num} OPEN (${terror})\n" | ||
| fi | ||
| ;; | ||
| esac | ||
| done <<<"$stuck_tasks" |
There was a problem hiding this comment.
Building category lists with \n string concatenation + echo -e is fragile.
If any task error message contains backslash sequences (e.g., \n, \t, \\), echo -e on lines 2024/2043/2055 will interpret them, corrupting the output or breaking the read loop. This is a real risk since error messages can contain arbitrary text.
💡 Safer alternative: use arrays
Replace the string concatenation pattern with arrays:
- local cat_merged_stuck=""
+ local -a cat_merged_stuck=()
...
- cat_merged_stuck="${cat_merged_stuck}${tid}|${tstatus}|PR #${pr_num} MERGED\n"
+ cat_merged_stuck+=("${tid}|${tstatus}|PR #${pr_num} MERGED")
...
# In auto-resolve loop:
- done <<<"$(echo -e "$cat_merged_stuck")"
+ done < <(printf '%s\n' "${cat_merged_stuck[@]}")This avoids interpreting backslash sequences in error messages.
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 1903 - 1962, The category
builders (cat_merged_stuck, cat_closed_no_merge, cat_obsolete,
cat_rebase_exhausted, cat_no_pr, cat_open_pr) currently concatenate entries with
literal "\n" and later rely on echo -e, which misinterprets backslash sequences
in error text; change each of those string variables into an array (e.g.,
cat_no_pr -> cat_no_pr_array) and push entries into the array instead of
concatenating strings inside the while loop that reads stuck_tasks (refer to the
while loop and the variables cat_merged_stuck, cat_closed_no_merge,
cat_obsolete, cat_rebase_exhausted, cat_no_pr, cat_open_pr); when you need to
emit the lists, use printf '%s\n' "${cat_no_pr_array[@]}" (and similarly for the
other arrays) so entries are printed exactly without interpreting backslash
escapes, and update any downstream consumers that expected newline-joined
strings to accept the printed output or the array form.
| while IFS='|' read -r tid tstatus tpr terror _trepo trebase; do | ||
| [[ -z "$tid" ]] && continue | ||
| total_stuck=$((total_stuck + 1)) | ||
|
|
||
| # No PR or sentinel | ||
| if [[ -z "$tpr" || "$tpr" == "no_pr" || "$tpr" == "task_only" ]]; then | ||
| cat_no_pr="${cat_no_pr}${tid}|${tstatus}|${terror}\n" | ||
| continue | ||
| fi | ||
| if [[ "$tpr" == "task_obsolete" ]]; then | ||
| cat_obsolete="${cat_obsolete}${tid}|${tstatus}|${tpr}\n" | ||
| continue | ||
| fi | ||
|
|
||
| # Extract PR number | ||
| local pr_num="" | ||
| if [[ "$tpr" =~ /pull/([0-9]+)$ ]]; then | ||
| pr_num="${BASH_REMATCH[1]}" | ||
| fi | ||
| if [[ -z "$pr_num" ]]; then | ||
| cat_obsolete="${cat_obsolete}${tid}|${tstatus}|unparseable:${tpr}\n" | ||
| continue | ||
| fi | ||
|
|
||
| # Query GitHub | ||
| local pr_json | ||
| pr_json=$(gh pr view "$pr_num" --json state,mergedAt 2>/dev/null || echo "") | ||
| if [[ -z "$pr_json" ]]; then | ||
| cat_no_pr="${cat_no_pr}${tid}|${tstatus}|PR #${pr_num} unreachable\n" | ||
| continue | ||
| fi | ||
|
|
||
| local pr_state | ||
| pr_state=$(echo "$pr_json" | grep -o '"state":"[^"]*"' | cut -d'"' -f4) | ||
|
|
||
| case "$pr_state" in | ||
| MERGED) | ||
| cat_merged_stuck="${cat_merged_stuck}${tid}|${tstatus}|PR #${pr_num} MERGED\n" | ||
| ;; | ||
| CLOSED) | ||
| cat_closed_no_merge="${cat_closed_no_merge}${tid}|${tstatus}|PR #${pr_num} CLOSED\n" | ||
| ;; | ||
| OPEN) | ||
| if [[ "${trebase:-0}" -ge 3 ]]; then | ||
| cat_rebase_exhausted="${cat_rebase_exhausted}${tid}|${tstatus}|PR #${pr_num} rebase_attempts=${trebase}\n" | ||
| else | ||
| cat_open_pr="${cat_open_pr}${tid}|${tstatus}|PR #${pr_num} OPEN (${terror})\n" | ||
| fi | ||
| ;; | ||
| esac | ||
| done <<<"$stuck_tasks" |
There was a problem hiding this comment.
Same missing --repo flag on gh pr view as in Phase 3b2.
Line 1938 calls gh pr view "$pr_num" without --repo, and the repo column is captured as _trepo (unused). Apply the same fix as recommended for Phase 3b2.
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 1912 - 1962, The gh CLI
call in the while-loop uses gh pr view "$pr_num" but ignores the repo captured
in _trepo; update the call to include the repository by passing --repo "$_trepo"
(i.e., change pr_json=$(gh pr view "$pr_num" --json state,mergedAt ...) to
pr_json=$(gh pr view --repo "$_trepo" "$pr_num" --json state,mergedAt ...) ),
ensuring the loop's read fields (_trepo) are actually used when querying PR info
and keeping the existing error handling/fallback unchanged.
There was a problem hiding this comment.
Code Review
This pull request introduces Phase 3b2 to the supervisor pulse cycle, enabling automatic reconciliation of stuck tasks against their actual GitHub PR state. It also adds a comprehensive triage command for manual diagnosis and resolution of queue health issues. The changes significantly improve the framework's ability to self-heal from external PR merges or closures. My review focuses on adherence to the aidevops Shell Script Style Guide, specifically regarding variable declaration safety, function return statements, and secure handling of dynamic data in reports. While the logic is sound and addresses a critical gap in the supervisor's state management, several violations of the repository's strict shell standards (Rules 11 and 12) were identified and should be corrected to ensure robustness and consistency.
| local reconciled_merged=0 | ||
| local reconciled_closed=0 | ||
| local reconciled_obsolete=0 |
There was a problem hiding this comment.
Variable declaration and assignment should be separate to ensure that the exit code of the assignment (or a subshell) does not mask the declaration's success, per Rule 11 of the style guide.
| local reconciled_merged=0 | |
| local reconciled_closed=0 | |
| local reconciled_obsolete=0 | |
| local reconciled_merged | |
| local reconciled_closed | |
| local reconciled_obsolete | |
| reconciled_merged=0 | |
| reconciled_closed=0 | |
| reconciled_obsolete=0 |
References
- Use local var="$1" pattern in functions (declare and assign separately for exit code safety) (link)
| [[ -z "$stale_id" ]] && continue | ||
|
|
||
| # Extract PR number from URL | ||
| local pr_number="" |
There was a problem hiding this comment.
Separate declaration and assignment for local variables as required by Rule 11.
| local pr_number="" | |
| local pr_number | |
| pr_number="" |
References
- Use local var="$1" pattern in functions (declare and assign separately for exit code safety) (link)
| pr_state=$(echo "$pr_json" | grep -o '"state":"[^"]*"' | cut -d'"' -f4) | ||
| pr_merged_at=$(echo "$pr_json" | grep -o '"mergedAt":"[^"]*"' | cut -d'"' -f4) |
| db "$SUPERVISOR_DB" "UPDATE tasks SET | ||
| status = 'deployed', | ||
| error = NULL, | ||
| completed_at = strftime('%Y-%m-%dT%H:%M:%SZ','now'), | ||
| updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') | ||
| WHERE id = '$escaped_stale_id';" 2>/dev/null || true |
There was a problem hiding this comment.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sun Feb 15 02:29:25 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|



Summary
triagecommand (supervisor-helper.sh triage [--dry-run] [--auto-resolve]) for interactive bulk diagnosis and resolution of stuck tasksRoot Cause
The supervisor had no mechanism to detect when a PR was merged externally (by a later pulse cycle or manual action) while the task was stuck in
blockedorverify_failedstate. This caused 27 tasks to accumulate as stuck despite their PRs being successfully merged.Changes
Phase 3b2 (automatic, runs every pulse)
blocked/verify_failedtasks with PR URLsgh pr viewdeployed(bypasses state machine since ground truth is verified), cleans up worktree, updates TODO.md, syncs GitHub issue labelsqueuedwith clean slate for re-dispatchtask_obsolete): Cancels the taskTriage command (manual, on-demand)
--dry-run: Preview only, no changes--auto-resolve: Automatically fixes resolvable categoriesVerification
Tested against the live queue:
triage --dry-runnow reports "No stuck tasks found — queue is healthy"Summary by CodeRabbit