t1183: Fix 3 supervisor pipeline bugs — PR-aware reaping, atomic eval, batch merge#1790
Conversation
…itions, batch merge
Bug 1 (Phase 1c): Check pr_url before retrying stuck evaluating tasks.
Tasks with PRs go to pr_review instead of retrying, preventing duplicate
worker dispatch and duplicate PRs.
Bug 2 (Phase 1 eval): cmd_transition('complete') now executes immediately
after quality gate passes, before non-critical post-processing (proof-log,
patterns, notifications). Prevents lost transitions when pulse is killed
by stale lock breaker at 600s timeout.
Bug 3 (Phase 3): Pull main after each merge so subsequent PRs can rebase
cleanly in the same pulse. Cap at SUPERVISOR_MAX_MERGES_PER_PULSE (default
5) to prevent runaway merge loops when workers create PRs faster than
Phase 3 can process them.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughEnhanced supervisor pipeline pulse script with crash-safe checkpoint transitions after task completion, PR-aware retry routing in Phase 1c/0.7 to avoid duplicate PRs, per-pulse merge cap (default 5) in Phase 3, and post-merge main branch sync for cleaner cascading PR rebases. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 21:05:41 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
.agents/scripts/supervisor/pulse.sh (2)
2707-2716:git pull --rebasecould fail or cause unintended rebases if the repo isn't onmain.
git pull --rebase origin mainrebases the current branch ontoorigin/main. If the repo working directory happens to be checked out on a different branch (e.g., after a worktree operation or interrupted cleanup), this either fails or rebases the wrong branch. The|| log_warnprevents breakage, but the update intent is silently lost.If the goal is simply to ensure subsequent
git rebase origin/maincommands in worktrees see the latestmain,git fetch origin mainis sufficient and branch-agnostic.Proposed: use fetch instead of pull --rebase
if [[ -n "$merge_repo" && -d "$merge_repo" ]]; then - git -C "$merge_repo" pull --rebase origin main 2>>"$SUPERVISOR_LOG" || + git -C "$merge_repo" fetch origin main 2>>"$SUPERVISOR_LOG" || log_warn " $tid: failed to pull main in $merge_repo after merge (non-blocking)" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 2707 - 2716, The current command git -C "$merge_repo" pull --rebase origin main can rebase the checked-out branch or fail if not on main; replace it with a branch-agnostic fetch: run git -C "$merge_repo" fetch origin main 2>>"$SUPERVISOR_LOG" || log_warn " $tid: failed to fetch main in $merge_repo after merge (non-blocking)". Update the block using merge_repo, SUPERVISOR_LOG and log_warn so it calls db/sql_escape/tid unchanged but uses git ... fetch origin main instead of pull --rebase.
2650-2666: Merge cap is effective, butbreakalso defers no-PR deploy transitions.The
breakat line 2665 exits the entire loop, which stops processing all remaining eligible tasks — includingcompletetasks withno_pr/task_onlythat would otherwise get a cheap direct-to-deployedtransition at line 2673 (no actual git merge involved). Those deferred deploys don't contribute to the cascading rebase problem the cap is designed to prevent.This is low-impact (they'll process next pulse in ~5 min), but if you want to preserve throughput for no-PR tasks while still capping real merges, a
continuewith a flag could skip only merge-eligible tasks.Optional: skip only merge-eligible tasks instead of breaking
- # t1183 Bug 3: Stop processing if we've hit the merge cap. - # Remaining tasks will be picked up in the next pulse cycle. - if [[ "$merged_count" -ge "$max_merges_per_pulse" ]]; then - log_info " Phase 3: reached max merges per pulse ($max_merges_per_pulse), deferring rest to next cycle" - break - fi + # t1183 Bug 3: Once merge cap is reached, only allow no-PR fast-path deploys. + # Real merges are deferred to the next pulse cycle. + local merge_cap_reached=false + if [[ "$merged_count" -ge "$max_merges_per_pulse" ]]; then + merge_cap_reached=true + fiThen guard the
cmd_pr_lifecyclecall (around line 2697) with:if [[ "$merge_cap_reached" == "true" ]]; then log_info " $tid: deferred (merge cap reached)" deferred_count=$((deferred_count + 1)) continue fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 2650 - 2666, The loop currently uses break when merged_count >= max_merges_per_pulse which prevents any further task processing (including no-PR/complete deploy-only tasks); change that to set a flag (e.g. merge_cap_reached="true") and continue so the loop keeps iterating for non-merge work; then, before attempting any merge work (identify the merge path around cmd_pr_lifecycle and where sibling/parent checks use merged_parents and merged_count), guard the merge branch with if [[ "$merge_cap_reached" == "true" ]] then log_info " $tid: deferred (merge cap reached)"; deferred_count=$((deferred_count+1)); continue; fi so only merge-eligible tasks are skipped while no-PR/complete transitions still run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 739-741: The condition that routes tasks to pr_review should
exclude the "verified_complete" sentinel and also preserve the audit flags
passed in Phase 0.7; update the if-check that uses stuck_pr_url to add a check
that "$stuck_pr_url" != "verified_complete", and change the cmd_transition call
(the invocation of cmd_transition "$stuck_id" "pr_review") to include the same
--pr-url "$stuck_pr_url" and --error "$stuck_error" arguments (and keep the
2>>"$SUPERVISOR_LOG" || true redirection) so the transition and log_info call
maintain parity with Phase 0.7 behavior and existing sentinel handling.
- Around line 749-753: The Phase 1c branch double-increments task retries:
cmd_transition "$stuck_id" "retrying" already causes state.sh to increment
retries, and the following manual db UPDATE (UPDATE tasks SET retries = retries
+ 1 ...) adds a second increment; remove the redundant manual increment line and
keep the cmd_transition call (or replace cmd_transition with an explicit set to
the intended absolute value if you truly need to control retries), i.e. delete
or comment out the db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries +
1..." statement so retries increase only once when cmd_transition is invoked.
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2707-2716: The current command git -C "$merge_repo" pull --rebase
origin main can rebase the checked-out branch or fail if not on main; replace it
with a branch-agnostic fetch: run git -C "$merge_repo" fetch origin main
2>>"$SUPERVISOR_LOG" || log_warn " $tid: failed to fetch main in $merge_repo
after merge (non-blocking)". Update the block using merge_repo, SUPERVISOR_LOG
and log_warn so it calls db/sql_escape/tid unchanged but uses git ... fetch
origin main instead of pull --rebase.
- Around line 2650-2666: The loop currently uses break when merged_count >=
max_merges_per_pulse which prevents any further task processing (including
no-PR/complete deploy-only tasks); change that to set a flag (e.g.
merge_cap_reached="true") and continue so the loop keeps iterating for non-merge
work; then, before attempting any merge work (identify the merge path around
cmd_pr_lifecycle and where sibling/parent checks use merged_parents and
merged_count), guard the merge branch with if [[ "$merge_cap_reached" == "true"
]] then log_info " $tid: deferred (merge cap reached)";
deferred_count=$((deferred_count+1)); continue; fi so only merge-eligible tasks
are skipped while no-PR/complete transitions still run.
| if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then | ||
| cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true | ||
| log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the exact lines in pulse.sh at 739-741
echo "=== Lines 739-741 in pulse.sh ==="
sed -n '739,741p' .agents/scripts/supervisor/pulse.sh
echo ""
echo "=== Context around lines 739-741 (735-745) ==="
sed -n '735,745p' .agents/scripts/supervisor/pulse.shRepository: marcusquinn/aidevops
Length of output: 1077
🏁 Script executed:
# Search for verified_complete sentinel usage across the codebase
echo "=== All verified_complete references in pulse.sh ==="
rg -n "verified_complete" .agents/scripts/supervisor/pulse.sh -B2 -A2
echo ""
echo "=== Compare with no_pr and task_only usage ==="
rg -n 'no_pr.*task_only|task_only.*no_pr' .agents/scripts/supervisor/pulse.sh -B1 -A1Repository: marcusquinn/aidevops
Length of output: 2166
🏁 Script executed:
# Find Phase 0.7 and the transition at line 289
echo "=== Phase 0.7 context (around line 289) ==="
sed -n '280,300p' .agents/scripts/supervisor/pulse.sh
echo ""
echo "=== cmd_transition calls with pr_review in pulse.sh ==="
rg -n 'cmd_transition.*pr_review' .agents/scripts/supervisor/pulse.sh -B2 -A2Repository: marcusquinn/aidevops
Length of output: 3180
🏁 Script executed:
# Check state.sh for cmd_transition definition and flag handling
echo "=== cmd_transition implementation in state.sh ==="
if [ -f ".agents/scripts/supervisor/state.sh" ]; then
rg -n "cmd_transition" .agents/scripts/supervisor/state.sh | head -5
# Look for the function definition
sed -n '1,200p' .agents/scripts/supervisor/state.sh | rg -n "cmd_transition|--pr-url|--error" -B2 -A2
else
echo "state.sh not found"
fiRepository: marcusquinn/aidevops
Length of output: 862
🏁 Script executed:
# Verify all sentinel checks in conditional statements
echo "=== All conditional checks for pr_url sentinels ==="
rg -n 'if.*\$\{.*pr.*url.*!=' .agents/scripts/supervisor/pulse.sh -B1 -A1 | head -60Repository: marcusquinn/aidevops
Length of output: 115
Add missing verified_complete sentinel check and audit trail parity with Phase 0.7.
Line 739 filters no_pr and task_only but omits verified_complete, which is explicitly checked in Phase 3b2 (line 960), Phase 3 (line 2673), and other pr_url sentinel validations (line 1144). If a task has pr_url == "verified_complete", the condition evaluates true and incorrectly routes to pr_review.
Additionally, Phase 0.7 (line 289) passes --pr-url and --error flags to cmd_transition for the same type of transition; Phase 1c (line 740) omits both, losing audit trail and reducing consistency.
Proposed fix
- if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then
- cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true
+ if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" && "$stuck_pr_url" != "verified_complete" ]]; then
+ cmd_transition "$stuck_id" "pr_review" --pr-url "$stuck_pr_url" --error "Stuck evaluating recovery (Phase 1c/t1183): worker dead, PR exists" 2>>"$SUPERVISOR_LOG" || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then | |
| cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true | |
| log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)" | |
| if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" && "$stuck_pr_url" != "verified_complete" ]]; then | |
| cmd_transition "$stuck_id" "pr_review" --pr-url "$stuck_pr_url" --error "Stuck evaluating recovery (Phase 1c/t1183): worker dead, PR exists" 2>>"$SUPERVISOR_LOG" || true | |
| log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/pulse.sh around lines 739 - 741, The condition
that routes tasks to pr_review should exclude the "verified_complete" sentinel
and also preserve the audit flags passed in Phase 0.7; update the if-check that
uses stuck_pr_url to add a check that "$stuck_pr_url" != "verified_complete",
and change the cmd_transition call (the invocation of cmd_transition "$stuck_id"
"pr_review") to include the same --pr-url "$stuck_pr_url" and --error
"$stuck_error" arguments (and keep the 2>>"$SUPERVISOR_LOG" || true redirection)
so the transition and log_info call maintain parity with Phase 0.7 behavior and
existing sentinel handling.
| if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then | ||
| # Transition to retrying so it gets re-dispatched | ||
| cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true | ||
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true | ||
| log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify cmd_transition auto-increment behavior and identify redundant increments
echo "=== Checking state.sh for 'retrying' state auto-increment logic ==="
fd state.sh --type f | head -5Repository: marcusquinn/aidevops
Length of output: 167
🏁 Script executed:
#!/bin/bash
# Search for retrying state increment pattern
rg -n "retrying" --type=sh -B3 -A3 | head -80Repository: marcusquinn/aidevops
Length of output: 5627
🏁 Script executed:
#!/bin/bash
# Find all retry increment patterns in pulse.sh
fd pulse.sh --type fRepository: marcusquinn/aidevops
Length of output: 137
🏁 Script executed:
#!/bin/bash
# Get state.sh content around retrying logic
cat -n .agents/scripts/supervisor/state.sh | grep -A10 -B10 "retrying"Repository: marcusquinn/aidevops
Length of output: 5418
🏁 Script executed:
#!/bin/bash
# Get pulse.sh content around lines 749-753
sed -n '740,760p' .agents/scripts/supervisor/pulse.sh | cat -nRepository: marcusquinn/aidevops
Length of output: 1597
🏁 Script executed:
#!/bin/bash
# Find cmd_transition definition and check its behavior with retrying
rg -n "cmd_transition" .agents/scripts/supervisor/ --type=sh -A15 | head -100Repository: marcusquinn/aidevops
Length of output: 7774
🏁 Script executed:
#!/bin/bash
# Verify Phase 0.7 pattern (lines ~290-310)
sed -n '280,320p' .agents/scripts/supervisor/todo-sync.sh | cat -nRepository: marcusquinn/aidevops
Length of output: 1730
🏁 Script executed:
#!/bin/bash
# Search for other manual retry increments in pulse.sh to see if this is a pattern
rg -n "retries = retries \+ 1|retries \+= 1" .agents/scripts/supervisor/pulse.sh -C3Repository: marcusquinn/aidevops
Length of output: 689
🏁 Script executed:
#!/bin/bash
# Search for Phase 0.7 in pulse.sh to verify the comparison
sed -n '290,310p' .agents/scripts/supervisor/pulse.sh | cat -nRepository: marcusquinn/aidevops
Length of output: 1583
🏁 Script executed:
#!/bin/bash
# Look for all Phase patterns and retry logic in pulse.sh
rg -n "Phase 0\.|Phase 1" .agents/scripts/supervisor/pulse.sh -B2 -A8 | head -150Repository: marcusquinn/aidevops
Length of output: 8105
🏁 Script executed:
#!/bin/bash
# Find cmd_transition function definition to verify it executes the update_parts
sed -n '1,300p' .agents/scripts/supervisor/state.sh | grep -A30 "^cmd_transition()"Repository: marcusquinn/aidevops
Length of output: 648
🏁 Script executed:
#!/bin/bash
# Check if db function is being called to execute the UPDATE in cmd_transition
sed -n '1,250p' .agents/scripts/supervisor/state.sh | tail -100Repository: marcusquinn/aidevops
Length of output: 3714
Double retry increment — each Phase 1c recovery consumes two retries instead of one.
When cmd_transition "$stuck_id" "retrying" executes (line 751), state.sh appends retries = retries + 1 to the UPDATE statement (lines 182–184). The subsequent manual UPDATE tasks SET retries = retries + 1 on line 752 increments it a second time. Tasks exhaust their retry budget at 2× the intended rate.
Compare with Phase 0.7 (lines 298–301), which correctly sets an absolute value (retries = $new_retries) before calling cmd_transition to "queued" — a state that does not auto-increment retries in cmd_transition.
Proposed fix: remove the redundant increment
if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then
# Transition to retrying so it gets re-dispatched
cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true
- db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true
log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then | |
| # Transition to retrying so it gets re-dispatched | |
| cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true | |
| log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)" | |
| if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then | |
| # Transition to retrying so it gets re-dispatched | |
| cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true | |
| log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/pulse.sh around lines 749 - 753, The Phase 1c
branch double-increments task retries: cmd_transition "$stuck_id" "retrying"
already causes state.sh to increment retries, and the following manual db UPDATE
(UPDATE tasks SET retries = retries + 1 ...) adds a second increment; remove the
redundant manual increment line and keep the cmd_transition call (or replace
cmd_transition with an explicit set to the intended absolute value if you truly
need to control retries), i.e. delete or comment out the db "$SUPERVISOR_DB"
"UPDATE tasks SET retries = retries + 1..." statement so retries increase only
once when cmd_transition is invoked.



Summary
Fixes three critical supervisor pipeline bugs that required repeated manual intervention during the PR backlog clearing session.
Bug 1: Phase 1c PR-aware reaping (lines 728-760)
evaluatingtasks with dead workers and blindly retries them — even when the task already has a PR. This dispatches a duplicate worker that may create a duplicate PR.pr_urlin the DB. If a PR exists, transition topr_review(Phase 3 handles it) instead ofretrying.Bug 2: Atomic eval transitions (lines 474-512)
cmd_transition("complete")executed after proof-log, budget tracking, and other non-critical operations. If the pulse is killed by the stale lock breaker (600s timeout) between evaluation and transition, the task stays stuck inevaluatingforever — re-evaluated every pulse but never transitioning.cmd_transition+cleanup_worker_processesto execute immediately after the quality gate passes. All subsequent operations (proof-log, patterns, notifications, model labels) are non-critical and can be lost without affecting correctness.Bug 3: Phase 3 batch merge (lines 2647-2720)
mainforward, dirtying remaining PRs. If rebase fails, the PR is deferred to the next pulse — creating a growing backlog when workers create PRs faster than Phase 3 can merge them.git pull --rebase origin mainin the task's repo so subsequent PRs can rebase cleanly in the same pulse. AddedSUPERVISOR_MAX_MERGES_PER_PULSEcap (default: 5) to prevent runaway merge loops.Testing
bash -nsyntax check: passshellcheck -x -s bash: zero violations.agents/scripts/supervisor/pulse.shCloses #1788
Summary by CodeRabbit
Bug Fixes