Skip to content

Comments

t1183: Fix 3 supervisor pipeline bugs — PR-aware reaping, atomic eval, batch merge#1790

Merged
marcusquinn merged 1 commit intomainfrom
bugfix/t1183-supervisor-pipeline
Feb 18, 2026
Merged

t1183: Fix 3 supervisor pipeline bugs — PR-aware reaping, atomic eval, batch merge#1790
marcusquinn merged 1 commit intomainfrom
bugfix/t1183-supervisor-pipeline

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 18, 2026

Summary

Fixes three critical supervisor pipeline bugs that required repeated manual intervention during the PR backlog clearing session.

Bug 1: Phase 1c PR-aware reaping (lines 728-760)

  • Problem: Phase 1c catches stuck evaluating tasks with dead workers and blindly retries them — even when the task already has a PR. This dispatches a duplicate worker that may create a duplicate PR.
  • Fix: Before the retry/fail decision, check pr_url in the DB. If a PR exists, transition to pr_review (Phase 3 handles it) instead of retrying.

Bug 2: Atomic eval transitions (lines 474-512)

  • Problem: cmd_transition("complete") executed after proof-log, budget tracking, and other non-critical operations. If the pulse is killed by the stale lock breaker (600s timeout) between evaluation and transition, the task stays stuck in evaluating forever — re-evaluated every pulse but never transitioning.
  • Fix: Move cmd_transition + cleanup_worker_processes to execute immediately after the quality gate passes. All subsequent operations (proof-log, patterns, notifications, model labels) are non-critical and can be lost without affecting correctness.

Bug 3: Phase 3 batch merge (lines 2647-2720)

  • Problem: Each merge moves main forward, dirtying remaining PRs. If rebase fails, the PR is deferred to the next pulse — creating a growing backlog when workers create PRs faster than Phase 3 can merge them.
  • Fix: After each successful merge, git pull --rebase origin main in the task's repo so subsequent PRs can rebase cleanly in the same pulse. Added SUPERVISOR_MAX_MERGES_PER_PULSE cap (default: 5) to prevent runaway merge loops.

Testing

  • bash -n syntax check: pass
  • shellcheck -x -s bash: zero violations
  • All changes in a single file: .agents/scripts/supervisor/pulse.sh

Closes #1788

Summary by CodeRabbit

Bug Fixes

  • Fixed potential duplicate pull requests in task workflows
  • Improved recovery handling for stale task states
  • Enhanced task completion checkpointing for crash safety
  • Added merge rate limiting (5 per cycle) for improved stability under heavy workloads
  • Optimized repository synchronization for cleaner task cascades

…itions, batch merge

Bug 1 (Phase 1c): Check pr_url before retrying stuck evaluating tasks.
Tasks with PRs go to pr_review instead of retrying, preventing duplicate
worker dispatch and duplicate PRs.

Bug 2 (Phase 1 eval): cmd_transition('complete') now executes immediately
after quality gate passes, before non-critical post-processing (proof-log,
patterns, notifications). Prevents lost transitions when pulse is killed
by stale lock breaker at 600s timeout.

Bug 3 (Phase 3): Pull main after each merge so subsequent PRs can rebase
cleanly in the same pulse. Cap at SUPERVISOR_MAX_MERGES_PER_PULSE (default
5) to prevent runaway merge loops when workers create PRs faster than
Phase 3 can process them.
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Walkthrough

Enhanced supervisor pipeline pulse script with crash-safe checkpoint transitions after task completion, PR-aware retry routing in Phase 1c/0.7 to avoid duplicate PRs, per-pulse merge cap (default 5) in Phase 3, and post-merge main branch sync for cleaner cascading PR rebases.

Changes

Cohort / File(s) Summary
Supervisor Pulse Phase Enhancements
.agents/scripts/supervisor/pulse.sh
Adds immediate checkpoint transitions with state cleanup after task completion; enhances Phase 1c/0.7 stale-evaluating detection to route to pr_review if PR exists, otherwise retry; introduces per-pulse merge cap (max_merges_per_pulse=5) in Phase 3; adds post-merge main branch sync to rebase subsequent PRs against updated main.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #1488: Extends Phase 3b2 PR-driven reconciliation with merged/closed PR handling and related recovery paths
  • #917: Implements per-task orphaned-PR detection to avoid unnecessary retries and transitions tasks to pr_review when PRs exist
  • #1704: Adds PR adoption into Phase 3a pipeline alongside PR-aware retry/transition logic and post-merge sync handling

Poem

🚀 Checkpoints lock in, clean and fast,
Duplicates vanish—a thing of the past,
Phases align with PR-aware might,
Five merges per pulse, rebasing just right!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the three main bugs being fixed (PR-aware reaping, atomic eval, batch merge) and directly relates to the changeset.
Linked Issues check ✅ Passed The PR successfully implements all three objectives from #1788: PR-aware reaping in Phase 1c, atomic evaluation transitions in Phase 1b, and Phase 3 batch merge handling with merge caps.
Out of Scope Changes check ✅ Passed All changes in the pulse.sh file are directly scoped to fixing the three identified bugs; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/t1183-supervisor-pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 26 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 21:05:38 UTC 2026: Code review monitoring started
Wed Feb 18 21:05:38 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 26

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 26
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 21:05:41 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit 921c1f9 into main Feb 18, 2026
10 of 11 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.agents/scripts/supervisor/pulse.sh (2)

2707-2716: git pull --rebase could fail or cause unintended rebases if the repo isn't on main.

git pull --rebase origin main rebases the current branch onto origin/main. If the repo working directory happens to be checked out on a different branch (e.g., after a worktree operation or interrupted cleanup), this either fails or rebases the wrong branch. The || log_warn prevents breakage, but the update intent is silently lost.

If the goal is simply to ensure subsequent git rebase origin/main commands in worktrees see the latest main, git fetch origin main is sufficient and branch-agnostic.

Proposed: use fetch instead of pull --rebase
 				if [[ -n "$merge_repo" && -d "$merge_repo" ]]; then
-					git -C "$merge_repo" pull --rebase origin main 2>>"$SUPERVISOR_LOG" ||
+					git -C "$merge_repo" fetch origin main 2>>"$SUPERVISOR_LOG" ||
 						log_warn "  $tid: failed to pull main in $merge_repo after merge (non-blocking)"
 				fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 2707 - 2716, The current
command git -C "$merge_repo" pull --rebase origin main can rebase the
checked-out branch or fail if not on main; replace it with a branch-agnostic
fetch: run git -C "$merge_repo" fetch origin main 2>>"$SUPERVISOR_LOG" ||
log_warn "  $tid: failed to fetch main in $merge_repo after merge
(non-blocking)". Update the block using merge_repo, SUPERVISOR_LOG and log_warn
so it calls db/sql_escape/tid unchanged but uses git ... fetch origin main
instead of pull --rebase.

2650-2666: Merge cap is effective, but break also defers no-PR deploy transitions.

The break at line 2665 exits the entire loop, which stops processing all remaining eligible tasks — including complete tasks with no_pr/task_only that would otherwise get a cheap direct-to-deployed transition at line 2673 (no actual git merge involved). Those deferred deploys don't contribute to the cascading rebase problem the cap is designed to prevent.

This is low-impact (they'll process next pulse in ~5 min), but if you want to preserve throughput for no-PR tasks while still capping real merges, a continue with a flag could skip only merge-eligible tasks.

Optional: skip only merge-eligible tasks instead of breaking
-		# t1183 Bug 3: Stop processing if we've hit the merge cap.
-		# Remaining tasks will be picked up in the next pulse cycle.
-		if [[ "$merged_count" -ge "$max_merges_per_pulse" ]]; then
-			log_info "  Phase 3: reached max merges per pulse ($max_merges_per_pulse), deferring rest to next cycle"
-			break
-		fi
+		# t1183 Bug 3: Once merge cap is reached, only allow no-PR fast-path deploys.
+		# Real merges are deferred to the next pulse cycle.
+		local merge_cap_reached=false
+		if [[ "$merged_count" -ge "$max_merges_per_pulse" ]]; then
+			merge_cap_reached=true
+		fi

Then guard the cmd_pr_lifecycle call (around line 2697) with:

if [[ "$merge_cap_reached" == "true" ]]; then
    log_info "  $tid: deferred (merge cap reached)"
    deferred_count=$((deferred_count + 1))
    continue
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 2650 - 2666, The loop
currently uses break when merged_count >= max_merges_per_pulse which prevents
any further task processing (including no-PR/complete deploy-only tasks); change
that to set a flag (e.g. merge_cap_reached="true") and continue so the loop
keeps iterating for non-merge work; then, before attempting any merge work
(identify the merge path around cmd_pr_lifecycle and where sibling/parent checks
use merged_parents and merged_count), guard the merge branch with if [[
"$merge_cap_reached" == "true" ]] then log_info "  $tid: deferred (merge cap
reached)"; deferred_count=$((deferred_count+1)); continue; fi so only
merge-eligible tasks are skipped while no-PR/complete transitions still run.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 739-741: The condition that routes tasks to pr_review should
exclude the "verified_complete" sentinel and also preserve the audit flags
passed in Phase 0.7; update the if-check that uses stuck_pr_url to add a check
that "$stuck_pr_url" != "verified_complete", and change the cmd_transition call
(the invocation of cmd_transition "$stuck_id" "pr_review") to include the same
--pr-url "$stuck_pr_url" and --error "$stuck_error" arguments (and keep the
2>>"$SUPERVISOR_LOG" || true redirection) so the transition and log_info call
maintain parity with Phase 0.7 behavior and existing sentinel handling.
- Around line 749-753: The Phase 1c branch double-increments task retries:
cmd_transition "$stuck_id" "retrying" already causes state.sh to increment
retries, and the following manual db UPDATE (UPDATE tasks SET retries = retries
+ 1 ...) adds a second increment; remove the redundant manual increment line and
keep the cmd_transition call (or replace cmd_transition with an explicit set to
the intended absolute value if you truly need to control retries), i.e. delete
or comment out the db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries +
1..." statement so retries increase only once when cmd_transition is invoked.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2707-2716: The current command git -C "$merge_repo" pull --rebase
origin main can rebase the checked-out branch or fail if not on main; replace it
with a branch-agnostic fetch: run git -C "$merge_repo" fetch origin main
2>>"$SUPERVISOR_LOG" || log_warn "  $tid: failed to fetch main in $merge_repo
after merge (non-blocking)". Update the block using merge_repo, SUPERVISOR_LOG
and log_warn so it calls db/sql_escape/tid unchanged but uses git ... fetch
origin main instead of pull --rebase.
- Around line 2650-2666: The loop currently uses break when merged_count >=
max_merges_per_pulse which prevents any further task processing (including
no-PR/complete deploy-only tasks); change that to set a flag (e.g.
merge_cap_reached="true") and continue so the loop keeps iterating for non-merge
work; then, before attempting any merge work (identify the merge path around
cmd_pr_lifecycle and where sibling/parent checks use merged_parents and
merged_count), guard the merge branch with if [[ "$merge_cap_reached" == "true"
]] then log_info "  $tid: deferred (merge cap reached)";
deferred_count=$((deferred_count+1)); continue; fi so only merge-eligible tasks
are skipped while no-PR/complete transitions still run.

Comment on lines +739 to +741
if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then
cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true
log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, examine the exact lines in pulse.sh at 739-741
echo "=== Lines 739-741 in pulse.sh ==="
sed -n '739,741p' .agents/scripts/supervisor/pulse.sh

echo ""
echo "=== Context around lines 739-741 (735-745) ==="
sed -n '735,745p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 1077


🏁 Script executed:

# Search for verified_complete sentinel usage across the codebase
echo "=== All verified_complete references in pulse.sh ==="
rg -n "verified_complete" .agents/scripts/supervisor/pulse.sh -B2 -A2

echo ""
echo "=== Compare with no_pr and task_only usage ==="
rg -n 'no_pr.*task_only|task_only.*no_pr' .agents/scripts/supervisor/pulse.sh -B1 -A1

Repository: marcusquinn/aidevops

Length of output: 2166


🏁 Script executed:

# Find Phase 0.7 and the transition at line 289
echo "=== Phase 0.7 context (around line 289) ==="
sed -n '280,300p' .agents/scripts/supervisor/pulse.sh

echo ""
echo "=== cmd_transition calls with pr_review in pulse.sh ==="
rg -n 'cmd_transition.*pr_review' .agents/scripts/supervisor/pulse.sh -B2 -A2

Repository: marcusquinn/aidevops

Length of output: 3180


🏁 Script executed:

# Check state.sh for cmd_transition definition and flag handling
echo "=== cmd_transition implementation in state.sh ==="
if [ -f ".agents/scripts/supervisor/state.sh" ]; then
  rg -n "cmd_transition" .agents/scripts/supervisor/state.sh | head -5
  # Look for the function definition
  sed -n '1,200p' .agents/scripts/supervisor/state.sh | rg -n "cmd_transition|--pr-url|--error" -B2 -A2
else
  echo "state.sh not found"
fi

Repository: marcusquinn/aidevops

Length of output: 862


🏁 Script executed:

# Verify all sentinel checks in conditional statements
echo "=== All conditional checks for pr_url sentinels ==="
rg -n 'if.*\$\{.*pr.*url.*!=' .agents/scripts/supervisor/pulse.sh -B1 -A1 | head -60

Repository: marcusquinn/aidevops

Length of output: 115


Add missing verified_complete sentinel check and audit trail parity with Phase 0.7.

Line 739 filters no_pr and task_only but omits verified_complete, which is explicitly checked in Phase 3b2 (line 960), Phase 3 (line 2673), and other pr_url sentinel validations (line 1144). If a task has pr_url == "verified_complete", the condition evaluates true and incorrectly routes to pr_review.

Additionally, Phase 0.7 (line 289) passes --pr-url and --error flags to cmd_transition for the same type of transition; Phase 1c (line 740) omits both, losing audit trail and reducing consistency.

Proposed fix
-			if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then
-				cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true
+			if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" && "$stuck_pr_url" != "verified_complete" ]]; then
+				cmd_transition "$stuck_id" "pr_review" --pr-url "$stuck_pr_url" --error "Stuck evaluating recovery (Phase 1c/t1183): worker dead, PR exists" 2>>"$SUPERVISOR_LOG" || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" ]]; then
cmd_transition "$stuck_id" "pr_review" 2>>"$SUPERVISOR_LOG" || true
log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)"
if [[ -n "$stuck_pr_url" && "$stuck_pr_url" != "no_pr" && "$stuck_pr_url" != "task_only" && "$stuck_pr_url" != "verified_complete" ]]; then
cmd_transition "$stuck_id" "pr_review" --pr-url "$stuck_pr_url" --error "Stuck evaluating recovery (Phase 1c/t1183): worker dead, PR exists" 2>>"$SUPERVISOR_LOG" || true
log_info " Phase 1c: $stuck_id → pr_review (has PR: $stuck_pr_url, skipping retry)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 739 - 741, The condition
that routes tasks to pr_review should exclude the "verified_complete" sentinel
and also preserve the audit flags passed in Phase 0.7; update the if-check that
uses stuck_pr_url to add a check that "$stuck_pr_url" != "verified_complete",
and change the cmd_transition call (the invocation of cmd_transition "$stuck_id"
"pr_review") to include the same --pr-url "$stuck_pr_url" and --error
"$stuck_error" arguments (and keep the 2>>"$SUPERVISOR_LOG" || true redirection)
so the transition and log_info call maintain parity with Phase 0.7 behavior and
existing sentinel handling.

Comment on lines +749 to +753
if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then
# Transition to retrying so it gets re-dispatched
cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true
log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify cmd_transition auto-increment behavior and identify redundant increments

echo "=== Checking state.sh for 'retrying' state auto-increment logic ==="
fd state.sh --type f | head -5

Repository: marcusquinn/aidevops

Length of output: 167


🏁 Script executed:

#!/bin/bash
# Search for retrying state increment pattern
rg -n "retrying" --type=sh -B3 -A3 | head -80

Repository: marcusquinn/aidevops

Length of output: 5627


🏁 Script executed:

#!/bin/bash
# Find all retry increment patterns in pulse.sh
fd pulse.sh --type f

Repository: marcusquinn/aidevops

Length of output: 137


🏁 Script executed:

#!/bin/bash
# Get state.sh content around retrying logic
cat -n .agents/scripts/supervisor/state.sh | grep -A10 -B10 "retrying"

Repository: marcusquinn/aidevops

Length of output: 5418


🏁 Script executed:

#!/bin/bash
# Get pulse.sh content around lines 749-753
sed -n '740,760p' .agents/scripts/supervisor/pulse.sh | cat -n

Repository: marcusquinn/aidevops

Length of output: 1597


🏁 Script executed:

#!/bin/bash
# Find cmd_transition definition and check its behavior with retrying
rg -n "cmd_transition" .agents/scripts/supervisor/ --type=sh -A15 | head -100

Repository: marcusquinn/aidevops

Length of output: 7774


🏁 Script executed:

#!/bin/bash
# Verify Phase 0.7 pattern (lines ~290-310)
sed -n '280,320p' .agents/scripts/supervisor/todo-sync.sh | cat -n

Repository: marcusquinn/aidevops

Length of output: 1730


🏁 Script executed:

#!/bin/bash
# Search for other manual retry increments in pulse.sh to see if this is a pattern
rg -n "retries = retries \+ 1|retries \+= 1" .agents/scripts/supervisor/pulse.sh -C3

Repository: marcusquinn/aidevops

Length of output: 689


🏁 Script executed:

#!/bin/bash
# Search for Phase 0.7 in pulse.sh to verify the comparison
sed -n '290,310p' .agents/scripts/supervisor/pulse.sh | cat -n

Repository: marcusquinn/aidevops

Length of output: 1583


🏁 Script executed:

#!/bin/bash
# Look for all Phase patterns and retry logic in pulse.sh
rg -n "Phase 0\.|Phase 1" .agents/scripts/supervisor/pulse.sh -B2 -A8 | head -150

Repository: marcusquinn/aidevops

Length of output: 8105


🏁 Script executed:

#!/bin/bash
# Find cmd_transition function definition to verify it executes the update_parts
sed -n '1,300p' .agents/scripts/supervisor/state.sh | grep -A30 "^cmd_transition()"

Repository: marcusquinn/aidevops

Length of output: 648


🏁 Script executed:

#!/bin/bash
# Check if db function is being called to execute the UPDATE in cmd_transition
sed -n '1,250p' .agents/scripts/supervisor/state.sh | tail -100

Repository: marcusquinn/aidevops

Length of output: 3714


Double retry increment — each Phase 1c recovery consumes two retries instead of one.

When cmd_transition "$stuck_id" "retrying" executes (line 751), state.sh appends retries = retries + 1 to the UPDATE statement (lines 182–184). The subsequent manual UPDATE tasks SET retries = retries + 1 on line 752 increments it a second time. Tasks exhaust their retry budget at 2× the intended rate.

Compare with Phase 0.7 (lines 298–301), which correctly sets an absolute value (retries = $new_retries) before calling cmd_transition to "queued" — a state that does not auto-increment retries in cmd_transition.

Proposed fix: remove the redundant increment
 			if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then
 				# Transition to retrying so it gets re-dispatched
 				cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true
-				db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true
 				log_info "  Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then
# Transition to retrying so it gets re-dispatched
cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "UPDATE tasks SET retries = retries + 1, updated_at = strftime('%Y-%m-%dT%H:%M:%SZ', 'now') WHERE id = '$(sql_escape "$stuck_id")';" 2>/dev/null || true
log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)"
if [[ "$stuck_retries" -lt "$stuck_max_retries" ]]; then
# Transition to retrying so it gets re-dispatched
cmd_transition "$stuck_id" "retrying" --error "Auto-reaped: stuck in evaluating >10min with dead worker (Phase 1c)" 2>>"$SUPERVISOR_LOG" || true
log_info " Phase 1c: $stuck_id → retrying (retry $((stuck_retries + 1))/$stuck_max_retries)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 749 - 753, The Phase 1c
branch double-increments task retries: cmd_transition "$stuck_id" "retrying"
already causes state.sh to increment retries, and the following manual db UPDATE
(UPDATE tasks SET retries = retries + 1 ...) adds a second increment; remove the
redundant manual increment line and keep the cmd_transition call (or replace
cmd_transition with an explicit set to the intended absolute value if you truly
need to control retries), i.e. delete or comment out the db "$SUPERVISOR_DB"
"UPDATE tasks SET retries = retries + 1..." statement so retries increase only
once when cmd_transition is invoked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

t1183: Fix 3 supervisor pipeline bugs: Phase 1c PR-aware reaping, atomic evaluation transitions, Phase 3 batch merge

1 participant