Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions .agents/scripts/supervisor/pulse.sh
Original file line number Diff line number Diff line change
Expand Up @@ -637,6 +637,141 @@ cmd_pulse() {
done <<<"$blocked_tasks"
fi

# Phase 3.6: Escalate rebase-blocked PRs to opus worker (t1050)
# When auto-rebase fails max_retry_cycles times, dispatch an opus worker to
# manually rebase, resolve conflicts, and merge the PR. Only ONE escalation
# runs at a time (sequential) so each subsequent rebase has a clean base.
local escalation_lock="${SUPERVISOR_DIR}/rebase-escalation.lock"
local escalation_cooldown=300 # 5 minutes between escalations

# Check if an escalation is already running or recently completed
local should_escalate=true
if [[ -f "$escalation_lock" ]]; then
local lock_age
lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))
if [[ "$lock_age" -lt "$escalation_cooldown" ]]; then
should_escalate=false
log_verbose " Phase 3.6: escalation cooldown (${lock_age}s/${escalation_cooldown}s)"
else
# Stale lock — remove it
rm -f "$escalation_lock" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using 2>/dev/null here violates the repository's style guide (rule 50), which states that stderr should only be redirected to a log file. This pattern is repeated throughout the new code block (e.g., lines 651, 673, 683, 687, 690, 695, 736, 740, and 766). Suppressing errors can make debugging more difficult.

Please redirect to the supervisor log file instead to adhere to the style guide and improve observability.

Suggested change
rm -f "$escalation_lock" 2>/dev/null || true
rm -f "$escalation_lock" 2>>"$SUPERVISOR_LOG" || true
References
  1. The style guide specifies that 2>/dev/null is only acceptable when redirecting to log files, not for blanket suppression of errors. This change violates that rule. (link)

fi
fi

if [[ "$should_escalate" == "true" ]]; then
# Find ONE task that has exhausted auto-rebase retries
local escalation_candidate
escalation_candidate=$(db "$SUPERVISOR_DB" "
SELECT t.id, t.repo, t.pr_url, t.branch, t.rebase_attempts
FROM tasks t
WHERE t.status = 'blocked'
AND t.error LIKE '%Merge conflict%auto-rebase failed%'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Relying on a specific error string with LIKE can be brittle. If the error message from the rebase process changes in the future, this query will fail to find candidates for escalation.

For better long-term maintainability, consider adding a dedicated column (e.g., rebase_failure_reason) to the tasks table to store a structured error code instead of relying on string matching in the error text.

AND t.rebase_attempts >= $max_retry_cycles
AND t.pr_url IS NOT NULL AND t.pr_url != '' AND t.pr_url != 'no_pr'
ORDER BY t.rebase_attempts ASC, t.id ASC
LIMIT 1;
" 2>/dev/null || echo "")
Comment on lines +661 to +673
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: $max_retry_cycles is unset when no Phase 3.5 blocked tasks exist.

max_retry_cycles is declared local at line 597 inside the Phase 3.5 while loop body, which only executes when blocked_tasks is non-empty. If no merge-conflict-blocked tasks exist (the if at line 592 is false), $max_retry_cycles is never assigned. The SQL at line 669 then expands to AND t.rebase_attempts >= — a SQLite syntax error, silently swallowed by || echo "".

This means Phase 3.6 escalation never fires when Phase 3.5 has nothing to process, which is precisely when escalation candidates (retries already exhausted) are most likely to exist.

🐛 Proposed fix: define the constant at Phase 3.5/3.6 shared scope
+	# Shared constant for Phase 3.5 and 3.6
+	local max_retry_cycles=3
+
 	# Phase 3.5: Auto-retry blocked merge-conflict tasks (t1029)
 	# When a task is blocked with "Merge conflict — auto-rebase failed", periodically
 	# re-attempt the rebase after main advances. Other PRs merging often resolve conflicts.
 	local blocked_tasks

And remove the duplicate declaration inside the loop:

-			# Cap at 3 total retry cycles to prevent infinite loops
-			local max_retry_cycles=3
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 661 - 673, The bug is that
max_retry_cycles is declared only inside the Phase 3.5 while loop (local at the
Phase 3.5 scope) so it can be unset when Phase 3.5 doesn't run, breaking the
Phase 3.6 SQL that uses it; fix by defining max_retry_cycles in the shared scope
before Phase 3.5/3.6 (e.g., immediately above the Phase 3.5 loop) and remove the
duplicate local declaration inside the while loop so both Phase 3.5 logic and
the escalation SQL (the db query that checks AND t.rebase_attempts >=
$max_retry_cycles) always sees a defined value.


if [[ -n "$escalation_candidate" ]]; then
local esc_id esc_repo esc_pr esc_branch esc_attempts
IFS='|' read -r esc_id esc_repo esc_pr esc_branch esc_attempts <<<"$escalation_candidate"

if [[ -n "$esc_id" ]]; then
log_info " Phase 3.6: escalating $esc_id to opus worker (rebase_attempts=$esc_attempts, pr=$esc_pr)"

# Create lock file
date +%s >"$escalation_lock" 2>/dev/null || true

# Resolve AI CLI
local esc_ai_cli
esc_ai_cli=$(resolve_ai_cli 2>/dev/null || echo "")
if [[ -z "$esc_ai_cli" ]]; then
log_warn " Phase 3.6: no AI CLI available for escalation"
rm -f "$escalation_lock" 2>/dev/null || true
else
# Find the worktree path
local esc_worktree=""
local esc_wt_row
esc_wt_row=$(db "$SUPERVISOR_DB" "SELECT worktree FROM tasks WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || echo "")
if [[ -n "$esc_wt_row" && -d "$esc_wt_row" ]]; then
esc_worktree="$esc_wt_row"
fi

# Build the escalation prompt
local esc_prompt="You are resolving a merge conflict that automated tools could not handle.

TASK: $esc_id
BRANCH: $esc_branch
PR: $esc_pr
REPO: $esc_repo
WORKTREE: ${esc_worktree:-$esc_repo}

STEPS:
1. cd to the worktree (or repo if no worktree)
2. Run: git fetch origin main
3. Abort any in-progress rebase: git rebase --abort (ignore errors)
4. Clean any dirty state: git stash push -m 'pre-escalation' (ignore errors)
5. Run: git rebase origin/main
6. If conflicts occur, resolve ALL of them:
- Read each conflicting file
- Understand both sides' intent
- Merge intelligently (keep both sides' changes where possible)
- Remove ALL conflict markers
- git add each resolved file
- git rebase --continue
- Repeat for each commit in the rebase
7. After rebase completes: git push --force-with-lease origin $esc_branch
8. Verify the PR is no longer in conflict: gh pr view $esc_pr --json mergeStateStatus
9. If CI passes, merge: gh pr merge $esc_pr --squash
10. Output ONLY: 'ESCALATION_MERGED: $esc_id' if merged, 'ESCALATION_REBASED: $esc_id' if rebased but not merged, or 'ESCALATION_FAILED: reason' if failed

RULES:
- Do NOT modify the intent of any code — only resolve conflicts
- Prefer the feature branch for new functionality, main for structural changes
- If a file has been deleted on main but modified on the branch, keep the branch version
- Do NOT create new commits beyond what the rebase produces"
Comment on lines +700 to +732
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Read the file around the specified lines
wc -l .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 104


🏁 Script executed:

# Get context around lines 700-732
sed -n '690,740p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2190


🏁 Script executed:

# Check if ESCALATION_MERGED/REBASED/FAILED markers are referenced anywhere
rg -n 'ESCALATION_MERGED|ESCALATION_REBASED|ESCALATION_FAILED' --type=sh

Repository: marcusquinn/aidevops

Length of output: 258


🏁 Script executed:

# Check how esc_branch, esc_pr, esc_repo are sourced and quoted
rg -n 'esc_branch\|esc_pr\|esc_repo' --type=sh -B2 -A2 | head -100

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Look for evaluate_worker function
rg -n 'evaluate_worker' --type=sh -A20

Repository: marcusquinn/aidevops

Length of output: 20028


🏁 Script executed:

# Check if there's any CI status checking before merge
rg -n 'mergeStateStatus|ciStatus|check.*pass' --type=sh

Repository: marcusquinn/aidevops

Length of output: 6926


🏁 Script executed:

# Check where the escalation worker is dispatched and what happens with its output
rg -n 'escalation' .agents/scripts/supervisor/pulse.sh -A5 -B5 | grep -E '(dispatch|worker|eval|outcome|log|ESCALATION)' | head -60

Repository: marcusquinn/aidevops

Length of output: 1124


🏁 Script executed:

# Look for where the escalation process is expected to complete or be handled
rg -n 'esc_log_dir|esc_model|dispatch.*esc' .agents/scripts/supervisor/pulse.sh -A10

Repository: marcusquinn/aidevops

Length of output: 2570


🏁 Script executed:

# Check if there's a comment about handling escalation outcomes
rg -n 'escalation.*outcome|ESCALATION.*handling|phase.*escalation' .agents/scripts/supervisor/ -i

Repository: marcusquinn/aidevops

Length of output: 296


🏁 Script executed:

# Verify variable quoting in actual shell commands within the prompt  
sed -n '710,735p' .agents/scripts/supervisor/pulse.sh | grep -E 'git|gh|branch|pr|repo'

Repository: marcusquinn/aidevops

Length of output: 817


🏁 Script executed:

# Look at the full escalation section to see variable sourcing
sed -n '640,750p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 4709


🏁 Script executed:

# Check where esc_branch, esc_pr, esc_repo are first set
rg -n 'esc_branch=|esc_pr=|esc_repo=' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Check if escalation outcome is parsed anywhere after dispatch
rg -n 'esc_log_file|escalation.*log' .agents/scripts/supervisor/pulse.sh -A5

Repository: marcusquinn/aidevops

Length of output: 750


🏁 Script executed:

# Look for any subsequent handling of escalated tasks
rg -n 'escalation_candidate|esc_id' .agents/scripts/supervisor/pulse.sh | tail -20

Repository: marcusquinn/aidevops

Length of output: 1227


Escalation outcome markers are defined in the prompt but never parsed by the supervisor.

The prompt (line 726) instructs the worker to output ESCALATION_MERGED: $esc_id, ESCALATION_REBASED: $esc_id, or ESCALATION_FAILED: reason, but the supervisor never reads the escalation log file or acts on these markers. After dispatch, the worker runs in the background and the outcome is discarded — the task stays in running state until a timeout or health check transitions it. Phase 1 evaluation will need to handle outcome parsing.

Also: $esc_branch, $esc_pr, and $esc_repo are interpolated unquoted into the prompt text (steps 7–9). Low risk since they originate from your DB, but defensive quoting would harden it. Additionally, the prompt relies on the worker to verify CI passes before merging with no supervisor-side validation — gh pr merge will reject if required checks fail, but explicit pre-merge CI validation in the supervisor would be safer.

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 700 - 732, The escalation
prompt stored in esc_prompt instructs the worker to emit outcome markers
(ESCALATION_MERGED, ESCALATION_REBASED, ESCALATION_FAILED) but the supervisor
never reads or parses that worker output; update the supervisor to capture and
parse the worker's escalation log (or wait for the worker to finish) and detect
those markers to transition the task state accordingly, mapping
ESCALATION_MERGED/REBASED to success states and ESCALATION_FAILED to failure
with the provided reason; also harden esc_prompt by quoting/interpolating
$esc_branch, $esc_pr, and $esc_repo in the prompt text to avoid
word-splitting/injection, and add an explicit supervisor-side CI check (e.g.,
call gh pr view --json checkSuites/commit/status to ensure required checks
passed) before performing gh pr merge so merges are only attempted when CI is
green.


# Resolve model — use opus for complex conflict resolution
local esc_model
esc_model=$(resolve_model "opus" "$esc_ai_cli" 2>/dev/null || echo "")

# Dispatch the worker
local esc_log_dir="${SUPERVISOR_DIR}/logs"
mkdir -p "$esc_log_dir" 2>/dev/null || true
local esc_log_file
esc_log_file="${esc_log_dir}/escalation-${esc_id}-$(date +%Y%m%d-%H%M%S).log"

local esc_workdir="${esc_worktree:-$esc_repo}"
if [[ "$esc_ai_cli" == "opencode" ]]; then
(cd "$esc_workdir" && $esc_ai_cli run \
${esc_model:+--model "$esc_model"} \
--format json \
--title "escalation-rebase-${esc_id}" \
"$esc_prompt" \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi

# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
Comment on lines +760 to +766
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Status update bypasses cmd_transition — GitHub issue labels won't sync.

The PR description states "Uses cmd_transition so GitHub issue labels sync automatically," but line 761–766 performs a raw UPDATE tasks SET status = 'running' instead of calling cmd_transition. This skips label synchronization, state_log recording, and any other side effects that cmd_transition provides.

♻️ Proposed fix: use cmd_transition
-				db "$SUPERVISOR_DB" "UPDATE tasks SET
-					status = 'running',
-					error = 'Escalation: opus rebase worker (PID $esc_pid)',
-					worker_pid = $esc_pid,
-					updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
-				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
+				# Transition via cmd_transition for label sync + state_log
+				cmd_transition "$esc_id" "running" --error "Escalation: opus rebase worker (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
+				# Update worker_pid separately (cmd_transition doesn't handle this field)
+				db "$SUPERVISOR_DB" "UPDATE tasks SET
+					worker_pid = $esc_pid,
+					log_file = '$(sql_escape "$esc_log_file")'
+				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 760 - 766, The block that
directly runs db "$SUPERVISOR_DB" "UPDATE tasks SET status = 'running' ..."
bypasses cmd_transition and therefore skips GitHub label sync and state_log;
replace this raw UPDATE with a call to cmd_transition so the task with id
"$esc_id" is transitioned to 'running' and the worker_pid is set to $esc_pid
(use the same variables esc_id and esc_pid and ensure any returned errors are
handled similarly to the original db call), preserving the updated_at behavior
if cmd_transition doesn't set it automatically.


log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
Comment on lines +752 to +769
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: No PID file written — next pulse will orphan-kill the escalation worker.

The background worker PID is captured (esc_pid=$!) and recorded in the DB (worker_pid), but no PID file is created under $SUPERVISOR_DIR/pids/. Phase 1 (line 147–156) and Phase 4b (line 866–883) both rely on PID files to track live workers. Without one:

  • Phase 4b will immediately flag this task as a "DB orphan" (running status, no PID file) and force-fail it.
  • The opus worker gets killed or ignored before it can finish the rebase.

Additionally, the log_file column is not updated in the DB, so evaluation won't find the worker's output.

🐛 Proposed fix: write PID file and update log_file
 					local esc_pid=$!
 				fi

+				# Write PID file so Phase 1/4 can track the worker
+				mkdir -p "$SUPERVISOR_DIR/pids" 2>/dev/null || true
+				echo "$esc_pid" > "$SUPERVISOR_DIR/pids/${esc_id}.pid"
+
 				# Record the escalation in the DB
 				db "$SUPERVISOR_DB" "UPDATE tasks SET
 					status = 'running',
 					error = 'Escalation: opus rebase worker (PID $esc_pid)',
 					worker_pid = $esc_pid,
+					log_file = '$(sql_escape "$esc_log_file")',
 					updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
 				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi
# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi
# Write PID file so Phase 1/4 can track the worker
mkdir -p "$SUPERVISOR_DIR/pids" 2>/dev/null || true
echo "$esc_pid" > "$SUPERVISOR_DIR/pids/${esc_id}.pid"
# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
log_file = '$(sql_escape "$esc_log_file")',
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 752 - 769, The escalation
branch records esc_pid in the DB but never writes a PID file or updates the
log_file column; add code after capturing esc_pid to create a PID file (e.g.
"$SUPERVISOR_DIR/pids/$esc_pid.pid" containing the PID) so Phase 1/4b can detect
the live worker, and modify the DB UPDATE (the db(...) call that sets
status/error/worker_pid for esc_id) to also set log_file = '$(sql_escape
"$esc_log_file")'; ensure the PID file is created atomically and is associated
with esc_pid/esc_id so cleanup logic can remove it when the worker exits.

fi
fi
fi
fi

# Phase 4: Worker health checks - detect dead, hung, and orphaned workers
local worker_timeout_seconds="${SUPERVISOR_WORKER_TIMEOUT:-3600}" # 1 hour default (t314: restored after merge overwrite)
# Absolute max runtime: kill workers regardless of log activity.
Expand Down
Loading