Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions .agents/scripts/supervisor/dispatch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1503,6 +1503,109 @@ reset_failure_dedup_state() {
return 0
}

#######################################
# Pre-dispatch CLI health check (t1113)
#
# Verifies the AI CLI binary exists, is executable, and can produce output
# before spawning a worker. This prevents wasting retries on environment
# issues where the CLI was invoked but never produced output (the
# "worker_never_started:no_sentinel" failure pattern).
#
# Strategy:
# 1. Check binary exists in PATH (command -v)
# 2. Run a lightweight version/help check to verify it can execute
# 3. Cache result for the pulse duration (pulse-level flag)
#
# $1: ai_cli - the CLI binary name (e.g., "opencode", "claude")
#
# Exit codes:
# 0 = CLI healthy, proceed with dispatch
# 1 = CLI not found or not executable
#
# Outputs: diagnostic message on failure (for dispatch log)
#######################################
check_cli_health() {
local ai_cli="$1"

# Pulse-level fast path: if CLI was already verified in this pulse, skip
if [[ -n "${_PULSE_CLI_VERIFIED:-}" ]]; then
log_verbose "CLI health: pulse-verified OK (skipping check)"
return 0
fi

# File-based cache: avoid re-checking within 5 minutes
local cache_dir="$SUPERVISOR_DIR/health"
mkdir -p "$cache_dir"
local cli_cache_file="$cache_dir/cli-${ai_cli}"
if [[ -f "$cli_cache_file" ]]; then
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age=$((now - cached_at))
Comment on lines +1541 to +1545

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The arithmetic expansion at line 1545 is unsafe because cached_at can be empty if the cache file exists but is empty (in which case cat succeeds and the || echo "0" fallback is not triggered). This would cause a syntax error in the arithmetic expansion. Additionally, following the repository style guide (line 11), declarations and assignments for expansions that can fail should be separate to ensure exit code safety.

Suggested change
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age=$((now - cached_at))
local cached_at
cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0")
local now
now=$(date +%s)
local age
age=$((now - ${cached_at:-0}))
References
  1. Use local var="$1" pattern in functions (declare and assign separately for exit code safety) (link)
  2. In shell scripts with set -e, var=$(cat file || echo 0) is not safe for arithmetic if the file can be empty. cat on an empty file succeeds, making var an empty string and causing arithmetic expansion to fail. Use parameter expansion with a default value (e.g., $((... - ${var:-0}))) to handle empty strings robustly.

if [[ "$age" -lt 300 ]]; then
log_verbose "CLI health: cached OK ($age seconds ago)"
_PULSE_CLI_VERIFIED="true"
return 0
fi
fi

# Check 1: binary exists in PATH
if ! command -v "$ai_cli" &>/dev/null; then
log_error "CLI health check FAILED: '$ai_cli' not found in PATH"
log_error "PATH=$PATH"
echo "cli_not_found:${ai_cli}"
return 1
fi

# Check 2: binary is executable and can produce version output
local version_output=""
local version_exit=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Initializing version_exit to 1 causes a logic error in the health check. When the version command succeeds (exit 0), the || version_exit=$? part is skipped, leaving version_exit at its initial value of 1. This causes the first condition of the health check at line 1589 ([[ "$version_exit" -eq 0 ]]) to fail even on success, forcing the check to rely on the more lenient output-based condition. It should be initialized to 0.

Suggested change
local version_exit=1
local version_exit=0


# Use timeout to prevent hanging on broken installations
local timeout_cmd=""
if command -v gtimeout &>/dev/null; then
timeout_cmd="gtimeout"
elif command -v timeout &>/dev/null; then
timeout_cmd="timeout"
fi

if [[ "$ai_cli" == "opencode" ]]; then
if [[ -n "$timeout_cmd" ]]; then
version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$?
else
version_output=$("$ai_cli" version 2>&1) || version_exit=$?
fi
else
# claude CLI
if [[ -n "$timeout_cmd" ]]; then
version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$?
else
version_output=$("$ai_cli" --version 2>&1) || version_exit=$?
fi
fi

# If version command succeeded (exit 0) or produced output, CLI is working
if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The health check logic is currently too lenient. If the version command fails (e.g., exit 1) but still produces some output (like an error message), it will be considered 'healthy' because of the [[ -n "$version_output" ... ]] condition. Given that the goal is to detect broken installations, a non-zero exit code should generally be treated as a failure unless there's a specific reason to trust the output of a failed command.

Suggested change
if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then
if [[ "$version_exit" -eq 0 ]]; then

# Cache the healthy result
date +%s >"$cli_cache_file" 2>/dev/null || true
_PULSE_CLI_VERIFIED="true"
log_info "CLI health: OK ($ai_cli: ${version_output:0:80})"
return 0
fi
Comment on lines +1562 to +1595
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

version_exit=1 initialization makes the exit_code==0 branch unreachable and enables a false-positive health result.

The || version_exit=$? idiom only assigns when the command exits non-zero (short-circuit evaluation). Because version_exit is initialized to 1, if opencode version / claude --version exits successfully (code 0), version_exit stays at 1 — the "$version_exit" -eq 0 guard on Line 1589 is permanently dead code.

More importantly this creates a false positive: a binary that exits non-zero (broken installation, startup crash) but still emits any output to stderr/stdout (which 2>&1 captures) will satisfy the second condition [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]] and be reported as healthy — exactly the class of environment failure this check is designed to catch.

🐛 Proposed fix — separate assignment from capture
-	local version_output=""
-	local version_exit=1
+	local version_output=""
+	local version_exit=0

 	if [[ "$ai_cli" == "opencode" ]]; then
 		if [[ -n "$timeout_cmd" ]]; then
-			version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$?
+			version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1)
+			version_exit=$?
 		else
-			version_output=$("$ai_cli" version 2>&1) || version_exit=$?
+			version_output=$("$ai_cli" version 2>&1)
+			version_exit=$?
 		fi
 	else
 		if [[ -n "$timeout_cmd" ]]; then
-			version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$?
+			version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1)
+			version_exit=$?
 		else
-			version_output=$("$ai_cli" --version 2>&1) || version_exit=$?
+			version_output=$("$ai_cli" --version 2>&1)
+			version_exit=$?
 		fi
 	fi

With this change, a clean exit 0 (even with empty output) is treated as healthy via the first condition, and a non-zero exit is correctly treated as a failure unless there is output (which preserves the "output-even-on-non-zero" lenient path intentionally).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 1562 - 1595, The
health-check incorrectly initializes version_exit=1 and uses the "cmd ||
version_exit=$?" idiom, causing successful commands to leave version_exit at 1
and producing false positives; change each invocation that currently does
version_output=$(... ) || version_exit=$? to first capture output then
immediately set version_exit=$? (e.g., version_output=$(... 2>&1);
version_exit=$?) for all branches that run the CLI (the blocks referencing
timeout_cmd, ai_cli, and the "opencode" vs "claude CLI" branches), leaving the
rest of the logic (the check using version_exit, version_output, cli_cache_file,
_PULSE_CLI_VERIFIED, and log_info) unchanged.


# Version check failed
if [[ "$version_exit" -eq 124 || "$version_exit" -eq 137 ]]; then
log_error "CLI health check FAILED: '$ai_cli' timed out (10s)"
echo "cli_timeout:${ai_cli}"
else
log_error "CLI health check FAILED: '$ai_cli' exited with code $version_exit"
log_error "Output: ${version_output:0:200}"
echo "cli_error:${ai_cli}:exit_${version_exit}"
fi
return 1
}

#######################################
# Pre-dispatch model health check (t132.3, t233)
# Two-tier probe strategy:
Expand Down Expand Up @@ -2354,6 +2457,20 @@ cmd_dispatch() {
local ai_cli
ai_cli=$(resolve_ai_cli) || return 1

# Pre-dispatch CLI health check (t1113): verify the AI CLI binary exists and
# can execute before creating worktrees and spawning workers. This prevents
# the "worker_never_started:no_sentinel" failure pattern where the CLI is
# invoked but never produces output due to environment issues (missing binary,
# broken installation, PATH misconfiguration). Deferring here avoids burning
# retries on environment problems that won't resolve between retry attempts.
local cli_health_exit=0 cli_health_detail=""
cli_health_detail=$(check_cli_health "$ai_cli") || cli_health_exit=$?
if [[ "$cli_health_exit" -ne 0 ]]; then
log_error "CLI health check failed for $task_id ($ai_cli): $cli_health_detail — deferring dispatch"
log_error "Fix: ensure '$ai_cli' is installed and in PATH, then retry"
return 3 # Defer to next pulse (same as provider unavailable)
fi

# Pre-dispatch model availability check (t233 — replaces simple health check)
# Calls model-availability-helper.sh check before spawning workers.
# Distinct exit codes prevent wasted dispatch attempts:
Expand Down Expand Up @@ -2450,6 +2567,7 @@ cmd_dispatch() {
echo "dispatch_type=${verify_mode:+verify}"
echo "verify_reason=${verify_reason:-}"
echo "hung_timeout_seconds=${dispatch_hung_timeout}"
echo "cli_health=ok"
echo "=== END DISPATCH METADATA ==="
echo ""
} >"$log_file" 2>/dev/null || true
Expand Down
24 changes: 17 additions & 7 deletions .agents/scripts/supervisor/evaluate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -666,11 +666,15 @@ link_pr_to_task() {
# five broad categories for pattern tracking and model routing decisions.
#
# Categories:
# TRANSIENT - recoverable with retry (rate limits, timeouts, backend blips)
# RESOURCE - infrastructure/environment issue (auth, OOM, disk)
# LOGIC - task/code problem (merge conflict, test failure, build error)
# BLOCKED - external dependency (human needed, upstream, missing context)
# AMBIGUOUS - unclear cause (clean exit, max retries, unknown)
# TRANSIENT - recoverable with retry (rate limits, timeouts, backend blips)
# RESOURCE - infrastructure/environment issue (auth, OOM, disk)
# ENVIRONMENT - dispatch infrastructure failure (t1113: CLI missing, worker never
# started, log file missing). These are NOT task/code problems —
# retrying won't help until the environment is fixed. The pulse
# handles these by deferring re-queue without burning retry count.
# LOGIC - task/code problem (merge conflict, test failure, build error)
# BLOCKED - external dependency (human needed, upstream, missing context)
# AMBIGUOUS - unclear cause (clean exit, max retries, unknown)
#
# $1: outcome_detail (e.g., "rate_limited", "auth_error", "merge_conflict")
#
Expand All @@ -691,9 +695,15 @@ classify_failure_mode() {
billing_credits_exhausted | out_of_memory)
echo "RESOURCE"
;;
merge_conflict | test_fail* | lint_* | build_* | \
worker_never_started* | log_file_missing* | log_file_empty | \
worker_never_started* | log_file_missing* | log_file_empty | \
no_log_path_in_db* | dispatch_script_not_executable)
# t1113: Reclassified from LOGIC to ENVIRONMENT. These failures indicate
# the dispatch infrastructure (CLI binary, worktree, permissions) is broken,
# not the task itself. Retrying the task won't help — the environment must
# be fixed first. The pulse defers these without burning retry count.
echo "ENVIRONMENT"
;;
merge_conflict | test_fail* | lint_* | build_*)
echo "LOGIC"
;;
blocked:* | waiting* | upstream* | missing_context* | \
Expand Down
66 changes: 48 additions & 18 deletions .agents/scripts/supervisor/pulse.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1616,24 +1616,54 @@ cmd_pulse() {
attempt_self_heal "$tid" "blocked" "$outcome_detail" "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
;;
failed)
log_error " $tid: FAILED ($outcome_detail)"
# Proof-log: failed decision (t218)
write_proof_log --task "$tid" --event "failed" --stage "evaluate" \
--decision "failed:$outcome_detail" \
--maker "pulse:phase1" 2>/dev/null || true
cmd_transition "$tid" "failed" --error "$outcome_detail" 2>>"$SUPERVISOR_LOG" || true
failed_count=$((failed_count + 1))
# Clean up worker process tree and PID file (t128.7)
cleanup_worker_processes "$tid"
# Auto-update TODO.md and send notification (t128.4)
update_todo_on_blocked "$tid" "FAILED: $outcome_detail" 2>>"$SUPERVISOR_LOG" || true
send_task_notification "$tid" "failed" "$outcome_detail" 2>>"$SUPERVISOR_LOG" || true
# Store failure pattern in memory (t128.6)
store_failure_pattern "$tid" "failed" "$outcome_detail" "$tid_desc" 2>>"$SUPERVISOR_LOG" || true
# Add failed:model label to GitHub issue (t1010)
add_model_label "$tid" "failed" "$tid_model" "${tid_repo:-.}" 2>>"$SUPERVISOR_LOG" || true
# Self-heal: attempt diagnostic subtask (t150)
attempt_self_heal "$tid" "failed" "$outcome_detail" "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
# t1113: Classify failure mode to distinguish environment issues from
# task/code problems. Environment failures (worker_never_started,
# log_file_missing, etc.) are re-queued without burning retry count
# since the task itself isn't at fault.
local failed_fmode=""
failed_fmode=$(classify_failure_mode "$outcome_detail" 2>/dev/null) || failed_fmode="AMBIGUOUS"

if [[ "$failed_fmode" == "ENVIRONMENT" ]]; then
# t1113: Environment failure — re-queue without incrementing retry count.
# The CLI/environment was broken, not the task. Burning retries here
# would exhaust max_retries on infrastructure issues, permanently
# failing tasks that would succeed once the environment is fixed.
log_warn " $tid: ENVIRONMENT failure ($outcome_detail) — re-queuing without retry increment (t1113)"
write_proof_log --task "$tid" --event "environment_failure" --stage "evaluate" \
--decision "requeue:$outcome_detail" \
--evidence "failure_mode=ENVIRONMENT,retry_preserved=true" \
--maker "pulse:phase1:t1113" 2>/dev/null || true
# Clean up worker process tree and PID file
cleanup_worker_processes "$tid"
# Transition back to queued (preserves current retry count)
cmd_transition "$tid" "queued" --error "environment:$outcome_detail" 2>>"$SUPERVISOR_LOG" || true
# Store pattern for diagnostics but don't mark as task failure
store_failure_pattern "$tid" "environment" "$outcome_detail" "$tid_desc" 2>>"$SUPERVISOR_LOG" || true
# Invalidate CLI health cache so next pulse re-checks
local cli_cache_dir="${SUPERVISOR_DIR}/health"
rm -f "$cli_cache_dir"/cli-* 2>/dev/null || true
_PULSE_CLI_VERIFIED=""
log_info " $tid: CLI health cache invalidated — next dispatch will re-verify"
else
log_error " $tid: FAILED ($outcome_detail)"
# Proof-log: failed decision (t218)
write_proof_log --task "$tid" --event "failed" --stage "evaluate" \
--decision "failed:$outcome_detail" \
--maker "pulse:phase1" 2>/dev/null || true
cmd_transition "$tid" "failed" --error "$outcome_detail" 2>>"$SUPERVISOR_LOG" || true
failed_count=$((failed_count + 1))
# Clean up worker process tree and PID file (t128.7)
cleanup_worker_processes "$tid"
# Auto-update TODO.md and send notification (t128.4)
update_todo_on_blocked "$tid" "FAILED: $outcome_detail" 2>>"$SUPERVISOR_LOG" || true
send_task_notification "$tid" "failed" "$outcome_detail" 2>>"$SUPERVISOR_LOG" || true
# Store failure pattern in memory (t128.6)
store_failure_pattern "$tid" "failed" "$outcome_detail" "$tid_desc" 2>>"$SUPERVISOR_LOG" || true
# Add failed:model label to GitHub issue (t1010)
add_model_label "$tid" "failed" "$tid_model" "${tid_repo:-.}" 2>>"$SUPERVISOR_LOG" || true
# Self-heal: attempt diagnostic subtask (t150)
attempt_self_heal "$tid" "failed" "$outcome_detail" "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
fi
;;
esac
done <<<"$running_tasks"
Expand Down
Loading