diff --git a/.agents/scripts/commands/full-loop.md b/.agents/scripts/commands/full-loop.md index 576ac9aca..45fd96c03 100644 --- a/.agents/scripts/commands/full-loop.md +++ b/.agents/scripts/commands/full-loop.md @@ -122,9 +122,69 @@ fi | `status:done` | PR merged | sync-on-pr-merge workflow (automated) | | `status:verify-failed` | Post-merge verification failed | Worker (contextual) | | `status:needs-testing` | Code merged, needs manual testing | Worker (contextual) | +| `dispatched:{model}` | Worker started on task | **Worker (Step 0.7)** | Only `status:available`, `status:claimed`, and `status:done` are fully automated. All other transitions are set contextually by the agent that best understands the current state. When setting a new status label, always remove the prior status labels to keep exactly one active. +### Step 0.7: Label Dispatch Model — `dispatched:{model}` + +After setting `status:in-progress`, tag the issue with the model running this worker. This provides observability into which model solved each task — essential for cost/quality analysis. + +**Detect the current model** from the system prompt or environment. The model name appears in the system prompt as "You are powered by the model named X" or via `ANTHROPIC_MODEL` / `CLAUDE_MODEL` environment variables. Map to a short label: + +| Model contains | Label | +|----------------|-------| +| `opus` | `dispatched:opus` | +| `sonnet` | `dispatched:sonnet` | +| `haiku` | `dispatched:haiku` | +| unknown | skip labeling | + +```bash +# Detect model — check env vars first, fall back to known model identity +MODEL_SHORT="" +for VAR in "$ANTHROPIC_MODEL" "$CLAUDE_MODEL"; do + case "$VAR" in + *opus*) MODEL_SHORT="opus" ;; + *sonnet*) MODEL_SHORT="sonnet" ;; + *haiku*) MODEL_SHORT="haiku" ;; + esac + [[ -n "$MODEL_SHORT" ]] && break +done + +# Fallback: the agent knows its own model from the system prompt. +# If env vars are empty, set MODEL_SHORT based on your model identity. +# Example: if you are claude-opus-4-6, set MODEL_SHORT="opus" + +if [[ -n "$MODEL_SHORT" && -n "$ISSUE_NUM" && "$ISSUE_NUM" != "null" ]]; then + REPO=$(gh repo view --json nameWithOwner -q .nameWithOwner) + + # Remove stale dispatched:* labels so attribution is unambiguous + for OLD in "dispatched:opus" "dispatched:sonnet" "dispatched:haiku"; do + if [[ "$OLD" != "dispatched:${MODEL_SHORT}" ]]; then + if ! gh issue edit "$ISSUE_NUM" --repo "$REPO" --remove-label "$OLD" 2>/dev/null; then + : # Label not present — expected, not an error + fi + fi + done + + # Create the label if it doesn't exist yet + if ! LABEL_ERR=$(gh label create "dispatched:${MODEL_SHORT}" --repo "$REPO" \ + --description "Task dispatched to ${MODEL_SHORT} model" --color "1D76DB" 2>&1); then + # "already exists" is expected — only warn on other failures + if [[ "$LABEL_ERR" != *"already exists"* ]]; then + echo "[dispatch-label] Warning: label create failed for dispatched:${MODEL_SHORT} on ${REPO}: ${LABEL_ERR}" >&2 + fi + fi + + if ! EDIT_ERR=$(gh issue edit "$ISSUE_NUM" --repo "$REPO" \ + --add-label "dispatched:${MODEL_SHORT}" 2>&1); then + echo "[dispatch-label] Warning: could not add dispatched:${MODEL_SHORT} to issue #${ISSUE_NUM} on ${REPO}: ${EDIT_ERR}" >&2 + fi +fi +``` + +**For interactive sessions** (not headless dispatch): If you are working on a task interactively and the issue exists, apply the label based on your own model identity. This ensures all task work is attributed, not just headless dispatches. + ### Step 1: Auto-Branch Setup The loop automatically handles branch setup when on main/master: diff --git a/.agents/scripts/commands/pulse.md b/.agents/scripts/commands/pulse.md index ce7c1b228..8e8db998e 100644 --- a/.agents/scripts/commands/pulse.md +++ b/.agents/scripts/commands/pulse.md @@ -8,12 +8,12 @@ You are the supervisor pulse. You run every 2 minutes via launchd — **there is **AUTONOMOUS EXECUTION REQUIRED:** You MUST execute every step including dispatching workers. NEVER present a summary and stop. NEVER ask "what would you like to action/do/work on?" — there is nobody to answer. Your output is a log of actions you ALREADY TOOK (past tense), not a menu of options. If you finish without having run `opencode run` or `gh pr merge` commands, you have failed. -**TARGET: 6 concurrent workers at all times.** If slots are available and work exists, dispatch workers to fill them. An idle slot is wasted capacity. +**TARGET: fill all available worker slots.** The max worker count is calculated dynamically by `pulse-wrapper.sh` based on available RAM (1 GB per worker, 8 GB reserved for OS + user apps, capped at 8). Read it at the start of each pulse — do not hardcode a number. Your job is simple: 1. Check the circuit breaker. If tripped, exit immediately. -2. Count running workers. If all 6 slots are full, continue to Step 2 (you can still merge ready PRs and observe outcomes). +2. Read the dynamic max workers limit and count running workers. If all slots are full, continue to Step 2 (you can still merge ready PRs and observe outcomes). 3. Fetch open issues and PRs from the managed repos. 4. **Observe outcomes** — check for stuck or failed work and file improvement issues. 5. Pick the highest-value items to fill available worker slots. @@ -22,7 +22,7 @@ Your job is simple: That's it. Minimal state (circuit breaker only). No databases. GitHub is the state DB. -**Max concurrency: 6 workers.** +**Max concurrency is dynamic** — determined by available RAM at pulse time. See Step 1. ## Step 0: Circuit Breaker Check (t1331) @@ -36,16 +36,28 @@ That's it. Minimal state (circuit breaker only). No databases. GitHub is the sta The circuit breaker trips after 3 consecutive task failures (configurable via `SUPERVISOR_CIRCUIT_BREAKER_THRESHOLD`). It auto-resets after 30 minutes or on manual reset (`circuit-breaker-helper.sh reset`). Any task success resets the counter to 0. -## Step 1: Count Running Workers +## Step 1: Read Max Workers and Count Running Workers ```bash -# Count running full-loop workers (macOS pgrep has no -c flag) -WORKER_COUNT=$(pgrep -f '/full-loop' 2>/dev/null | wc -l | tr -d ' ') -echo "Running workers: $WORKER_COUNT / 6" +# Read dynamic max workers (calculated by pulse-wrapper.sh from available RAM). +# Falls back to 4 if the file doesn't exist (conservative default). +MAX_WORKERS=$(cat ~/.aidevops/logs/pulse-max-workers 2>/dev/null || echo 4) + +# Count running full-loop workers — IMPORTANT: each opencode run spawns TWO +# processes (a node launcher + the .opencode binary). Only count the .opencode +# binaries to avoid 2x inflation. Filter by the binary path, not just '/full-loop'. +WORKER_COUNT=0 +while IFS= read -r pid; do + cmd=$(ps -p "$pid" -o command= 2>/dev/null || true) + if echo "$cmd" | grep -q '\.opencode'; then + WORKER_COUNT=$((WORKER_COUNT + 1)) + fi +done < <(pgrep -f '/full-loop' 2>/dev/null || true) +echo "Running workers: $WORKER_COUNT / $MAX_WORKERS" ``` -- If `WORKER_COUNT >= 6`: set `AVAILABLE=0` — no new workers, but continue to Step 2 (merges and outcome observation don't need slots). -- Otherwise: calculate `AVAILABLE=$((6 - WORKER_COUNT))` — this is how many workers you can dispatch. +- If `WORKER_COUNT >= MAX_WORKERS`: set `AVAILABLE=0` — no new workers, but continue to Step 2 (merges and outcome observation don't need slots). +- Otherwise: calculate `AVAILABLE=$((MAX_WORKERS - WORKER_COUNT))` — this is how many workers you can dispatch. ## Step 2: Fetch GitHub State @@ -87,7 +99,7 @@ If you see a pattern (same type of failure, same error), create an improvement i **Duplicate work:** If two open PRs target the same issue or have very similar titles, flag it by commenting on the newer one. -**Long-running workers:** Check the runtime of each running worker process with `ps axo pid,etime,command | grep '/full-loop'`. The `etime` column shows elapsed time (format: `HH:MM` or `D-HH:MM:SS`). Parse it to get hours. +**Long-running workers:** Check the runtime of each running worker process with `ps axo pid,etime,command | grep '/full-loop' | grep '\.opencode'` (filter to `.opencode` binaries only — each worker has a `node` launcher + `.opencode` binary; only check the binary to avoid double-counting). The `etime` column shows elapsed time (format: `HH:MM` or `D-HH:MM:SS`). Parse it to get hours. Workers now have a self-imposed 2-hour time budget (see full-loop.md rule 8), but the supervisor enforces a safety net. For any worker running 2+ hours, **assess whether it's making progress** before deciding to kill: @@ -233,7 +245,7 @@ This turns blocked issues from a dead end into an actively managed queue. **Skip issues that already have an open PR:** If an issue number appears in the title or branch name of an open PR, a worker has already produced output for it. Do not dispatch another worker for the same issue. Check the PR list you already fetched — if any PR's `headRefName` or `title` contains the issue number, skip that issue. -**Deduplication — check running processes:** Before dispatching, check `ps axo command | grep '/full-loop'` for any running worker whose command line contains the issue/PR number you're about to dispatch. Different pulse runs may have used different title formats for the same work (e.g., "issue-2300-simplify-infra-scripts" vs "Issue #2300: t1337 Simplify Tier 3"). Extract the canonical number (e.g., `2300`, `t1337`) and check if ANY running worker references it. If so, skip — do not dispatch a duplicate. +**Deduplication — check running processes:** Before dispatching, check `ps axo command | grep '/full-loop' | grep '\.opencode'` for any running worker whose command line contains the issue/PR number you're about to dispatch (filter to `.opencode` binaries to avoid double-counting node launchers). Different pulse runs may have used different title formats for the same work (e.g., "issue-2300-simplify-infra-scripts" vs "Issue #2300: t1337 Simplify Tier 3"). Extract the canonical number (e.g., `2300`, `t1337`) and check if ANY running worker references it. If so, skip — do not dispatch a duplicate. **Blocker-chain validation (MANDATORY before dispatch):** Before dispatching a worker for any issue, validate that its entire dependency chain is resolved — not just the immediate `status:blocked` label. This prevents the #1 cause of workers running 3-9 hours without producing PRs: they start work on tasks whose prerequisites aren't merged yet, then spin trying to work around missing schemas, APIs, or migrations. @@ -276,7 +288,7 @@ If you're unsure whether it needs decomposition, dispatch the worker — but pre ## Step 4: Execute Dispatches NOW -**CRITICAL: Do not stop after Step 3. Do not present a summary and wait. Execute the commands below for every item you selected in Step 3. The goal is 6 concurrent workers at all times — if you have available slots, fill them. An idle slot is wasted capacity.** +**CRITICAL: Do not stop after Step 3. Do not present a summary and wait. Execute the commands below for every item you selected in Step 3. The goal is MAX_WORKERS concurrent workers at all times — if you have available slots, fill them. An idle slot is wasted capacity.** ### For PRs that just need merging (CI green, approved): @@ -496,7 +508,7 @@ fi The strategic review does what sonnet cannot: meta-reasoning about queue health, resource utilisation, stuck chains, stale state, and systemic issues. It can take corrective actions (merge ready PRs, file issues, clean worktrees, dispatch high-value work). -This does NOT count against the 6-worker concurrency limit — it's a supervisor function, not a task worker. +This does NOT count against the MAX_WORKERS concurrency limit — it's a supervisor function, not a task worker. See `scripts/commands/strategic-review.md` for the full review prompt. @@ -504,7 +516,7 @@ See `scripts/commands/strategic-review.md` for the full review prompt. - Do NOT maintain state files, databases, or logs (the circuit breaker, stuck detection, and opus review helpers manage their own state files — those are the only exceptions) - Do NOT auto-kill workers based on stuck detection alone — stuck detection (Step 2b) is advisory only. The kill decision is separate (Step 2a) and requires your judgment -- Do NOT dispatch more workers than available slots (max 6 total) +- Do NOT dispatch more workers than available slots (max MAX_WORKERS total, read from Step 1) - Do NOT try to implement anything yourself — you are the supervisor, not a worker - Do NOT read source code, run tests, or do any task work - Do NOT retry failed workers — the next pulse will pick up where things left off diff --git a/.agents/scripts/pulse-wrapper.sh b/.agents/scripts/pulse-wrapper.sh new file mode 100755 index 000000000..4e0249bfc --- /dev/null +++ b/.agents/scripts/pulse-wrapper.sh @@ -0,0 +1,351 @@ +#!/usr/bin/env bash +# pulse-wrapper.sh - Wrapper for supervisor pulse with timeout and dedup +# +# Solves: opencode run enters idle state after completing the pulse prompt +# but never exits, blocking all future pulses via the pgrep dedup guard. +# +# This wrapper: +# 1. Uses a PID file with staleness check (not pgrep) for dedup +# 2. Runs opencode run with a hard timeout (default: 10 min) +# 3. Guarantees the process is killed if it hangs after completion +# +# Called by launchd every 120s via the supervisor-pulse plist. + +set -euo pipefail + +####################################### +# Configuration +####################################### +PULSE_TIMEOUT="${PULSE_TIMEOUT:-600}" # 10 minutes max per pulse +PULSE_STALE_THRESHOLD="${PULSE_STALE_THRESHOLD:-900}" # 15 min = definitely stuck + +# Validate numeric configuration +if ! [[ "$PULSE_TIMEOUT" =~ ^[0-9]+$ ]]; then + echo "[pulse-wrapper] Invalid PULSE_TIMEOUT: $PULSE_TIMEOUT — using default 600" >&2 + PULSE_TIMEOUT=600 +fi +if ! [[ "$PULSE_STALE_THRESHOLD" =~ ^[0-9]+$ ]]; then + echo "[pulse-wrapper] Invalid PULSE_STALE_THRESHOLD: $PULSE_STALE_THRESHOLD — using default 900" >&2 + PULSE_STALE_THRESHOLD=900 +fi +PIDFILE="${HOME}/.aidevops/logs/pulse.pid" +LOGFILE="${HOME}/.aidevops/logs/pulse.log" +OPENCODE_BIN="${OPENCODE_BIN:-/opt/homebrew/bin/opencode}" +PULSE_DIR="${PULSE_DIR:-${HOME}/Git/aidevops}" +PULSE_MODEL="${PULSE_MODEL:-anthropic/claude-sonnet-4-6}" +ORPHAN_MAX_AGE="${ORPHAN_MAX_AGE:-7200}" # 2 hours — kill orphans older than this +RAM_PER_WORKER_MB="${RAM_PER_WORKER_MB:-1024}" # 1 GB per worker +RAM_RESERVE_MB="${RAM_RESERVE_MB:-8192}" # 8 GB reserved for OS + user apps +MAX_WORKERS_CAP="${MAX_WORKERS_CAP:-8}" # Hard ceiling regardless of RAM + +####################################### +# Ensure log directory exists +####################################### +mkdir -p "$(dirname "$PIDFILE")" + +####################################### +# Check for stale PID file and clean up +# Returns: 0 if safe to proceed, 1 if another pulse is genuinely running +####################################### +check_dedup() { + if [[ ! -f "$PIDFILE" ]]; then + return 0 + fi + + local old_pid + old_pid=$(cat "$PIDFILE" 2>/dev/null || echo "") + + if [[ -z "$old_pid" ]]; then + rm -f "$PIDFILE" + return 0 + fi + + # Check if the process is still running + if ! kill -0 "$old_pid" 2>/dev/null; then + # Process is dead, clean up stale PID file + rm -f "$PIDFILE" + return 0 + fi + + # Process is running — check how long + local elapsed_seconds + elapsed_seconds=$(_get_process_age "$old_pid") + + if [[ "$elapsed_seconds" -gt "$PULSE_STALE_THRESHOLD" ]]; then + # Process has been running too long — it's stuck + echo "[pulse-wrapper] Killing stale pulse process $old_pid (running ${elapsed_seconds}s, threshold ${PULSE_STALE_THRESHOLD}s)" >>"$LOGFILE" + _kill_tree "$old_pid" + sleep 2 + # Force kill if still alive + if kill -0 "$old_pid" 2>/dev/null; then + _force_kill_tree "$old_pid" + fi + rm -f "$PIDFILE" + return 0 + fi + + # Process is running and within time limit — genuine dedup + echo "[pulse-wrapper] Pulse already running (PID $old_pid, ${elapsed_seconds}s elapsed). Skipping." >>"$LOGFILE" + return 1 +} + +####################################### +# Kill a process and all its children (macOS-compatible) +# Arguments: +# $1 - PID to kill +####################################### +_kill_tree() { + local pid="$1" + # Find all child processes recursively (bash 3.2 compatible — no mapfile) + local child + while IFS= read -r child; do + [[ -n "$child" ]] && _kill_tree "$child" + done < <(pgrep -P "$pid" 2>/dev/null || true) + kill "$pid" 2>/dev/null || true + return 0 +} + +####################################### +# Force kill a process and all its children +# Arguments: +# $1 - PID to kill +####################################### +_force_kill_tree() { + local pid="$1" + local child + while IFS= read -r child; do + [[ -n "$child" ]] && _force_kill_tree "$child" + done < <(pgrep -P "$pid" 2>/dev/null || true) + kill -9 "$pid" 2>/dev/null || true + return 0 +} + +####################################### +# Get process age in seconds +# Arguments: +# $1 - PID +# Returns: elapsed seconds via stdout +####################################### +_get_process_age() { + local pid="$1" + local etime + # macOS ps etime format: MM:SS or HH:MM:SS or D-HH:MM:SS + etime=$(ps -p "$pid" -o etime= 2>/dev/null | tr -d ' ') || etime="" + + if [[ -z "$etime" ]]; then + echo "0" + return 0 + fi + + local days=0 hours=0 minutes=0 seconds=0 + + # Parse D-HH:MM:SS format + if [[ "$etime" == *-* ]]; then + days="${etime%%-*}" + etime="${etime#*-}" + fi + + # Count colons to determine format + local colon_count + colon_count=$(echo "$etime" | tr -cd ':' | wc -c | tr -d ' ') + + if [[ "$colon_count" -eq 2 ]]; then + # HH:MM:SS + IFS=':' read -r hours minutes seconds <<<"$etime" + elif [[ "$colon_count" -eq 1 ]]; then + # MM:SS + IFS=':' read -r minutes seconds <<<"$etime" + else + seconds="$etime" + fi + + # Remove leading zeros to avoid octal interpretation + days=$((10#${days})) + hours=$((10#${hours})) + minutes=$((10#${minutes})) + seconds=$((10#${seconds})) + + echo $((days * 86400 + hours * 3600 + minutes * 60 + seconds)) + return 0 +} + +####################################### +# Run the pulse with timeout +####################################### +run_pulse() { + echo "[pulse-wrapper] Starting pulse at $(date -u +%Y-%m-%dT%H:%M:%SZ)" >>"$LOGFILE" + + # Start opencode run in background + "$OPENCODE_BIN" run "/pulse" \ + --dir "$PULSE_DIR" \ + -m "$PULSE_MODEL" \ + --title "Supervisor Pulse" \ + >>"$LOGFILE" 2>&1 & + + local opencode_pid=$! + echo "$opencode_pid" >"$PIDFILE" + + echo "[pulse-wrapper] opencode PID: $opencode_pid, timeout: ${PULSE_TIMEOUT}s" >>"$LOGFILE" + + # Wait for completion OR timeout + local waited=0 + local check_interval=5 + + while kill -0 "$opencode_pid" 2>/dev/null; do + if [[ "$waited" -ge "$PULSE_TIMEOUT" ]]; then + echo "[pulse-wrapper] Timeout after ${PULSE_TIMEOUT}s — killing opencode PID $opencode_pid and children" >>"$LOGFILE" + _kill_tree "$opencode_pid" + sleep 2 + # Force kill if graceful shutdown didn't work + if kill -0 "$opencode_pid" 2>/dev/null; then + echo "[pulse-wrapper] Force killing PID $opencode_pid" >>"$LOGFILE" + _force_kill_tree "$opencode_pid" + fi + break + fi + sleep "$check_interval" + waited=$((waited + check_interval)) + done + + # Clean up PID file + rm -f "$PIDFILE" + + # Wait to collect exit status (avoid zombie) + wait "$opencode_pid" 2>/dev/null || true + + echo "[pulse-wrapper] Pulse completed at $(date -u +%Y-%m-%dT%H:%M:%SZ) (ran ${waited}s)" >>"$LOGFILE" + return 0 +} + +####################################### +# Main +####################################### +main() { + if ! check_dedup; then + return 0 + fi + + cleanup_orphans + calculate_max_workers + run_pulse + return 0 +} + +####################################### +# Kill orphaned opencode processes +# +# Criteria (ALL must be true): +# - No TTY (headless — not a user's terminal tab) +# - Not a current worker (/full-loop not in command) +# - Not the supervisor pulse (Supervisor Pulse not in command) +# - Not a strategic review (Strategic Review not in command) +# - Older than ORPHAN_MAX_AGE seconds +# +# These are completed headless sessions where opencode entered idle +# state with a file watcher and never exited. +####################################### +cleanup_orphans() { + local killed=0 + local total_mb=0 + + while IFS= read -r line; do + local pid tty etime rss cmd + pid=$(echo "$line" | awk '{print $1}') + tty=$(echo "$line" | awk '{print $2}') + etime=$(echo "$line" | awk '{print $3}') + rss=$(echo "$line" | awk '{print $4}') + cmd=$(echo "$line" | cut -d' ' -f5-) + + # Skip interactive sessions (has a real TTY) + if [[ "$tty" != "??" ]]; then + continue + fi + + # Skip active workers, pulse, strategic reviews, and language servers + if echo "$cmd" | grep -qE '/full-loop|Supervisor Pulse|Strategic Review|language-server|eslintServer'; then + continue + fi + + # Skip young processes + local age_seconds + age_seconds=$(_get_process_age "$pid") + if [[ "$age_seconds" -lt "$ORPHAN_MAX_AGE" ]]; then + continue + fi + + # This is an orphan — kill it + local mb=$((rss / 1024)) + kill "$pid" 2>/dev/null || true + killed=$((killed + 1)) + total_mb=$((total_mb + mb)) + done < <(ps axo pid,tty,etime,rss,command | grep '\.opencode' | grep -v grep | grep -v 'bash-language-server') + + # Also kill orphaned node launchers (parent of .opencode processes) + while IFS= read -r line; do + local pid tty etime rss cmd + pid=$(echo "$line" | awk '{print $1}') + tty=$(echo "$line" | awk '{print $2}') + etime=$(echo "$line" | awk '{print $3}') + rss=$(echo "$line" | awk '{print $4}') + cmd=$(echo "$line" | cut -d' ' -f5-) + + [[ "$tty" != "??" ]] && continue + echo "$cmd" | grep -qE '/full-loop|Supervisor Pulse|Strategic Review|language-server|eslintServer' && continue + + local age_seconds + age_seconds=$(_get_process_age "$pid") + [[ "$age_seconds" -lt "$ORPHAN_MAX_AGE" ]] && continue + + kill "$pid" 2>/dev/null || true + local mb=$((rss / 1024)) + killed=$((killed + 1)) + total_mb=$((total_mb + mb)) + done < <(ps axo pid,tty,etime,rss,command | grep 'node.*opencode' | grep -v grep | grep -v '\.opencode') + + if [[ "$killed" -gt 0 ]]; then + echo "[pulse-wrapper] Cleaned up $killed orphaned opencode processes (freed ~${total_mb}MB)" >>"$LOGFILE" + fi + return 0 +} + +####################################### +# Calculate max workers from available RAM +# +# Formula: (free_ram - RAM_RESERVE_MB) / RAM_PER_WORKER_MB +# Clamped to [1, MAX_WORKERS_CAP] +# +# Writes MAX_WORKERS to a file that pulse.md reads via bash. +####################################### +calculate_max_workers() { + local free_mb + if [[ "$(uname)" == "Darwin" ]]; then + # macOS: use vm_stat for free + inactive (reclaimable) pages + local page_size free_pages inactive_pages + page_size=$(sysctl -n hw.pagesize 2>/dev/null || echo 16384) + free_pages=$(vm_stat 2>/dev/null | awk '/Pages free/ {gsub(/\./,"",$3); print $3}') + inactive_pages=$(vm_stat 2>/dev/null | awk '/Pages inactive/ {gsub(/\./,"",$3); print $3}') + free_mb=$(((free_pages + inactive_pages) * page_size / 1024 / 1024)) + else + # Linux: use MemAvailable from /proc/meminfo + free_mb=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo 2>/dev/null || echo 8192) + fi + + local available_mb=$((free_mb - RAM_RESERVE_MB)) + local max_workers=$((available_mb / RAM_PER_WORKER_MB)) + + # Clamp to [1, MAX_WORKERS_CAP] + if [[ "$max_workers" -lt 1 ]]; then + max_workers=1 + elif [[ "$max_workers" -gt "$MAX_WORKERS_CAP" ]]; then + max_workers="$MAX_WORKERS_CAP" + fi + + # Write to a file that pulse.md can read + local max_workers_file="${HOME}/.aidevops/logs/pulse-max-workers" + echo "$max_workers" >"$max_workers_file" + + echo "[pulse-wrapper] Available RAM: ${free_mb}MB, reserve: ${RAM_RESERVE_MB}MB, max workers: ${max_workers}" >>"$LOGFILE" + return 0 +} + +main "$@" diff --git a/.agents/scripts/supervisor-archived/launchd.sh b/.agents/scripts/supervisor-archived/launchd.sh index 92664b90c..17e01e9fb 100755 --- a/.agents/scripts/supervisor-archived/launchd.sh +++ b/.agents/scripts/supervisor-archived/launchd.sh @@ -131,33 +131,33 @@ _generate_supervisor_pulse_plist() { local label="com.aidevops.aidevops-supervisor-pulse" - # Build ProgramArguments array - local prog_args - prog_args="${script_path} - pulse" - if [[ -n "$batch_arg" ]]; then - # batch_arg is "--batch id" — split into two strings - local batch_id - batch_id="${batch_arg#--batch }" - prog_args="${prog_args} - --batch - ${batch_id}" + # Use pulse-wrapper.sh which handles dedup (PID file with staleness), + # timeout (kills opencode if it hangs after completing), and cleanup. + # This replaces the old pgrep-based dedup guard that failed when + # opencode run entered idle state without exiting (t1345). + local wrapper_path + wrapper_path="$(dirname "$script_path")/../scripts/pulse-wrapper.sh" + # Resolve to absolute path + if [[ -f "$HOME/.aidevops/agents/scripts/pulse-wrapper.sh" ]]; then + wrapper_path="$HOME/.aidevops/agents/scripts/pulse-wrapper.sh" fi - # Build EnvironmentVariables dict + # Validate wrapper exists before generating plist + if [[ ! -f "$wrapper_path" ]]; then + echo "ERROR: pulse-wrapper.sh not found at $wrapper_path" >&2 + return 1 + fi + + # Build EnvironmentVariables dict — wrapper reads these local env_dict env_dict="PATH ${env_path} - SUPERVISOR_PULSE_LOCK_TIMEOUT - 1800 - SUPERVISOR_AI_TIMEOUT - 600" - # SUPERVISOR_PULSE_LOCK_TIMEOUT=1800s (30 min): prevents concurrent pulses when - # Phase 14 AI reasoning runs long (108KB+ context can take 6+ min). The default - # 600s was shorter than the total pulse duration, causing concurrent pulses that - # interfered with each other via Phase 4e orphan killing (t1301). - # SUPERVISOR_AI_TIMEOUT=600s (10 min): gives the AI CLI more time to respond to - # large context prompts before portable_timeout fires (t1301). + HOME + ${HOME} + PULSE_TIMEOUT + 600 + PULSE_STALE_THRESHOLD + 900" if [[ -n "$gh_token" ]]; then env_dict="${env_dict} GH_TOKEN @@ -173,7 +173,8 @@ _generate_supervisor_pulse_plist() { ${label} ProgramArguments - ${prog_args} + /bin/bash + ${wrapper_path} StartInterval ${interval_seconds} diff --git a/setup.sh b/setup.sh index a9eb48f11..02624bcb0 100755 --- a/setup.sh +++ b/setup.sh @@ -662,32 +662,41 @@ main() { fi fi - # Enable supervisor pulse scheduler if not already installed - # The pulse dispatches AI workers via opencode run "/pulse" every 2 minutes. - # macOS: launchd plist | Linux: cron entry - local opencode_path - opencode_path=$(command -v opencode 2>/dev/null) || opencode_path="" - if [[ -n "$opencode_path" ]] && [[ "${AIDEVOPS_SUPERVISOR_PULSE:-true}" != "false" ]]; then - local _pulse_installed=false + # Enable supervisor pulse scheduler + # Uses pulse-wrapper.sh which handles timeout, dedup, and process cleanup. + # macOS: launchd plist invoking wrapper | Linux: cron entry invoking wrapper + # The wrapper invokes `opencode run "/pulse"` (AI-guided supervisor) with a + # hard timeout to prevent hangs when opencode enters idle state without exiting. + if [[ "${AIDEVOPS_SUPERVISOR_PULSE:-true}" != "false" ]]; then + local wrapper_script="$HOME/.aidevops/agents/scripts/pulse-wrapper.sh" + local pulse_label="com.aidevops.aidevops-supervisor-pulse" local _aidevops_dir _aidevops_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # Detect if pulse is already installed (any platform) + local _pulse_installed=false if [[ "$(uname -s)" == "Darwin" ]]; then - # macOS: check launchd - if _launchd_has_agent "com.aidevops.aidevops-supervisor-pulse"; then - _pulse_installed=true + local pulse_plist="$HOME/Library/LaunchAgents/${pulse_label}.plist" + if _launchd_has_agent "$pulse_label"; then + # Check if it's the old format (needs upgrade to wrapper-based) + if [[ -f "$pulse_plist" ]] && grep -q "pulse-wrapper" "$pulse_plist" 2>/dev/null; then + _pulse_installed=true + fi + # Old format will fall through to upgrade path fi fi # Both platforms: check cron - if [[ "$_pulse_installed" == "false" ]] && crontab -l 2>/dev/null | grep -qF "aidevops: supervisor-pulse"; then + if [[ "$_pulse_installed" == "false" ]] && crontab -l 2>/dev/null | grep -qF "pulse-wrapper" 2>/dev/null; then _pulse_installed=true fi - if [[ "$_pulse_installed" == "false" ]]; then - local _install_pulse=false - if [[ "$NON_INTERACTIVE" == "true" ]]; then - _install_pulse=true - else + if [[ "$_pulse_installed" == "false" && -f "$wrapper_script" ]]; then + # Detect opencode binary location + local opencode_bin + opencode_bin=$(command -v opencode 2>/dev/null || echo "/opt/homebrew/bin/opencode") + + local _do_install=true + if [[ "$NON_INTERACTIVE" != "true" ]]; then echo "" echo "The supervisor pulse enables autonomous orchestration:" echo " - Dispatches AI workers to implement tasks from GitHub issues" @@ -696,27 +705,95 @@ main() { echo " - Circuit breaker pauses dispatch on consecutive failures" echo "" read -r -p "Enable supervisor pulse? [Y/n]: " enable_pulse - if [[ "$enable_pulse" =~ ^[Yy]?$ || -z "$enable_pulse" ]]; then - _install_pulse=true - else - print_info "Skipped. Enable later — see: runners.md 'Pulse Scheduler Setup'" + if [[ ! "$enable_pulse" =~ ^[Yy]?$ && -n "$enable_pulse" ]]; then + _do_install=false + print_info "Skipped. Enable later: ./setup.sh (re-run)" fi fi - if [[ "$_install_pulse" == "true" ]]; then + if [[ "$_do_install" == "true" ]]; then mkdir -p "$HOME/.aidevops/logs" - local _pulse_cmd="$opencode_path run \"/pulse\" --dir $_aidevops_dir -m anthropic/claude-sonnet-4-6 --title \"Supervisor Pulse\"" - # Install as cron entry (works on both macOS and Linux) - ( - crontab -l 2>/dev/null | grep -v 'aidevops: supervisor-pulse' - echo "*/2 * * * * pgrep -f 'Supervisor Pulse' >/dev/null || $_pulse_cmd >> $HOME/.aidevops/logs/pulse.log 2>&1 # aidevops: supervisor-pulse" - ) | crontab - 2>/dev/null || true - if crontab -l 2>/dev/null | grep -qF "aidevops: supervisor-pulse"; then - print_info "Supervisor pulse enabled (cron, every 2 min). Disable: crontab -e and remove the supervisor-pulse line" + + if [[ "$(uname -s)" == "Darwin" ]]; then + # macOS: use launchd plist with wrapper + local pulse_plist="$HOME/Library/LaunchAgents/${pulse_label}.plist" + + # Unload old plist if upgrading + if _launchd_has_agent "$pulse_label"; then + launchctl unload "$pulse_plist" 2>/dev/null || true + pkill -f 'Supervisor Pulse' 2>/dev/null || true + fi + + # Also clean up old label if present + local old_plist="$HOME/Library/LaunchAgents/com.aidevops.supervisor-pulse.plist" + if [[ -f "$old_plist" ]]; then + launchctl unload "$old_plist" 2>/dev/null || true + rm -f "$old_plist" + fi + + # Write the plist + cat >"$pulse_plist" < + + + + Label + ${pulse_label} + ProgramArguments + + /bin/bash + ${wrapper_script} + + StartInterval + 120 + StandardOutPath + ${HOME}/.aidevops/logs/pulse-wrapper.log + StandardErrorPath + ${HOME}/.aidevops/logs/pulse-wrapper.log + EnvironmentVariables + + PATH + ${PATH} + HOME + ${HOME} + OPENCODE_BIN + ${opencode_bin} + PULSE_DIR + ${_aidevops_dir} + PULSE_TIMEOUT + 600 + PULSE_STALE_THRESHOLD + 900 + + RunAtLoad + + KeepAlive + + + +PLIST + + if launchctl load "$pulse_plist" 2>/dev/null; then + print_info "Supervisor pulse enabled (launchd, every 2 min)" + else + print_warning "Failed to load supervisor pulse LaunchAgent" + fi else - print_warning "Failed to install supervisor pulse cron entry. See runners.md for manual setup." + # Linux: use cron entry with wrapper + # Remove old-style cron entries (direct opencode invocation) + ( + crontab -l 2>/dev/null | grep -v 'aidevops: supervisor-pulse' + echo "*/2 * * * * OPENCODE_BIN=${opencode_bin} PULSE_DIR=${_aidevops_dir} /bin/bash ${wrapper_script} >> $HOME/.aidevops/logs/pulse-wrapper.log 2>&1 # aidevops: supervisor-pulse" + ) | crontab - 2>/dev/null || true + if crontab -l 2>/dev/null | grep -qF "aidevops: supervisor-pulse"; then + print_info "Supervisor pulse enabled (cron, every 2 min). Disable: crontab -e and remove the supervisor-pulse line" + else + print_warning "Failed to install supervisor pulse cron entry. See runners.md for manual setup." + fi fi fi + elif [[ ! -f "$wrapper_script" ]]; then + print_warning "pulse-wrapper.sh not found — supervisor pulse not installed" fi fi