Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 45 additions & 21 deletions .agents/scripts/pulse-wrapper.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@
# 1. Uses a PID file with staleness check (not pgrep) for dedup
# 2. Cleans up orphaned opencode processes before each pulse
# 3. Calculates dynamic worker concurrency from available RAM
# 4. Lets the pulse run to completion — no hard timeout
# 4. Internal watchdog kills stuck pulses after PULSE_STALE_THRESHOLD (t1397)
#
# Lifecycle: launchd fires every 120s. If a pulse is still running, the
# dedup check skips. If a pulse has been running longer than PULSE_STALE_THRESHOLD
# (default 30 min), it's assumed stuck (opencode idle bug) and killed so the
# next invocation can start fresh. This is the ONLY kill mechanism — no
# arbitrary timeouts that would interrupt active work.
# dedup check skips. run_pulse() has an internal watchdog that polls every
# 60s and kills the opencode process if it exceeds PULSE_STALE_THRESHOLD
# (default 30 min). This ensures the wrapper always exits, allowing launchd
# to fire the next invocation. check_dedup() serves as a secondary safety
# net for edge cases where the wrapper itself gets stuck.
#
# Called by launchd every 120s via the supervisor-pulse plist.

Expand Down Expand Up @@ -842,15 +843,18 @@ prefetch_active_workers() {
}

#######################################
# Run the pulse — no hard timeout
# Run the pulse — with internal watchdog timeout (t1397)
#
# The pulse runs until opencode exits naturally. If opencode enters its
# idle-state bug (file watcher keeps process alive after session completes),
# the NEXT launchd invocation's check_dedup() will detect the stale process
# (age > PULSE_STALE_THRESHOLD) and kill it. This is correct because:
# - Active pulses doing real work are never interrupted
# - Stuck pulses are detected by the next invocation (120s later)
# - The stale threshold (30 min) is generous enough for any real workload
# The pulse runs until opencode exits naturally. A watchdog loop checks
# every 60s whether the process has exceeded PULSE_STALE_THRESHOLD. If so,
# it kills the process tree and returns, allowing the wrapper to continue
# to the quality sweep and health issue phases.
#
# Previous design relied on the NEXT launchd invocation's check_dedup()
# to kill stale processes. This failed because launchd StartInterval only
# fires when the previous invocation has exited — and the wrapper blocks
# on `wait`, so the next invocation never starts. The watchdog is now
# internal to the same process that spawned opencode.
#######################################
run_pulse() {
local start_epoch
Expand All @@ -869,7 +873,7 @@ ${state_content}
--- END PRE-FETCHED STATE ---"
fi

# Run opencode — blocks until it exits (or is killed by next invocation's stale check)
# Run opencode in background
"$OPENCODE_BIN" run "$prompt" \
--dir "$PULSE_DIR" \
-m "$PULSE_MODEL" \
Expand All @@ -881,7 +885,29 @@ ${state_content}

echo "[pulse-wrapper] opencode PID: $opencode_pid" >>"$LOGFILE"

# Wait for natural exit
# Watchdog loop: check every 60s if the process is still alive and within
# the stale threshold. This replaces the bare `wait` that blocked the
# wrapper indefinitely when opencode hung.
while kill -0 "$opencode_pid" 2>/dev/null; do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To align with the project's general rules of not suppressing stderr unnecessarily, consider using ps to check for process existence. This avoids masking potential permission errors (EPERM) that kill -0 with 2>/dev/null would hide.

Suggested change
while kill -0 "$opencode_pid" 2>/dev/null; do
while ps -p "$opencode_pid" >/dev/null; do
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

local now
now=$(date +%s)
local elapsed=$((now - start_epoch))
if [[ "$elapsed" -gt "$PULSE_STALE_THRESHOLD" ]]; then
echo "[pulse-wrapper] Pulse exceeded stale threshold (${elapsed}s > ${PULSE_STALE_THRESHOLD}s) — killing" >>"$LOGFILE"
_kill_tree "$opencode_pid"
sleep 2
# Force kill if still alive
if kill -0 "$opencode_pid" 2>/dev/null; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the while loop condition, using ps here is more robust as it avoids suppressing potentially important errors that kill -0 with 2>/dev/null would hide.

Suggested change
if kill -0 "$opencode_pid" 2>/dev/null; then
if ps -p "$opencode_pid" >/dev/null; then
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

_force_kill_tree "$opencode_pid"
fi
break
fi
# Sleep 60s then re-check. Portable across bash 3.2+ (macOS default).
# The process may exit during sleep — kill -0 at top of loop catches that.
sleep 60
done

# Reap the process (may already be dead)
wait "$opencode_pid" 2>/dev/null || true

# Clean up PID file
Expand Down Expand Up @@ -1467,12 +1493,10 @@ run_daily_quality_sweep() {
# Timestamp guard — run at most once per QUALITY_SWEEP_INTERVAL
if [[ -f "$QUALITY_SWEEP_LAST_RUN" ]]; then
local last_run
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" || echo "0")
# Validate integer before arithmetic expansion (prevents command injection)
if ! [[ "$last_run" =~ ^[0-9]+$ ]]; then
echo "[pulse-wrapper] Corrupt sweep timestamp '${last_run}' — resetting" >>"$LOGFILE"
last_run=0
fi
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

One of the general rules for this repository advises against using 2>/dev/null on file operations when the file's existence has already been checked. Since line 1494 already verifies that $QUALITY_SWEEP_LAST_RUN is a file, redirecting stderr to /dev/null here could mask important issues like read permission errors. The || echo "0" fallback is sufficient to handle cases where cat fails.

Suggested change
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" || echo "0")
References
  1. Avoid using 2>/dev/null to suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g., [[ -f "$file" ]] or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.

# Strip whitespace/newlines and validate integer (t1397)
last_run="${last_run//[^0-9]/}"
last_run="${last_run:-0}"
local now
now=$(date +%s)
local elapsed=$((now - last_run))
Expand Down
Loading