t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853
t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853alex-solovyev merged 1 commit intomainfrom
Conversation
The bare `wait $opencode_pid` blocked the wrapper indefinitely when opencode hung. Since launchd StartInterval only fires when the previous invocation exits, the stale-process check in check_dedup() never ran — the wrapper was still alive, just blocked on wait. Replace with a watchdog loop that polls every 60s and kills the process tree if it exceeds PULSE_STALE_THRESHOLD (30min). Portable across bash 3.2+ (macOS default). Also simplify quality sweep timestamp validation — strip non-numeric chars and default to 0, replacing the verbose regex check.
WalkthroughConvert opencode invocation from blocking wait to background execution with internal watchdog monitoring. Kill process tree when exceeding PULSE_STALE_THRESHOLD via 60-second polling loop. Enhance quality sweep timestamp handling with robust sanitization. Update related documentation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Mar 5 01:20:01 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
Missing issue link. This PR has task ID Add |
|
|
Pulse triage: PR is mergeable (author: marcusquinn, maintainer). CodeRabbit review is in-progress — will merge once CodeRabbit posts its final review. All other checks pass (SonarCloud, Codacy, qlty, Socket). |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/pulse-wrapper.sh (1)
1496-1503:⚠️ Potential issue | 🟠 MajorTimestamp sanitization via digit-stripping can corrupt the elapsed-time guard, causing sweep suppression.
The parameter expansion
${last_run//[^0-9]/}(line 1499) strips all non-digits and concatenates the remainder. When the timestamp file contains newlines or whitespace between digits (e.g.,1700000000\n9999999999), it produces17000000009999999999, which invalidates theelapsedcalculation and breaks the interval guard.Implement strict validation: accept only pure epoch integers and reject future-dated or corrupted values.
Proposed fix
- last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0") - # Strip whitespace/newlines and validate integer (t1397) - last_run="${last_run//[^0-9]/}" - last_run="${last_run:-0}" - local now - now=$(date +%s) + last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0") + local now + now=$(date +%s) + # Accept only a pure epoch integer; reset invalid/future values. + if ! [[ "$last_run" =~ ^[0-9]+$ ]] || [[ "$last_run" -gt "$now" ]]; then + echo "[pulse-wrapper] Invalid quality sweep timestamp '${last_run}' — resetting to 0" >>"$LOGFILE" + last_run=0 + fiThis ensures reliability as required by automation script guidelines and prevents long-term suppression of the quality sweep due to corrupted timestamps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pulse-wrapper.sh around lines 1496 - 1503, The timestamp sanitization currently using last_run="${last_run//[^0-9]/}" can produce corrupted multi-timestamp concatenations; change the logic around QUALITY_SWEEP_LAST_RUN/last_run to be strict: read only the first token/line, trim whitespace, validate that it matches the regex ^[0-9]+$ (pure digits), parse it as an integer and check it is <= current epoch (now=$(date +%s)); if validation fails or it is future-dated, reset last_run to 0 so elapsed calculation uses a safe value; ensure the elapsed calculation and the guard against QUALITY_SWEEP_INTERVAL use this validated last_run variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 1496-1503: The timestamp sanitization currently using
last_run="${last_run//[^0-9]/}" can produce corrupted multi-timestamp
concatenations; change the logic around QUALITY_SWEEP_LAST_RUN/last_run to be
strict: read only the first token/line, trim whitespace, validate that it
matches the regex ^[0-9]+$ (pure digits), parse it as an integer and check it is
<= current epoch (now=$(date +%s)); if validation fails or it is future-dated,
reset last_run to 0 so elapsed calculation uses a safe value; ensure the elapsed
calculation and the guard against QUALITY_SWEEP_INTERVAL use this validated
last_run variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1bb1ab44-f0ee-4c09-b895-6d2800bca53b
📒 Files selected for processing (1)
.agents/scripts/pulse-wrapper.sh
There was a problem hiding this comment.
Code Review
This pull request introduces a watchdog mechanism to the pulse-wrapper.sh script to prevent it from getting stuck on a hung opencode process. The implementation is sound and correctly addresses the issue of launchd not re-triggering the script if the previous invocation hasn't exited. My review includes a few suggestions to improve the robustness of the script by adhering more closely to the project's guidelines on error handling, specifically regarding the suppression of stderr. All comments align with the provided rules and are kept as is.
| # Watchdog loop: check every 60s if the process is still alive and within | ||
| # the stale threshold. This replaces the bare `wait` that blocked the | ||
| # wrapper indefinitely when opencode hung. | ||
| while kill -0 "$opencode_pid" 2>/dev/null; do |
There was a problem hiding this comment.
To align with the project's general rules of not suppressing stderr unnecessarily, consider using ps to check for process existence. This avoids masking potential permission errors (EPERM) that kill -0 with 2>/dev/null would hide.
| while kill -0 "$opencode_pid" 2>/dev/null; do | |
| while ps -p "$opencode_pid" >/dev/null; do |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| _kill_tree "$opencode_pid" | ||
| sleep 2 | ||
| # Force kill if still alive | ||
| if kill -0 "$opencode_pid" 2>/dev/null; then |
There was a problem hiding this comment.
Similar to the while loop condition, using ps here is more robust as it avoids suppressing potentially important errors that kill -0 with 2>/dev/null would hide.
| if kill -0 "$opencode_pid" 2>/dev/null; then | |
| if ps -p "$opencode_pid" >/dev/null; then |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| echo "[pulse-wrapper] Corrupt sweep timestamp '${last_run}' — resetting" >>"$LOGFILE" | ||
| last_run=0 | ||
| fi | ||
| last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
One of the general rules for this repository advises against using 2>/dev/null on file operations when the file's existence has already been checked. Since line 1494 already verifies that $QUALITY_SWEEP_LAST_RUN is a file, redirecting stderr to /dev/null here could mask important issues like read permission errors. The || echo "0" fallback is sufficient to handle cases where cat fails.
| last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0") | |
| last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" || echo "0") |
References
- Avoid using
2>/dev/nullto suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g.,[[ -f "$file" ]]or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.



Summary
wait $opencode_pidinrun_pulse()with a watchdog loop that polls every 60s and kills the process tree if it exceedsPULSE_STALE_THRESHOLD(30min)Root Cause
The
marcusquinnpulse runner was stuck for 28+ hours. The design assumed launchdStartIntervalwould fire new invocations every 120s, andcheck_dedup()in the new invocation would detect and kill stale processes. But launchdStartIntervalonly fires when the previous invocation has exited — and the wrapper was blocked onwait $opencode_pid, so it never exited, and no new invocation ever started.The
opencode runprocess hung after starting the pulse workflow — the LLM session started but never produced output or completed. With no watchdog, the wrapper waited indefinitely.Fix
The watchdog is now internal to
run_pulse()— the same process that spawned opencode checks every 60s whether the child has exceeded the stale threshold. If so, it kills the process tree (SIGTERM then SIGKILL) and continues to the quality sweep and health issue phases.check_dedup()remains as a secondary safety net for edge cases where the wrapper itself gets stuck.Portable across bash 3.2+ (macOS default) — no
wait -n -tdependency.Testing
while kill -0loop exits when the process dies naturally OR when elapsed > thresholdSummary by CodeRabbit