Skip to content

t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853

Merged
alex-solovyev merged 1 commit intomainfrom
bugfix/pulse-wrapper-stuck
Mar 5, 2026
Merged

t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853
alex-solovyev merged 1 commit intomainfrom
bugfix/pulse-wrapper-stuck

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Mar 5, 2026

Summary

  • Replace bare wait $opencode_pid in run_pulse() with a watchdog loop that polls every 60s and kills the process tree if it exceeds PULSE_STALE_THRESHOLD (30min)
  • Simplify quality sweep timestamp validation — strip non-numeric chars and default to 0

Root Cause

The marcusquinn pulse runner was stuck for 28+ hours. The design assumed launchd StartInterval would fire new invocations every 120s, and check_dedup() in the new invocation would detect and kill stale processes. But launchd StartInterval only fires when the previous invocation has exited — and the wrapper was blocked on wait $opencode_pid, so it never exited, and no new invocation ever started.

The opencode run process hung after starting the pulse workflow — the LLM session started but never produced output or completed. With no watchdog, the wrapper waited indefinitely.

Fix

The watchdog is now internal to run_pulse() — the same process that spawned opencode checks every 60s whether the child has exceeded the stale threshold. If so, it kills the process tree (SIGTERM then SIGKILL) and continues to the quality sweep and health issue phases.

check_dedup() remains as a secondary safety net for edge cases where the wrapper itself gets stuck.

Portable across bash 3.2+ (macOS default) — no wait -n -t dependency.

Testing

  • ShellCheck clean
  • Logic verified: the while kill -0 loop exits when the process dies naturally OR when elapsed > threshold
  • The stuck processes (PIDs 56504/56743/56744) need manual kill after merge + deploy

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of stalled processes to prevent indefinite hangs
    • Enhanced process monitoring and automatic cleanup for better system stability
    • Fixed timestamp handling to be more robust and reliable

The bare `wait $opencode_pid` blocked the wrapper indefinitely when
opencode hung. Since launchd StartInterval only fires when the previous
invocation exits, the stale-process check in check_dedup() never ran —
the wrapper was still alive, just blocked on wait.

Replace with a watchdog loop that polls every 60s and kills the process
tree if it exceeds PULSE_STALE_THRESHOLD (30min). Portable across bash
3.2+ (macOS default).

Also simplify quality sweep timestamp validation — strip non-numeric
chars and default to 0, replacing the verbose regex check.
@github-actions github-actions bot added the bug Auto-created from TODO.md tag label Mar 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Walkthrough

Convert opencode invocation from blocking wait to background execution with internal watchdog monitoring. Kill process tree when exceeding PULSE_STALE_THRESHOLD via 60-second polling loop. Enhance quality sweep timestamp handling with robust sanitization. Update related documentation.

Changes

Cohort / File(s) Summary
Internal Watchdog & Process Management
.agents/scripts/pulse-wrapper.sh
Replaced blocking wait with background execution plus polling loop; monitors elapsed time and kills process tree if PULSE_STALE_THRESHOLD exceeded. Converts opencode invocation to background run with PID tracking via PIDFILE. Added 60-second sleep-based loop for liveness and staleness checks with force-kill fallback.
Quality Sweep Robustness
.agents/scripts/pulse-wrapper.sh
Enhanced timestamp handling with 2>/dev/null error suppression and numeric sanitization; strips non-numeric characters and defaults to 0 if invalid. Fixes regression where sweep was posting unconditionally regardless of metric stability.
Documentation & Clarification
.agents/scripts/pulse-wrapper.sh
Updated comments and section headers to reflect internal watchdog behavior. Clarified dedup lifecycle as secondary safety net mechanism and adjusted pre-fetch/pulse flow documentation to align with new background execution model.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #431: Modifies background process management and PID recording/disowning for pulse-era subprocess handling
  • PR #1485: Introduces PID-based liveness checks and adjusts pulse dispatch lifecycle with stale-lock/process cleanup logic

Suggested labels

bug

Poem

🐕 The watchdog now stands guard with care,
Sixty seconds of polling through the air,
No more unconditional sweeps in the night—
Stale pulses fall, metrics shine bright! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR addresses issue #2851 by fixing the stuck pulse-wrapper process with an internal watchdog, but the linked issue also requires auditing and fixing the sweep logic regression to prevent unconditional CodeRabbit mentions—this coding requirement is not evident in the changeset. Review the sweep logic in pulse-wrapper.sh to verify that conditional posting for @coderabbitai mentions is correctly implemented and only triggers when Quality Gate fails or issues increase by +10, as specified in #2851.
Out of Scope Changes check ❓ Inconclusive The changes focus on the watchdog timeout implementation and timestamp validation, which are in scope for fixing the stuck process issue, but do not comprehensively address the sweep logic regression mentioned in linked issue #2851. Clarify whether the sweep logic regression fix is part of this PR or deferred to a separate issue, to ensure all #2851 requirements are properly scoped and addressed.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the primary change: adding a watchdog timeout mechanism to fix stuck pulse-wrapper processes, which is the main objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/pulse-wrapper-stuck

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where opencode processes managed by the pulse-wrapper script could become stuck indefinitely, preventing subsequent scheduled runs. By introducing an internal watchdog mechanism, the pulse-wrapper now actively monitors and terminates unresponsive opencode processes, ensuring system stability and continuous operation. Additionally, the change refines the quality sweep timestamp validation for improved reliability.

Highlights

  • Introduced Internal Watchdog: Implemented a watchdog loop within run_pulse() to monitor opencode processes, ensuring they are terminated if they exceed PULSE_STALE_THRESHOLD (30 minutes).
  • Resolved launchd Stalling Issue: Corrected the pulse-wrapper's behavior by replacing a blocking wait command with the internal watchdog, allowing the wrapper to exit and launchd to schedule new invocations.
  • Improved Timestamp Validation: Simplified the logic for validating quality sweep timestamps by stripping non-numeric characters and defaulting to zero, enhancing robustness.
Changelog
  • .agents/scripts/pulse-wrapper.sh
    • Updated header comments to describe the new internal watchdog and its role in preventing stuck processes.
    • Modified the run_pulse function's documentation to reflect the new watchdog implementation and clarify the launchd interaction.
    • Replaced the wait $opencode_pid call with a while kill -0 watchdog loop that checks process status and elapsed time, killing the process tree if the stale threshold is exceeded.
    • Adjusted the quality sweep timestamp validation logic to strip non-numeric characters and set a default of 0, simplifying the check.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 107 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Mar 5 01:19:58 UTC 2026: Code review monitoring started
Thu Mar 5 01:19:59 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 107

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 107
  • VULNERABILITIES: 0

Generated on: Thu Mar 5 01:20:01 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Missing issue link. This PR has task ID t1397 with a matching open issue #0, but the PR body doesn't contain a closing keyword.

Add Closes #0 to the PR description so GitHub auto-links and auto-closes the issue on merge.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 5, 2026

@alex-solovyev
Copy link
Collaborator

Pulse triage: PR is mergeable (author: marcusquinn, maintainer). CodeRabbit review is in-progress — will merge once CodeRabbit posts its final review. All other checks pass (SonarCloud, Codacy, qlty, Socket).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/pulse-wrapper.sh (1)

1496-1503: ⚠️ Potential issue | 🟠 Major

Timestamp sanitization via digit-stripping can corrupt the elapsed-time guard, causing sweep suppression.

The parameter expansion ${last_run//[^0-9]/} (line 1499) strips all non-digits and concatenates the remainder. When the timestamp file contains newlines or whitespace between digits (e.g., 1700000000\n9999999999), it produces 17000000009999999999, which invalidates the elapsed calculation and breaks the interval guard.

Implement strict validation: accept only pure epoch integers and reject future-dated or corrupted values.

Proposed fix
-		last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
-		# Strip whitespace/newlines and validate integer (t1397)
-		last_run="${last_run//[^0-9]/}"
-		last_run="${last_run:-0}"
-		local now
-		now=$(date +%s)
+		last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
+		local now
+		now=$(date +%s)
+		# Accept only a pure epoch integer; reset invalid/future values.
+		if ! [[ "$last_run" =~ ^[0-9]+$ ]] || [[ "$last_run" -gt "$now" ]]; then
+			echo "[pulse-wrapper] Invalid quality sweep timestamp '${last_run}' — resetting to 0" >>"$LOGFILE"
+			last_run=0
+		fi

This ensures reliability as required by automation script guidelines and prevents long-term suppression of the quality sweep due to corrupted timestamps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/pulse-wrapper.sh around lines 1496 - 1503, The timestamp
sanitization currently using last_run="${last_run//[^0-9]/}" can produce
corrupted multi-timestamp concatenations; change the logic around
QUALITY_SWEEP_LAST_RUN/last_run to be strict: read only the first token/line,
trim whitespace, validate that it matches the regex ^[0-9]+$ (pure digits),
parse it as an integer and check it is <= current epoch (now=$(date +%s)); if
validation fails or it is future-dated, reset last_run to 0 so elapsed
calculation uses a safe value; ensure the elapsed calculation and the guard
against QUALITY_SWEEP_INTERVAL use this validated last_run variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 1496-1503: The timestamp sanitization currently using
last_run="${last_run//[^0-9]/}" can produce corrupted multi-timestamp
concatenations; change the logic around QUALITY_SWEEP_LAST_RUN/last_run to be
strict: read only the first token/line, trim whitespace, validate that it
matches the regex ^[0-9]+$ (pure digits), parse it as an integer and check it is
<= current epoch (now=$(date +%s)); if validation fails or it is future-dated,
reset last_run to 0 so elapsed calculation uses a safe value; ensure the elapsed
calculation and the guard against QUALITY_SWEEP_INTERVAL use this validated
last_run variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1bb1ab44-f0ee-4c09-b895-6d2800bca53b

📥 Commits

Reviewing files that changed from the base of the PR and between 7160f7b and 3b20c5b.

📒 Files selected for processing (1)
  • .agents/scripts/pulse-wrapper.sh

@alex-solovyev alex-solovyev merged commit 61fcb8f into main Mar 5, 2026
33 of 35 checks passed
@alex-solovyev alex-solovyev deleted the bugfix/pulse-wrapper-stuck branch March 5, 2026 01:25
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a watchdog mechanism to the pulse-wrapper.sh script to prevent it from getting stuck on a hung opencode process. The implementation is sound and correctly addresses the issue of launchd not re-triggering the script if the previous invocation hasn't exited. My review includes a few suggestions to improve the robustness of the script by adhering more closely to the project's guidelines on error handling, specifically regarding the suppression of stderr. All comments align with the provided rules and are kept as is.

# Watchdog loop: check every 60s if the process is still alive and within
# the stale threshold. This replaces the bare `wait` that blocked the
# wrapper indefinitely when opencode hung.
while kill -0 "$opencode_pid" 2>/dev/null; do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To align with the project's general rules of not suppressing stderr unnecessarily, consider using ps to check for process existence. This avoids masking potential permission errors (EPERM) that kill -0 with 2>/dev/null would hide.

Suggested change
while kill -0 "$opencode_pid" 2>/dev/null; do
while ps -p "$opencode_pid" >/dev/null; do
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

_kill_tree "$opencode_pid"
sleep 2
# Force kill if still alive
if kill -0 "$opencode_pid" 2>/dev/null; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the while loop condition, using ps here is more robust as it avoids suppressing potentially important errors that kill -0 with 2>/dev/null would hide.

Suggested change
if kill -0 "$opencode_pid" 2>/dev/null; then
if ps -p "$opencode_pid" >/dev/null; then
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

echo "[pulse-wrapper] Corrupt sweep timestamp '${last_run}' — resetting" >>"$LOGFILE"
last_run=0
fi
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

One of the general rules for this repository advises against using 2>/dev/null on file operations when the file's existence has already been checked. Since line 1494 already verifies that $QUALITY_SWEEP_LAST_RUN is a file, redirecting stderr to /dev/null here could mask important issues like read permission errors. The || echo "0" fallback is sufficient to handle cases where cat fails.

Suggested change
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" || echo "0")
References
  1. Avoid using 2>/dev/null to suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g., [[ -f "$file" ]] or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Auto-created from TODO.md tag

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants