t1397: fix pulse-wrapper stuck process — add watchdog timeout by marcusquinn · Pull Request #2853 · marcusquinn/aidevops

marcusquinn · 2026-03-05T01:19:35Z

Summary

Replace bare wait $opencode_pid in run_pulse() with a watchdog loop that polls every 60s and kills the process tree if it exceeds PULSE_STALE_THRESHOLD (30min)
Simplify quality sweep timestamp validation — strip non-numeric chars and default to 0

Root Cause

The marcusquinn pulse runner was stuck for 28+ hours. The design assumed launchd StartInterval would fire new invocations every 120s, and check_dedup() in the new invocation would detect and kill stale processes. But launchd StartInterval only fires when the previous invocation has exited — and the wrapper was blocked on wait $opencode_pid, so it never exited, and no new invocation ever started.

The opencode run process hung after starting the pulse workflow — the LLM session started but never produced output or completed. With no watchdog, the wrapper waited indefinitely.

Fix

The watchdog is now internal to run_pulse() — the same process that spawned opencode checks every 60s whether the child has exceeded the stale threshold. If so, it kills the process tree (SIGTERM then SIGKILL) and continues to the quality sweep and health issue phases.

check_dedup() remains as a secondary safety net for edge cases where the wrapper itself gets stuck.

Portable across bash 3.2+ (macOS default) — no wait -n -t dependency.

Testing

ShellCheck clean
Logic verified: the while kill -0 loop exits when the process dies naturally OR when elapsed > threshold
The stuck processes (PIDs 56504/56743/56744) need manual kill after merge + deploy

Summary by CodeRabbit

Bug Fixes
- Improved handling of stalled processes to prevent indefinite hangs
- Enhanced process monitoring and automatic cleanup for better system stability
- Fixed timestamp handling to be more robust and reliable

The bare `wait $opencode_pid` blocked the wrapper indefinitely when opencode hung. Since launchd StartInterval only fires when the previous invocation exits, the stale-process check in check_dedup() never ran — the wrapper was still alive, just blocked on wait. Replace with a watchdog loop that polls every 60s and kills the process tree if it exceeds PULSE_STALE_THRESHOLD (30min). Portable across bash 3.2+ (macOS default). Also simplify quality sweep timestamp validation — strip non-numeric chars and default to 0, replacing the verbose regex check.

coderabbitai · 2026-03-05T01:19:50Z

Walkthrough

Convert opencode invocation from blocking wait to background execution with internal watchdog monitoring. Kill process tree when exceeding PULSE_STALE_THRESHOLD via 60-second polling loop. Enhance quality sweep timestamp handling with robust sanitization. Update related documentation.

Changes

Cohort / File(s)	Summary
Internal Watchdog & Process Management `.agents/scripts/pulse-wrapper.sh`	Replaced blocking wait with background execution plus polling loop; monitors elapsed time and kills process tree if PULSE_STALE_THRESHOLD exceeded. Converts opencode invocation to background run with PID tracking via PIDFILE. Added 60-second sleep-based loop for liveness and staleness checks with force-kill fallback.
Quality Sweep Robustness `.agents/scripts/pulse-wrapper.sh`	Enhanced timestamp handling with `2>/dev/null` error suppression and numeric sanitization; strips non-numeric characters and defaults to 0 if invalid. Fixes regression where sweep was posting unconditionally regardless of metric stability.
Documentation & Clarification `.agents/scripts/pulse-wrapper.sh`	Updated comments and section headers to reflect internal watchdog behavior. Clarified dedup lifecycle as secondary safety net mechanism and adjusted pre-fetch/pulse flow documentation to align with new background execution model.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PR #431: Modifies background process management and PID recording/disowning for pulse-era subprocess handling
PR #1485: Introduces PID-based liveness checks and adjusts pulse dispatch lifecycle with stale-lock/process cleanup logic

Suggested labels

bug

Poem

🐕 The watchdog now stands guard with care,
Sixty seconds of polling through the air,
No more unconditional sweeps in the night—
Stale pulses fall, metrics shine bright! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR addresses issue `#2851` by fixing the stuck pulse-wrapper process with an internal watchdog, but the linked issue also requires auditing and fixing the sweep logic regression to prevent unconditional CodeRabbit mentions—this coding requirement is not evident in the changeset.	Review the sweep logic in pulse-wrapper.sh to verify that conditional posting for `@coderabbitai` mentions is correctly implemented and only triggers when Quality Gate fails or issues increase by +10, as specified in `#2851`.
Out of Scope Changes check	❓ Inconclusive	The changes focus on the watchdog timeout implementation and timestamp validation, which are in scope for fixing the stuck process issue, but do not comprehensively address the sweep logic regression mentioned in linked issue `#2851`.	Clarify whether the sweep logic regression fix is part of this PR or deferred to a separate issue, to ensure all `#2851` requirements are properly scoped and addressed.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the primary change: adding a watchdog timeout mechanism to fix stuck pulse-wrapper processes, which is the main objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix/pulse-wrapper-stuck

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-05T01:19:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where opencode processes managed by the pulse-wrapper script could become stuck indefinitely, preventing subsequent scheduled runs. By introducing an internal watchdog mechanism, the pulse-wrapper now actively monitors and terminates unresponsive opencode processes, ensuring system stability and continuous operation. Additionally, the change refines the quality sweep timestamp validation for improved reliability.

Highlights

Introduced Internal Watchdog: Implemented a watchdog loop within run_pulse() to monitor opencode processes, ensuring they are terminated if they exceed PULSE_STALE_THRESHOLD (30 minutes).
Resolved launchd Stalling Issue: Corrected the pulse-wrapper's behavior by replacing a blocking wait command with the internal watchdog, allowing the wrapper to exit and launchd to schedule new invocations.
Improved Timestamp Validation: Simplified the logic for validating quality sweep timestamps by stripping non-numeric characters and defaulting to zero, enhancing robustness.

Changelog

.agents/scripts/pulse-wrapper.sh
- Updated header comments to describe the new internal watchdog and its role in preventing stuck processes.
- Modified the run_pulse function's documentation to reflect the new watchdog implementation and clarify the launchd interaction.
- Replaced the wait $opencode_pid call with a while kill -0 watchdog loop that checks process status and elapsed time, killing the process tree if the stale threshold is exceeded.
- Adjusted the quality sweep timestamp validation logic to strip non-numeric characters and set a default of 0, simplifying the check.

Activity

The author identified a root cause where the pulse-wrapper blocked on wait $opencode_pid, preventing launchd from firing new invocations.
The proposed fix involves an internal watchdog in run_pulse() to kill stale processes, ensuring the wrapper exits.
Testing included ShellCheck validation and logic verification for the new watchdog loop.
The pull request closes issue fix: pulse-wrapper.sh CodeRabbit sweep still posting unconditionally (regression after t1390/t1392) #2851.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-05T01:20:02Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 107 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Mar 5 01:19:58 UTC 2026: Code review monitoring started
Thu Mar 5 01:19:59 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 107

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 107
VULNERABILITIES: 0

Generated on: Thu Mar 5 01:20:01 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

github-actions · 2026-03-05T01:20:12Z

Missing issue link. This PR has task ID t1397 with a matching open issue #0, but the PR body doesn't contain a closing keyword.

Add Closes #0 to the PR description so GitHub auto-links and auto-closes the issue on merge.

sonarqubecloud · 2026-03-05T01:21:02Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

alex-solovyev · 2026-03-05T01:24:13Z

Pulse triage: PR is mergeable (author: marcusquinn, maintainer). CodeRabbit review is in-progress — will merge once CodeRabbit posts its final review. All other checks pass (SonarCloud, Codacy, qlty, Socket).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/scripts/pulse-wrapper.sh (1)
1496-1503: ⚠️ Potential issue | 🟠 Major

Timestamp sanitization via digit-stripping can corrupt the elapsed-time guard, causing sweep suppression.

The parameter expansion ${last_run//[^0-9]/} (line 1499) strips all non-digits and concatenates the remainder. When the timestamp file contains newlines or whitespace between digits (e.g., 1700000000\n9999999999), it produces 17000000009999999999, which invalidates the elapsed calculation and breaks the interval guard.

Implement strict validation: accept only pure epoch integers and reject future-dated or corrupted values.
Proposed fix
-		last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
-		# Strip whitespace/newlines and validate integer (t1397)
-		last_run="${last_run//[^0-9]/}"
-		last_run="${last_run:-0}"
-		local now
-		now=$(date +%s)
+		last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")
+		local now
+		now=$(date +%s)
+		# Accept only a pure epoch integer; reset invalid/future values.
+		if ! [[ "$last_run" =~ ^[0-9]+$ ]] || [[ "$last_run" -gt "$now" ]]; then
+			echo "[pulse-wrapper] Invalid quality sweep timestamp '${last_run}' — resetting to 0" >>"$LOGFILE"
+			last_run=0
+		fi
This ensures reliability as required by automation script guidelines and prevents long-term suppression of the quality sweep due to corrupted timestamps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/pulse-wrapper.sh around lines 1496 - 1503, The timestamp
sanitization currently using last_run="${last_run//[^0-9]/}" can produce
corrupted multi-timestamp concatenations; change the logic around
QUALITY_SWEEP_LAST_RUN/last_run to be strict: read only the first token/line,
trim whitespace, validate that it matches the regex ^[0-9]+$ (pure digits),
parse it as an integer and check it is <= current epoch (now=$(date +%s)); if
validation fails or it is future-dated, reset last_run to 0 so elapsed
calculation uses a safe value; ensure the elapsed calculation and the guard
against QUALITY_SWEEP_INTERVAL use this validated last_run variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 1496-1503: The timestamp sanitization currently using
last_run="${last_run//[^0-9]/}" can produce corrupted multi-timestamp
concatenations; change the logic around QUALITY_SWEEP_LAST_RUN/last_run to be
strict: read only the first token/line, trim whitespace, validate that it
matches the regex ^[0-9]+$ (pure digits), parse it as an integer and check it is
<= current epoch (now=$(date +%s)); if validation fails or it is future-dated,
reset last_run to 0 so elapsed calculation uses a safe value; ensure the elapsed
calculation and the guard against QUALITY_SWEEP_INTERVAL use this validated
last_run variable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1bb1ab44-f0ee-4c09-b895-6d2800bca53b

📥 Commits

Reviewing files that changed from the base of the PR and between 7160f7b and 3b20c5b.

📒 Files selected for processing (1)

.agents/scripts/pulse-wrapper.sh

gemini-code-assist

Code Review

This pull request introduces a watchdog mechanism to the pulse-wrapper.sh script to prevent it from getting stuck on a hung opencode process. The implementation is sound and correctly addresses the issue of launchd not re-triggering the script if the previous invocation hasn't exited. My review includes a few suggestions to improve the robustness of the script by adhering more closely to the project's guidelines on error handling, specifically regarding the suppression of stderr. All comments align with the provided rules and are kept as is.

gemini-code-assist · 2026-03-05T01:31:55Z

.agents/scripts/pulse-wrapper.sh

+	# Watchdog loop: check every 60s if the process is still alive and within
+	# the stale threshold. This replaces the bare `wait` that blocked the
+	# wrapper indefinitely when opencode hung.
+	while kill -0 "$opencode_pid" 2>/dev/null; do


To align with the project's general rules of not suppressing stderr unnecessarily, consider using ps to check for process existence. This avoids masking potential permission errors (EPERM) that kill -0 with 2>/dev/null would hide.

Suggested change

while kill -0 "$opencode_pid" 2>/dev/null; do

while ps -p "$opencode_pid" >/dev/null; do

References

Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

gemini-code-assist · 2026-03-05T01:31:55Z

.agents/scripts/pulse-wrapper.sh

+			_kill_tree "$opencode_pid"
+			sleep 2
+			# Force kill if still alive
+			if kill -0 "$opencode_pid" 2>/dev/null; then


Similar to the while loop condition, using ps here is more robust as it avoids suppressing potentially important errors that kill -0 with 2>/dev/null would hide.

Suggested change

if kill -0 "$opencode_pid" 2>/dev/null; then

if ps -p "$opencode_pid" >/dev/null; then

References

Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

gemini-code-assist · 2026-03-05T01:31:55Z

.agents/scripts/pulse-wrapper.sh

-			echo "[pulse-wrapper] Corrupt sweep timestamp '${last_run}' — resetting" >>"$LOGFILE"
-			last_run=0
-		fi
+		last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")


One of the general rules for this repository advises against using 2>/dev/null on file operations when the file's existence has already been checked. Since line 1494 already verifies that $QUALITY_SWEEP_LAST_RUN is a file, redirecting stderr to /dev/null here could mask important issues like read permission errors. The || echo "0" fallback is sufficient to handle cases where cat fails.

Suggested change

last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null || echo "0")

last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" || echo "0")

References

Avoid using 2>/dev/null to suppress errors on file operations if the file's existence has already been verified by a preceding check (e.g., [[ -f "$file" ]] or an early return). This practice is redundant for 'file not found' errors and can mask other important issues like permissions problems.

github-actions bot added the bug Auto-created from TODO.md tag label Mar 5, 2026

alex-solovyev mentioned this pull request Mar 5, 2026

[Supervisor:alex-solovyev] 3 PRs, 1 assigned, 0 workers at 23:45 UTC #2646

Open

alex-solovyev mentioned this pull request Mar 5, 2026

t2851: fix CodeRabbit sweep unconditional posting in pulse-wrapper.sh #2852

Merged

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

coderabbitai bot approved these changes Mar 5, 2026

View reviewed changes

alex-solovyev merged commit 61fcb8f into main Mar 5, 2026
33 of 35 checks passed

alex-solovyev deleted the bugfix/pulse-wrapper-stuck branch March 5, 2026 01:25

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 5, 2026

t1398: Fix runaway memory consumption — process guards, ShellCheck limits, session awareness #2855

Merged

5 tasks

marcusquinn mentioned this pull request Mar 5, 2026

t1400: fix: pulse-wrapper.sh 2>/dev/null suppressions #2857

Closed

alex-solovyev mentioned this pull request Mar 5, 2026

t2857: Replace kill -0 with ps -p and remove redundant 2>/dev/null #2873

Closed

This was referenced Mar 5, 2026

t2893: Address quality-debt review feedback on pulse-wrapper.sh #2910

Merged

t1425, t1426, t1427: Fix pulse-wrapper zombie accumulation and blocking #4020

Merged

t1429: Separate stats from pulse to prevent rate-limit blocking #4094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853

t1397: fix pulse-wrapper stuck process — add watchdog timeout#2853
alex-solovyev merged 1 commit intomainfrom
bugfix/pulse-wrapper-stuck

marcusquinn commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

sonarqubecloud bot commented Mar 5, 2026

Uh oh!

alex-solovyev commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	while kill -0 "$opencode_pid" 2>/dev/null; do
	while ps -p "$opencode_pid" >/dev/null; do

	if kill -0 "$opencode_pid" 2>/dev/null; then
	if ps -p "$opencode_pid" >/dev/null; then

	last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" 2>/dev/null \|\| echo "0")
	last_run=$(cat "$QUALITY_SWEEP_LAST_RUN" \|\| echo "0")

Conversation

marcusquinn commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Mar 5, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

sonarqubecloud bot commented Mar 5, 2026

Quality Gate passed

Uh oh!

alex-solovyev commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcusquinn commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading