t1482: fix pulse PID self-clash blocks dispatch when prefetch hangs#4575
t1482: fix pulse PID self-clash blocks dispatch when prefetch hangs#4575alex-solovyev merged 3 commits intomainfrom
Conversation
Two systemic bugs caused the pulse to stop dispatching workers: 1. PID self-clash: main() wrote the wrapper's own PID ($$) to the PID file during setup. When prefetch_state hung and the wrapper was killed by launchd, the next invocation saw the PID file with a live wrapper PID and blocked via check_dedup. Fix: use SETUP:$$ sentinel during setup phase; check_dedup treats SETUP: PIDs as non-blocking since the instance lock already prevents true concurrency. Also adds self-PID detection as a safety net. 2. Prefetch sub-helper hangs: sequential helpers called after the parallel gh fetches (pr-salvage-helper, gh-failure-miner-helper) could hang indefinitely on gh API calls, blocking the entire prefetch_state stage and preventing run_pulse() from ever being reached. Fix: wrap slow sub-helpers with run_cmd_with_timeout (90-120s per helper). Timeouts are non-fatal — the pulse proceeds with degraded state rather than blocking forever. Also adds a 120s hard timeout on the parallel gh fetch wait loop.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a SETUP:PID sentinel during wrapper startup, implements run_cmd_with_timeout() for per-call timeouts, integrates timeouts into prefetch sub-steps, and updates check_dedup / status logic to handle SETUP/IDLE sentinels, self-detection, and zombie/invalid PID cases. Changes
Sequence Diagram(s)sequenceDiagram
participant Wrapper as "pulse-wrapper.sh"
participant PIDFILE as "PID file"
participant Prefetch as "prefetch helper(s) / gh"
participant OS as "OS (processes)"
Wrapper->>PIDFILE: write "SETUP:<PID>"
Wrapper->>Prefetch: run_cmd_with_timeout(prefetch_hygiene, 120s)
alt prefetch completes
Prefetch-->>Wrapper: results
else timeout (returns 124)
Wrapper->>OS: kill process group of Prefetch (TERM/kill)
OS-->>Wrapper: processes terminated
Wrapper-->>PIDFILE: write "IDLE:*" or reset sentinel
end
Wrapper->>PIDFILE: write "<PID>" (normal runtime sentinel)
Wrapper->>OS: spawn main runtime processes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Mar 14 02:55:36 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
.agents/scripts/pulse-wrapper.sh (1)
1604-1607: Remove orphaned duplicate doc blockLine 1604-Line 1607 appears to be a leftover header fragment duplicated before
run_stage_with_timeout(). Safe to delete for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pulse-wrapper.sh around lines 1604 - 1607, Remove the orphaned duplicate doc block fragment that precedes the run_stage_with_timeout() function; locate the stray header comments describing exit codes (the duplicated 0/124/else lines) right before the run_stage_with_timeout() definition and delete that duplicate block so only the single, correct documentation remains adjacent to run_stage_with_timeout().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 2730-2734: Update the PIDFILE reader in pulse-session-helper.sh to
recognize the "SETUP:" and "IDLE:" sentinels written by pulse-wrapper.sh before
treating the file content as a numeric PID: read PIDFILE into a variable, check
for prefixes "SETUP:" and "IDLE:" and handle those cases (render setup/idle
status) instead of immediately calling ps -p on the raw content; alternatively,
if you prefer backward compatibility, expose the numeric PID in parallel (e.g.,
write a separate numeric PID file or append the raw PID after the sentinel) so
existing ps -p usage still works—adjust the code paths that currently call ps -p
"$pid" to first normalize or extract the numeric PID when present.
- Around line 525-541: The loop currently uses a blocking wait "$pid" which can
hang and prevent re-checking the 120s limit; replace the blocking wait with a
non-blocking polling loop for each PID (use the existing pids array and computed
wait_start/wait_elapsed) that repeatedly checks whether the PID is still alive
(e.g., kill -0 or wait -n semantics), updates wait_elapsed, and if elapsed > 120
logs to LOGFILE and kills remaining PIDs; ensure you preserve the existing log
message and the kill/cleanup behavior when timing out.
- Around line 312-331: The code currently kills whatever process holds setup_pid
without verifying it belongs to pulse-wrapper; change check_dedup so before
calling _kill_tree/_force_kill_tree you validate the PID refers to the expected
prior pulse-wrapper process by checking its commandline or executable and owner:
read /proc/"$setup_pid"/cmdline (or use ps -p "$setup_pid" -o args=) and ensure
it contains the pulse-wrapper identifier (e.g. "pulse-wrapper.sh" or the exact
argv used), and confirm the UID matches the expected user (compare
/proc/"$setup_pid"/status Uid or ps -o uid=); only if both match, proceed to log
and invoke _kill_tree/_force_kill_tree, otherwise log a warning and do not kill,
but still write the IDLE sentinel to "$PIDFILE" as appropriate.
---
Nitpick comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 1604-1607: Remove the orphaned duplicate doc block fragment that
precedes the run_stage_with_timeout() function; locate the stray header comments
describing exit codes (the duplicated 0/124/else lines) right before the
run_stage_with_timeout() definition and delete that duplicate block so only the
single, correct documentation remains adjacent to run_stage_with_timeout().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c066a1fe-f131-42b0-b758-fa364d2368a3
📒 Files selected for processing (1)
.agents/scripts/pulse-wrapper.sh
…tate The parallel gh fetch wait loop used 'wait $pid' which blocks until the process exits. The elapsed-time check between waits never fires when a single wait hangs for minutes. Replace with kill -0 polling (sleep 2 interval) that reliably enforces the 120s timeout, matching the pattern already used by run_cmd_with_timeout(). Survivors are killed via _kill_tree/_force_kill_tree before reaping.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Mar 14 03:06:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.agents/scripts/pulse-wrapper.sh (1)
312-331:⚠️ Potential issue | 🟠 MajorPID reuse safety concern remains unaddressed
The logic at Lines 325-330 kills
setup_pidwithout verifying it belongs to apulse-wrapper.shprocess. Since the instance lock is already held by the current process, a live process atsetup_pidafter passingacquire_instance_lock()indicates PID reuse occurred.The suggested hardening from the prior review—validating the process command line contains
pulse-wrapper.shbefore killing—should be applied to prevent killing unrelated processes.🛡️ Suggested hardening
if [[ "$pid_content" == SETUP:* ]]; then local setup_pid="${pid_content#SETUP:}" + if ! [[ "$setup_pid" =~ ^[0-9]+$ ]]; then + echo "[pulse-wrapper] check_dedup: invalid SETUP sentinel '${pid_content}' — resetting to IDLE" >>"$LOGFILE" + echo "IDLE:$(date -u +%Y-%m-%dT%H:%M:%SZ)" >"$PIDFILE" + return 0 + fi + if [[ "$setup_pid" == "$$" ]]; then # We wrote this ourselves — proceed return 0 fi - if ! ps -p "$setup_pid" >/dev/null 2>&1; then + local setup_cmd="" + setup_cmd=$(ps -p "$setup_pid" -o command= 2>/dev/null || echo "") + if [[ -z "$setup_cmd" ]]; then echo "[pulse-wrapper] check_dedup: SETUP wrapper $setup_pid is dead — proceeding" >>"$LOGFILE" echo "IDLE:$(date -u +%Y-%m-%dT%H:%M:%SZ)" >"$PIDFILE" return 0 fi + if [[ "$setup_cmd" != *"pulse-wrapper.sh"* ]]; then + echo "[pulse-wrapper] check_dedup: SETUP PID $setup_pid belongs to non-wrapper process ('${setup_cmd:0:60}...'); resetting sentinel" >>"$LOGFILE" + echo "IDLE:$(date -u +%Y-%m-%dT%H:%M:%SZ)" >"$PIDFILE" + return 0 + fi # SETUP wrapper is alive but we hold the instance lock — it's a zombie,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pulse-wrapper.sh around lines 312 - 331, The current check_dedup flow kills the process at setup_pid without confirming it's an instance of pulse-wrapper.sh, risking killing an unrelated reused PID; before invoking _kill_tree or _force_kill_tree on setup_pid, probe the process cmdline (e.g., via ps -p "$setup_pid" -o args=) and verify it contains "pulse-wrapper.sh" (or another unique marker you use) and only then proceed with killing and logging; if the cmdline does not match, log a warning, treat it as PID reuse by an unrelated process, write the IDLE timestamp to PIDFILE and return without killing. Ensure the checks are applied in the same branch that currently calls _kill_tree and _force_kill_tree so you avoid terminating non-wrapper processes.
🧹 Nitpick comments (1)
.agents/scripts/pulse-wrapper.sh (1)
1627-1631: Orphaned documentation block should be removedLines 1627-1630 contain exit code documentation that appears between
run_cmd_with_timeout()(ends at 1625) andrun_stage_with_timeout()(starts at 1631). This duplicates the documentation that properly appears at Lines 1639-1642. Likely a copy-paste remnant.🧹 Proposed cleanup
return $? } -# 0 - stage completed successfully -# 124 - stage timed out and was killed -# else- stage exited with command exit code -####################################### ####################################### # Run a stage with a wall-clock timeout🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pulse-wrapper.sh around lines 1627 - 1631, Remove the orphaned exit-code documentation block that lives between run_cmd_with_timeout() and run_stage_with_timeout() (the duplicate comment lines "0 - stage completed successfully", "124 - stage timed out and was killed", "else- stage exited with command exit code"); it is a copy-paste remnant and should be deleted so only the proper documentation block (the one at the later location) remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 312-331: The current check_dedup flow kills the process at
setup_pid without confirming it's an instance of pulse-wrapper.sh, risking
killing an unrelated reused PID; before invoking _kill_tree or _force_kill_tree
on setup_pid, probe the process cmdline (e.g., via ps -p "$setup_pid" -o args=)
and verify it contains "pulse-wrapper.sh" (or another unique marker you use) and
only then proceed with killing and logging; if the cmdline does not match, log a
warning, treat it as PID reuse by an unrelated process, write the IDLE timestamp
to PIDFILE and return without killing. Ensure the checks are applied in the same
branch that currently calls _kill_tree and _force_kill_tree so you avoid
terminating non-wrapper processes.
---
Nitpick comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 1627-1631: Remove the orphaned exit-code documentation block that
lives between run_cmd_with_timeout() and run_stage_with_timeout() (the duplicate
comment lines "0 - stage completed successfully", "124 - stage timed out and was
killed", "else- stage exited with command exit code"); it is a copy-paste
remnant and should be deleted so only the proper documentation block (the one at
the later location) remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b09ce8f-9fad-4e9d-a38b-1c1ae33653c9
📒 Files selected for processing (1)
.agents/scripts/pulse-wrapper.sh
… sentinel-aware status check_dedup (GH#4575): - Add numeric validation of SETUP sentinel PID before any process checks - Validate cmdline contains 'pulse-wrapper.sh' via ps -p before killing; if PID was reused by an unrelated process, log warning and reset to IDLE instead of killing an innocent process pulse-wrapper.sh: - Remove orphaned duplicate doc block between run_cmd_with_timeout and run_stage_with_timeout (exit codes fragment from copy-paste) pulse-session-helper.sh: - is_pulse_running() now parses SETUP:/IDLE: sentinels before treating PIDFILE content as numeric PID — prevents ps -p failures on non-numeric content and correctly reports IDLE as not-running - Status display distinguishes SETUP phase (yellow 'setup') from active pulse (green 'running')
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Mar 14 03:15:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.agents/scripts/pulse-wrapper.sh (1)
624-626: Differentiate timeout vs generic failure in helper logs.Each
|| { ... timed out ... }branch currently logs a timeout for any non-zero return. That can mask real helper errors as timeouts and make incident triage noisy.♻️ Suggested refactor
- run_cmd_with_timeout 120 prefetch_hygiene >"$hygiene_tmp" 2>/dev/null || { - echo "[pulse-wrapper] prefetch_hygiene timed out after 120s (non-fatal)" >>"$LOGFILE" - } + local hygiene_rc=0 + run_cmd_with_timeout 120 prefetch_hygiene >"$hygiene_tmp" 2>/dev/null || hygiene_rc=$? + if [[ "$hygiene_rc" -eq 124 ]]; then + echo "[pulse-wrapper] prefetch_hygiene timed out after 120s (non-fatal)" >>"$LOGFILE" + elif [[ "$hygiene_rc" -ne 0 ]]; then + echo "[pulse-wrapper] prefetch_hygiene failed with exit ${hygiene_rc} (non-fatal)" >>"$LOGFILE" + fi - run_cmd_with_timeout 90 prefetch_ci_failures >"$ci_tmp" 2>/dev/null || { - echo "[pulse-wrapper] prefetch_ci_failures timed out after 90s (non-fatal)" >>"$LOGFILE" - } + local ci_rc=0 + run_cmd_with_timeout 90 prefetch_ci_failures >"$ci_tmp" 2>/dev/null || ci_rc=$? + if [[ "$ci_rc" -eq 124 ]]; then + echo "[pulse-wrapper] prefetch_ci_failures timed out after 90s (non-fatal)" >>"$LOGFILE" + elif [[ "$ci_rc" -ne 0 ]]; then + echo "[pulse-wrapper] prefetch_ci_failures failed with exit ${ci_rc} (non-fatal)" >>"$LOGFILE" + fi - run_cmd_with_timeout 90 prefetch_gh_failure_notifications >"$ghfail_tmp" 2>/dev/null || { - echo "[pulse-wrapper] prefetch_gh_failure_notifications timed out after 90s (non-fatal)" >>"$LOGFILE" - } + local ghfail_rc=0 + run_cmd_with_timeout 90 prefetch_gh_failure_notifications >"$ghfail_tmp" 2>/dev/null || ghfail_rc=$? + if [[ "$ghfail_rc" -eq 124 ]]; then + echo "[pulse-wrapper] prefetch_gh_failure_notifications timed out after 90s (non-fatal)" >>"$LOGFILE" + elif [[ "$ghfail_rc" -ne 0 ]]; then + echo "[pulse-wrapper] prefetch_gh_failure_notifications failed with exit ${ghfail_rc} (non-fatal)" >>"$LOGFILE" + fiAs per coding guidelines "Automation scripts - focus on: Reliability and robustness" and "Clear logging and feedback".
Also applies to: 633-635, 651-653
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/pulse-wrapper.sh around lines 624 - 626, The log currently treats any non-zero exit from run_cmd_with_timeout as a timeout; change the pattern used around the prefetch_hygiene invocation (and the similar blocks for lines referenced) to capture the command exit status into a variable (e.g., rc=$?), then discriminate timeout vs other failures by checking for the timeout exit code your timeout helper uses (commonly 124 for GNU timeout or the wrapper-defined code) and log either "[pulse-wrapper] prefetch_hygiene timed out after 120s (non-fatal)" when rc matches the timeout code, or "[pulse-wrapper] prefetch_hygiene failed with exit $rc" (and include any available stderr or context) for other non-zero rc values; apply the same change to the other occurrences (the blocks around prefetch_hygiene at the other noted locations) and ensure you capture rc immediately after run_cmd_with_timeout before any other command runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.agents/scripts/pulse-wrapper.sh:
- Around line 624-626: The log currently treats any non-zero exit from
run_cmd_with_timeout as a timeout; change the pattern used around the
prefetch_hygiene invocation (and the similar blocks for lines referenced) to
capture the command exit status into a variable (e.g., rc=$?), then discriminate
timeout vs other failures by checking for the timeout exit code your timeout
helper uses (commonly 124 for GNU timeout or the wrapper-defined code) and log
either "[pulse-wrapper] prefetch_hygiene timed out after 120s (non-fatal)" when
rc matches the timeout code, or "[pulse-wrapper] prefetch_hygiene failed with
exit $rc" (and include any available stderr or context) for other non-zero rc
values; apply the same change to the other occurrences (the blocks around
prefetch_hygiene at the other noted locations) and ensure you capture rc
immediately after run_cmd_with_timeout before any other command runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2322812d-f03b-4274-8358-478a4939da34
📒 Files selected for processing (2)
.agents/scripts/pulse-session-helper.sh.agents/scripts/pulse-wrapper.sh



Summary
check_dedupto block all future invocations whenprefetch_statehungghAPI call can't block the entire pulse indefinitelyghfetch wait loopRoot Cause
The pulse stopped dispatching workers ~12.5 hours ago because:
main()wrote$$(wrapper PID) to the PID file at line 2610prefetch_state()hung onpr-salvage-helper.shdoing sequentialgh apicalls across 8 reposcheck_dedup()saw the old wrapper PID as alive and blockedrun_pulse()Changes
check_dedup()(PID self-clash fix):SETUP:PIDsentinel format — treats as non-blocking (instance lock prevents true concurrency)$$ == old_pid) as safety netmain()(PID file format):SETUP:$$during pre-flight stages instead of plain$$run_pulse()still overwrites with the plain opencode PID for watchdog trackingprefetch_state()(sub-helper timeouts):prefetch_hygiene(includes pr-salvage): 120s timeoutprefetch_ci_failures: 90s timeoutprefetch_gh_failure_notifications: 90s timeoutghfetch wait loop: 120s hard timeoutrun_cmd_with_timeout()(new utility):_kill_treefor clean process group cleanupSelf-healing
These fixes are systemic — they activate on the next
aidevops updatefor all users. No manual intervention needed. The wrapper self-heals by:Closes #4576
Testing
IDLE:*and numeric PIDs still handled)Summary by CodeRabbit
New Features
Bug Fixes