[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1#33424
Conversation
Add a daemon-side OrphanReaper that scans /proc for zombie children of the daemon and reaps them by specific PID with waitpid(pid, WNOHANG) via bun:ffi. Avoids waitpid(-1) so it never races libuv's own child reaping; a one-interval grace window lets libuv reap its tracked children first. Gated to PID 1 on Linux and degrades safely if the FFI binding is unavailable. Closes ATL-746 Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
| return; | ||
| } | ||
| if (!initWaitpid()) return; | ||
| seenLastScan = new Set(); |
There was a problem hiding this comment.
let's only start this when a certain /workspace/config.json flag is enabled, so we can QA this and make sure bad stuff doesn't happen
There was a problem hiding this comment.
Done in 0b7b44d. Added an opt-in flag daemon.reapOrphanedSubprocesses (default false) to DaemonConfigSchema; startOrphanReaper() now reads it via getConfigReadOnly() and no-ops with a log line when disabled (on top of the existing PID-1/Linux gate). So nothing changes until a workspace sets it in config.json:
{ "daemon": { "reapOrphanedSubprocesses": true } }Read is wrapped in try/catch so a missing/malformed config can't block startup. Added tests asserting the default-off and explicit opt-in behavior. Resolving this thread.
| /** Linux `WNOHANG` — return immediately if no child has changed state. */ | ||
| const WNOHANG = 1; | ||
|
|
||
| const SCAN_INTERVAL_MS = 10_000; |
There was a problem hiding this comment.
can bump this up to a minute
There was a problem hiding this comment.
Done in 0b7b44d — SCAN_INTERVAL_MS bumped to 60_000. Note this also lengthens the grace window to ~60s (a zombie must survive one full interval before it's reaped), which is still well clear of libuv's own reaping (milliseconds after SIGCHLD). Resolving this thread.
… interval to 60s Add an opt-in config flag (default off) so the reaper can be enabled per workspace for QA before becoming the default, and lengthen the scan interval from 10s to 60s. Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>
Adds a daemon-side orphan reaper so subprocesses that reparent to the daemon when it runs as PID 1 in the managed container are actually
waitpid()-ed instead of piling up as<defunct>zombies that consume PID slots. This closes the confirmed leak (~120 zombies observed: git transport helpers from bash-tool timeouts andbunskill-runner processes) with no Dockerfile or entrypoint change.OrphanReaperruns on a 60s interval, scans/procfor processes in stateZwhose parent is the daemon, and reaps each by specific PID viawaitpid(pid, WNOHANG)bound throughbun:ffi. It is gated three ways:process.pid === 1on Linux, an opt-indaemon.reapOrphanedSubprocessesconfig flag (defaultfalse, so it can be enabled per workspace for validation before becoming the default), and graceful degradation to a no-op (logged warning, never throws) if the FFI binding is unavailable — per the daemon's "never block startup on a subsystem failure" rule.Closes ATL-746
Why this is safe (the libuv coexistence problem)
The subtle risk with any in-process reaper under Bun/Node is racing libuv. libuv installs its own
SIGCHLDhandler and reaps the children it spawned (everyBun.spawn) by specific PID (uv__wait_children→waitpid(process->pid, …)). A blanketwaitpid(-1)reaper would race that: if our call wins on a libuv-tracked child, libuv's ownwaitpidreturnsECHILD, itsexitevent never fires, andproc.exitedhangs with a lost exit code (libuv's source literally drops the result in this case — "someone else stole the waitpid from us. Handle this by not handling it at all.").This reaper avoids the race two ways:
waitpid(-1)— onlywaitpid(pid, WNOHANG)for PIDs we positively identified as zombies parented to us in/proc.SIGCHLD, so anything still<defunct>a full interval later is a genuine orphan libuv is not tracking.ECHILD/0returns are tolerated as no-ops.PR_SET_CHILD_SUBREAPERis intentionally not used: because the daemon is PID 1, orphans already reparent to it, so the only missing piece is the reaping loop.Verification
/proc/<pid>/statparsing (includingcommvalues containing spaces and parentheses), the defer-one-interval grace selection, and the opt-in config gate (default off, honors explicit enable).PR_SET_CHILD_SUBREAPERmakes the test process a subreaper so it reproduces the PID-1 reparenting exactly) spawns a libuv-tracked child that detaches a grandchild, then asserts: libuv independently reaps the tracked child, the orphaned grandchild is deferred on the first scan and reaped on the second, and no<defunct>entry remains.ps -eo pid,ppid,stat,comm | awk '$2==1 && $3 ~ /Z/') dropping to ~0 once shipped to a PID-1 container.Root cause analysis
How did the code get into this state? Every command-running tool spawns
detached: true(its own process group) and group-kills on timeout/abort withprocess.kill(-pgid, SIGKILL)(thebashandhost_bashtools, the skill sandbox runner, and the debug-bash route). This correctly kills the whole tree, and Bun/libuv reaps the immediate child. But descendants the daemon never spawned (git's transport helpers; the skill runner'sbash -c "bun run …"process) reparent to PID 1 when the group dies. Bun-as-PID-1 is not an init and neverwaitpid()s reparented orphans, so they leak as zombies.What decisions led to it? The detached-process-group + group-kill design is correct for killing. The gap is conceptual: killing ≠ reaping. Reaping a reparented process is, by POSIX, the responsibility of its current parent — which after reparenting is PID 1. Running the application directly as PID 1 (
exec bun … main.ts) without an init meant nothing fulfilled that responsibility.Were there warning signs? Yes — the observability shipped in #32595 ("Shell process group SIGKILL'd — orphans expected to reparent to PID 1") names the exact mechanism. The leak was confirmed by counting
<defunct>PPID=1 entries in a live container.How do we prevent recurrence? The reaper is a universal safety net: it catches orphans from all current and future kill sites, not just the four today, so new tools that spawn detached trees can't reintroduce the leak. The structured "Reaped orphaned subprocesses" log makes any residual leakage visible.
AGENTS.md guidance? No new rule added. The failure was environmental (app-as-PID-1 with no reaper), not a coding-convention gap, and the reaper now handles it generically; a prescriptive "always X when spawning" rule would risk going stale without preventing this class of bug.
Alternatives considered and rejected
tini/docker run --initas PID 1 (the ticket's original recommendation). The canonical fix and ~2 lines, but it is an infrastructure change. In the managed environment the k8s pod template overrides the imageENTRYPOINT(command/argsinstateful_template.yaml), so a DockerfileENTRYPOINT ["tini","--"]would be silently ignored; it would have to go indocker-entrypoint.sh. Deliberately avoided per the requirement to keep this daemon-side and Dockerfile-free, given the deployment risk of entrypoint changes. The daemon reaper is the in-process equivalent of "make the app a proper subreaper" and catches the same orphans.waitpid(-1, WNOHANG)reaper loop in the daemon. Rejected: races libuv as described above and can swallow tracked children's exit codes.References
SIGCHLD(uv__wait_children): https://github.com/nodejs/node/blob/main/deps/uv/src/unix/process.cWNOHANG, never-1, grace window): https://github.com/coopergwrenn/prctl-subreaperwaitpid(2): https://man7.org/linux/man-pages/man2/waitpid.2.htmlprctl(2)PR_SET_CHILD_SUBREAPER(why PID 1 already is the subreaper): https://man7.org/linux/man-pages/man2/prctl.2.html