[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1 by devin-ai-integration[bot] · Pull Request #33424 · vellum-ai/vellum-assistant

devin-ai-integration · 2026-06-04T14:19:29Z

Adds a daemon-side orphan reaper so subprocesses that reparent to the daemon when it runs as PID 1 in the managed container are actually waitpid()-ed instead of piling up as <defunct> zombies that consume PID slots. This closes the confirmed leak (~120 zombies observed: git transport helpers from bash-tool timeouts and bun skill-runner processes) with no Dockerfile or entrypoint change.

OrphanReaper runs on a 60s interval, scans /proc for processes in state Z whose parent is the daemon, and reaps each by specific PID via waitpid(pid, WNOHANG) bound through bun:ffi. It is gated three ways: process.pid === 1 on Linux, an opt-in daemon.reapOrphanedSubprocesses config flag (default false, so it can be enabled per workspace for validation before becoming the default), and graceful degradation to a no-op (logged warning, never throws) if the FFI binding is unavailable — per the daemon's "never block startup on a subsystem failure" rule.

Closes ATL-746

Why this is safe (the libuv coexistence problem)

The subtle risk with any in-process reaper under Bun/Node is racing libuv. libuv installs its own SIGCHLD handler and reaps the children it spawned (every Bun.spawn) by specific PID (uv__wait_children → waitpid(process->pid, …)). A blanket waitpid(-1) reaper would race that: if our call wins on a libuv-tracked child, libuv's own waitpid returns ECHILD, its exit event never fires, and proc.exited hangs with a lost exit code (libuv's source literally drops the result in this case — "someone else stole the waitpid from us. Handle this by not handling it at all.").

This reaper avoids the race two ways:

Never waitpid(-1) — only waitpid(pid, WNOHANG) for PIDs we positively identified as zombies parented to us in /proc.
One-interval grace window — a zombie is only reaped if it was already seen on the previous scan. libuv reaps its own tracked children within milliseconds of SIGCHLD, so anything still <defunct> a full interval later is a genuine orphan libuv is not tracking. ECHILD/0 returns are tolerated as no-ops.

PR_SET_CHILD_SUBREAPER is intentionally not used: because the daemon is PID 1, orphans already reparent to it, so the only missing piece is the reaping loop.

Verification

Unit tests cover the subtle logic: /proc/<pid>/stat parsing (including comm values containing spaces and parentheses), the defer-one-interval grace selection, and the opt-in config gate (default off, honors explicit enable).
An integration test (PR_SET_CHILD_SUBREAPER makes the test process a subreaper so it reproduces the PID-1 reparenting exactly) spawns a libuv-tracked child that detaches a grandchild, then asserts: libuv independently reaps the tracked child, the orphaned grandchild is deferred on the first scan and reaped on the second, and no <defunct> entry remains.
Reproduces the original repro (ps -eo pid,ppid,stat,comm | awk '$2==1 && $3 ~ /Z/') dropping to ~0 once shipped to a PID-1 container.

Root cause analysis

How did the code get into this state? Every command-running tool spawns detached: true (its own process group) and group-kills on timeout/abort with process.kill(-pgid, SIGKILL) (the bash and host_bash tools, the skill sandbox runner, and the debug-bash route). This correctly kills the whole tree, and Bun/libuv reaps the immediate child. But descendants the daemon never spawned (git's transport helpers; the skill runner's bash -c "bun run …" process) reparent to PID 1 when the group dies. Bun-as-PID-1 is not an init and never waitpid()s reparented orphans, so they leak as zombies.

What decisions led to it? The detached-process-group + group-kill design is correct for killing. The gap is conceptual: killing ≠ reaping. Reaping a reparented process is, by POSIX, the responsibility of its current parent — which after reparenting is PID 1. Running the application directly as PID 1 (exec bun … main.ts) without an init meant nothing fulfilled that responsibility.

Were there warning signs? Yes — the observability shipped in #32595 ("Shell process group SIGKILL'd — orphans expected to reparent to PID 1") names the exact mechanism. The leak was confirmed by counting <defunct> PPID=1 entries in a live container.

How do we prevent recurrence? The reaper is a universal safety net: it catches orphans from all current and future kill sites, not just the four today, so new tools that spawn detached trees can't reintroduce the leak. The structured "Reaped orphaned subprocesses" log makes any residual leakage visible.

AGENTS.md guidance? No new rule added. The failure was environmental (app-as-PID-1 with no reaper), not a coding-convention gap, and the reaper now handles it generically; a prescriptive "always X when spawning" rule would risk going stale without preventing this class of bug.

Alternatives considered and rejected

tini / docker run --init as PID 1 (the ticket's original recommendation). The canonical fix and ~2 lines, but it is an infrastructure change. In the managed environment the k8s pod template overrides the image ENTRYPOINT (command/args in stateful_template.yaml), so a Dockerfile ENTRYPOINT ["tini","--"] would be silently ignored; it would have to go in docker-entrypoint.sh. Deliberately avoided per the requirement to keep this daemon-side and Dockerfile-free, given the deployment risk of entrypoint changes. The daemon reaper is the in-process equivalent of "make the app a proper subreaper" and catches the same orphans.
waitpid(-1, WNOHANG) reaper loop in the daemon. Rejected: races libuv as described above and can swallow tracked children's exit codes.
Graceful kill (SIGTERM → grace → SIGKILL) at the four kill sites. Useful defense-in-depth (cooperative tools like git would reap their own descendants before dying, reducing orphan creation), but it only reduces, never eliminates — SIGKILL on a hung/uncooperative tree always orphans. Deferred as an optional fast-follow to keep this PR's surface minimal and its timeout/abort semantics unchanged; the reaper is the actual fix regardless.

References

libuv reaps its own children by pid on SIGCHLD (uv__wait_children): https://github.com/nodejs/node/blob/main/deps/uv/src/unix/process.c
Subreaper reaping pattern for runtimes embedding libuv (specific-pid + WNOHANG, never -1, grace window): https://github.com/coopergwrenn/prctl-subreaper
waitpid(2): https://man7.org/linux/man-pages/man2/waitpid.2.html
prctl(2) PR_SET_CHILD_SUBREAPER (why PID 1 already is the subreaper): https://man7.org/linux/man-pages/man2/prctl.2.html

Requested by: @dvargas92495
Session: https://app.devin.ai/sessions/68f0940f2cec4c88990c2c5d1ceb728c

Add a daemon-side OrphanReaper that scans /proc for zombie children of the daemon and reaps them by specific PID with waitpid(pid, WNOHANG) via bun:ffi. Avoids waitpid(-1) so it never races libuv's own child reaping; a one-interval grace window lets libuv reap its tracked children first. Gated to PID 1 on Linux and degrades safely if the FFI binding is unavailable. Closes ATL-746 Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>

devin-ai-integration · 2026-06-04T14:19:33Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment, CI, and merge conflict monitoring

linear · 2026-06-04T14:19:33Z

ATL-746

dvargasfuertes · 2026-06-04T14:46:01Z

+    return;
+  }
+  if (!initWaitpid()) return;
+  seenLastScan = new Set();


let's only start this when a certain /workspace/config.json flag is enabled, so we can QA this and make sure bad stuff doesn't happen

Done in 0b7b44d. Added an opt-in flag daemon.reapOrphanedSubprocesses (default false) to DaemonConfigSchema; startOrphanReaper() now reads it via getConfigReadOnly() and no-ops with a log line when disabled (on top of the existing PID-1/Linux gate). So nothing changes until a workspace sets it in config.json:

{ "daemon": { "reapOrphanedSubprocesses": true } }

Read is wrapped in try/catch so a missing/malformed config can't block startup. Added tests asserting the default-off and explicit opt-in behavior. Resolving this thread.

dvargasfuertes · 2026-06-04T14:46:11Z

+/** Linux `WNOHANG` — return immediately if no child has changed state. */
+const WNOHANG = 1;
+
+const SCAN_INTERVAL_MS = 10_000;


can bump this up to a minute

Done in 0b7b44d — SCAN_INTERVAL_MS bumped to 60_000. Note this also lengthens the grace window to ~60s (a zombie must survive one full interval before it's reaped), which is still well clear of libuv's own reaping (milliseconds after SIGCHLD). Resolving this thread.

… interval to 60s Add an opt-in config flag (default off) so the reaper can be enabled per workspace for QA before becoming the default, and lengthen the scan interval from 10s to 60s. Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>

dvargasfuertes requested changes Jun 4, 2026

View reviewed changes

dvargasfuertes approved these changes Jun 4, 2026

View reviewed changes

dvargasfuertes merged commit 85bce3c into main Jun 4, 2026
13 checks passed

dvargasfuertes deleted the devin/1780581974-atl-746-orphan-reaper branch June 4, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1#33424

[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1#33424
dvargasfuertes merged 2 commits into
mainfrom
devin/1780581974-atl-746-orphan-reaper

devin-ai-integration Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Jun 4, 2026

Uh oh!

linear Bot commented Jun 4, 2026

Uh oh!

dvargasfuertes Jun 4, 2026

Uh oh!

devin-ai-integration Bot Jun 4, 2026

Uh oh!

dvargasfuertes Jun 4, 2026

Uh oh!

devin-ai-integration Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is safe (the libuv coexistence problem)

Verification

Root cause analysis

Alternatives considered and rejected

References

Uh oh!

devin-ai-integration Bot commented Jun 4, 2026

🤖 Devin AI Engineer

Uh oh!

linear Bot commented Jun 4, 2026

Uh oh!

dvargasfuertes Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dvargasfuertes Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented Jun 4, 2026 •

edited

Loading