Skip to content

[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1#33424

Merged
dvargasfuertes merged 2 commits into
mainfrom
devin/1780581974-atl-746-orphan-reaper
Jun 4, 2026
Merged

[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1#33424
dvargasfuertes merged 2 commits into
mainfrom
devin/1780581974-atl-746-orphan-reaper

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Jun 4, 2026

Adds a daemon-side orphan reaper so subprocesses that reparent to the daemon when it runs as PID 1 in the managed container are actually waitpid()-ed instead of piling up as <defunct> zombies that consume PID slots. This closes the confirmed leak (~120 zombies observed: git transport helpers from bash-tool timeouts and bun skill-runner processes) with no Dockerfile or entrypoint change.

OrphanReaper runs on a 60s interval, scans /proc for processes in state Z whose parent is the daemon, and reaps each by specific PID via waitpid(pid, WNOHANG) bound through bun:ffi. It is gated three ways: process.pid === 1 on Linux, an opt-in daemon.reapOrphanedSubprocesses config flag (default false, so it can be enabled per workspace for validation before becoming the default), and graceful degradation to a no-op (logged warning, never throws) if the FFI binding is unavailable — per the daemon's "never block startup on a subsystem failure" rule.

Closes ATL-746

Why this is safe (the libuv coexistence problem)

The subtle risk with any in-process reaper under Bun/Node is racing libuv. libuv installs its own SIGCHLD handler and reaps the children it spawned (every Bun.spawn) by specific PID (uv__wait_childrenwaitpid(process->pid, …)). A blanket waitpid(-1) reaper would race that: if our call wins on a libuv-tracked child, libuv's own waitpid returns ECHILD, its exit event never fires, and proc.exited hangs with a lost exit code (libuv's source literally drops the result in this case — "someone else stole the waitpid from us. Handle this by not handling it at all.").

This reaper avoids the race two ways:

  • Never waitpid(-1) — only waitpid(pid, WNOHANG) for PIDs we positively identified as zombies parented to us in /proc.
  • One-interval grace window — a zombie is only reaped if it was already seen on the previous scan. libuv reaps its own tracked children within milliseconds of SIGCHLD, so anything still <defunct> a full interval later is a genuine orphan libuv is not tracking. ECHILD/0 returns are tolerated as no-ops.

PR_SET_CHILD_SUBREAPER is intentionally not used: because the daemon is PID 1, orphans already reparent to it, so the only missing piece is the reaping loop.

Verification

  • Unit tests cover the subtle logic: /proc/<pid>/stat parsing (including comm values containing spaces and parentheses), the defer-one-interval grace selection, and the opt-in config gate (default off, honors explicit enable).
  • An integration test (PR_SET_CHILD_SUBREAPER makes the test process a subreaper so it reproduces the PID-1 reparenting exactly) spawns a libuv-tracked child that detaches a grandchild, then asserts: libuv independently reaps the tracked child, the orphaned grandchild is deferred on the first scan and reaped on the second, and no <defunct> entry remains.
  • Reproduces the original repro (ps -eo pid,ppid,stat,comm | awk '$2==1 && $3 ~ /Z/') dropping to ~0 once shipped to a PID-1 container.

Root cause analysis

How did the code get into this state? Every command-running tool spawns detached: true (its own process group) and group-kills on timeout/abort with process.kill(-pgid, SIGKILL) (the bash and host_bash tools, the skill sandbox runner, and the debug-bash route). This correctly kills the whole tree, and Bun/libuv reaps the immediate child. But descendants the daemon never spawned (git's transport helpers; the skill runner's bash -c "bun run …" process) reparent to PID 1 when the group dies. Bun-as-PID-1 is not an init and never waitpid()s reparented orphans, so they leak as zombies.

What decisions led to it? The detached-process-group + group-kill design is correct for killing. The gap is conceptual: killing ≠ reaping. Reaping a reparented process is, by POSIX, the responsibility of its current parent — which after reparenting is PID 1. Running the application directly as PID 1 (exec bun … main.ts) without an init meant nothing fulfilled that responsibility.

Were there warning signs? Yes — the observability shipped in #32595 ("Shell process group SIGKILL'd — orphans expected to reparent to PID 1") names the exact mechanism. The leak was confirmed by counting <defunct> PPID=1 entries in a live container.

How do we prevent recurrence? The reaper is a universal safety net: it catches orphans from all current and future kill sites, not just the four today, so new tools that spawn detached trees can't reintroduce the leak. The structured "Reaped orphaned subprocesses" log makes any residual leakage visible.

AGENTS.md guidance? No new rule added. The failure was environmental (app-as-PID-1 with no reaper), not a coding-convention gap, and the reaper now handles it generically; a prescriptive "always X when spawning" rule would risk going stale without preventing this class of bug.

Alternatives considered and rejected

  • tini / docker run --init as PID 1 (the ticket's original recommendation). The canonical fix and ~2 lines, but it is an infrastructure change. In the managed environment the k8s pod template overrides the image ENTRYPOINT (command/args in stateful_template.yaml), so a Dockerfile ENTRYPOINT ["tini","--"] would be silently ignored; it would have to go in docker-entrypoint.sh. Deliberately avoided per the requirement to keep this daemon-side and Dockerfile-free, given the deployment risk of entrypoint changes. The daemon reaper is the in-process equivalent of "make the app a proper subreaper" and catches the same orphans.
  • waitpid(-1, WNOHANG) reaper loop in the daemon. Rejected: races libuv as described above and can swallow tracked children's exit codes.
  • Graceful kill (SIGTERM → grace → SIGKILL) at the four kill sites. Useful defense-in-depth (cooperative tools like git would reap their own descendants before dying, reducing orphan creation), but it only reduces, never eliminates — SIGKILL on a hung/uncooperative tree always orphans. Deferred as an optional fast-follow to keep this PR's surface minimal and its timeout/abort semantics unchanged; the reaper is the actual fix regardless.

References


Add a daemon-side OrphanReaper that scans /proc for zombie children of the
daemon and reaps them by specific PID with waitpid(pid, WNOHANG) via bun:ffi.
Avoids waitpid(-1) so it never races libuv's own child reaping; a one-interval
grace window lets libuv reap its tracked children first. Gated to PID 1 on
Linux and degrades safely if the FFI binding is unavailable.

Closes ATL-746

Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@linear
Copy link
Copy Markdown

linear Bot commented Jun 4, 2026

ATL-746

return;
}
if (!initWaitpid()) return;
seenLastScan = new Set();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only start this when a certain /workspace/config.json flag is enabled, so we can QA this and make sure bad stuff doesn't happen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0b7b44d. Added an opt-in flag daemon.reapOrphanedSubprocesses (default false) to DaemonConfigSchema; startOrphanReaper() now reads it via getConfigReadOnly() and no-ops with a log line when disabled (on top of the existing PID-1/Linux gate). So nothing changes until a workspace sets it in config.json:

{ "daemon": { "reapOrphanedSubprocesses": true } }

Read is wrapped in try/catch so a missing/malformed config can't block startup. Added tests asserting the default-off and explicit opt-in behavior. Resolving this thread.

Comment thread assistant/src/daemon/orphan-reaper.ts Outdated
/** Linux `WNOHANG` — return immediately if no child has changed state. */
const WNOHANG = 1;

const SCAN_INTERVAL_MS = 10_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can bump this up to a minute

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0b7b44dSCAN_INTERVAL_MS bumped to 60_000. Note this also lengthens the grace window to ~60s (a zombie must survive one full interval before it's reaped), which is still well clear of libuv's own reaping (milliseconds after SIGCHLD). Resolving this thread.

… interval to 60s

Add an opt-in config flag (default off) so the reaper can be enabled per
workspace for QA before becoming the default, and lengthen the scan interval
from 10s to 60s.

Co-Authored-By: vargas@vellum.ai <vargas@vellum.ai>
@dvargasfuertes dvargasfuertes merged commit 85bce3c into main Jun 4, 2026
13 checks passed
@dvargasfuertes dvargasfuertes deleted the devin/1780581974-atl-746-orphan-reaper branch June 4, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant