Observability for subprocess orphans (bug report #3) by vellum-apollo-bot[bot] · Pull Request #32595 · vellum-ai/vellum-assistant

vellum-apollo-bot · 2026-05-29T19:13:04Z

Observability for subprocess orphans (bug report #3)

Adds two structured log signals on the assistant's bash tool so the next zombie-accumulation report points at a specific call site instead of a guess. No semantics change to the bash tool — orphans still happen, but now they're observable. The actual fix (tini / reaper / group-wait) is tracked in ATL-746.

What's in here

1. `buildKillTree` warn log (the ground-truth event)

The shell tool's buildKillTree previously did process.kill(-pid, SIGKILL) silently. It now emits a structured warn before the kill:

Shell process group SIGKILL'd — orphans expected to reparent to PID 1
  reason: "timeout" | "abort" | "spawn_error"
  command: <redacted>
  conversationId, toolName, invocationId
  groupPid, durationMs

This is the event that creates orphans. The five call sites (timeout × 2, abort × 2, cancel × 1) all go through one helper that returns the actual kill closure, so the log and the kill cannot drift.

2. `logShellExit` info log (the bookend)

Every shell invocation now has a paired start/exit log. The start log (Executing shell command) already exists; we add a matching exit log so an operator can correlate them with invocationId:

Shell command exited
  mode: "foreground" | "background"
  exitCode, signal, timedOut, spawnError
  durationMs
  invocationId, conversationId, command (redacted)

High-conviction RCA (waiting on these logs to confirm per-incident)

Model invokes the bash tool with a command like git push.
shell.ts spawns bash -c … detached in its own process group with a 120s default timeout.
The git command triggers .git/hooks/pre-push (lint + typecheck + tests) which routinely exceeds 120s.
The timeout fires process.kill(-pid, SIGKILL) on the whole group.
Bash dies before reaping git's grandchildren (git-remote-http, pre-push, hook-spawned sleep/tail/timeout, etc.).
Those grandchildren reparent to PID 1 — but PID 1 is bun --smol run src/daemon/main.ts, which does not install a SIGCHLD reaper, so they accumulate as <defunct>.

Rate ~5 zombies/hour matches the 118/26h incident the user reported. The leak does not trace back to the daemon's own git-service.execGit (uses execFileAsync with a 10s timeout; repros confirm zero zombies from that path). Every observed zombie is model-issued git through the bash tool.

What is not in this PR

No PID-1 reaper (tini or otherwise). That's ATL-746, parked at Medium pending these logs.
No kill-semantics change. Group-SIGKILL behavior is unchanged.
No timeout change. The 120s default is independently questionable for git push against a heavy pre-push hook, but tradeoffs there are separate.

Test plan

assistant/src/__tests__/shell-observability.test.ts — 4 tests covering foreground exit log, timeout → killTree warn + exit log, background mode exit log, abort → killTree warn with reason="abort".
Existing shell-credential-ref.test.ts continues to pass (9/9).
Pre-existing failures in terminal-tools.test.ts (buildSanitizedEnv) are on origin/main, not introduced here.

Files touched

assistant/src/tools/terminal/shell.ts — buildKillTree signature change + logShellExit helper; 5 kill call sites wired.
assistant/src/__tests__/shell-observability.test.ts — new, 4 tests.
assistant/AGENTS.md — narrows the test-isolation rule to preload-time infrastructure only.

🚀

Adds three structured log signals so the next zombie-accumulation report points at a specific call site instead of a guess: 1. buildKillTree() warns on every group-SIGKILL with reason (timeout/abort/spawn_error), command, conversationId, groupPid, and durationMs. This is the ground-truth event that creates orphans. 2. logShellExit() info log on every shell teardown — pairs with the existing 'Executing shell command' start log so every invocation has a start/exit pair carrying mode (fg/bg), exitCode, signal, timedOut, durationMs, spawnError. 3. Periodic /proc-based zombie sampler (daemon/zombie-process- sampler.ts) — every 5 min counts zombies parented to the daemon, breaks down by comm, warns above a threshold. Linux-only; no-op on macOS desktop. No semantics change to the bash tool — orphans still happen, but now they're observable.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23e30ddbdb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T19:15:20Z

+import type { WakeOptions } from "../runtime/agent-wake.js";
+import type { BackgroundTool } from "../tools/background-tool-registry.js";
+import type { Tool } from "../tools/types.js";


Remove extra production imports from the shell test

assistant/AGENTS.md says files under __tests__/ must not import from src/ except for the production module they are testing. This new test imports agent-wake, background-tool-registry, and tools/types only to type its mocks, coupling the test harness to production modules outside shell.ts; if those modules gain import-time side effects, they can run before the mocks/workspace isolation this rule is meant to protect. Duplicate the small mock shapes locally or derive types from the shell module under test instead.

Useful? React with 👍 / 👎.

We should update the AGENTS.md, this should only be strictly true preload's

Updated assistant/AGENTS.md in cbbc0b7 to scope the strict no-src/ rule to preload/verifier/helpers only — *.test.ts files run after preload inside the workspace override, so the inverted-invariants concern doesnt apply to them. Imports stay as-is.

chatgpt-codex-connector · 2026-05-29T19:15:20Z

+ */
+import { afterEach, beforeEach, describe, expect, mock, test } from "bun:test";
+
+import { isLinux } from "../../util/platform.js";


Avoid importing platform utilities in this sampler test

assistant/AGENTS.md forbids __tests__/ files from importing production src/ modules other than the module being tested. This test is for zombie-process-sampler.ts, but it also imports ../../util/platform.js; that reintroduces the production/test coupling the guideline calls out and can run unrelated import-time code before test isolation is established. Use process.platform === "linux" or a local helper in the test instead.

Useful? React with 👍 / 👎.

dvargasfuertes · 2026-05-29T19:38:34Z

+import type { WakeOptions } from "../runtime/agent-wake.js";
+import type { BackgroundTool } from "../tools/background-tool-registry.js";
+import type { Tool } from "../tools/types.js";


We should update the AGENTS.md, this should only be strictly true preload's

dvargasfuertes · 2026-05-29T19:39:52Z

+ *
+ * Exposed for testing — production code calls the singleton sampler.
+ */
+export function sampleZombies(parentPid: number): ZombieSample {


changed my mind, let's keep the shell improvements but remove the zombie sampler

Done — cbbc0b7 drops the sampler, its test, and the lifecycle wiring. PR now only adds the shell-side observability (buildKillTree warn + logShellExit info).

Per PR review: - Remove zombie-process-sampler.ts + test + lifecycle wiring. Keeping just the shell-side observability (buildKillTree + logShellExit). - Update assistant/AGENTS.md so the strict 'no src/ imports' rule applies only to preload-time infrastructure (preload, verifier, shared helpers), not to ordinary *.test.ts files.

PR #32595 ('Observability for subprocess orphans') merged into main after this branch opened. It introduced shell-observability.test.ts using `let shellTool: Tool` + `shellTool.execute(...)` — the same pattern fad576a fixed in the other shell tests. CI's typecheck on the merge commit catches this; rebasing surfaces the file so the fix goes in-branch. Same mechanical fix: `Tool` → `ToolDefinition` for the variable type, `execute!` on the four call sites.

…type core tools (#32631) * refactor(tools): move name to ToolDefinition, delete ToolManifest, retype core tools Closes the ToolDefinition refactor cycle started by #32362 (executionMode) and #32472 (category) — every tool author writes a `ToolDefinition` literal, the registry finalizes it into a `Tool`. ## name → ToolDefinition (PR #32472 follow-up review comment) `name?: string` is now an optional field on `ToolDefinition` with a JSDoc note that loaders default it to the source file basename (`tools/read.ts` → `read`). `Tool` collapses from `Required<ToolDefinition> & { name: string }` to bare `Required<ToolDefinition>` since name is now part of the base shape. `finalizeTool(tool, defaultName)` now prefers `tool.name` over the file-derived default, so a literal that overrides the file name still wins. ## Delete ToolManifest The Zod-derived `ToolManifest` type in `ipc/skill-routes/registries.ts` was a near-duplicate of `ToolDefinition` minus `execute` (functions don't cross IPC). Replaced with `ToolDefinition` directly; renamed the schema to `WireToolDefinitionSchema` to reflect that it's the wire form of the canonical type. `buildProxyTool(definition: ToolDefinition)` uses `!` assertions on fields the Zod schema guarantees are set. ## Core tools as ToolDefinition Every core tool literal — 22 `class XxxTool implements ToolDefinition` sites plus 14 `: ToolDefinition = { ... }` object literals across `apps/`, `ui-surface/`, `computer-use/` definitions — now uses the relaxed authoring contract. Aggregate lists (`explicitTools`, `cesTools`, `coreAppProxyTools`, `allUiSurfaceTools`) are typed `ToolDefinition[]`. `registerTool` widens to accept `ToolDefinition` and finalizes internally via a `WeakMap<ToolDefinition, Tool>` so idempotent re-registration (test reset helpers, ESM module re-imports) stays the same silent no-op the old `existing === tool` check provided. Throws if `tool.name` is missing — the file-basename default is a loader concern, not a registry concern. ## Verified locally - `eslint src/tools src/ipc/skill-routes/registries.ts` — clean. - `bun test plugin-bootstrap plugin-tool-contribution plugin-types registry skill-tool-manifest tool-executor tool-approval-handler conversation-skill-tools conversation-app-control-instantiation skill-projection-feature-flag` — 263/263 pass. - Broader `bun test src/tools/` reproduces a 410/489 pass baseline identical to `origin/main` HEAD — no new regressions. * fix(tools): typecheck after ToolDefinition widening PR #32472 widened `ToolDefinition` (made name/description/etc. optional). This commit fixes the cascade of typecheck failures in core-tool tests + two production-side mismatches. ## Registry returns Tool[], not ToolDefinition[] `getAllToolDefinitions()` and `getMcpToolDefinitions()` previously returned `ToolDefinition[]`. At runtime they always return finalized `Tool` objects (full Required<ToolDefinition>); the loose typing was lying. Tightening to `Tool[]` fixes: - `conversation.ts` and `conversation-tool-setup.ts` `.name` reads that became `string | undefined` under the widened type. - Cross-package assignability to `@vellumai/skill-host-contracts`'s stricter `ToolDefinition` (which requires name/description/input_schema). - `btw-routes.ts` passing the result to `runBtwSidechain` whose `tools` parameter is the contracts-package shape. ## buildProxyTool takes Zod-inferred wire shape `buildProxyTool(definition: ToolDefinition)` was being passed elements from `z.array(WireToolDefinitionSchema)`. The Zod-inferred type has strict field types (e.g. `defaultRiskLevel: "low"|"medium"|"high"`) that don't match assistant's enum-typed `ToolDefinition` even though they're structurally equivalent. Switching the parameter to `z.infer<typeof WireToolDefinitionSchema>` aligns the types and drops the `!` assertions / `as` casts that the prior version needed. ## Tests: ! on .execute() calls and tool.description 13 test files invoke `.execute()` directly on imported tool exports typed as `ToolDefinition`. Added non-null assertions matching the runtime invariant — these tools are class instances or fully-specified literals where every field is implemented. Also: `computer-use-tools.test.ts` `schema()` helper now accepts `input_schema?: object` instead of requiring it; `background-shell-bash` and `terminal-tools` local `shellTool: Tool` declarations changed to `shellTool: ToolDefinition` matching the module export type; `credential-execution-tools` drops an explicit `(t: Tool)` callback annotation that tripped the variance check. * fix(tools): update shell-observability test to widened ToolDefinition PR #32595 ('Observability for subprocess orphans') merged into main after this branch opened. It introduced shell-observability.test.ts using `let shellTool: Tool` + `shellTool.execute(...)` — the same pattern fad576a fixed in the other shell tests. CI's typecheck on the merge commit catches this; rebasing surfaces the file so the fix goes in-branch. Same mechanical fix: `Tool` → `ToolDefinition` for the variable type, `execute!` on the four call sites. * refactor(tools): convert class-based tools to ToolDefinition literals + consolidate schemas Replaces all 22 class-based tools with plain object literals typed as `ToolDefinition`, addressing PR review feedback. Class+instance indirection is gone; `finalizeTool` no longer has anything to do with prototypes. Schema consolidation: - `ToolDefinitionSchema` (tools/types.ts) is the single source of truth. `WireToolDefinitionSchema = ToolDefinitionSchema.required({...})` with `executionTarget` optional. - `ipc/skill-routes/registries.ts` imports the canonical schema; the local duplicate and `RiskLevel` cast are removed (`z.enum(RiskLevel)` infers as the native enum). Refactors: - ask-question: constructor-injected `prompterFactory` becomes a `createAskQuestionTool()` factory; default export `askQuestionTool`. Test file updated to import the literal + factory. - host-filesystem/transfer: `executeLocal` lifted from a private method to a module-level function (registry stores finalized literal references, so `this`-based dispatch is unavailable). - 20 other tools converted to literals (zero `this.` usage). Tests: - ask-question 23/23, host-filesystem 29/29, memory 14/14, network web-search 36/36 pass in isolation. Broad-batch failures are pre-existing Bun mock-leakage: baseline at merge-base shows 167 fail / 537 tests; this branch shows 79 fail / 489 tests under the same conditions. * fix(tools): make ToolDefinition.execute required + widen input_schema to object CI typecheck caught three regressions from the schema-driven type: 1. `ExecutionTarget` type import was removed by accident while restructuring the import block. Restored. 2. `ToolDefinition.execute` was inferred as optional (because the schema doesn't include execute), so every test calling `tool.execute(...)` failed with TS18048. Changed the type overlay to require execute — every in-process tool literal has one, and the wire schema still parses without it (closures can't cross IPC; `buildProxyTool` synthesizes execute on arrival). 3. `input_schema` was inferred as `Record<string, unknown>` from `z.record(z.string(), z.unknown())`, which is stricter than the prior `object` interface field. Factories and tests assigning typed JSON-schema objects (`as object`) failed. Widened the overlay to `input_schema?: object` so authors can keep using JSON schema literals; wire form still parses to `Record<string, unknown>` via `WireToolDefinitionSchema`. Also updated `managed-skill-lifecycle.test.ts` to use the literal `skillLoadTool` import instead of the (removed) `SkillLoadTool` class. * fix(tools): keep execute optional on ToolDefinition, bang test call sites Reverting the previous `execute: required` change — there are schema-only ToolDefinitions in the codebase (`graphRememberDefinition`, `storeStyleAnalysisTool`, `SWEEP_TOOL`) that are handed to providers as function-calling schemas without ever being registered for execution. Requiring execute on the type would break those legitimate use cases, plus the plugin-api contract test that asserts an empty literal satisfies `ToolDefinition`. Instead: keep `execute` optional, and add the `!` bang to test call sites (mirroring the existing pattern in `host-filesystem/*.test.ts`). Sed-replaced 111 call sites across 8 test files. Affected tools: credentialStoreTool, fileListTool, makeAuthenticatedRequestTool, recallTool, rememberTool, runAuthenticatedCommandTool, webSearchTool. Also added a comment in types.ts documenting why execute is optional and the schema-only ToolDefinition pattern, with pointers to the three existing instances. * fix(tools): bang skillLoadTool.execute in managed-skill-lifecycle test * refactor(tools): consolidate to one schema + finalizeTool at IPC boundary Round 2 feedback on PR #32631: kill the dual-schema design. types.ts: - Delete WireToolDefinitionSchema and its doc block. - ToolDefinitionSchema is the single source of truth — all fields optional, both author and wire paths share it. - Update ToolDefinition's doc to reflect the new contract and note the `satisfies ToolDefinition` pattern. registries.ts (skill-routes): - Drop the Wire-level-schemas comment block. - Use ToolDefinitionSchema in z.array(). - Delete buildProxyTool + WireToolDefinition type. - Replace tools.map(buildProxyTool) with tools.map((t) => finalizeTool(t, t.name ?? "")) — finalizeTool already synthesizes a no-op error closure for the no-supervisor path. Tool sources (28 files: terminal/shell, host-terminal/host-shell, host-filesystem/*, filesystem/*, credentials/vault, credential-execution/*, memory/register, memory/graph/tools, memory/v2/sweep-job, messaging/style-analyzer, ui-surface, computer-use/definitions, network/{web-fetch,web-search}, system/request-permission, subagent/notify-parent, skills/execute, apps/definitions): - Switch from `: ToolDefinition = { ... }` to `... } satisfies ToolDefinition` so the inferred per-export type preserves `execute` as required at call sites that statically import the literal. The schema-only definitions (graphRememberDefinition, storeStyleAnalysisTool, SWEEP_TOOL) intentionally keep `execute` absent — ToolDefinition itself still has it optional. Test files (27 files): - Drop the `.execute!(` bang now that satisfies preserves the required type at the call site. - Pattern A refactor: replace inline `await import("../tools/...")` in beforeEach with module-scope static imports for background-shell-bash and terminal-tools. shell-observability keeps a dynamic import because the test mocks `../util/logger.js` and shell.ts captures the logger at module-eval time — static import there would hoist past mock.module() and the test would see the real logger. Type-narrow via `typeof import(...)["shellTool"]` so no bang is needed. Intentional reverts: - openai-responses-provider.test.ts and plugin-api-tool-definition.test.ts keep `: ToolDefinition = { ... }` — they assert type shape (including the empty-literal contract test) rather than calling .execute, so the widened type is the right tool there. registries.test.ts: - "rejects missing required fields" → "fills defaults for partial tool entries": partial entries no longer reject (one all-optional schema + finalizeTool fills defaults). - "proxy execute throws when no supervisor is attached" → "proxy execute surfaces an error result when no supervisor is attached": finalizeTool synthesizes a no-op error closure now, not a throw. * fix(tools): convert skillLoadTool + ask-question to satisfies, finalize before supervisor short-circuit Round 2 CI follow-up — three type-check failures from PR #32631 round 2: 1. `skillLoadTool` (skills/load.ts) and `askQuestionTool` (ask-question/ask-question-tool.ts) source files were missed in the first satisfies pass: - `skills/load.ts` contains a regex literal (`/!`[^`]*`/g`) inside the execute body. The AST converter at scratch/satisfies-converter.ts doesn't track regex literals, so its brace matcher walked through the backticks inside the character class, lost depth tracking, and silently emitted "no spots found". Converted manually. - `ask-question/ask-question-tool.ts` wraps a factory call (`createAskQuestionTool()` returns ToolDefinition). The cleanest fix is to drop the factory's explicit return annotation and apply `satisfies ToolDefinition` inside the returned literal so the inferred return type carries the narrow shape through the export. 2. `ipc/skill-routes/registries.ts` supervisor short-circuit log + return read `tools.map((t) => t.name)` where `t.name` is `string | undefined` per the all-optional schema. Pulled `finalizeTool` ahead of the supervisor branch so both paths see a `Tool[]` (name guaranteed). 3. `ipc/skill-routes/__tests__/registries.test.ts` "fills defaults" assertion compared `installed.defaultRiskLevel` (typed `RiskLevel`) against the string literal `"medium"` — strict enum compare needs `RiskLevel.Medium`. * fix(tests): drop widening return type from makeToolWithStub in ask-question test The helper had `tool: ToolDefinition` in its explicit return type, which collapses the satisfies-narrowed shape returned by `createAskQuestionTool()` back to the public author-facing shape (execute optional). Remove the explicit return type so callers see the inferred narrow shape and can invoke `tool.execute(...)` without a `!` bang. Also drop the now-unused `ToolDefinition` type import. --------- Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>

vellum-apollo-bot Bot requested a review from siddseethepalli as a code owner May 29, 2026 19:13

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

dvargasfuertes requested changes May 29, 2026

View reviewed changes

vellum-apollo-bot Bot requested a review from dvargasfuertes May 29, 2026 19:51

dvargasfuertes approved these changes May 29, 2026

View reviewed changes

dvargasfuertes merged commit 5712c7a into main May 29, 2026
13 checks passed

dvargasfuertes deleted the apollo/observe-subprocess-orphans branch May 29, 2026 20:46

devin-ai-integration Bot mentioned this pull request Jun 4, 2026

[ATL-746] fix(daemon): reap orphaned subprocesses reparented to PID 1 #33424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability for subprocess orphans (bug report #3)#32595

Observability for subprocess orphans (bug report #3)#32595
dvargasfuertes merged 2 commits into
mainfrom
apollo/observe-subprocess-orphans

vellum-apollo-bot Bot commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

dvargasfuertes May 29, 2026

Uh oh!

vellum-apollo-bot Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

dvargasfuertes May 29, 2026

Uh oh!

dvargasfuertes May 29, 2026

Uh oh!

vellum-apollo-bot Bot May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vellum-apollo-bot Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observability for subprocess orphans (bug report #3)

What's in here

1. buildKillTree warn log (the ground-truth event)

2. logShellExit info log (the bookend)

High-conviction RCA (waiting on these logs to confirm per-incident)

What is not in this PR

Test plan

Files touched

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dvargasfuertes May 29, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dvargasfuertes May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dvargasfuertes May 29, 2026

Choose a reason for hiding this comment

Uh oh!

vellum-apollo-bot Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vellum-apollo-bot Bot commented May 29, 2026 •

edited

Loading

1. `buildKillTree` warn log (the ground-truth event)

2. `logShellExit` info log (the bookend)