Skip to content

Observability for subprocess orphans (bug report #3)#32595

Merged
dvargasfuertes merged 2 commits into
mainfrom
apollo/observe-subprocess-orphans
May 29, 2026
Merged

Observability for subprocess orphans (bug report #3)#32595
dvargasfuertes merged 2 commits into
mainfrom
apollo/observe-subprocess-orphans

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

@vellum-apollo-bot vellum-apollo-bot Bot commented May 29, 2026

Observability for subprocess orphans (bug report #3)

Adds two structured log signals on the assistant's bash tool so the next zombie-accumulation report points at a specific call site instead of a guess. No semantics change to the bash tool — orphans still happen, but now they're observable. The actual fix (tini / reaper / group-wait) is tracked in ATL-746.

What's in here

1. buildKillTree warn log (the ground-truth event)

The shell tool's buildKillTree previously did process.kill(-pid, SIGKILL) silently. It now emits a structured warn before the kill:

Shell process group SIGKILL'd — orphans expected to reparent to PID 1
  reason: "timeout" | "abort" | "spawn_error"
  command: <redacted>
  conversationId, toolName, invocationId
  groupPid, durationMs

This is the event that creates orphans. The five call sites (timeout × 2, abort × 2, cancel × 1) all go through one helper that returns the actual kill closure, so the log and the kill cannot drift.

2. logShellExit info log (the bookend)

Every shell invocation now has a paired start/exit log. The start log (Executing shell command) already exists; we add a matching exit log so an operator can correlate them with invocationId:

Shell command exited
  mode: "foreground" | "background"
  exitCode, signal, timedOut, spawnError
  durationMs
  invocationId, conversationId, command (redacted)

High-conviction RCA (waiting on these logs to confirm per-incident)

  1. Model invokes the bash tool with a command like git push.
  2. shell.ts spawns bash -c … detached in its own process group with a 120s default timeout.
  3. The git command triggers .git/hooks/pre-push (lint + typecheck + tests) which routinely exceeds 120s.
  4. The timeout fires process.kill(-pid, SIGKILL) on the whole group.
  5. Bash dies before reaping git's grandchildren (git-remote-http, pre-push, hook-spawned sleep/tail/timeout, etc.).
  6. Those grandchildren reparent to PID 1 — but PID 1 is bun --smol run src/daemon/main.ts, which does not install a SIGCHLD reaper, so they accumulate as <defunct>.

Rate ~5 zombies/hour matches the 118/26h incident the user reported. The leak does not trace back to the daemon's own git-service.execGit (uses execFileAsync with a 10s timeout; repros confirm zero zombies from that path). Every observed zombie is model-issued git through the bash tool.

What is not in this PR

  • No PID-1 reaper (tini or otherwise). That's ATL-746, parked at Medium pending these logs.
  • No kill-semantics change. Group-SIGKILL behavior is unchanged.
  • No timeout change. The 120s default is independently questionable for git push against a heavy pre-push hook, but tradeoffs there are separate.

Test plan

  • assistant/src/__tests__/shell-observability.test.ts — 4 tests covering foreground exit log, timeout → killTree warn + exit log, background mode exit log, abort → killTree warn with reason="abort".
  • Existing shell-credential-ref.test.ts continues to pass (9/9).
  • Pre-existing failures in terminal-tools.test.ts (buildSanitizedEnv) are on origin/main, not introduced here.

Files touched

  • assistant/src/tools/terminal/shell.tsbuildKillTree signature change + logShellExit helper; 5 kill call sites wired.
  • assistant/src/__tests__/shell-observability.test.ts — new, 4 tests.
  • assistant/AGENTS.md — narrows the test-isolation rule to preload-time infrastructure only.

🚀

Adds three structured log signals so the next zombie-accumulation
report points at a specific call site instead of a guess:

1. buildKillTree() warns on every group-SIGKILL with reason
   (timeout/abort/spawn_error), command, conversationId, groupPid,
   and durationMs. This is the ground-truth event that creates
   orphans.

2. logShellExit() info log on every shell teardown — pairs with
   the existing 'Executing shell command' start log so every
   invocation has a start/exit pair carrying mode (fg/bg), exitCode,
   signal, timedOut, durationMs, spawnError.

3. Periodic /proc-based zombie sampler (daemon/zombie-process-
   sampler.ts) — every 5 min counts zombies parented to the daemon,
   breaks down by comm, warns above a threshold. Linux-only;
   no-op on macOS desktop.

No semantics change to the bash tool — orphans still happen, but
now they're observable.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23e30ddbdb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +17 to +19
import type { WakeOptions } from "../runtime/agent-wake.js";
import type { BackgroundTool } from "../tools/background-tool-registry.js";
import type { Tool } from "../tools/types.js";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove extra production imports from the shell test

assistant/AGENTS.md says files under __tests__/ must not import from src/ except for the production module they are testing. This new test imports agent-wake, background-tool-registry, and tools/types only to type its mocks, coupling the test harness to production modules outside shell.ts; if those modules gain import-time side effects, they can run before the mocks/workspace isolation this rule is meant to protect. Duplicate the small mock shapes locally or derive types from the shell module under test instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the AGENTS.md, this should only be strictly true preload's

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated assistant/AGENTS.md in cbbc0b7 to scope the strict no-src/ rule to preload/verifier/helpers only — *.test.ts files run after preload inside the workspace override, so the inverted-invariants concern doesnt apply to them. Imports stay as-is.

*/
import { afterEach, beforeEach, describe, expect, mock, test } from "bun:test";

import { isLinux } from "../../util/platform.js";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid importing platform utilities in this sampler test

assistant/AGENTS.md forbids __tests__/ files from importing production src/ modules other than the module being tested. This test is for zombie-process-sampler.ts, but it also imports ../../util/platform.js; that reintroduces the production/test coupling the guideline calls out and can run unrelated import-time code before test isolation is established. Use process.platform === "linux" or a local helper in the test instead.

Useful? React with 👍 / 👎.

Comment on lines +17 to +19
import type { WakeOptions } from "../runtime/agent-wake.js";
import type { BackgroundTool } from "../tools/background-tool-registry.js";
import type { Tool } from "../tools/types.js";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the AGENTS.md, this should only be strictly true preload's

*
* Exposed for testing — production code calls the singleton sampler.
*/
export function sampleZombies(parentPid: number): ZombieSample {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed my mind, let's keep the shell improvements but remove the zombie sampler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — cbbc0b7 drops the sampler, its test, and the lifecycle wiring. PR now only adds the shell-side observability (buildKillTree warn + logShellExit info).

Per PR review:
- Remove zombie-process-sampler.ts + test + lifecycle wiring.
  Keeping just the shell-side observability (buildKillTree + logShellExit).
- Update assistant/AGENTS.md so the strict 'no src/ imports' rule
  applies only to preload-time infrastructure (preload, verifier,
  shared helpers), not to ordinary *.test.ts files.
@dvargasfuertes dvargasfuertes merged commit 5712c7a into main May 29, 2026
13 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/observe-subprocess-orphans branch May 29, 2026 20:46
vellum-apollo-bot Bot added a commit that referenced this pull request May 29, 2026
PR #32595 ('Observability for subprocess orphans') merged into main
after this branch opened. It introduced shell-observability.test.ts
using `let shellTool: Tool` + `shellTool.execute(...)` — the same
pattern fad576a fixed in the other shell tests. CI's typecheck on
the merge commit catches this; rebasing surfaces the file so the fix
goes in-branch.

Same mechanical fix: `Tool` → `ToolDefinition` for the variable
type, `execute!` on the four call sites.
dvargasfuertes pushed a commit that referenced this pull request May 30, 2026
…type core tools (#32631)

* refactor(tools): move name to ToolDefinition, delete ToolManifest, retype core tools

Closes the ToolDefinition refactor cycle started by #32362 (executionMode)
and #32472 (category) — every tool author writes a `ToolDefinition`
literal, the registry finalizes it into a `Tool`.

## name → ToolDefinition (PR #32472 follow-up review comment)

`name?: string` is now an optional field on `ToolDefinition` with a
JSDoc note that loaders default it to the source file basename
(`tools/read.ts` → `read`). `Tool` collapses from
`Required<ToolDefinition> & { name: string }` to bare
`Required<ToolDefinition>` since name is now part of the base shape.

`finalizeTool(tool, defaultName)` now prefers `tool.name` over the
file-derived default, so a literal that overrides the file name still
wins.

## Delete ToolManifest

The Zod-derived `ToolManifest` type in `ipc/skill-routes/registries.ts`
was a near-duplicate of `ToolDefinition` minus `execute` (functions
don't cross IPC). Replaced with `ToolDefinition` directly; renamed the
schema to `WireToolDefinitionSchema` to reflect that it's the wire form
of the canonical type. `buildProxyTool(definition: ToolDefinition)`
uses `!` assertions on fields the Zod schema guarantees are set.

## Core tools as ToolDefinition

Every core tool literal — 22 `class XxxTool implements ToolDefinition`
sites plus 14 `: ToolDefinition = { ... }` object literals across
`apps/`, `ui-surface/`, `computer-use/` definitions — now uses the
relaxed authoring contract. Aggregate lists (`explicitTools`,
`cesTools`, `coreAppProxyTools`, `allUiSurfaceTools`) are typed
`ToolDefinition[]`.

`registerTool` widens to accept `ToolDefinition` and finalizes
internally via a `WeakMap<ToolDefinition, Tool>` so idempotent
re-registration (test reset helpers, ESM module re-imports) stays the
same silent no-op the old `existing === tool` check provided. Throws
if `tool.name` is missing — the file-basename default is a loader
concern, not a registry concern.

## Verified locally

- `eslint src/tools src/ipc/skill-routes/registries.ts` — clean.
- `bun test plugin-bootstrap plugin-tool-contribution plugin-types
  registry skill-tool-manifest tool-executor tool-approval-handler
  conversation-skill-tools conversation-app-control-instantiation
  skill-projection-feature-flag` — 263/263 pass.
- Broader `bun test src/tools/` reproduces a 410/489 pass baseline
  identical to `origin/main` HEAD — no new regressions.

* fix(tools): typecheck after ToolDefinition widening

PR #32472 widened `ToolDefinition` (made name/description/etc.
optional). This commit fixes the cascade of typecheck failures
in core-tool tests + two production-side mismatches.

## Registry returns Tool[], not ToolDefinition[]

`getAllToolDefinitions()` and `getMcpToolDefinitions()` previously
returned `ToolDefinition[]`. At runtime they always return finalized
`Tool` objects (full Required<ToolDefinition>); the loose typing was
lying. Tightening to `Tool[]` fixes:

- `conversation.ts` and `conversation-tool-setup.ts` `.name` reads
  that became `string | undefined` under the widened type.
- Cross-package assignability to `@vellumai/skill-host-contracts`'s
  stricter `ToolDefinition` (which requires name/description/input_schema).
- `btw-routes.ts` passing the result to `runBtwSidechain` whose
  `tools` parameter is the contracts-package shape.

## buildProxyTool takes Zod-inferred wire shape

`buildProxyTool(definition: ToolDefinition)` was being passed elements
from `z.array(WireToolDefinitionSchema)`. The Zod-inferred type has
strict field types (e.g. `defaultRiskLevel: "low"|"medium"|"high"`)
that don't match assistant's enum-typed `ToolDefinition` even though
they're structurally equivalent. Switching the parameter to
`z.infer<typeof WireToolDefinitionSchema>` aligns the types and drops
the `!` assertions / `as` casts that the prior version needed.

## Tests: ! on .execute() calls and tool.description

13 test files invoke `.execute()` directly on imported tool exports
typed as `ToolDefinition`. Added non-null assertions matching the
runtime invariant — these tools are class instances or fully-specified
literals where every field is implemented.

Also: `computer-use-tools.test.ts` `schema()` helper now accepts
`input_schema?: object` instead of requiring it; `background-shell-bash`
and `terminal-tools` local `shellTool: Tool` declarations changed to
`shellTool: ToolDefinition` matching the module export type;
`credential-execution-tools` drops an explicit `(t: Tool)` callback
annotation that tripped the variance check.

* fix(tools): update shell-observability test to widened ToolDefinition

PR #32595 ('Observability for subprocess orphans') merged into main
after this branch opened. It introduced shell-observability.test.ts
using `let shellTool: Tool` + `shellTool.execute(...)` — the same
pattern fad576a fixed in the other shell tests. CI's typecheck on
the merge commit catches this; rebasing surfaces the file so the fix
goes in-branch.

Same mechanical fix: `Tool` → `ToolDefinition` for the variable
type, `execute!` on the four call sites.

* refactor(tools): convert class-based tools to ToolDefinition literals + consolidate schemas

Replaces all 22 class-based tools with plain object literals typed as
`ToolDefinition`, addressing PR review feedback. Class+instance
indirection is gone; `finalizeTool` no longer has anything to do with
prototypes.

Schema consolidation:
- `ToolDefinitionSchema` (tools/types.ts) is the single source of truth.
  `WireToolDefinitionSchema = ToolDefinitionSchema.required({...})` with
  `executionTarget` optional.
- `ipc/skill-routes/registries.ts` imports the canonical schema; the
  local duplicate and `RiskLevel` cast are removed
  (`z.enum(RiskLevel)` infers as the native enum).

Refactors:
- ask-question: constructor-injected `prompterFactory` becomes a
  `createAskQuestionTool()` factory; default export `askQuestionTool`.
  Test file updated to import the literal + factory.
- host-filesystem/transfer: `executeLocal` lifted from a private method
  to a module-level function (registry stores finalized literal
  references, so `this`-based dispatch is unavailable).
- 20 other tools converted to literals (zero `this.` usage).

Tests:
- ask-question 23/23, host-filesystem 29/29, memory 14/14, network
  web-search 36/36 pass in isolation. Broad-batch failures are
  pre-existing Bun mock-leakage: baseline at merge-base shows
  167 fail / 537 tests; this branch shows 79 fail / 489 tests under
  the same conditions.

* fix(tools): make ToolDefinition.execute required + widen input_schema to object

CI typecheck caught three regressions from the schema-driven type:

1. `ExecutionTarget` type import was removed by accident while
   restructuring the import block. Restored.

2. `ToolDefinition.execute` was inferred as optional (because the
   schema doesn't include execute), so every test calling
   `tool.execute(...)` failed with TS18048. Changed the type overlay
   to require execute — every in-process tool literal has one, and the
   wire schema still parses without it (closures can't cross IPC;
   `buildProxyTool` synthesizes execute on arrival).

3. `input_schema` was inferred as `Record<string, unknown>` from
   `z.record(z.string(), z.unknown())`, which is stricter than the
   prior `object` interface field. Factories and tests assigning
   typed JSON-schema objects (`as object`) failed. Widened the
   overlay to `input_schema?: object` so authors can keep using JSON
   schema literals; wire form still parses to `Record<string,
   unknown>` via `WireToolDefinitionSchema`.

Also updated `managed-skill-lifecycle.test.ts` to use the literal
`skillLoadTool` import instead of the (removed) `SkillLoadTool`
class.

* fix(tools): keep execute optional on ToolDefinition, bang test call sites

Reverting the previous `execute: required` change — there are
schema-only ToolDefinitions in the codebase (`graphRememberDefinition`,
`storeStyleAnalysisTool`, `SWEEP_TOOL`) that are handed to providers
as function-calling schemas without ever being registered for
execution. Requiring execute on the type would break those legitimate
use cases, plus the plugin-api contract test that asserts an empty
literal satisfies `ToolDefinition`.

Instead: keep `execute` optional, and add the `!` bang to test
call sites (mirroring the existing pattern in `host-filesystem/*.test.ts`).
Sed-replaced 111 call sites across 8 test files. Affected tools:
credentialStoreTool, fileListTool, makeAuthenticatedRequestTool,
recallTool, rememberTool, runAuthenticatedCommandTool, webSearchTool.

Also added a comment in types.ts documenting why execute is optional
and the schema-only ToolDefinition pattern, with pointers to the
three existing instances.

* fix(tools): bang skillLoadTool.execute in managed-skill-lifecycle test

* refactor(tools): consolidate to one schema + finalizeTool at IPC boundary

Round 2 feedback on PR #32631: kill the dual-schema design.

types.ts:
- Delete WireToolDefinitionSchema and its doc block.
- ToolDefinitionSchema is the single source of truth — all fields
  optional, both author and wire paths share it.
- Update ToolDefinition's doc to reflect the new contract and note the
  `satisfies ToolDefinition` pattern.

registries.ts (skill-routes):
- Drop the Wire-level-schemas comment block.
- Use ToolDefinitionSchema in z.array().
- Delete buildProxyTool + WireToolDefinition type.
- Replace tools.map(buildProxyTool) with
  tools.map((t) => finalizeTool(t, t.name ?? "")) — finalizeTool already
  synthesizes a no-op error closure for the no-supervisor path.

Tool sources (28 files: terminal/shell, host-terminal/host-shell,
host-filesystem/*, filesystem/*, credentials/vault, credential-execution/*,
memory/register, memory/graph/tools, memory/v2/sweep-job,
messaging/style-analyzer, ui-surface, computer-use/definitions,
network/{web-fetch,web-search}, system/request-permission,
subagent/notify-parent, skills/execute, apps/definitions):
- Switch from `: ToolDefinition = { ... }` to
  `... } satisfies ToolDefinition` so the inferred per-export type
  preserves `execute` as required at call sites that statically import
  the literal. The schema-only definitions (graphRememberDefinition,
  storeStyleAnalysisTool, SWEEP_TOOL) intentionally keep `execute`
  absent — ToolDefinition itself still has it optional.

Test files (27 files):
- Drop the `.execute!(` bang now that satisfies preserves the required
  type at the call site.
- Pattern A refactor: replace inline `await import("../tools/...")` in
  beforeEach with module-scope static imports for background-shell-bash
  and terminal-tools. shell-observability keeps a dynamic import because
  the test mocks `../util/logger.js` and shell.ts captures the logger at
  module-eval time — static import there would hoist past mock.module()
  and the test would see the real logger. Type-narrow via
  `typeof import(...)["shellTool"]` so no bang is needed.

Intentional reverts:
- openai-responses-provider.test.ts and plugin-api-tool-definition.test.ts
  keep `: ToolDefinition = { ... }` — they assert type shape (including
  the empty-literal contract test) rather than calling .execute, so the
  widened type is the right tool there.

registries.test.ts:
- "rejects missing required fields" → "fills defaults for partial tool
  entries": partial entries no longer reject (one all-optional schema +
  finalizeTool fills defaults).
- "proxy execute throws when no supervisor is attached" → "proxy execute
  surfaces an error result when no supervisor is attached": finalizeTool
  synthesizes a no-op error closure now, not a throw.

* fix(tools): convert skillLoadTool + ask-question to satisfies, finalize before supervisor short-circuit

Round 2 CI follow-up — three type-check failures from PR #32631 round 2:

1. `skillLoadTool` (skills/load.ts) and `askQuestionTool`
   (ask-question/ask-question-tool.ts) source files were missed in the
   first satisfies pass:
   - `skills/load.ts` contains a regex literal (`/!`[^`]*`/g`) inside the
     execute body. The AST converter at scratch/satisfies-converter.ts
     doesn't track regex literals, so its brace matcher walked through
     the backticks inside the character class, lost depth tracking, and
     silently emitted "no spots found". Converted manually.
   - `ask-question/ask-question-tool.ts` wraps a factory call
     (`createAskQuestionTool()` returns ToolDefinition). The cleanest
     fix is to drop the factory's explicit return annotation and apply
     `satisfies ToolDefinition` inside the returned literal so the
     inferred return type carries the narrow shape through the export.

2. `ipc/skill-routes/registries.ts` supervisor short-circuit log + return
   read `tools.map((t) => t.name)` where `t.name` is `string | undefined`
   per the all-optional schema. Pulled `finalizeTool` ahead of the
   supervisor branch so both paths see a `Tool[]` (name guaranteed).

3. `ipc/skill-routes/__tests__/registries.test.ts` "fills defaults"
   assertion compared `installed.defaultRiskLevel` (typed `RiskLevel`)
   against the string literal `"medium"` — strict enum compare needs
   `RiskLevel.Medium`.

* fix(tests): drop widening return type from makeToolWithStub in ask-question test

The helper had `tool: ToolDefinition` in its explicit return type, which
collapses the satisfies-narrowed shape returned by `createAskQuestionTool()`
back to the public author-facing shape (execute optional). Remove the
explicit return type so callers see the inferred narrow shape and can
invoke `tool.execute(...)` without a `!` bang.

Also drop the now-unused `ToolDefinition` type import.

---------

Co-authored-by: vellum-apollo-bot[bot] <242025090+vellum-apollo-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant