Skip to content

fix(pty-daemon): raise daemon RLIMIT_NOFILE and surface real spawn errno#5067

Merged
saddlepaddle merged 1 commit into
mainfrom
fix-superset-spawn-errors
Jun 3, 2026
Merged

fix(pty-daemon): raise daemon RLIMIT_NOFILE and surface real spawn errno#5067
saddlepaddle merged 1 commit into
mainfrom
fix-superset-spawn-errors

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented Jun 3, 2026

Problem

Users on dense shared hosts were getting this from the SDK:

open <uuid>: spawn failed (shell=/bin/zsh cwd=/Users/.../worktrees/.../public-method): posix_spawnp failed.

Diagnosis

  • The committed cwd pre-flight in Pty.spawn (statSync) had already passed, so the cwd existed — this is not a deleted/missing worktree.
  • node-pty's native pty.fork throws a bare Error("posix_spawnp failed.") with .code: undefined, .errno: undefined — the errno is lost at the boundary.
  • Empirically, a bad shell/cwd does not produce this message (node-pty spawns a helper; bad exec surfaces async via exit). posix_spawnp failed. is thrown specifically when spawning node-pty's helper fails — i.e. resource exhaustion (EMFILE/EAGAIN).
  • Prod DB confirms the scale: one org's daemon (thrive-holdings) was hosting 118 live worktrees on a single machine (202 across 7 hosts), creation rate accelerating (1 → 21 → 42 → 56/week), no pruning. The daemon is one process per org per host and never raised RLIMIT_NOFILE, against macOS's 256 soft default. ~40–50 worktrees with a live agent shell exhausts 256 fds, after which every new PTY spawn fails with EMFILE for every user on that daemon.

Changes

DaemonSupervisor.ts — raise RLIMIT_NOFILE
Spawn the daemon through /bin/sh -c 'ulimit -n 1048576 …; exec "$@"' (POSIX only; Windows path unchanged). exec preserves PID/argv/process-group (adoption-by-PID and ps matching unaffected), and handoff successors the daemon spawns from itself inherit the raised limit. 1048576 matches systemd's LimitNOFILE default; macOS clamps to the kernel ceiling.

Pty.ts — surface the real errno
On spawn failure, re-probe the same shell+cwd with spawnSync (which does report the errno) and include errno= in the wire error:

spawn failed (shell=/bin/zsh cwd=... errno=EMFILE): posix_spawnp failed.

Only runs on the (rare) failure path; resource errors like EMFILE are sticky enough to reproduce immediately.

Testing

  • typecheck (both packages) ✅ · bun run lint (4618 files) ✅ · pty-daemon unit tests 13/13 ✅
  • EMFILE path proven end-to-end: a real Server over a unix socket under node --experimental-strip-types at ulimit -n 256, fds exhausted → daemon PTY spawn fails → error frame correctly carries code: ESPAWN + errno=EMFILE. (Smoke test was removed rather than kept, since it can only run under a low ulimit and would self-skip in CI.)
  • The one failing host-service integration test (terminal disposal cleans up background process groups) is pre-existing — fails identically with this change stashed.

Open in Stage

Summary by cubic

Raise the daemon’s file descriptor limit to prevent EMFILE-triggered PTY spawn failures on dense hosts, and surface the real errno when a PTY spawn fails. This reduces crashes and makes errors easier to diagnose.

  • Bug Fixes
    • host-service: On POSIX, start the daemon via /bin/sh -c 'ulimit -n 1048576 || ulimit -n "$(ulimit -Hn)" || true; exec "$@"' so RLIMIT_NOFILE is raised before exec and inherited by children; Windows unchanged.
    • pty-daemon: On spawn failure, re-probe the same shell+cwd with spawnSync and include errno= in the error (e.g., EMFILE/EAGAIN/ENOENT); only runs on the failure path.

Written for commit f6536a1. Summary will update on new commits.

Review in cubic

Summary by CodeRabbit

  • Bug Fixes
    • Improved error reporting for terminal creation failures with more detailed diagnostic information.
    • Enhanced daemon stability on macOS by optimizing system resource utilization during startup.

Users on dense shared hosts hit "spawn failed (...): posix_spawnp failed."
from the SDK. node-pty's native error drops the errno, and the message is
thrown specifically when spawning node-pty's helper fails — i.e. resource
exhaustion (EMFILE), not a bad shell/cwd. One org daemon was hosting 100+
worktrees on a single machine against macOS's 256 fd soft default.

- DaemonSupervisor: spawn the daemon through `/bin/sh -c 'ulimit -n ...; exec'`
  (POSIX only) so RLIMIT_NOFILE is raised before exec; exec preserves PID/argv/
  process-group, and handoff successors inherit the raised limit.
- Pty.spawn: on failure, re-probe the same shell+cwd with spawnSync (which
  surfaces the real code) and include `errno=` in the wire error, so the next
  occurrence is diagnosable (EMFILE/EAGAIN/ENOENT) instead of a bare string.
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented Jun 3, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@stage-review
Copy link
Copy Markdown

stage-review Bot commented Jun 3, 2026

Ready to review this PR? Stage has broken it down into 2 individual chapters for you:

Title
1 Raise RLIMIT_NOFILE for the pty-daemon
2 Surface real errno on PTY spawn failure
Open in Stage

Chapters generated by Stage for commit f6536a1 on Jun 3, 2026 1:56am UTC.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Two independent runtime improvements: daemon process spawning now raises file descriptor limits on non-Windows platforms before executing the daemon binary; and PTY spawn errors now include reprobed errno details to surface underlying shell failures that the native node-pty library would otherwise hide.

Changes

Daemon File Descriptor Limit Raising

Layer / File(s) Summary
Daemon spawn with platform-specific ulimit routing
packages/host-service/src/daemon/DaemonSupervisor.ts
Child process creation in DaemonSupervisor.spawn() now uses /bin/sh to run ulimit -n (with fallback to hard limit) before execing the Node executable on non-Windows, while Windows continues direct Node executable invocation. Spawn options and subsequent readiness/exit handling remain unchanged.

PTY Spawn Error Diagnosis

Layer / File(s) Summary
reprobeErrno helper for spawn diagnostics
packages/pty-daemon/src/Pty/Pty.ts
Local reprobeErrno(meta) helper added that uses child_process.spawnSync with stdio: "ignore" and 1s timeout to determine whether an error occurred during shell invocation; returns "ok" when no error, the error code (or message fallback), or "reprobe-failed:<message>" if reprobe itself throws.
Spawn error reporting with reprobed errno
packages/pty-daemon/src/Pty/Pty.ts
Pty.spawn() catch path updated to call reprobeErrno(meta) and embed errno=<...> in the thrown error message, preserving existing shell and cwd context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit springs forth with clearer sight,
The daemon bounds with unlimited might,
And when errors brew in the PTY shell,
We probe deep to find what stories tell!
Two platforms smile as limits grow,
And diagnoses shine with reprobéd glow.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes both main changes: raising RLIMIT_NOFILE for the daemon and surfacing the real spawn errno, matching the core objectives.
Description check ✅ Passed The description comprehensively covers the problem, diagnosis, changes, and testing, but lacks formal sections matching the template structure (Related Issues, Type of Change, Screenshots).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-superset-spawn-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR addresses fd exhaustion on dense shared hosts where the per-org daemon was started with macOS's default 256 soft RLIMIT_NOFILE, causing node-pty to fail silently with posix_spawnp failed once ~40–50 live PTY sessions were open. It also surfaces the real errno (e.g., EMFILE) that node-pty's native layer drops.

  • DaemonSupervisor.ts: Non-Windows daemon spawns are now wrapped in /bin/sh -c 'ulimit -n 1048576 … ; exec \"$@\"', raising the open-file limit before exec hands off to node — the PID is preserved so adoption-by-PID logic is unaffected, and child processes inherit the raised limit.
  • Pty.ts: New reprobeErrno helper runs a spawnSync no-op immediately after a node-pty failure to recover the errno that the native layer swallows, embedding errno=EMFILE (or similar) in the wire-level error message.

Confidence Score: 4/5

Both changes are well-scoped to the failure path; the ulimit wrapper preserves PID and process-group semantics via exec, and the reprobe probe only fires on an already-errored spawn.

The ulimit shell wrapper is a battle-tested pattern, exec correctly excludes $0 so the node daemon is invoked with the right arguments, and the Windows code path is untouched. The reprobeErrno helper is non-invasive and only runs after node-pty has already thrown. The one rough edge is that a resolved/transient failure produces errno=ok in the error message, which reads oddly in an exception context but does not affect correctness.

No files require special attention; both changed files are straightforward and the logic is confined to daemon-startup and error-reporting paths.

Important Files Changed

Filename Overview
packages/host-service/src/daemon/DaemonSupervisor.ts Wraps non-Windows daemon spawn in /bin/sh with ulimit to raise RLIMIT_NOFILE before handing off to the node daemon; exec preserves the PID so adoption and ps-matching remain correct.
packages/pty-daemon/src/Pty/Pty.ts Adds reprobeErrno to recover the actual errno from a failed node-pty spawn by immediately re-running spawnSync; annotates the thrown error with errno= for actionable wire-level diagnostics. Minor: returns the string ok on success, which reads strangely inside an error message.

Sequence Diagram

sequenceDiagram
    participant HS as DaemonSupervisor
    participant SH as /bin/sh wrapper
    participant D as pty-daemon (node)
    participant NT as node-pty native
    participant RE as reprobeErrno

    HS->>SH: spawn /bin/sh with ulimit script
    SH->>SH: ulimit -n 1048576 (raise RLIMIT_NOFILE)
    SH->>D: exec node script --socket (PID preserved, limit inherited)
    D->>NT: nodePty.spawn(shell, argv, cwd)
    alt spawn succeeds
        NT-->>D: IPty handle
    else spawn fails e.g. EMFILE
        NT-->>D: Error posix_spawnp failed, errno undefined
        D->>RE: spawnSync(shell, cwd, timeout 1000)
        RE-->>D: "probe.error.code = EMFILE"
        D-->>HS: "Error spawn failed errno=EMFILE"
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/pty-daemon/src/Pty/Pty.ts:152
**`"ok"` in error message is confusing**

When the re-probe succeeds (the EMFILE condition resolved between node-pty's failure and `spawnSync`), the thrown error message reads `errno=ok`, which is contradictory — the caller sees an exception but with a result string that implies success. A label like `"non-sticky"` or `"resolved"` would make it immediately clear that the condition was transient rather than suggesting the probe found no error.

Reviews (1): Last reviewed commit: "fix(pty-daemon): raise daemon RLIMIT_NOF..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/pty-daemon/src/Pty/Pty.ts`:
- Around line 147-151: The reprobeErrno probe currently calls
childProcess.spawnSync(meta.shell, ["-c", ":"], { cwd: meta.cwd, timeout: 1000,
stdio: "ignore" }) without passing environment, so update the options to include
env: meta.env to match nodePty.spawn behavior; specifically modify the spawnSync
call inside reprobeErrno to pass the meta.env property (env: meta.env) alongside
cwd, timeout, and stdio so environment-driven errno results are reproduced
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e067bcb7-4917-4aa7-be25-74fd54ae9fa3

📥 Commits

Reviewing files that changed from the base of the PR and between de57341 and f6536a1.

📒 Files selected for processing (2)
  • packages/host-service/src/daemon/DaemonSupervisor.ts
  • packages/pty-daemon/src/Pty/Pty.ts

Comment on lines +147 to +151
const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
cwd: meta.cwd,
timeout: 1000,
stdio: "ignore",
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify parity between nodePty.spawn and reprobeErrno spawnSync options in Pty.ts
fd -i 'Pty.ts' | xargs -I{} sh -c '
  echo "== {} ==";
  rg -n -C3 "nodePty\\.spawn\\(|spawnSync\\(" "{}";
  rg -n -C2 "env:\\s*meta\\.env" "{}";
'

Repository: superset-sh/superset

Length of output: 958


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Matched files by fd -i 'Pty.ts':"
fd -i 'Pty\.ts$' .

echo
echo "---- reprobeErrno block ----"
sed -n '120,220p' packages/pty-daemon/src/Pty/Pty.ts | cat -n

echo
echo "---- Search for spawnSync options containing env: meta.env ----"
rg -n --context 2 "spawnSync\\(" packages/pty-daemon/src/Pty/Pty.ts
rg -n --context 2 "env:\\s*meta\\.env" packages/pty-daemon/src/Pty/Pty.ts

Repository: superset-sh/superset

Length of output: 4552


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "type SessionMeta|interface SessionMeta|SessionMeta" packages/pty-daemon/src/Pty/Pty.ts packages/pty-daemon/src -S

Repository: superset-sh/superset

Length of output: 1959


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' packages/pty-daemon/src/protocol/messages.ts | nl -ba

Repository: superset-sh/superset

Length of output: 107


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' packages/pty-daemon/src/protocol/messages.ts | cat -n

Repository: superset-sh/superset

Length of output: 3608


Preserve spawn parity in reprobeErrno by passing meta.env to spawnSync.

reprobeErrno probes with cwd only, while the real nodePty.spawn uses env: meta.env; dropping env can misreport the errno for environment-driven failures.

Suggested patch
 function reprobeErrno(meta: SessionMeta): string {
 	try {
 		const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
 			cwd: meta.cwd,
+			env: meta.env,
 			timeout: 1000,
 			stdio: "ignore",
 		});
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
cwd: meta.cwd,
timeout: 1000,
stdio: "ignore",
});
const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
cwd: meta.cwd,
env: meta.env,
timeout: 1000,
stdio: "ignore",
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/pty-daemon/src/Pty/Pty.ts` around lines 147 - 151, The reprobeErrno
probe currently calls childProcess.spawnSync(meta.shell, ["-c", ":"], { cwd:
meta.cwd, timeout: 1000, stdio: "ignore" }) without passing environment, so
update the options to include env: meta.env to match nodePty.spawn behavior;
specifically modify the spawnSync call inside reprobeErrno to pass the meta.env
property (env: meta.env) alongside cwd, timeout, and stdio so environment-driven
errno results are reproduced correctly.

timeout: 1000,
stdio: "ignore",
});
if (!probe.error) return "ok";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "ok" in error message is confusing

When the re-probe succeeds (the EMFILE condition resolved between node-pty's failure and spawnSync), the thrown error message reads errno=ok, which is contradictory — the caller sees an exception but with a result string that implies success. A label like "non-sticky" or "resolved" would make it immediately clear that the condition was transient rather than suggesting the probe found no error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/pty-daemon/src/Pty/Pty.ts
Line: 152

Comment:
**`"ok"` in error message is confusing**

When the re-probe succeeds (the EMFILE condition resolved between node-pty's failure and `spawnSync`), the thrown error message reads `errno=ok`, which is contradictory — the caller sees an exception but with a result string that implies success. A label like `"non-sticky"` or `"resolved"` would make it immediately clear that the condition was transient rather than suggesting the probe found no error.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/pty-daemon/src/Pty/Pty.ts">

<violation number="1" location="packages/pty-daemon/src/Pty/Pty.ts:148">
P2: Errno reprobe does not use the same environment as the failing spawn, so reported errno can be incorrect/misleading.</violation>

<violation number="2" location="packages/pty-daemon/src/Pty/Pty.ts:152">
P3: Returning `"ok"` here produces a confusing error message like `errno=ok` in the thrown exception — the caller sees an error but the embedded errno label suggests success. Use a label like `"transient"` or `"non-reproducible"` to clarify that the condition was not replicable at reprobe time.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

function reprobeErrno(meta: SessionMeta): string {
try {
const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
cwd: meta.cwd,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Errno reprobe does not use the same environment as the failing spawn, so reported errno can be incorrect/misleading.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/pty-daemon/src/Pty/Pty.ts, line 148:

<comment>Errno reprobe does not use the same environment as the failing spawn, so reported errno can be incorrect/misleading.</comment>

<file context>
@@ -142,6 +142,21 @@ function validateDims(cols: number, rows: number): void {
+function reprobeErrno(meta: SessionMeta): string {
+	try {
+		const probe = childProcess.spawnSync(meta.shell, ["-c", ":"], {
+			cwd: meta.cwd,
+			timeout: 1000,
+			stdio: "ignore",
</file context>
Suggested change
cwd: meta.cwd,
cwd: meta.cwd,
env: meta.env,

timeout: 1000,
stdio: "ignore",
});
if (!probe.error) return "ok";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Returning "ok" here produces a confusing error message like errno=ok in the thrown exception — the caller sees an error but the embedded errno label suggests success. Use a label like "transient" or "non-reproducible" to clarify that the condition was not replicable at reprobe time.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/pty-daemon/src/Pty/Pty.ts, line 152:

<comment>Returning `"ok"` here produces a confusing error message like `errno=ok` in the thrown exception — the caller sees an error but the embedded errno label suggests success. Use a label like `"transient"` or `"non-reproducible"` to clarify that the condition was not replicable at reprobe time.</comment>

<file context>
@@ -142,6 +142,21 @@ function validateDims(cols: number, rows: number): void {
+			timeout: 1000,
+			stdio: "ignore",
+		});
+		if (!probe.error) return "ok";
+		const e = probe.error as NodeJS.ErrnoException;
+		return e.code ?? e.message;
</file context>
Suggested change
if (!probe.error) return "ok";
if (!probe.error) return "non-reproducible";

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

@saddlepaddle saddlepaddle merged commit b1a6e4d into main Jun 3, 2026
16 of 17 checks passed
saddlepaddle added a commit that referenced this pull request Jun 3, 2026
Bundled fixes since 0.2.21 (no command-surface changes):
- #5067 pty-daemon: raise RLIMIT_NOFILE + surface real spawn errno
- #5066 projects: restrict deletion to organization owners
- #5055 route organization creation through auth hooks

Also syncs bun.lock with the desktop 1.12.2 / host-service 0.8.18 bump
that landed in e173c89 without a lockfile update.
saddlepaddle added a commit that referenced this pull request Jun 3, 2026
* chore(desktop): bump version to 1.12.2 (host-service 0.8.17 -> 0.8.18)

* release(cli): cut cli v0.2.22

Bundled fixes since 0.2.21 (no command-surface changes):
- #5067 pty-daemon: raise RLIMIT_NOFILE + surface real spawn errno
- #5066 projects: restrict deletion to organization owners
- #5055 route organization creation through auth hooks

Also syncs bun.lock with the desktop 1.12.2 / host-service 0.8.18 bump
that landed in e173c89 without a lockfile update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant