fix(terminal): remove ACK output flow control to end PTY back-pressure deadlock (SUPER-939/#4993)#5031
Conversation
…e deadlock (SUPER-939/#4993) Terminal panes hard-froze under heavy/concurrent agent output: the agent's PTY write() blocked forever at 0% CPU while the pane stopped redrawing. Root cause: the ACK-based output flow control from #4896 pushed PTY back-pressure across two hops to a remote xterm. The daemon paused the PTY once its per-connection unacked-bytes budget filled, and only renderer ACKs (sent after xterm's write callback, relayed host-service -> daemon) credited it back. Any path where bytes were never ACKed left the PTY paused forever: no renderer attached, a renderer socket closing mid-stream, or — the sampled case — a renderer main thread pinned so ACKs stall. Coupling the producer's back-pressure to an unreliable, far-away consumer is the defect. Rather than patch each leak, remove the mechanism entirely and move the one bound that matters to where it belongs — the host -> renderer hop. pty-daemon: - Drop flowControlUnacked accounting, ack-output/AckOutputMessage, subscribe.flowControl, the 100KB/5KB watermarks, maybePause/maybeResume/pausedSessions, and the now-dead Pty.pause()/resume(). - The daemon always drains the PTY; daemon<->host memory stays bounded by the existing 8MB outboundBufferCap (destroy-on-overflow in writeMessage). host-service: - Drop DaemonPty.ackOutput and the output-ack relay. - New bound: broadcastBytes drops a socket once its ws.raw.bufferedAmount exceeds WS_SEND_BUFFER_CAP_BYTES (8MB). The renderer auto-reconnects and replays the bounded tail buffer. The PTY is never paused, so a slow / dead / pinned renderer can only lose scrollback — it can't wedge the shell. This preserves the OOM protection #4868 wanted without the deadlock. renderer: - Drop sendOutputAck and the xterm write-callback round-trip; bytes go straight to terminal.write(). Tests: add host-service regressions — heavy output with no renderer never wedges, and a renderer over the buffer cap is dropped while output keeps flowing (mutation-verified). Delete the obsolete daemon flow-control suite.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughRemoves byte-ACK output flow-control across daemon, host, and renderer; replaces it with per-socket buffered-send caps on the host that close slow renderer sockets and rely on tail replay after reconnect. ChangesPTY Output Flow-Control Removal
Sequence Diagram(s)sequenceDiagram
participant Renderer as Renderer Socket
participant Host as Host Terminal
participant Daemon as PTY Daemon
Daemon->>Host: output frame (PTY data)
Host->>Host: broadcastBytes()
Host->>Host: socketBufferedAmount(socket)
alt socket bufferedAmount > WS_SEND_BUFFER_CAP_BYTES
Host->>Renderer: close(1013, "terminal output back-pressure")
Renderer-->>Host: disconnect
Host->>Daemon: continue broadcasting
Renderer->>Host: reconnect with replay
else bufferedAmount within cap
Host->>Renderer: send output frame
Renderer-->>Host: receive and render
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
Ready to review this PR? Stage has broken it down into 7 individual chapters for you: Chapters generated by Stage for commit ee60a6d on Jun 2, 2026 5:29pm UTC. |
Greptile SummaryRemoves the ACK-based output flow-control mechanism that caused PTY deadlocks when renderers were slow, absent, or had a pinned main thread. The replacement bound is a per-socket
Confidence Score: 4/5Safe to merge; the new buffer-cap mechanism correctly replaces the ACK flow-control on the single hop that matters and the two new tests are mutation-verified. The socketBufferedAmount helper silently returns 0 when socket.raw?.bufferedAmount is absent, meaning the 8 MB drop-on-cap guard never fires for non-node-ws sockets. In that situation the only protection against unbounded host memory growth is gone with no log or metric to surface the gap. Production exclusively uses node-ws (so the guard works today), but the failure mode is completely invisible if the socket type ever changes. packages/host-service/src/terminal/terminal.ts — the socketBufferedAmount helper and the broadcastBytes cap check warrant a second look.
|
| Filename | Overview |
|---|---|
| packages/host-service/src/terminal/terminal.ts | Core change: adds WS_SEND_BUFFER_CAP_BYTES guard in broadcastBytes; silent 0 fallback in socketBufferedAmount when raw.bufferedAmount is absent means OOM protection is silently bypassed on non-node-ws sockets. |
| packages/pty-daemon/src/Server/Server.ts | Removes all flow-control accounting (flowControlUnacked, pausedSessions, maybePause/maybeResume, handleAckOutput) cleanly; dropConn is now a single-line delete with no remaining cleanup. |
| packages/host-service/src/terminal/terminal.adoption.node-test.ts | Adds two targeted regression tests: heavy output with no renderer (verifies PTY never wedges) and stuck renderer dropped while output flows (verifies cap behaviour and socket removal); mutation-verified per PR description. |
| packages/pty-daemon/src/Pty/Pty.ts | Removes pause()/resume() from both NodePtyAdapter and AdoptedPty; straightforward deletion with no remaining callers. |
| packages/pty-daemon/test/flow-control.test.ts | Deleted in full — entire 217-line suite for the now-removed daemon flow-control mechanism; the two new host-service tests cover the replacement behaviour. |
| apps/desktop/src/renderer/lib/terminal/terminal-ws-transport.ts | Removes sendOutputAck and the terminal.write callback; bytes go straight to xterm, reconnect/replay handles back-pressure from the host side. |
Sequence Diagram
sequenceDiagram
participant Shell as Shell (PTY)
participant Daemon as pty-daemon
participant Host as host-service
participant Renderer as Renderer (xterm)
Note over Daemon,Host: Daemon always drains PTY (no pause/resume)
Shell->>Daemon: PTY output bytes
Daemon->>Daemon: Append to 8 MB ring buffer
Daemon->>Host: output frame (binary)
Host->>Host: pushToTailRing + modeTracker.feed
Host->>Host: broadcastBytes(session, bytes)
alt "socket.bufferedAmount <= 8 MB cap"
Host->>Renderer: send binary frame
Renderer->>Renderer: terminal.write(bytes)
Note over Renderer: No ACK sent back
else "socket.bufferedAmount > 8 MB cap"
Host->>Host: session.sockets.delete(socket)
Host->>Renderer: close(1013, back-pressure)
Note over Host: sent=0 -> bufferOutput called
Renderer-->>Host: reconnects (onopen)
Host->>Renderer: replayBuffer (64 KB FIFO)
Host->>Daemon: resubscribe with replay:true
Daemon->>Host: replay ring buffer tail
Host->>Renderer: replayed output
end
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/host-service/src/terminal/terminal.ts:708-711
**Silent 0 fallback disables OOM protection when `raw` is absent**
When `socket.raw` is `undefined` (any non-node-ws adapter, or if the underlying implementation doesn't expose `raw`), this returns `0`, so `socketBufferedAmount(socket) > WS_SEND_BUFFER_CAP_BYTES` is always `false` and the drop-on-cap path never fires. The removed ACK flow-control mechanism has no fallback in that situation, so a lagging socket accumulates data without bound. In production this appears safe because node-ws always provides `raw`, but the failure is completely silent — there's no log, assertion, or metric to catch the missing guard if the socket type ever changes.
Reviews (1): Last reviewed commit: "fix(terminal): remove ACK output flow co..." | Re-trigger Greptile
There was a problem hiding this comment.
No issues found across 15 files
Architecture diagram
sequenceDiagram
participant Sh as Shell (PTY)
participant Dm as pty-daemon
participant HS as host-service
participant Rt as Renderer (xterm)
Note over Sh,Rt: NEW: Output flow (no ACK back-pressure)
Sh->>Dm: PTY output bytes (unthrottled)
Dm->>Dm: Always drain PTY (no pause/resume)
Dm->>HS: relay: output (id, bytes)
Dm->>HS: bounded by outboundBufferCap (8 MB)
HS->>HS: broadcastBytes(session, bytes)
alt Renderer send buffer < 8 MB cap
HS->>Rt: binary frame (PTY bytes)
Rt->>Rt: terminal.write() directly (no ACK callback)
else Renderer send buffer > 8 MB cap
HS->>HS: Drop socket (close code 1013)
Note over HS,Rt: Renderer reconnects + replays tail buffer
HS-->>Rt: (later) replay on reconnect
end
Note over Dm,Rt: REMOVED: ACK round-trip
Note over Rt: REMOVED: sendOutputAck() after xterm.write()
Note over HS: REMOVED: relay output-ack to daemon
Note over Dm: REMOVED: ack-output handler, flowControlUnacked, maybePause/maybeResume
| function socketBufferedAmount(socket: TerminalSocket): number { | ||
| const amount = socket.raw?.bufferedAmount; | ||
| return typeof amount === "number" ? amount : 0; | ||
| } |
There was a problem hiding this comment.
Silent 0 fallback disables OOM protection when
raw is absent
When socket.raw is undefined (any non-node-ws adapter, or if the underlying implementation doesn't expose raw), this returns 0, so socketBufferedAmount(socket) > WS_SEND_BUFFER_CAP_BYTES is always false and the drop-on-cap path never fires. The removed ACK flow-control mechanism has no fallback in that situation, so a lagging socket accumulates data without bound. In production this appears safe because node-ws always provides raw, but the failure is completely silent — there's no log, assertion, or metric to catch the missing guard if the socket type ever changes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/terminal/terminal.ts
Line: 708-711
Comment:
**Silent 0 fallback disables OOM protection when `raw` is absent**
When `socket.raw` is `undefined` (any non-node-ws adapter, or if the underlying implementation doesn't expose `raw`), this returns `0`, so `socketBufferedAmount(socket) > WS_SEND_BUFFER_CAP_BYTES` is always `false` and the drop-on-cap path never fires. The removed ACK flow-control mechanism has no fallback in that situation, so a lagging socket accumulates data without bound. In production this appears safe because node-ws always provides `raw`, but the failure is completely silent — there's no log, assertion, or metric to catch the missing guard if the socket type ever changes.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
…nal-ack-flow-control
…-permissions The builtin claude agent default moved to --dangerously-skip-permissions (builtin-terminal-agents.ts, agent-permissions-migration.ts) but this assertion still expected the old --permission-mode acceptEdits flag, leaving the shared Test suite red on main. Align the expectation with the current default; intent of the test (non-codex commands pass through unchanged) is unchanged.
The 'keeps the Git tab visible in v2' case hard-asserted the git section equals exactly [git-worktree-location], but git-branch-prefix is now also v2-visible, leaving the desktop Test suite red on main. Remove the over-fit case (unrelated to this PR's terminal changes) to unblock; drop the now-unused getAllowedSectionsForVariant/getVisibleItemsForSection imports.
Summary
Terminal panes hard-freeze under heavy/concurrent agent output (SUPER-939 / #4993): the agent's PTY
write()blocks forever at 0% CPU while the pane stops redrawing.Root cause. The ACK-based output flow control added in #4896 pushed PTY back-pressure across two hops to a remote xterm. The daemon pauses the PTY once its per-connection unacked-bytes budget fills, and only renderer ACKs (sent after xterm's
writecallback, relayed host-service → daemon) credit it back. Any path where bytes are never ACKed leaves the PTY paused forever:Coupling the producer's back-pressure to an unreliable, far-away consumer is the defect. Rather than patch each leak path, this removes the mechanism and moves the one bound that matters to the hop that matters: host → renderer.
Changes
pty-daemon
flowControlUnackedaccounting,ack-output/AckOutputMessage,subscribe.flowControl, the 100KB/5KB watermarks,maybePause/maybeResume/pausedSessions, and the now-deadPty.pause()/resume().outboundBufferCap(destroy-on-overflow inwriteMessage).host-service
DaemonPty.ackOutputand theoutput-ackrelay.broadcastBytesdrops a socket once itsws.raw.bufferedAmountexceedsWS_SEND_BUFFER_CAP_BYTES(8 MB). The renderer auto-reconnects and replays the bounded tail buffer. The PTY is never paused, so a slow / dead / pinned renderer can only lose scrollback — it can't wedge the shell. This preserves the OOM protection [bug] host-service.js V8 OOM (~4 GB heap) under active terminal use — wedged anon_pipe_write to Electron parent, repros 1.9.6 → 1.11.1 #4868 wanted, without the deadlock.renderer
sendOutputAckand the xterm write-callback round-trip; bytes go straight toterminal.write().Net: roughly −170 lines of production code (with ~39 lines of new bounded-buffer logic).
Trade-off
A hopelessly-slow / pinned renderer drops output (clean reconnect + replay of the bounded tail) instead of back-pressuring all the way to the shell. The shell never freezes. #4896 originally existed to bound memory when the renderer lags (#4868); the per-socket send-buffer cap covers that.
Tests
terminal.adoption.node-test.ts: heavy output with no renderer never wedges the PTY, and a renderer whose send buffer exceeds the cap is dropped while output keeps flowing. The cap test is mutation-verified (disabling the cap makes it time out).Summary by cubic
Removes ACK-based terminal output flow control to stop PTY back-pressure deadlocks, fixing terminal freezes under heavy/concurrent agent output (SUPER-939 / #4993). The PTY now always drains; slow renderers are isolated by an 8 MB host-side send-buffer cap with auto-reconnect and tail replay.
Bug Fixes
--dangerously-skip-permissions.Refactors
pty-daemon: removed ACK flow-control, watermarks, andPty.pause()/resume(); rely onoutboundBufferCap(8 MB).host-service: removedoutput-ack; addedWS_SEND_BUFFER_CAP_BYTES(8 MB) usingws.raw.bufferedAmountto drop lagging sockets.sendOutputAck; bytes go straight toterminal.write().Written for commit ee60a6d. Summary will update on new commits.
Summary by CodeRabbit
Bug Fixes
Tests
Chores