http: arm idle timer on open so a stalled TLS handshake times out#30376
Conversation
|
Updated 8:05 PM PT - May 7th, 2026
❌ @robobun, your commit da6eeb0 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 30376That installs a local version of the PR into your bun-30376 --bun |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds configurable HTTP idle timeout: exported idle_timeout_seconds can be set via BUN_CONFIG_HTTP_IDLE_TIMEOUT at thread start; sockets are armed on open and session rearm uses the configured timeout; Handler.onTimeout forwards to onLongTimeout. Adds regression test for stalled TLS handshake. ChangesHTTP idle timeout configuration and event handling
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/http/HTTPThread.zig`:
- Around line 228-232: Add a typed env-var accessor named
BUN_CONFIG_HTTP_IDLE_TIMEOUT in src/bun_core/env_var.zig following the same
pattern as BUN_CONFIG_DNS_TIME_TO_LIVE_SECONDS (an unsigned integer env_var),
then replace the manual bun.getenvZ + std.fmt.parseInt usage in HTTPThread.zig
with a call to that accessor's .get() method (i.e., use
BUN_CONFIG_HTTP_IDLE_TIMEOUT.get() via the bun.env_var) and assign the returned
value to bun.http.idle_timeout_seconds when present. Ensure the new accessor's
type and default/optional handling match the repo pattern and import/namespace
usage so HTTPThread.zig can call it directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e4fc9cdb-52b3-4db6-8d20-7811b4024b48
📒 Files selected for processing (5)
src/http/HTTPContext.zigsrc/http/HTTPThread.zigsrc/http/h2_client/ClientSession.zigsrc/http/http.zigtest/cli/install/bun-install-stalled-tls.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/http/http.zig (2)
1857-2007:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSuccessful header reads still consume the old idle budget.
handleOnDataHeaders()only callssetTimeout()on the short-read path and before chunked-body parsing. If a complete header block arrives with no body bytes in the same packet, or we return after a 1xx response, the timer keeps counting from the previous write/open event and can fire earlier than the configured idle window.Suggested fix
pub fn handleOnDataHeaders( this: *HTTPClient, comptime is_ssl: bool, incoming_data: []const u8, ctx: *NewHTTPContext(is_ssl), socket: NewHTTPContext(is_ssl).HTTPSocket, ) void { log("handleOnDataHeader data: {s}", .{incoming_data}); + this.setTimeout(socket); var to_read = incoming_data; var needs_move = true;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/http/http.zig` around lines 1857 - 2007, handleOnDataHeaders() currently only resets the idle timer in the short-read and chunked-body branches, so successful header-only reads (e.g., a 1xx with no more bytes or a complete header block with no body) leave the previous timeout intact; call this.setTimeout(socket) before any early returns where no body processing will follow: specifically add this.setTimeout(socket) in the 1xx branch before the `return` when to_read.len == 0, and add this.setTimeout(socket) before the `return` in the block after cloneMetadata() when to_read.len == 0 (the header-only path that may call progressUpdate), and also ensure setTimeout is invoked before returning from the proxy-handshake path if startProxyHandshake() does not already do so; update handleOnDataHeaders to place these calls near the existing setTimeout usages so the idle timer is consistently refreshed on header-only reads.
1720-1740:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRe-arm the idle timer for proxied streaming uploads too.
.proxy_bodyonly refreshes the timeout in the.bytesbranch. A streamed request body goes through.stream => this.flushStream(...), so proxied streaming uploads can still time out mid-transfer even while bytes are being drained.Suggested fix
.proxy_body => { log("send proxy body", .{}); if (this.proxy_tunnel) |proxy| { + this.setTimeout(socket); switch (this.state.original_request_body) { .bytes => { - this.setTimeout(socket); - const to_send = this.state.request_body; const sent = proxy.write(to_send) catch return; // just wait and retry when onWritable! if closed internally will call proxy.onClose🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/http/http.zig` around lines 1720 - 1740, The proxy_body handler only re-arms the idle timeout in the .bytes branch, so streamed proxied uploads can time out; update the .proxy_body case for when this.proxy_tunnel and this.state.original_request_body == .stream to call this.setTimeout(socket) (the same timeout refresh used in the .bytes branch) immediately before invoking this.flushStream(is_ssl, socket) so streaming writes also renew the idle timer; reference the .proxy_body branch, this.setTimeout, this.flushStream, and this.proxy_tunnel to locate and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/http/http.zig`:
- Around line 2216-2227: The ceil-to-minute math in setTimeout (function
setTimeout on HTTPClient) can overflow because idle_timeout_seconds is a c_uint
and (secs + 59) may wrap; fix by preventing overflow: either clamp
idle_timeout_seconds earlier to at most maxInt(c_uint) - 59 (so ((secs + 59) /
60) * 60 cannot overflow) or perform a widening cast/bounds check in setTimeout
(e.g., cast secs to a wider integer, check if secs > maxInt(c_uint) - 59 and
handle separately, or compute the minute-rounded value using safe arithmetic)
and then call socket.setTimeout with the safe result; update the code paths that
reference idle_timeout_seconds to use the chosen safe bound.
---
Outside diff comments:
In `@src/http/http.zig`:
- Around line 1857-2007: handleOnDataHeaders() currently only resets the idle
timer in the short-read and chunked-body branches, so successful header-only
reads (e.g., a 1xx with no more bytes or a complete header block with no body)
leave the previous timeout intact; call this.setTimeout(socket) before any early
returns where no body processing will follow: specifically add
this.setTimeout(socket) in the 1xx branch before the `return` when to_read.len
== 0, and add this.setTimeout(socket) before the `return` in the block after
cloneMetadata() when to_read.len == 0 (the header-only path that may call
progressUpdate), and also ensure setTimeout is invoked before returning from the
proxy-handshake path if startProxyHandshake() does not already do so; update
handleOnDataHeaders to place these calls near the existing setTimeout usages so
the idle timer is consistently refreshed on header-only reads.
- Around line 1720-1740: The proxy_body handler only re-arms the idle timeout in
the .bytes branch, so streamed proxied uploads can time out; update the
.proxy_body case for when this.proxy_tunnel and this.state.original_request_body
== .stream to call this.setTimeout(socket) (the same timeout refresh used in the
.bytes branch) immediately before invoking this.flushStream(is_ssl, socket) so
streaming writes also renew the idle timer; reference the .proxy_body branch,
this.setTimeout, this.flushStream, and this.proxy_tunnel to locate and apply the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 63176774-5534-4e2a-ace0-9ddaba83040e
📒 Files selected for processing (1)
src/http/http.zig
77b08a6 to
a8c2bdb
Compare
CI statusBuild 52708 (da6eeb0, rebased on main@9ed6e89b)
Only remaining hard error:
Pre-existing on
This PR doesn't touch abort/shutdown dispatch — for the default config it arms the same 5-minute long timer as before ( All other 52708 annotations are warnings (passed on retry): Earlier build 52690 (58696b0)
This PR's new test
|
There was a problem hiding this comment.
Thanks for addressing all the earlier feedback — the normalisation-at-load-time approach in a8c2bdb is clean and I have nothing further. Deferring to a human since this changes core HTTP client timer behaviour (new short-tick onTimeout handler, setTimeout refactor, new BUN_CONFIG_HTTP_IDLE_TIMEOUT knob) that runs on every fetch/install request.
Extended reasoning...
Overview
This PR fixes a real bun install hang (#30325) where a stalled TLS handshake leaves the request blocked forever because the idle timer wasn't armed until onWritable (post-handshake). It touches:
src/http/http.zig— arms the idle timer inonOpen, refactorsHTTPClient.setTimeoutto read a globalidle_timeout_secondsinstead of a per-callminutesarg, updates 10 call sites.src/http/HTTPContext.zig— adds aHandler.onTimeout(short-tick) that forwards toonLongTimeout, sincesocket.setTimeoutmay pick either timer depending on duration.src/http/HTTPThread.zig— readsBUN_CONFIG_HTTP_IDLE_TIMEOUTonce at thread start, clamps to the uSockets long-timer wrap bound (239 min), and rounds >240s up to a whole minute.src/http/h2_client/ClientSession.zig— routesrearmTimeoutthrough the same value.src/bun_core/env_var.zig— registers the new typed env var.- New regression test
bun-install-stalled-tls.test.ts.
Security risks
None identified. The change is timer-arming logic; no auth, crypto, or input parsing. The new env var only shortens/lengthens an idle timeout and is clamped to a safe range.
Level of scrutiny
High. This is the production HTTP client used by every fetch() and bun install. The default behaviour is preserved (300s ≡ the old hard-coded 5-minute long timer), and the only delta for unconfigured environments is that the handshake phase is now covered — strictly an improvement. But the refactor swaps socket.timeout(0); socket.setTimeoutMinutes(5) for socket.setTimeout(seconds) and adds a brand-new onTimeout short-tick handler to HTTPContext.Handler that previously didn't exist. Any subtle interaction between the short-tick and long-tick timers across the request lifecycle would affect every HTTP request, so a human familiar with the uSockets timer wheel should sign off.
Other factors
- All three of my earlier inline comments (floor-division quirk above 240s, dead
minutesparameter, h1/h2 rounding divergence) were addressed; the final approach centralises the clamp + ceil-to-minute inHTTPThread.onStarton au64intermediate, which also resolved CodeRabbit's overflow concern. - A regression test is included and the PR description shows it failing before / passing after.
- The robobun CI comment references build failures on an intermediate commit (77b08a6); I can't confirm green CI on the latest a8c2bdb from here.
- Introduces new public-ish configuration surface (
BUN_CONFIG_HTTP_IDLE_TIMEOUT) that a maintainer should ack.
bun install (and fetch) could hang forever when an HTTPS connection's TCP connect succeeded but the TLS handshake stalled — the HTTP client only armed its idle timer inside onWritable(), which runs after the handshake completes. A server (or middlebox/NAT under load) that accepted the TCP connection but never answered ClientHello left the socket ESTABLISHED with no timer, and bun install's pendingTaskCount()==0 loop waited forever in epoll_wait. Arm the idle timer in onOpen() so the handshake phase is covered, wire the short-tick onTimeout handler alongside the existing onLongTimeout, and read the duration from a new BUN_CONFIG_HTTP_IDLE_TIMEOUT env var (default 300s — unchanged from the previous hard-coded 5 minutes) so tests can exercise the path in seconds. Fixes #30325.
a8c2bdb to
da6eeb0
Compare
There was a problem hiding this comment.
No issues found and all prior review feedback has been addressed — deferring only because this changes socket-timer lifecycle in the core HTTP client (backs every fetch() and bun install), which is worth a human glance before merge.
Extended reasoning...
Overview
This PR fixes #30325 by arming the HTTP client's idle timer in onOpen() (so a stalled TLS handshake eventually fails instead of hanging forever) and makes the timeout configurable via BUN_CONFIG_HTTP_IDLE_TIMEOUT. It touches src/http/http.zig (timer arming + setTimeout refactor), src/http/HTTPContext.zig (wires the short-tick onTimeout handler), src/http/HTTPThread.zig (env-var load + clamp/round), src/http/h2_client/ClientSession.zig (routes h2 rearmTimeout through the same value), src/bun_core/env_var.zig (typed accessor), and adds a regression test.
Security risks
None identified. The change adds an idle timeout where previously there was none during handshake; it does not weaken TLS verification, certificate handling, or auth. The new env var only controls a timeout duration and is clamped/normalised before use.
Level of scrutiny
High. The HTTP client backs every fetch() call and bun install network operation, so a regression here has very wide blast radius. The change is conceptually small but alters socket-timer lifecycle (armed earlier, new short-tick onTimeout dispatch path) and switches the default 5-minute timeout from setTimeoutMinutes(5) to socket.setTimeout(300) — equivalent in effect (300 > 240 → setLongTimeout(5)) but a different code path. That's exactly the kind of subtle networking change that benefits from a maintainer's eye.
Other factors
- All four prior review threads (CodeRabbit's typed-env-var and overflow concerns; my floor-division, dead-parameter, and h2-rounding nits) have been addressed in follow-up commits, and the final shape — clamping + ceil-to-minute normalisation done once in
HTTPThread.onStarton au64— is clean. - The new regression test passed on every CI lane; the two CI failures are documented as pre-existing on
mainand unrelated (Windows AFD-poll race, ASAN worker panic). - Default behaviour is unchanged for unconfigured environments except that the handshake phase is now covered by the timer, which is the bug fix itself.
- I have not previously posted a top-level review on this PR (only inline comments, all resolved), so this is not redundant.
…ow-update-backpressure Conflict in src/http/http.zig with fe735f8 (#30376 idle-timer-on-open): setTimeout(socket, minutes) became setTimeout(socket) reading idle_timeout_seconds. Kept the receive_paused guards in .body/.body_chunk onData, updated consumeResponseBody's resume to the new signature, and switched maybePauseReceive's timer-clear from setTimeoutMinutes(0)+timeout(0) to setTimeout(0) which clears both the short-tick and long-minute timers.
…en-sh#30376) Fixes the `bun install` hang reported in oven-sh#30325 (latest comment — still reproducing on 1.3.13). ## Repro Point `bun install` at an HTTPS registry that accepts TCP but never answers the TLS ClientHello: ```ts // raw TCP server that swallows the ClientHello and never replies net.createServer(s => s.on("data", () => {})).listen(0); ``` ```toml [install] registry = "https://127.0.0.1:<port>/" ``` `bun install` connects, the socket goes ESTABLISHED, and the process blocks in `epoll_wait` forever with no timer armed. This is the state the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED sockets to the npm CDN, zero rx/tx, 14+ minutes and counting. ## Root cause `HTTPClient.onOpen()` starts the TLS handshake but does not arm the socket's idle timer — the first `setTimeout(socket, 5)` call is in `onWritable()`, which only runs *after* the handshake completes. Freshly-connected sockets inherit `long_timeout = 255` (disabled) from the connecting socket, so a stall anywhere between TCP-connect and handshake-done has no timer at all. The `bun install` main loop then waits forever on `pendingTaskCount() == 0` because the `NetworkTask` callback never fires. The earlier fixes in oven-sh#29611 / oven-sh#29649 covered a different hang (4xx/5xx tarball responses not releasing the task slot); they didn't touch this path. ## Fix - Arm the idle timer in `onOpen()` so it covers the TLS handshake. - Wire the short-tick `onTimeout` handler in `HTTPContext.Handler` alongside the existing `onLongTimeout` — `socket.setTimeout(seconds)` picks whichever timer fits the duration, so both must dispatch. - Read the idle-timeout duration from a new `BUN_CONFIG_HTTP_IDLE_TIMEOUT` env var (seconds). Default is 300 — the previous hard-coded 5 minutes — so nothing changes for unconfigured environments except that the handshake phase is now covered. `0` disables the timer (same as `disable_timeout = true`). - Route the experimental h2 client session's `rearmTimeout` through the same value for consistency. ## Verification New test `test/cli/install/bun-install-stalled-tls.test.ts` starts a raw TCP server that accepts connections and never replies, points `bun install` at it over `https://`, sets `BUN_CONFIG_HTTP_IDLE_TIMEOUT=3` / `BUN_CONFIG_HTTP_RETRY_COUNT=0`, and asserts the install fails with a timeout error. ``` # without this change (fail) bun install times out when the registry accepts TCP but never completes the TLS handshake [60004.48ms] ^ this test timed out after 60000ms. # with this change (pass) bun install times out when the registry accepts TCP but never completes the TLS handshake [4483.87ms] ``` `fetch-http2-client.test.ts` (58 tests) and `bun-install-retry.test.ts` still pass. Fixes oven-sh#30325 --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Signed-off-by: Sisyphus <sisyphus@ohos-bun.dev>
### What does this PR do? Enables TCP keepalive (`SO_KEEPALIVE` + `TCP_KEEPIDLE=60s`) on `fetch()` client sockets. Without this, when a connection becomes half-open — the peer is gone but the FIN/RST never reached us (NAT timeout, wifi/cellular handoff, middlebox state eviction, VPN disconnect) — the kernel never discovers it. A streaming `reader.read()` on such a socket blocks forever (or until an application-level timeout). Node's fetch (undici) sets `SO_KEEPALIVE` with `TCP_KEEPIDLE=60s`, so a half-open connection is detected at ~70s (60s idle + 10 probes × 1s). This makes Bun match that behavior via the existing `socket.setKeepAlive()` → `bsd_socket_keepalive()` path, which already hardcodes `TCP_KEEPINTVL=1` and `TCP_KEEPCNT=10`. The call is placed in `onOpen()` next to `client.setTimeout(socket)` (#30376) — socket-level, fires once per connection, inherited by keep-alive-reused requests. ### How did you verify your code works? Added `test/js/web/fetch/fetch-tcp-keepalive.test.ts` (Linux-only) that: - starts a streaming server, opens a `fetch()` to it - reads `/proc/self/net/tcp` and finds the client socket (ESTABLISHED, remote port = server port) - asserts the timer field is not `00:00000000` — i.e. the kernel's `sk_timer` (keepalive) is armed (`timer_active=02`) Without this patch the timer field is `00`; with it, `02:<jiffies>`. --------- Co-authored-by: robobun <117481402+robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Fixes the
bun installhang reported in #30325 (latest comment — still reproducing on 1.3.13).Repro
Point
bun installat an HTTPS registry that accepts TCP but never answers the TLS ClientHello:bun installconnects, the socket goes ESTABLISHED, and the process blocks inepoll_waitforever with no timer armed. This is the state the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED sockets to the npm CDN, zero rx/tx, 14+ minutes and counting.Root cause
HTTPClient.onOpen()starts the TLS handshake but does not arm the socket's idle timer — the firstsetTimeout(socket, 5)call is inonWritable(), which only runs after the handshake completes. Freshly-connected sockets inheritlong_timeout = 255(disabled) from the connecting socket, so a stall anywhere between TCP-connect and handshake-done has no timer at all. Thebun installmain loop then waits forever onpendingTaskCount() == 0because theNetworkTaskcallback never fires.The earlier fixes in #29611 / #29649 covered a different hang (4xx/5xx tarball responses not releasing the task slot); they didn't touch this path.
Fix
onOpen()so it covers the TLS handshake.onTimeouthandler inHTTPContext.Handleralongside the existingonLongTimeout—socket.setTimeout(seconds)picks whichever timer fits the duration, so both must dispatch.BUN_CONFIG_HTTP_IDLE_TIMEOUTenv var (seconds). Default is 300 — the previous hard-coded 5 minutes — so nothing changes for unconfigured environments except that the handshake phase is now covered.0disables the timer (same asdisable_timeout = true).rearmTimeoutthrough the same value for consistency.Verification
New test
test/cli/install/bun-install-stalled-tls.test.tsstarts a raw TCP server that accepts connections and never replies, pointsbun installat it overhttps://, setsBUN_CONFIG_HTTP_IDLE_TIMEOUT=3/BUN_CONFIG_HTTP_RETRY_COUNT=0, and asserts the install fails with a timeout error.fetch-http2-client.test.ts(58 tests) andbun-install-retry.test.tsstill pass.Fixes #30325