Skip to content

http: arm idle timer on open so a stalled TLS handshake times out#30376

Merged
cirospaciari merged 5 commits into
mainfrom
farm/ce1925a6/http-handshake-idle-timeout
May 9, 2026
Merged

http: arm idle timer on open so a stalled TLS handshake times out#30376
cirospaciari merged 5 commits into
mainfrom
farm/ce1925a6/http-handshake-idle-timeout

Conversation

@robobun

@robobun robobun commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Fixes the bun install hang reported in #30325 (latest comment — still reproducing on 1.3.13).

Repro

Point bun install at an HTTPS registry that accepts TCP but never answers the TLS ClientHello:

// raw TCP server that swallows the ClientHello and never replies
net.createServer(s => s.on("data", () => {})).listen(0);
[install]
registry = "https://127.0.0.1:<port>/"

bun install connects, the socket goes ESTABLISHED, and the process blocks in epoll_wait forever with no timer armed. This is the state the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED sockets to the npm CDN, zero rx/tx, 14+ minutes and counting.

Root cause

HTTPClient.onOpen() starts the TLS handshake but does not arm the socket's idle timer — the first setTimeout(socket, 5) call is in onWritable(), which only runs after the handshake completes. Freshly-connected sockets inherit long_timeout = 255 (disabled) from the connecting socket, so a stall anywhere between TCP-connect and handshake-done has no timer at all. The bun install main loop then waits forever on pendingTaskCount() == 0 because the NetworkTask callback never fires.

The earlier fixes in #29611 / #29649 covered a different hang (4xx/5xx tarball responses not releasing the task slot); they didn't touch this path.

Fix

  • Arm the idle timer in onOpen() so it covers the TLS handshake.
  • Wire the short-tick onTimeout handler in HTTPContext.Handler alongside the existing onLongTimeoutsocket.setTimeout(seconds) picks whichever timer fits the duration, so both must dispatch.
  • Read the idle-timeout duration from a new BUN_CONFIG_HTTP_IDLE_TIMEOUT env var (seconds). Default is 300 — the previous hard-coded 5 minutes — so nothing changes for unconfigured environments except that the handshake phase is now covered. 0 disables the timer (same as disable_timeout = true).
  • Route the experimental h2 client session's rearmTimeout through the same value for consistency.

Verification

New test test/cli/install/bun-install-stalled-tls.test.ts starts a raw TCP server that accepts connections and never replies, points bun install at it over https://, sets BUN_CONFIG_HTTP_IDLE_TIMEOUT=3 / BUN_CONFIG_HTTP_RETRY_COUNT=0, and asserts the install fails with a timeout error.

# without this change
(fail) bun install times out when the registry accepts TCP but never completes the TLS handshake [60004.48ms]
  ^ this test timed out after 60000ms.

# with this change
(pass) bun install times out when the registry accepts TCP but never completes the TLS handshake [4483.87ms]

fetch-http2-client.test.ts (58 tests) and bun-install-retry.test.ts still pass.

Fixes #30325

@robobun

robobun commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 8:05 PM PT - May 7th, 2026

@robobun, your commit da6eeb0 has 1 failures in Build #52708 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30376

That installs a local version of the PR into your bun-30376 executable, so you can run:

bun-30376 --bun

@coderabbitai

coderabbitai Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds configurable HTTP idle timeout: exported idle_timeout_seconds can be set via BUN_CONFIG_HTTP_IDLE_TIMEOUT at thread start; sockets are armed on open and session rearm uses the configured timeout; Handler.onTimeout forwards to onLongTimeout. Adds regression test for stalled TLS handshake.

Changes

HTTP idle timeout configuration and event handling

Layer / File(s) Summary
Configuration contract
src/http/http.zig
Exported idle_timeout_seconds variable defaults to 300 seconds and controls idle timeout duration for HTTP client sockets.
Environment variable declaration
src/bun_core/env_var.zig
Adds BUN_CONFIG_HTTP_IDLE_TIMEOUT env var with default 300 and documentation (0 disables).
Environment loading at startup
src/http/HTTPThread.zig
onStart reads BUN_CONFIG_HTTP_IDLE_TIMEOUT, clamps/casts it, and assigns it to bun.http.idle_timeout_seconds.
Socket arming on open
src/http/http.zig
onOpen arms the socket timeout immediately after connection open to cover TLS handshake.
setTimeout implementation
src/http/http.zig
setTimeout now ignores the minutes arg and applies idle_timeout_seconds, disabling when configured to 0 or when disable_timeout is set.
Timeout refresh during request flow
src/http/http.zig
Multiple writable/short-read/header/body/data handling paths call setTimeout(socket) to refresh the idle timer throughout request/response progression.
Session rearm
src/http/h2_client/ClientSession.zig
ClientSession.rearmTimeout sets socket timeout using HTTPClient.idle_timeout_seconds when timeouts are desired, otherwise 0.
Timeout event dispatch
src/http/HTTPContext.zig
Handler.onTimeout forwards short-tick idle timeout events directly to onLongTimeout.
Regression test
test/cli/install/bun-install-stalled-tls.test.ts
Regression test verifies bun install times out when an HTTPS registry accepts TCP but never completes TLS handshake; sets BUN_CONFIG_HTTP_IDLE_TIMEOUT=3 and cleans up sockets/server.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: arming the idle timer on socket open to fix TLS handshake timeouts.
Description check ✅ Passed The PR description exceeds the template requirements, providing comprehensive context: root cause analysis, detailed fix explanation, verification approach with test results, and issue reference.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/http/HTTPThread.zig`:
- Around line 228-232: Add a typed env-var accessor named
BUN_CONFIG_HTTP_IDLE_TIMEOUT in src/bun_core/env_var.zig following the same
pattern as BUN_CONFIG_DNS_TIME_TO_LIVE_SECONDS (an unsigned integer env_var),
then replace the manual bun.getenvZ + std.fmt.parseInt usage in HTTPThread.zig
with a call to that accessor's .get() method (i.e., use
BUN_CONFIG_HTTP_IDLE_TIMEOUT.get() via the bun.env_var) and assign the returned
value to bun.http.idle_timeout_seconds when present. Ensure the new accessor's
type and default/optional handling match the repo pattern and import/namespace
usage so HTTPThread.zig can call it directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e4fc9cdb-52b3-4db6-8d20-7811b4024b48

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed6e89 and 5e007bc.

📒 Files selected for processing (5)
  • src/http/HTTPContext.zig
  • src/http/HTTPThread.zig
  • src/http/h2_client/ClientSession.zig
  • src/http/http.zig
  • test/cli/install/bun-install-stalled-tls.test.ts

Comment thread src/http/HTTPThread.zig Outdated
Comment thread src/http/http.zig
Comment thread src/http/http.zig Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/http/http.zig (2)

1857-2007: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Successful header reads still consume the old idle budget.

handleOnDataHeaders() only calls setTimeout() on the short-read path and before chunked-body parsing. If a complete header block arrives with no body bytes in the same packet, or we return after a 1xx response, the timer keeps counting from the previous write/open event and can fire earlier than the configured idle window.

Suggested fix
 pub fn handleOnDataHeaders(
     this: *HTTPClient,
     comptime is_ssl: bool,
     incoming_data: []const u8,
     ctx: *NewHTTPContext(is_ssl),
     socket: NewHTTPContext(is_ssl).HTTPSocket,
 ) void {
     log("handleOnDataHeader data: {s}", .{incoming_data});
+    this.setTimeout(socket);
     var to_read = incoming_data;
     var needs_move = true;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/http/http.zig` around lines 1857 - 2007, handleOnDataHeaders() currently
only resets the idle timer in the short-read and chunked-body branches, so
successful header-only reads (e.g., a 1xx with no more bytes or a complete
header block with no body) leave the previous timeout intact; call
this.setTimeout(socket) before any early returns where no body processing will
follow: specifically add this.setTimeout(socket) in the 1xx branch before the
`return` when to_read.len == 0, and add this.setTimeout(socket) before the
`return` in the block after cloneMetadata() when to_read.len == 0 (the
header-only path that may call progressUpdate), and also ensure setTimeout is
invoked before returning from the proxy-handshake path if startProxyHandshake()
does not already do so; update handleOnDataHeaders to place these calls near the
existing setTimeout usages so the idle timer is consistently refreshed on
header-only reads.

1720-1740: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-arm the idle timer for proxied streaming uploads too.

.proxy_body only refreshes the timeout in the .bytes branch. A streamed request body goes through .stream => this.flushStream(...), so proxied streaming uploads can still time out mid-transfer even while bytes are being drained.

Suggested fix
         .proxy_body => {
             log("send proxy body", .{});
             if (this.proxy_tunnel) |proxy| {
+                this.setTimeout(socket);
                 switch (this.state.original_request_body) {
                     .bytes => {
-                        this.setTimeout(socket);
-
                         const to_send = this.state.request_body;
                         const sent = proxy.write(to_send) catch return; // just wait and retry when onWritable! if closed internally will call proxy.onClose
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/http/http.zig` around lines 1720 - 1740, The proxy_body handler only
re-arms the idle timeout in the .bytes branch, so streamed proxied uploads can
time out; update the .proxy_body case for when this.proxy_tunnel and
this.state.original_request_body == .stream to call this.setTimeout(socket) (the
same timeout refresh used in the .bytes branch) immediately before invoking
this.flushStream(is_ssl, socket) so streaming writes also renew the idle timer;
reference the .proxy_body branch, this.setTimeout, this.flushStream, and
this.proxy_tunnel to locate and apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/http/http.zig`:
- Around line 2216-2227: The ceil-to-minute math in setTimeout (function
setTimeout on HTTPClient) can overflow because idle_timeout_seconds is a c_uint
and (secs + 59) may wrap; fix by preventing overflow: either clamp
idle_timeout_seconds earlier to at most maxInt(c_uint) - 59 (so ((secs + 59) /
60) * 60 cannot overflow) or perform a widening cast/bounds check in setTimeout
(e.g., cast secs to a wider integer, check if secs > maxInt(c_uint) - 59 and
handle separately, or compute the minute-rounded value using safe arithmetic)
and then call socket.setTimeout with the safe result; update the code paths that
reference idle_timeout_seconds to use the chosen safe bound.

---

Outside diff comments:
In `@src/http/http.zig`:
- Around line 1857-2007: handleOnDataHeaders() currently only resets the idle
timer in the short-read and chunked-body branches, so successful header-only
reads (e.g., a 1xx with no more bytes or a complete header block with no body)
leave the previous timeout intact; call this.setTimeout(socket) before any early
returns where no body processing will follow: specifically add
this.setTimeout(socket) in the 1xx branch before the `return` when to_read.len
== 0, and add this.setTimeout(socket) before the `return` in the block after
cloneMetadata() when to_read.len == 0 (the header-only path that may call
progressUpdate), and also ensure setTimeout is invoked before returning from the
proxy-handshake path if startProxyHandshake() does not already do so; update
handleOnDataHeaders to place these calls near the existing setTimeout usages so
the idle timer is consistently refreshed on header-only reads.
- Around line 1720-1740: The proxy_body handler only re-arms the idle timeout in
the .bytes branch, so streamed proxied uploads can time out; update the
.proxy_body case for when this.proxy_tunnel and this.state.original_request_body
== .stream to call this.setTimeout(socket) (the same timeout refresh used in the
.bytes branch) immediately before invoking this.flushStream(is_ssl, socket) so
streaming writes also renew the idle timer; reference the .proxy_body branch,
this.setTimeout, this.flushStream, and this.proxy_tunnel to locate and apply the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 63176774-5534-4e2a-ace0-9ddaba83040e

📥 Commits

Reviewing files that changed from the base of the PR and between 48d2fd5 and 58696b0.

📒 Files selected for processing (1)
  • src/http/http.zig

Comment thread src/http/http.zig Outdated
Comment thread src/http/h2_client/ClientSession.zig
@robobun robobun force-pushed the farm/ce1925a6/http-handshake-idle-timeout branch from 77b08a6 to a8c2bdb Compare May 8, 2026 00:07
@robobun

robobun commented May 8, 2026

Copy link
Copy Markdown
Collaborator Author

CI status

Build 52708 (da6eeb0, rebased on main@9ed6e89b)

node-http-backpressure-max.test.ts passed on darwin — the 4/4 timeout on the previous build (52698) was a runner-load flake, not caused by this change.

Only remaining hard error:

test-http-should-emit-close-when-connection-is-aborted.ts — Windows timeout (all 3 lanes)

Pre-existing on main. Same failure appears on unrelated PR builds 52688 (farm/26bff18f/serve-listen-errno), 52687/52679 (farm/ed443810/fix-s3-path-double-free), 52680, 52678 (farm/39a49e68/wasm-esm-integration). It's the known Windows AFD-poll race described in the unmerged fix branches farm/ffec12ba/fix-http-abort-close-windows-timeout / farm/36f07dd7:

fetch() abort hops to the HTTP client thread before the socket is closed, and on Windows that cross-thread close races the server socket's AFD poll re-submission — disconnect isn't observed and the test times out.

This PR doesn't touch abort/shutdown dispatch — for the default config it arms the same 5-minute long timer as before (socket.setTimeout(300)setLongTimeout(5)), just earlier (in onOpen() instead of onWritable()).

All other 52708 annotations are warnings (passed on retry): spawn.test.ts, terminal-spawn.test.ts, hot.test.ts, update_interactive_install.test.ts, test-http-server-capture-rejections.js.

Earlier build 52690 (58696b0)

test-worker-nested-uncaught.js panicked with EventLoop.enqueueTaskConcurrent: VM has terminated on debian-13 x64-asan — worker-teardown race that uses no HTTP; reproduces ~4/15 runs on main locally under the debug ASAN build. Did not recur on 52708.

This PR's new test

test/cli/install/bun-install-stalled-tls.test.ts passed on every lane that ran it on both builds.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the earlier feedback — the normalisation-at-load-time approach in a8c2bdb is clean and I have nothing further. Deferring to a human since this changes core HTTP client timer behaviour (new short-tick onTimeout handler, setTimeout refactor, new BUN_CONFIG_HTTP_IDLE_TIMEOUT knob) that runs on every fetch/install request.

Extended reasoning...

Overview

This PR fixes a real bun install hang (#30325) where a stalled TLS handshake leaves the request blocked forever because the idle timer wasn't armed until onWritable (post-handshake). It touches:

  • src/http/http.zig — arms the idle timer in onOpen, refactors HTTPClient.setTimeout to read a global idle_timeout_seconds instead of a per-call minutes arg, updates 10 call sites.
  • src/http/HTTPContext.zig — adds a Handler.onTimeout (short-tick) that forwards to onLongTimeout, since socket.setTimeout may pick either timer depending on duration.
  • src/http/HTTPThread.zig — reads BUN_CONFIG_HTTP_IDLE_TIMEOUT once at thread start, clamps to the uSockets long-timer wrap bound (239 min), and rounds >240s up to a whole minute.
  • src/http/h2_client/ClientSession.zig — routes rearmTimeout through the same value.
  • src/bun_core/env_var.zig — registers the new typed env var.
  • New regression test bun-install-stalled-tls.test.ts.

Security risks

None identified. The change is timer-arming logic; no auth, crypto, or input parsing. The new env var only shortens/lengthens an idle timeout and is clamped to a safe range.

Level of scrutiny

High. This is the production HTTP client used by every fetch() and bun install. The default behaviour is preserved (300s ≡ the old hard-coded 5-minute long timer), and the only delta for unconfigured environments is that the handshake phase is now covered — strictly an improvement. But the refactor swaps socket.timeout(0); socket.setTimeoutMinutes(5) for socket.setTimeout(seconds) and adds a brand-new onTimeout short-tick handler to HTTPContext.Handler that previously didn't exist. Any subtle interaction between the short-tick and long-tick timers across the request lifecycle would affect every HTTP request, so a human familiar with the uSockets timer wheel should sign off.

Other factors

  • All three of my earlier inline comments (floor-division quirk above 240s, dead minutes parameter, h1/h2 rounding divergence) were addressed; the final approach centralises the clamp + ceil-to-minute in HTTPThread.onStart on a u64 intermediate, which also resolved CodeRabbit's overflow concern.
  • A regression test is included and the PR description shows it failing before / passing after.
  • The robobun CI comment references build failures on an intermediate commit (77b08a6); I can't confirm green CI on the latest a8c2bdb from here.
  • Introduces new public-ish configuration surface (BUN_CONFIG_HTTP_IDLE_TIMEOUT) that a maintainer should ack.

robobun and others added 5 commits May 8, 2026 01:18
bun install (and fetch) could hang forever when an HTTPS connection's
TCP connect succeeded but the TLS handshake stalled — the HTTP client
only armed its idle timer inside onWritable(), which runs after the
handshake completes. A server (or middlebox/NAT under load) that
accepted the TCP connection but never answered ClientHello left the
socket ESTABLISHED with no timer, and bun install's
pendingTaskCount()==0 loop waited forever in epoll_wait.

Arm the idle timer in onOpen() so the handshake phase is covered, wire
the short-tick onTimeout handler alongside the existing onLongTimeout,
and read the duration from a new BUN_CONFIG_HTTP_IDLE_TIMEOUT env var
(default 300s — unchanged from the previous hard-coded 5 minutes) so
tests can exercise the path in seconds.

Fixes #30325.
@robobun robobun force-pushed the farm/ce1925a6/http-handshake-idle-timeout branch from a8c2bdb to da6eeb0 Compare May 8, 2026 01:21

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found and all prior review feedback has been addressed — deferring only because this changes socket-timer lifecycle in the core HTTP client (backs every fetch() and bun install), which is worth a human glance before merge.

Extended reasoning...

Overview

This PR fixes #30325 by arming the HTTP client's idle timer in onOpen() (so a stalled TLS handshake eventually fails instead of hanging forever) and makes the timeout configurable via BUN_CONFIG_HTTP_IDLE_TIMEOUT. It touches src/http/http.zig (timer arming + setTimeout refactor), src/http/HTTPContext.zig (wires the short-tick onTimeout handler), src/http/HTTPThread.zig (env-var load + clamp/round), src/http/h2_client/ClientSession.zig (routes h2 rearmTimeout through the same value), src/bun_core/env_var.zig (typed accessor), and adds a regression test.

Security risks

None identified. The change adds an idle timeout where previously there was none during handshake; it does not weaken TLS verification, certificate handling, or auth. The new env var only controls a timeout duration and is clamped/normalised before use.

Level of scrutiny

High. The HTTP client backs every fetch() call and bun install network operation, so a regression here has very wide blast radius. The change is conceptually small but alters socket-timer lifecycle (armed earlier, new short-tick onTimeout dispatch path) and switches the default 5-minute timeout from setTimeoutMinutes(5) to socket.setTimeout(300) — equivalent in effect (300 > 240 → setLongTimeout(5)) but a different code path. That's exactly the kind of subtle networking change that benefits from a maintainer's eye.

Other factors

  • All four prior review threads (CodeRabbit's typed-env-var and overflow concerns; my floor-division, dead-parameter, and h2-rounding nits) have been addressed in follow-up commits, and the final shape — clamping + ceil-to-minute normalisation done once in HTTPThread.onStart on a u64 — is clean.
  • The new regression test passed on every CI lane; the two CI failures are documented as pre-existing on main and unrelated (Windows AFD-poll race, ASAN worker panic).
  • Default behaviour is unchanged for unconfigured environments except that the handshake phase is now covered by the timer, which is the bug fix itself.
  • I have not previously posted a top-level review on this PR (only inline comments, all resolved), so this is not redundant.

@cirospaciari cirospaciari merged commit fe735f8 into main May 9, 2026
76 of 77 checks passed
@cirospaciari cirospaciari deleted the farm/ce1925a6/http-handshake-idle-timeout branch May 9, 2026 21:49
robobun added a commit that referenced this pull request May 9, 2026
…ow-update-backpressure

Conflict in src/http/http.zig with fe735f8 (#30376 idle-timer-on-open):
setTimeout(socket, minutes) became setTimeout(socket) reading
idle_timeout_seconds. Kept the receive_paused guards in
.body/.body_chunk onData, updated consumeResponseBody's resume to the
new signature, and switched maybePauseReceive's timer-clear from
setTimeoutMinutes(0)+timeout(0) to setTimeout(0) which clears both
the short-tick and long-minute timers.
springmin pushed a commit to springmin/bun that referenced this pull request May 10, 2026
…en-sh#30376)

Fixes the `bun install` hang reported in oven-sh#30325 (latest comment — still
reproducing on 1.3.13).

## Repro

Point `bun install` at an HTTPS registry that accepts TCP but never
answers the TLS ClientHello:

```ts
// raw TCP server that swallows the ClientHello and never replies
net.createServer(s => s.on("data", () => {})).listen(0);
```

```toml
[install]
registry = "https://127.0.0.1:<port>/"
```

`bun install` connects, the socket goes ESTABLISHED, and the process
blocks in `epoll_wait` forever with no timer armed. This is the state
the reporter captured in their Gitea/Kubernetes CI: three ESTABLISHED
sockets to the npm CDN, zero rx/tx, 14+ minutes and counting.

## Root cause

`HTTPClient.onOpen()` starts the TLS handshake but does not arm the
socket's idle timer — the first `setTimeout(socket, 5)` call is in
`onWritable()`, which only runs *after* the handshake completes.
Freshly-connected sockets inherit `long_timeout = 255` (disabled) from
the connecting socket, so a stall anywhere between TCP-connect and
handshake-done has no timer at all. The `bun install` main loop then
waits forever on `pendingTaskCount() == 0` because the `NetworkTask`
callback never fires.

The earlier fixes in oven-sh#29611 / oven-sh#29649 covered a different hang (4xx/5xx
tarball responses not releasing the task slot); they didn't touch this
path.

## Fix

- Arm the idle timer in `onOpen()` so it covers the TLS handshake.
- Wire the short-tick `onTimeout` handler in `HTTPContext.Handler`
alongside the existing `onLongTimeout` — `socket.setTimeout(seconds)`
picks whichever timer fits the duration, so both must dispatch.
- Read the idle-timeout duration from a new
`BUN_CONFIG_HTTP_IDLE_TIMEOUT` env var (seconds). Default is 300 — the
previous hard-coded 5 minutes — so nothing changes for unconfigured
environments except that the handshake phase is now covered. `0`
disables the timer (same as `disable_timeout = true`).
- Route the experimental h2 client session's `rearmTimeout` through the
same value for consistency.

## Verification

New test `test/cli/install/bun-install-stalled-tls.test.ts` starts a raw
TCP server that accepts connections and never replies, points `bun
install` at it over `https://`, sets `BUN_CONFIG_HTTP_IDLE_TIMEOUT=3` /
`BUN_CONFIG_HTTP_RETRY_COUNT=0`, and asserts the install fails with a
timeout error.

```
# without this change
(fail) bun install times out when the registry accepts TCP but never completes the TLS handshake [60004.48ms]
  ^ this test timed out after 60000ms.

# with this change
(pass) bun install times out when the registry accepts TCP but never completes the TLS handshake [4483.87ms]
```

`fetch-http2-client.test.ts` (58 tests) and `bun-install-retry.test.ts`
still pass.

Fixes oven-sh#30325

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Signed-off-by: Sisyphus <sisyphus@ohos-bun.dev>
cirospaciari added a commit that referenced this pull request May 13, 2026
### What does this PR do?

Enables TCP keepalive (`SO_KEEPALIVE` + `TCP_KEEPIDLE=60s`) on `fetch()`
client sockets.

Without this, when a connection becomes half-open — the peer is gone but
the FIN/RST never reached us (NAT timeout, wifi/cellular handoff,
middlebox state eviction, VPN disconnect) — the kernel never discovers
it. A streaming `reader.read()` on such a socket blocks forever (or
until an application-level timeout).

Node's fetch (undici) sets `SO_KEEPALIVE` with `TCP_KEEPIDLE=60s`, so a
half-open connection is detected at ~70s (60s idle + 10 probes × 1s).
This makes Bun match that behavior via the existing
`socket.setKeepAlive()` → `bsd_socket_keepalive()` path, which already
hardcodes `TCP_KEEPINTVL=1` and `TCP_KEEPCNT=10`.

The call is placed in `onOpen()` next to `client.setTimeout(socket)`
(#30376) — socket-level, fires once per connection, inherited by
keep-alive-reused requests.

### How did you verify your code works?

Added `test/js/web/fetch/fetch-tcp-keepalive.test.ts` (Linux-only) that:
- starts a streaming server, opens a `fetch()` to it
- reads `/proc/self/net/tcp` and finds the client socket (ESTABLISHED,
remote port = server port)
- asserts the timer field is not `00:00000000` — i.e. the kernel's
`sk_timer` (keepalive) is armed (`timer_active=02`)

Without this patch the timer field is `00`; with it, `02:<jiffies>`.

---------

Co-authored-by: robobun <117481402+robobun@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: bun install hangs indefinitely in Docker/CI environments with stalled HTTP/2 connections

3 participants