Skip to content

http: enable TCP keepalive on fetch sockets#30627

Merged
cirospaciari merged 3 commits into
mainfrom
ciro/fetch-tcp-keepalive
May 13, 2026
Merged

http: enable TCP keepalive on fetch sockets#30627
cirospaciari merged 3 commits into
mainfrom
ciro/fetch-tcp-keepalive

Conversation

@cirospaciari

Copy link
Copy Markdown
Member

What does this PR do?

Enables TCP keepalive (SO_KEEPALIVE + TCP_KEEPIDLE=60s) on fetch() client sockets.

Without this, when a connection becomes half-open — the peer is gone but the FIN/RST never reached us (NAT timeout, wifi/cellular handoff, middlebox state eviction, VPN disconnect) — the kernel never discovers it. A streaming reader.read() on such a socket blocks forever (or until an application-level timeout).

Node's fetch (undici) sets SO_KEEPALIVE with TCP_KEEPIDLE=60s, so a half-open connection is detected at ~70s (60s idle + 10 probes × 1s). This makes Bun match that behavior via the existing socket.setKeepAlive()bsd_socket_keepalive() path, which already hardcodes TCP_KEEPINTVL=1 and TCP_KEEPCNT=10.

The call is placed in onOpen() next to client.setTimeout(socket) (#30376) — socket-level, fires once per connection, inherited by keep-alive-reused requests.

How did you verify your code works?

Added test/js/web/fetch/fetch-tcp-keepalive.test.ts (Linux-only) that:

  • starts a streaming server, opens a fetch() to it
  • reads /proc/self/net/tcp and finds the client socket (ESTABLISHED, remote port = server port)
  • asserts the timer field is not 00:00000000 — i.e. the kernel's sk_timer (keepalive) is armed (timer_active=02)

Without this patch the timer field is 00; with it, 02:<jiffies>.

Bun's fetch() HTTP client does not set SO_KEEPALIVE on its sockets. When
a connection becomes half-open — the peer closed but the FIN/RST never
reached us (NAT timeout, wifi/cellular handoff, middlebox state eviction,
VPN disconnect) — the kernel has no way to discover it without keepalive
probes. A streaming reader.read() on such a socket blocks until an
application-level timeout, if any.

Node's fetch (undici) sets SO_KEEPALIVE with TCP_KEEPIDLE=60s, so a
half-open connection is detected at ~70s (60s idle + 10 probes × 1s).
This change makes Bun's fetch do the same, via the existing
socket.setKeepAlive() → bsd_socket_keepalive() plumbing that already sets
TCP_KEEPINTVL=1 and TCP_KEEPCNT=10.

Placed in onOpen() alongside the existing client.setTimeout(socket)
(#30376) — socket-level, fires once per connection, inherited by
keep-alive-reused requests.

The test reads /proc/self/net/tcp to verify the kernel's keepalive timer
is armed on the client socket (timer_active=02). Linux-only.
@robobun

robobun commented May 13, 2026

Copy link
Copy Markdown
Collaborator
Updated 9:57 AM PT - May 13th, 2026

@autofix-ci[bot], your commit d4fb0d0 has 2 failures in Build #54044 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30627

That installs a local version of the PR into your bun-30627 executable, so you can run:

bun-30627 --bun

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0a5e2a86-499e-460f-b6a9-6a438a0fbced

📥 Commits

Reviewing files that changed from the base of the PR and between 6d5e33c and d4fb0d0.

📒 Files selected for processing (1)
  • test/js/web/fetch/fetch-tcp-keepalive.test.ts

Walkthrough

TCP keepalive is enabled on fetch client sockets with a 60-second idle timeout to detect stalled connections. A socket configuration change in onOpen() arms keepalive, and a Linux-only test validates the keepalive timer is active on established client sockets.

Changes

TCP Keepalive for Fetch

Layer / File(s) Summary
TCP keepalive socket configuration
src/http/http.zig
onOpen() calls socket.setKeepAlive(true, 60) after arming the idle timeout, enabling TCP keepalive detection on fetch client sockets.
TCP keepalive verification test
test/js/web/fetch/fetch-tcp-keepalive.test.ts
Linux-only test spawns a streaming HTTP server, performs fetch(), inspects /proc/self/net/tcp for the client-side ESTABLISHED entry, and asserts the keepalive timer active field is "02" (armed).
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: enabling TCP keepalive on fetch sockets.
Description check ✅ Passed The PR description covers both required sections: it clearly explains what the PR does (enables TCP keepalive with specific settings and reasoning) and how it was verified (with a detailed Linux test).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/web/fetch/fetch-tcp-keepalive.test.ts`:
- Around line 63-69: The assertion is too broad: instead of checking timerActive
is not "00", update the test to assert that timerActive specifically indicates
the keepalive timer code "02" (the PR notes "02:<jiffies>"). Locate the
timerActive check in fetch-tcp-keepalive.test.ts and replace the negative check
using timerActive with a positive check that timerActive begins with or matches
the "02" timer code (e.g., startsWith "02" or a regex matching the "02" prefix)
so the test only passes for the keepalive timer.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a0d3c677-476b-43fe-afce-edcd04cfab0e

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf4b33 and fbb25f7.

📒 Files selected for processing (2)
  • src/http/http.zig
  • test/js/web/fetch/fetch-tcp-keepalive.test.ts

Comment thread test/js/web/fetch/fetch-tcp-keepalive.test.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Found 1 issue this PR may fix:

  1. HTTPS requests hanging (regression 1.2.23 -> 1.3.x) #26066 - HTTPS requests hanging after sustained use (400-500 requests over minutes), consistent with half-open connections on pooled sockets where the peer disappeared without sending FIN/RST. TCP keepalive probes would detect these dead connections.

If this is helpful, copy the block below into the PR description to auto-close this issue on merge.

Fixes #26066

🤖 Generated with Claude Code

@cirospaciari

Copy link
Copy Markdown
Member Author

@robobun adopt

Comment thread test/js/web/fetch/fetch-tcp-keepalive.test.ts Outdated
@robobun

robobun commented May 13, 2026

Copy link
Copy Markdown
Collaborator

✅ Adopted. Built locally, test passes with the fix and fails without it (timerActive is "00" on main, "02" with setKeepAlive(true, 60)).

Pushed 6d5e33c:

  • Tightened the assertion to toBe("02") per review (keepalive timer specifically, not just "any timer").
  • Replaced the sleep(50) race with await fetch() + reader.read() so the socket is guaranteed ESTABLISHED and the GET is ACKed before reading /proc/self/net/tcp.

CI status: the diff is green. The remaining red on build 54044 is pre-existing and unrelated:

  • test-http-should-emit-close-when-connection-is-aborted.ts — times out on Windows 2019 x64/x64-baseline. This same failure appears on 30+ unrelated PR builds (53900–54038: avif-linux, fake-timers, cluster-reuseport, Intl, etc.), none of which touch HTTP/fetch. main builds don't run the Windows 2019 lane at all, so it only surfaces on PRs. Not introduced here.
  • jsc-stress.test.ts (mprotect failed: 487) — JSC Wasm JIT failure on Windows, unrelated to HTTP.
  • dev-and-prod.test.ts — annotated flaky, HMR timing.

Skipping a retrigger since the Windows 2019 timeout is systematic across all PR builds and would recur. Ready for review/merge.

robobun and others added 2 commits May 13, 2026 16:15
- Tighten assertion from `not.toBe("00")` to `toBe("02")` so only the
  keepalive timer (sk_timer) satisfies it, not a retransmit (01) or
  zero-window probe (04).
- Await response headers + first chunk instead of `sleep(50)` so the
  socket is guaranteed ESTABLISHED and the outbound GET is ACKed before
  reading /proc/self/net/tcp — avoids a race where a retransmit timer
  would mask the keepalive timer.
- Cancel the reader instead of draining the full body.
@cirospaciari cirospaciari enabled auto-merge (squash) May 13, 2026 16:57
@cirospaciari cirospaciari disabled auto-merge May 13, 2026 16:57
@cirospaciari cirospaciari merged commit b9c757b into main May 13, 2026
75 of 77 checks passed
@cirospaciari cirospaciari deleted the ciro/fetch-tcp-keepalive branch May 13, 2026 16:57
dylan-conway added a commit that referenced this pull request May 14, 2026
### What does this PR do?

Ports the TCP keepalive call from `src/http/http.zig` `onOpen` (added in
#30627, gated in #30640) into the Rust `on_open` in `src/http/lib.rs`.
The Rust HTTP client landed in #30412 from a branch cut before #30627
merged, so the `set_keepalive` call was never carried over.

This is the cause of `test/js/web/fetch/fetch-tcp-keepalive.test.ts`
failing on every Linux test job since #30412 merged. The test reads
`/proc/self/net/tcp` and expects `timer_active=02` (keepalive armed) but
gets `00` because the Rust client never calls `set_keepalive`. Build
#54331 (and every Linux job since): 2 pass / 2 fail; build #54099
(#30640's PR, Zig build): 4 pass / 0 fail; build #54202 (#30412's PR):
test file not in tree, so it never ran.

The change is the same shape as the Zig reference: after
`self.set_timeout(socket)`, guarded by `!self.flags.disable_keepalive`
so `fetch(url, { keepalive: false })` and `node:http` non-keepalive
agents skip it (matching undici's `buildConnector`). `set_keep_alive`
already exists on `NewSocketHandler` (`src/uws_sys/socket.rs:466`) and
the `disable_keepalive` flag is already wired through.

### How did you verify your code works?

I have not built or run this — opening it directly per request so CI
validates. The covering test is
`test/js/web/fetch/fetch-tcp-keepalive.test.ts` (Linux-only), which is
the test currently failing on every PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants