feat(relay): graceful tunnel drain on SIGTERM#4594
Conversation
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a TunnelManager WS close-code constant (4001) and an async drain() that closes tunnels with that code; reduces TunnelClient reconnect ceiling to 5s and treats WS close code 4001 as a drain signal that resets reconnectAttempts. ChangesTunnel drain & reconnect
Sequence Diagram(s)sequenceDiagram
participant Signal
participant RelayServer
participant TunnelManager
participant TunnelClient
Signal->>RelayServer: SIGINT/SIGTERM
RelayServer->>TunnelManager: drain("Server draining for deploy")
TunnelManager->>TunnelClient: close WS (code=4001, reason)
TunnelClient-->>TunnelClient: onclose(code=4001) -> reconnectAttempts = 0
TunnelClient->>TunnelClient: schedule reconnect (backoff ≤ 5000ms)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds a graceful tunnel-drain path to the relay so that rolling deploys no longer leave host-services in exponential-backoff limbo (~60 s) after a SIGTERM. The relay closes every open WebSocket with code 1001 before exiting, and the host-service now recognises code 1001 as a clean drain signal and resets its reconnect counter so it retries within ~1 s.
Confidence Score: 3/5The deploy-drain path works correctly in the happy path, but the close-code collision with the existing ping-timeout path means the host-service cannot distinguish the two scenarios and may trigger reconnect storms under load. The relay's existing ping-timeout handler already closes host sockets with code 1001. The new client-side logic resets backoff for any code-1001 close, so a wave of ping timeouts during relay overload would produce the same thundering-herd that the exponential backoff was designed to suppress.
|
| Filename | Overview |
|---|---|
| apps/relay/src/index.ts | Adds SIGTERM handler with draining guard that calls tunnelManager.drain() then exits; HTTP server is not stopped before the 250 ms flush window, leaving a brief reconnect-then-immediate-exit gap. |
| apps/relay/src/tunnel.ts | New drain() method closes all tunnels with close code 1001; the same code 1001 is already used in the existing ping-timeout path, creating a semantic collision that the host-service cannot resolve. |
| packages/host-service/src/tunnel/tunnel-client.ts | Resets reconnectAttempts to 0 on close code 1001; because ping-timeout closures also use 1001, this fast-reconnect path can be triggered unintentionally under load, potentially causing reconnect storms. |
Sequence Diagram
sequenceDiagram
participant Fly as Fly Platform
participant Relay as Relay (index.ts)
participant TM as TunnelManager
participant Host as Host Service
Fly->>Relay: SIGTERM
Relay->>Relay: "draining = true"
Relay->>TM: drain()
loop each open tunnel
TM->>Host: ws.close(1001, "Server draining for deploy")
end
Note over TM: await 250ms (flush close frames)
Host->>Host: "onclose(event.code = 1001)"
Host->>Host: "reconnectAttempts = 0"
TM-->>Relay: drain() resolved
Relay->>Relay: process.exit(0)
Note over Host: scheduleReconnect() ~500-1000ms delay
Host->>Relay: reconnect (new relay instance)
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/host-service/src/tunnel/tunnel-client.ts:133-141
**Close code 1001 collision with ping-timeout path**
The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.
### Issue 2 of 2
apps/relay/src/index.ts:55-65
**HTTP server keeps accepting connections during drain window**
After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain.
Reviews (1): Last reviewed commit: "feat(relay): graceful tunnel drain on SI..." | Re-trigger Greptile
| // Clean close from the relay (1001 = "Going Away", typically a | ||
| // deploy drain). Reset backoff so we don't sit in 30s retry | ||
| // mode while the relay is restarting in <2s. | ||
| if (event.code === 1001) { | ||
| this.reconnectAttempts = 0; | ||
| console.log( | ||
| "[host-service:tunnel] relay draining; reconnecting immediately", | ||
| ); | ||
| } |
There was a problem hiding this comment.
Close code 1001 collision with ping-timeout path
The relay already closes tunnels with code 1001 in the ping-timeout path (tunnel.ts:112: ws.close(1001, "Ping timeout")). Since the new host-service handler treats any 1001 as a clean deploy drain and resets reconnectAttempts = 0, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 133-141
Comment:
**Close code 1001 collision with ping-timeout path**
The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.
How can I resolve this? If you propose a fix, please make it concise.| process.on("SIGTERM", async () => { | ||
| if (draining) return; | ||
| draining = true; | ||
| console.log("[relay] SIGTERM received, draining tunnels"); | ||
| try { | ||
| await tunnelManager.drain(); | ||
| } catch (err) { | ||
| console.error("[relay] drain failed", err); | ||
| } | ||
| process.exit(0); | ||
| }); |
There was a problem hiding this comment.
HTTP server keeps accepting connections during drain window
After SIGTERM fires, drain() closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when process.exit(0) fires. Calling server.close() before awaiting tunnelManager.drain() would stop the listener from accepting new connections while still allowing in-flight sockets to drain.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/src/index.ts
Line: 55-65
Comment:
**HTTP server keeps accepting connections during drain window**
After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/relay/src/directory.ts`:
- Around line 144-149: The startup cleanup currently loads OWNER_KEY into
allOwners then calls unregister(hostId, region, machineId) serially, causing one
Redis RTT per host; change this to batch the cleanup: either maintain an
owner-indexed set (e.g., OWNER_INDEX_<owner>) and retrieve that set to
pipeline/publish a bounded cleanup, or implement a single Redis Lua script that
accepts myOwner, region and machineId and removes/unregisters all matching
hostIds server-side. Update the logic around OWNER_KEY, allOwners, and
unregister so you collect hostIds locally and perform a pipelined/multi or
single EVAL that performs the equivalent unregister operations atomically,
avoiding awaiting unregister in a loop. Ensure region and machineId parameters
are passed into the batch script/pipeline so the same validation/side-effects
occur as the original unregister function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aed582de-fc34-4f99-b147-02dd38fb8abb
📒 Files selected for processing (2)
apps/relay/src/directory.tsapps/relay/src/index.ts
There was a problem hiding this comment.
2 issues found across 14 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/relay/scripts/smoke-test.sh">
<violation number="1" location="apps/relay/scripts/smoke-test.sh:27">
P2: The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</violation>
</file>
<file name=".github/workflows/release-cli.yml">
<violation number="1" location=".github/workflows/release-cli.yml:68">
P1: The prerelease guard is overly broad and causes this step to be skipped for all `cli-v*` tags, including stable releases.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
| fail=0 | ||
| for region in "${REGIONS[@]}"; do | ||
| printf " %-4s " "$region" | ||
| body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || { |
There was a problem hiding this comment.
P2: The smoke test never verifies HTTP status 200, so non-200 /health responses can be treated as success if the body still matches the region.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/relay/scripts/smoke-test.sh, line 27:
<comment>The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</comment>
<file context>
@@ -0,0 +1,45 @@
+fail=0
+for region in "${REGIONS[@]}"; do
+ printf " %-4s " "$region"
+ body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || {
+ printf " ✗ curl failed: %s\n" "$body"
+ fail=$((fail + 1))
</file context>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/release-cli.yml">
<violation number="1" location=".github/workflows/release-cli.yml:70">
P2: The release gate is too narrow: it excludes only `-alpha/-beta/-rc`, so other prerelease tags can still publish `cli-latest`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:
1. Fly sends SIGTERM, relay process exits without closing WS, host
sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
takes another 15-30s to retry.
Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.
Fix is two-sided:
- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
TunnelManager that walks every open WS and closes it with code 1001
("Going Away"), waits 250ms for frames to flush, then exits. Wired
to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.
- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
WS onClose event reports code 1001, reset reconnectAttempts to 0 so
the next reconnect fires at the 0.5-1s base delay instead of the
saturated 30s/attempt tail.
Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.
Verified via game-day on superset-relay-staging (sjc, iad, fra).
Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the main process during rolling deploys, AND its default ~5s grace period kills the process before the JS handler runs. Two fixes: - Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the drain handler intercepts Fly's actual signal (also covers `fly machine stop` and local Ctrl-C). - Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml and fly.staging.toml so the handler actually has time to drain. Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT received, draining tunnels", unregisters open WSs cleanly, then exits with code 0. Host re-registration happens ~12s after WS close (down from 64s pre-fix).
Game-day #14 exposed a ~2min user-visible glitch when a relay machine gets killed and auto-restarted by Fly. The killed process never gets to unregister its tunnels (SIGKILL leaves no opportunity), so the directory keeps entries pointing at "{region}:{machineId}". Fly auto-restarts the same machineId, but the new process has no in-memory tunnels — meanwhile, cross-region requests still fly-replay traffic at that stale machineId and the reconciler reports `is_online=yes` based on the stale directory. Fix: on startup, scan the OWNER hash for entries matching our own encoded owner string and delete them via the existing compare-and- delete `unregister` helper. Runs synchronously before serve() via top-level await — the relay won't accept connections until cleanup completes, so we can't race a fresh registration. Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2 stale directory entries on startup" log → sandbox re-registered 13s after kill (down from ~2min recovery without this fix).
Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds.
…h cleanup Four fixes from PR review on #4594: 1. Drain uses app-defined close code 4001 instead of 1001. The ping-timeout path in tunnel.ts already uses 1001, and the host-side reset-on-clean-close would have fired for those too — risking a thundering-herd reconnect storm under a mass ping-timeout event. Now the host only resets on 4001 (deliberate deploy drain), and 1001 keeps its RFC-conformant "going away" semantics elsewhere. (Greptile P1) 2. Stop accepting new TCP connections during drain. Call server.close() before tunnelManager.drain() so a host that reconnects during the 250ms flush window doesn't open a tunnel that immediately dies on process.exit. (Greptile P2) 3. Startup directory cleanup batched into a single Lua EVAL. Previous loop did one Redis round-trip per stale entry — at prod scale (~500 tunnels per machine) that would have eaten 25s of startup time on typical Upstash latency. Single EVAL keeps cleanup at one RTT regardless of directory size. (CodeRabbit Major) 4. Update host-side comment from "30s retry mode" to "5s ceiling" (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit Minor)
Adds smoke-test.sh that pings /health on every configured region via `fly-prefer-region` and verifies the response body's `region` field matches the requested region. Wired into both deploy.sh and deploy-staging.sh so deploys halt if any region failed to come up. Catches partial-deploy failures, missing regions, and machines that booted but aren't actually serving — exactly the multi-region failure modes we couldn't detect with the existing single-region setup. Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra) — all 3 returned correct region in /health response.
062be1c to
ac4f410
Compare
|
Ready to review this PR? Stage has broken it down into 6 individual chapters for you: Chapters generated by Stage for commit 75994e0 on May 16, 2026 7:26pm UTC. |
Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.
Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.
Verified end-to-end on staging: host log shows
"[host-service:tunnel] relay drain notice received (Server draining
for deploy); reconnecting immediately"
followed by reconnect within ~900ms.
ac4f410 to
3aee590
Compare
…ile draining drain() previously relied on WS onClose → unregister → fire-and-forget directory.unregister for directory cleanup. process.exit(0) runs right after drain() returns, so those async Redis writes routinely didn't land before the process died — leaving the dead machine's directory entries live for ~90s (TTL), during which other relay machines fly-replay traffic to the corpse. Invisible single-region; a real per-deploy misrouting window multi-region. - drain() now awaits an explicit clearDirectory() callback (clearStaleEntriesForMachine — the batch Lua script already used on startup) before the process exits. - A draining flag rejects new tunnel registrations during the drain window, in register() and the /tunnel onOpen handler, re-checked after the async auth calls so a host can't slip in while draining flips. - drain() routes each tunnel through disposeTunnel and removes it from the map, so the trailing onClose callbacks are no-ops rather than racing the explicit directory clear.
Summary
Discovered during a multi-region rolling-deploy game-day on `superset-relay-staging`: every relay-machine restart left host tunnels disconnected for ~60s before they reconnected. That outage compounds across regions during multi-region deploys, so it's a blocker for rolling multi-region to prod cleanly.
Root cause
Game-day timings:
Fix (two-sided)
Relay (`apps/relay/src/tunnel.ts` + `apps/relay/src/index.ts`):
Host (`packages/host-service/src/tunnel/tunnel-client.ts`):
Together: deploy outages should drop from ~60s to ~2-5s per affected host.
Test plan
Out of scope (separate concerns to track)
Summary by cubic
Gracefully drains relay tunnels on SIGINT/SIGTERM: stop new connections, send an in-band “drain” message, close with WS code 4001 so hosts reconnect immediately, and await directory cleanup before exit. Cuts deploy-time outages to ~1–5s, clears stale directory entries, and adds Fly lifecycle settings plus region-aware smoke tests for safer multi‑region deploys.
New Features
apps/relay): on SIGINT/SIGTERM callserver.close(), broadcast{type:"drain"}, close with code 4001, reject new registers while draining, await batched directory cleanup, then exit. Setkill_signal = "SIGINT"andkill_timeout = "10s"infly.tomlandfly.staging.toml.packages/host-service): on{type:"drain"}or close code 4001, resetreconnectAttemptsand reconnect immediately; capRECONNECT_MAX_MSat 5s. Protocol addsdrainmessage (packages/shared).apps/relay/scripts/deploy-staging.sh; add region-awaresmoke-test.shand run it from both deploy scripts to verify/healthper region.Bug Fixes
is_online.Written for commit 75994e0. Summary will update on new commits. Review in cubic
Summary by CodeRabbit
New Features
Chores