Skip to content

feat(relay): graceful tunnel drain on SIGTERM#4594

Merged
saddlepaddle merged 9 commits into
mainfrom
feat-relay-graceful-drain
May 16, 2026
Merged

feat(relay): graceful tunnel drain on SIGTERM#4594
saddlepaddle merged 9 commits into
mainfrom
feat-relay-graceful-drain

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 15, 2026

Summary

Discovered during a multi-region rolling-deploy game-day on `superset-relay-staging`: every relay-machine restart left host tunnels disconnected for ~60s before they reconnected. That outage compounds across regions during multi-region deploys, so it's a blocker for rolling multi-region to prod cleanly.

Root cause

  1. Fly sends SIGTERM to the relay process during a rolling deploy.
  2. Relay had no SIGTERM handler — the process just exits, leaving every open WS as a TCP-RST'd socket from the host's perspective.
  3. Host-service's exponential reconnect backoff (`RECONNECT_BASE_MS=1s`, `RECONNECT_MAX_MS=30s`) climbs into the 30s/attempt tail after ~6 failed attempts while the relay is down (~30-40s).
  4. By the time the relay comes back, the host is mid-30s-wait. Worst case is ~60s before the next attempt fires and the WS re-establishes.

Game-day timings:

  • 04:02:41Z — SJC machine restart (sandbox WS killed)
  • 04:02:42Z — SJC listening again
  • 04:03:45Z — sandbox re-registered (63s gap from restart)

Fix (two-sided)

Relay (`apps/relay/src/tunnel.ts` + `apps/relay/src/index.ts`):

  • New `tunnelManager.drain()` that walks every open WS and closes it with code 1001 ("Going Away"), waits 250ms for close frames to flush, then exits.
  • Wired to SIGTERM in index.ts so Fly's pre-stop signal triggers the clean drain.

Host (`packages/host-service/src/tunnel/tunnel-client.ts`):

  • `socket.onclose` now recognizes code 1001 specifically and resets `reconnectAttempts = 0` before scheduling the next attempt. The next reconnect fires at the 0.5-1s base delay instead of the saturated 30s tail.

Together: deploy outages should drop from ~60s to ~2-5s per affected host.

Test plan

  • Deploy this branch to `superset-relay-staging` via `./apps/relay/scripts/deploy-staging.sh`.
  • Re-run the rolling-deploy game-day: trigger a redeploy, watch sandbox + satyas-mbp tunnels in staging logs, confirm re-registration happens within ~5s of each machine coming back (instead of ~60s).
  • Confirm no zombie directory entries — `disposeTunnel` cleans pingTimer/pendingRequests on close, and the existing `unregister` path on the next `onclose` of the host-side socket fires correctly.

Out of scope (separate concerns to track)

  • The shorter-term mitigation "lower `RECONNECT_MAX_MS`" was considered but rejected — under genuine sustained outages we still want backoff to prevent storms. The 1001-reset path is the right shape: fast retry only when the relay clearly told us "I'm going away".
  • Cloud-DB lag (`is_online` staying stale after re-register) — tracked separately; this PR doesn't touch the `scheduleOnlineWrite` path.
  • Synthetic check on staging — nice-to-have but not blocking.

Summary by cubic

Gracefully drains relay tunnels on SIGINT/SIGTERM: stop new connections, send an in-band “drain” message, close with WS code 4001 so hosts reconnect immediately, and await directory cleanup before exit. Cuts deploy-time outages to ~1–5s, clears stale directory entries, and adds Fly lifecycle settings plus region-aware smoke tests for safer multi‑region deploys.

  • New Features

    • Relay (apps/relay): on SIGINT/SIGTERM call server.close(), broadcast {type:"drain"}, close with code 4001, reject new registers while draining, await batched directory cleanup, then exit. Set kill_signal = "SIGINT" and kill_timeout = "10s" in fly.toml and fly.staging.toml.
    • Host (packages/host-service): on {type:"drain"} or close code 4001, reset reconnectAttempts and reconnect immediately; cap RECONNECT_MAX_MS at 5s. Protocol adds drain message (packages/shared).
    • Deploy/ops: add apps/relay/scripts/deploy-staging.sh; add region-aware smoke-test.sh and run it from both deploy scripts to verify /health per region.
  • Bug Fixes

    • Relay directory: batch-clear self-owned entries via a single Redis EVAL on startup and during drain; await cleanup before exit to prevent stale routing and incorrect is_online.

Written for commit 75994e0. Summary will update on new commits. Review in cubic

Summary by CodeRabbit

  • New Features

    • Relay now gracefully drains active connections on shutdown to reduce abrupt disconnects.
    • Relay startup clears stale directory entries from prior processes to avoid stale state.
    • Tunnel clients detect relay-initiated shutdowns and reconnect immediately with faster retry behavior.
  • Chores

    • Added deployment script and runtime lifecycle settings to enable rolling deploys with graceful shutdown.

Review Change Stack

@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 15, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a TunnelManager WS close-code constant (4001) and an async drain() that closes tunnels with that code; reduces TunnelClient reconnect ceiling to 5s and treats WS close code 4001 as a drain signal that resets reconnectAttempts.

Changes

Tunnel drain & reconnect

Layer / File(s) Summary
TunnelManager WS close-code and drain
apps/relay/src/tunnel.ts
Adds TunnelManager.WS_CLOSE_DRAIN = 4001 and async drain(reason?) which closes all tunnel WebSockets with code 4001 and waits ~250ms for frames to flush.
TunnelClient reconnect ceiling & onclose handling
packages/host-service/src/tunnel/tunnel-client.ts
Reduces RECONNECT_MAX_MS from 30,000ms to 5,000ms and adds an onclose branch for close code 4001 that resets reconnectAttempts to 0 and logs a drain-related message before the existing reconnect scheduling path proceeds.

Sequence Diagram(s)

sequenceDiagram
  participant Signal
  participant RelayServer
  participant TunnelManager
  participant TunnelClient
  Signal->>RelayServer: SIGINT/SIGTERM
  RelayServer->>TunnelManager: drain("Server draining for deploy")
  TunnelManager->>TunnelClient: close WS (code=4001, reason)
  TunnelClient-->>TunnelClient: onclose(code=4001) -> reconnectAttempts = 0
  TunnelClient->>TunnelClient: schedule reconnect (backoff ≤ 5000ms)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I closed the tunnels soft and neat,
Sent code four-thousand, a polite retreat,
Reset my hops to try again,
Shorter waits, a quicker mend,
Hoppity-hop — deploy complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: graceful tunnel drain on SIGTERM, which is the core feature introduced across multiple files in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive, well-structured, and covers all required template sections including a clear summary, related issues context, type of change, testing plan, and additional notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat-relay-graceful-drain

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds a graceful tunnel-drain path to the relay so that rolling deploys no longer leave host-services in exponential-backoff limbo (~60 s) after a SIGTERM. The relay closes every open WebSocket with code 1001 before exiting, and the host-service now recognises code 1001 as a clean drain signal and resets its reconnect counter so it retries within ~1 s.

  • apps/relay/src/tunnel.ts + index.ts: New TunnelManager.drain() closes all tunnels with WS code 1001 and waits 250 ms for frames to flush; wired to SIGTERM in index.ts behind a draining guard.
  • packages/host-service/src/tunnel/tunnel-client.ts: onclose resets reconnectAttempts = 0 when event.code === 1001, bypassing saturated backoff for what the client treats as a clean relay restart.

Confidence Score: 3/5

The deploy-drain path works correctly in the happy path, but the close-code collision with the existing ping-timeout path means the host-service cannot distinguish the two scenarios and may trigger reconnect storms under load.

The relay's existing ping-timeout handler already closes host sockets with code 1001. The new client-side logic resets backoff for any code-1001 close, so a wave of ping timeouts during relay overload would produce the same thundering-herd that the exponential backoff was designed to suppress.

apps/relay/src/tunnel.ts and packages/host-service/src/tunnel/tunnel-client.ts need a consistent, distinct close code for the deploy-drain signal vs. the ping-timeout path.

Important Files Changed

Filename Overview
apps/relay/src/index.ts Adds SIGTERM handler with draining guard that calls tunnelManager.drain() then exits; HTTP server is not stopped before the 250 ms flush window, leaving a brief reconnect-then-immediate-exit gap.
apps/relay/src/tunnel.ts New drain() method closes all tunnels with close code 1001; the same code 1001 is already used in the existing ping-timeout path, creating a semantic collision that the host-service cannot resolve.
packages/host-service/src/tunnel/tunnel-client.ts Resets reconnectAttempts to 0 on close code 1001; because ping-timeout closures also use 1001, this fast-reconnect path can be triggered unintentionally under load, potentially causing reconnect storms.

Sequence Diagram

sequenceDiagram
    participant Fly as Fly Platform
    participant Relay as Relay (index.ts)
    participant TM as TunnelManager
    participant Host as Host Service

    Fly->>Relay: SIGTERM
    Relay->>Relay: "draining = true"
    Relay->>TM: drain()
    loop each open tunnel
        TM->>Host: ws.close(1001, "Server draining for deploy")
    end
    Note over TM: await 250ms (flush close frames)
    Host->>Host: "onclose(event.code = 1001)"
    Host->>Host: "reconnectAttempts = 0"
    TM-->>Relay: drain() resolved
    Relay->>Relay: process.exit(0)
    Note over Host: scheduleReconnect() ~500-1000ms delay
    Host->>Relay: reconnect (new relay instance)
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/host-service/src/tunnel/tunnel-client.ts:133-141
**Close code 1001 collision with ping-timeout path**

The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.

### Issue 2 of 2
apps/relay/src/index.ts:55-65
**HTTP server keeps accepting connections during drain window**

After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain.

Reviews (1): Last reviewed commit: "feat(relay): graceful tunnel drain on SI..." | Re-trigger Greptile

Comment on lines +133 to +141
// Clean close from the relay (1001 = "Going Away", typically a
// deploy drain). Reset backoff so we don't sit in 30s retry
// mode while the relay is restarting in <2s.
if (event.code === 1001) {
this.reconnectAttempts = 0;
console.log(
"[host-service:tunnel] relay draining; reconnecting immediately",
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Close code 1001 collision with ping-timeout path

The relay already closes tunnels with code 1001 in the ping-timeout path (tunnel.ts:112: ws.close(1001, "Ping timeout")). Since the new host-service handler treats any 1001 as a clean deploy drain and resets reconnectAttempts = 0, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/tunnel/tunnel-client.ts
Line: 133-141

Comment:
**Close code 1001 collision with ping-timeout path**

The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread apps/relay/src/index.ts Outdated
Comment on lines +55 to +65
process.on("SIGTERM", async () => {
if (draining) return;
draining = true;
console.log("[relay] SIGTERM received, draining tunnels");
try {
await tunnelManager.drain();
} catch (err) {
console.error("[relay] drain failed", err);
}
process.exit(0);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 HTTP server keeps accepting connections during drain window

After SIGTERM fires, drain() closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when process.exit(0) fires. Calling server.close() before awaiting tunnelManager.drain() would stop the listener from accepting new connections while still allowing in-flight sockets to drain.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/src/index.ts
Line: 55-65

Comment:
**HTTP server keeps accepting connections during drain window**

After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/relay/src/directory.ts`:
- Around line 144-149: The startup cleanup currently loads OWNER_KEY into
allOwners then calls unregister(hostId, region, machineId) serially, causing one
Redis RTT per host; change this to batch the cleanup: either maintain an
owner-indexed set (e.g., OWNER_INDEX_<owner>) and retrieve that set to
pipeline/publish a bounded cleanup, or implement a single Redis Lua script that
accepts myOwner, region and machineId and removes/unregisters all matching
hostIds server-side. Update the logic around OWNER_KEY, allOwners, and
unregister so you collect hostIds locally and perform a pipelined/multi or
single EVAL that performs the equivalent unregister operations atomically,
avoiding awaiting unregister in a loop. Ensure region and machineId parameters
are passed into the batch script/pipeline so the same validation/side-effects
occur as the original unregister function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aed582de-fc34-4f99-b147-02dd38fb8abb

📥 Commits

Reviewing files that changed from the base of the PR and between ef63c7f and 65dfa44.

📒 Files selected for processing (2)
  • apps/relay/src/directory.ts
  • apps/relay/src/index.ts

Comment thread apps/relay/src/directory.ts Outdated
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 14 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/relay/scripts/smoke-test.sh">

<violation number="1" location="apps/relay/scripts/smoke-test.sh:27">
P2: The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</violation>
</file>

<file name=".github/workflows/release-cli.yml">

<violation number="1" location=".github/workflows/release-cli.yml:68">
P1: The prerelease guard is overly broad and causes this step to be skipped for all `cli-v*` tags, including stable releases.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread .github/workflows/release-cli.yml Outdated
fail=0
for region in "${REGIONS[@]}"; do
printf " %-4s " "$region"
body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The smoke test never verifies HTTP status 200, so non-200 /health responses can be treated as success if the body still matches the region.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/relay/scripts/smoke-test.sh, line 27:

<comment>The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</comment>

<file context>
@@ -0,0 +1,45 @@
+fail=0
+for region in "${REGIONS[@]}"; do
+  printf "  %-4s " "$region"
+  body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || {
+    printf "  ✗ curl failed: %s\n" "$body"
+    fail=$((fail + 1))
</file context>

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/release-cli.yml">

<violation number="1" location=".github/workflows/release-cli.yml:70">
P2: The release gate is too narrow: it excludes only `-alpha/-beta/-rc`, so other prerelease tags can still publish `cli-latest`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread .github/workflows/release-cli.yml Outdated
Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:

1. Fly sends SIGTERM, relay process exits without closing WS, host
   sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
   30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
   takes another 15-30s to retry.

Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.

Fix is two-sided:

- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
  TunnelManager that walks every open WS and closes it with code 1001
  ("Going Away"), waits 250ms for frames to flush, then exits. Wired
  to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.

- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
  WS onClose event reports code 1001, reset reconnectAttempts to 0 so
  the next reconnect fires at the 0.5-1s base delay instead of the
  saturated 30s/attempt tail.

Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.

Verified via game-day on superset-relay-staging (sjc, iad, fra).
Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the
main process during rolling deploys, AND its default ~5s grace period
kills the process before the JS handler runs. Two fixes:

- Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the
  drain handler intercepts Fly's actual signal (also covers
  `fly machine stop` and local Ctrl-C).
- Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml
  and fly.staging.toml so the handler actually has time to drain.

Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT
received, draining tunnels", unregisters open WSs cleanly, then exits
with code 0. Host re-registration happens ~12s after WS close (down
from 64s pre-fix).
Game-day #14 exposed a ~2min user-visible glitch when a relay machine
gets killed and auto-restarted by Fly. The killed process never gets
to unregister its tunnels (SIGKILL leaves no opportunity), so the
directory keeps entries pointing at "{region}:{machineId}". Fly
auto-restarts the same machineId, but the new process has no in-memory
tunnels — meanwhile, cross-region requests still fly-replay traffic
at that stale machineId and the reconciler reports `is_online=yes`
based on the stale directory.

Fix: on startup, scan the OWNER hash for entries matching our own
encoded owner string and delete them via the existing compare-and-
delete `unregister` helper. Runs synchronously before serve() via
top-level await — the relay won't accept connections until cleanup
completes, so we can't race a fresh registration.

Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2
stale directory entries on startup" log → sandbox re-registered 13s
after kill (down from ~2min recovery without this fix).
Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.

5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.
…h cleanup

Four fixes from PR review on #4594:

1. Drain uses app-defined close code 4001 instead of 1001. The
   ping-timeout path in tunnel.ts already uses 1001, and the host-side
   reset-on-clean-close would have fired for those too — risking a
   thundering-herd reconnect storm under a mass ping-timeout event.
   Now the host only resets on 4001 (deliberate deploy drain), and
   1001 keeps its RFC-conformant "going away" semantics elsewhere.
   (Greptile P1)

2. Stop accepting new TCP connections during drain. Call server.close()
   before tunnelManager.drain() so a host that reconnects during the
   250ms flush window doesn't open a tunnel that immediately dies on
   process.exit. (Greptile P2)

3. Startup directory cleanup batched into a single Lua EVAL. Previous
   loop did one Redis round-trip per stale entry — at prod scale (~500
   tunnels per machine) that would have eaten 25s of startup time on
   typical Upstash latency. Single EVAL keeps cleanup at one RTT
   regardless of directory size. (CodeRabbit Major)

4. Update host-side comment from "30s retry mode" to "5s ceiling"
   (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit
   Minor)
Adds smoke-test.sh that pings /health on every configured region via
`fly-prefer-region` and verifies the response body's `region` field
matches the requested region. Wired into both deploy.sh and
deploy-staging.sh so deploys halt if any region failed to come up.

Catches partial-deploy failures, missing regions, and machines that
booted but aren't actually serving — exactly the multi-region failure
modes we couldn't detect with the existing single-region setup.

Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra)
— all 3 returned correct region in /health response.
@stage-review
Copy link
Copy Markdown

stage-review Bot commented May 16, 2026

Ready to review this PR? Stage has broken it down into 6 individual chapters for you:

Title
1 Define drain message in tunnel protocol
2 Implement stale entry cleanup in relay directory
3 Implement tunnel draining in TunnelManager
4 Wire drain logic to relay lifecycle
5 Handle drain signals in host-service
6 Configure Fly lifecycle and deployment scripts
Open in Stage

Chapters generated by Stage for commit 75994e0 on May 16, 2026 7:26pm UTC.

Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.

Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.

Verified end-to-end on staging: host log shows
  "[host-service:tunnel] relay drain notice received (Server draining
   for deploy); reconnecting immediately"
followed by reconnect within ~900ms.
…ile draining

drain() previously relied on WS onClose → unregister → fire-and-forget
directory.unregister for directory cleanup. process.exit(0) runs right
after drain() returns, so those async Redis writes routinely didn't land
before the process died — leaving the dead machine's directory entries
live for ~90s (TTL), during which other relay machines fly-replay
traffic to the corpse. Invisible single-region; a real per-deploy
misrouting window multi-region.

- drain() now awaits an explicit clearDirectory() callback
  (clearStaleEntriesForMachine — the batch Lua script already used on
  startup) before the process exits.
- A draining flag rejects new tunnel registrations during the drain
  window, in register() and the /tunnel onOpen handler, re-checked after
  the async auth calls so a host can't slip in while draining flips.
- drain() routes each tunnel through disposeTunnel and removes it from
  the map, so the trailing onClose callbacks are no-ops rather than
  racing the explicit directory clear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant