feat(relay): graceful tunnel drain on SIGTERM by saddlepaddle · Pull Request #4594 · superset-sh/superset

saddlepaddle · 2026-05-15T04:10:16Z

Summary

Discovered during a multi-region rolling-deploy game-day on `superset-relay-staging`: every relay-machine restart left host tunnels disconnected for ~60s before they reconnected. That outage compounds across regions during multi-region deploys, so it's a blocker for rolling multi-region to prod cleanly.

Root cause

Fly sends SIGTERM to the relay process during a rolling deploy.
Relay had no SIGTERM handler — the process just exits, leaving every open WS as a TCP-RST'd socket from the host's perspective.
Host-service's exponential reconnect backoff (`RECONNECT_BASE_MS=1s`, `RECONNECT_MAX_MS=30s`) climbs into the 30s/attempt tail after ~6 failed attempts while the relay is down (~30-40s).
By the time the relay comes back, the host is mid-30s-wait. Worst case is ~60s before the next attempt fires and the WS re-establishes.

Game-day timings:

04:02:41Z — SJC machine restart (sandbox WS killed)
04:02:42Z — SJC listening again
04:03:45Z — sandbox re-registered (63s gap from restart)

Fix (two-sided)

Relay (`apps/relay/src/tunnel.ts` + `apps/relay/src/index.ts`):

New `tunnelManager.drain()` that walks every open WS and closes it with code 1001 ("Going Away"), waits 250ms for close frames to flush, then exits.
Wired to SIGTERM in index.ts so Fly's pre-stop signal triggers the clean drain.

Host (`packages/host-service/src/tunnel/tunnel-client.ts`):

`socket.onclose` now recognizes code 1001 specifically and resets `reconnectAttempts = 0` before scheduling the next attempt. The next reconnect fires at the 0.5-1s base delay instead of the saturated 30s tail.

Together: deploy outages should drop from ~60s to ~2-5s per affected host.

Test plan

Deploy this branch to `superset-relay-staging` via `./apps/relay/scripts/deploy-staging.sh`.
Re-run the rolling-deploy game-day: trigger a redeploy, watch sandbox + satyas-mbp tunnels in staging logs, confirm re-registration happens within ~5s of each machine coming back (instead of ~60s).
Confirm no zombie directory entries — `disposeTunnel` cleans pingTimer/pendingRequests on close, and the existing `unregister` path on the next `onclose` of the host-side socket fires correctly.

Out of scope (separate concerns to track)

The shorter-term mitigation "lower `RECONNECT_MAX_MS`" was considered but rejected — under genuine sustained outages we still want backoff to prevent storms. The 1001-reset path is the right shape: fast retry only when the relay clearly told us "I'm going away".
Cloud-DB lag (`is_online` staying stale after re-register) — tracked separately; this PR doesn't touch the `scheduleOnlineWrite` path.
Synthetic check on staging — nice-to-have but not blocking.

Summary by cubic

Gracefully drains relay tunnels on SIGINT/SIGTERM: stop new connections, send an in-band “drain” message, close with WS code 4001 so hosts reconnect immediately, and await directory cleanup before exit. Cuts deploy-time outages to ~1–5s, clears stale directory entries, and adds Fly lifecycle settings plus region-aware smoke tests for safer multi‑region deploys.

New Features
- Relay (apps/relay): on SIGINT/SIGTERM call server.close(), broadcast {type:"drain"}, close with code 4001, reject new registers while draining, await batched directory cleanup, then exit. Set kill_signal = "SIGINT" and kill_timeout = "10s" in fly.toml and fly.staging.toml.
- Host (packages/host-service): on {type:"drain"} or close code 4001, reset reconnectAttempts and reconnect immediately; cap RECONNECT_MAX_MS at 5s. Protocol adds drain message (packages/shared).
- Deploy/ops: add apps/relay/scripts/deploy-staging.sh; add region-aware smoke-test.sh and run it from both deploy scripts to verify /health per region.
Bug Fixes
- Relay directory: batch-clear self-owned entries via a single Redis EVAL on startup and during drain; await cleanup before exit to prevent stale routing and incorrect is_online.

^{Written for commit 75994e0. Summary will update on new commits. Review in cubic}

Summary by CodeRabbit

New Features
- Relay now gracefully drains active connections on shutdown to reduce abrupt disconnects.
- Relay startup clears stale directory entries from prior processes to avoid stale state.
- Tunnel clients detect relay-initiated shutdowns and reconnect immediately with faster retry behavior.
Chores
- Added deployment script and runtime lifecycle settings to enable rolling deploys with graceful shutdown.

capy-ai · 2026-05-15T04:10:30Z

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

coderabbitai · 2026-05-15T04:10:33Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a TunnelManager WS close-code constant (4001) and an async drain() that closes tunnels with that code; reduces TunnelClient reconnect ceiling to 5s and treats WS close code 4001 as a drain signal that resets reconnectAttempts.

Changes

Tunnel drain & reconnect

Layer / File(s)	Summary
TunnelManager WS close-code and drain `apps/relay/src/tunnel.ts`	Adds `TunnelManager.WS_CLOSE_DRAIN = 4001` and `async drain(reason?)` which closes all tunnel WebSockets with code `4001` and waits ~250ms for frames to flush.
TunnelClient reconnect ceiling & onclose handling `packages/host-service/src/tunnel/tunnel-client.ts`	Reduces `RECONNECT_MAX_MS` from 30,000ms to 5,000ms and adds an `onclose` branch for close code `4001` that resets `reconnectAttempts` to `0` and logs a drain-related message before the existing reconnect scheduling path proceeds.

Sequence Diagram(s)

sequenceDiagram
  participant Signal
  participant RelayServer
  participant TunnelManager
  participant TunnelClient
  Signal->>RelayServer: SIGINT/SIGTERM
  RelayServer->>TunnelManager: drain("Server draining for deploy")
  TunnelManager->>TunnelClient: close WS (code=4001, reason)
  TunnelClient-->>TunnelClient: onclose(code=4001) -> reconnectAttempts = 0
  TunnelClient->>TunnelClient: schedule reconnect (backoff ≤ 5000ms)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

superset-sh/superset#4539: Modifies tunnel-client reconnect/close handling; related area of the repo (tunneling).

Poem

🐰 I closed the tunnels soft and neat,
Sent code four-thousand, a polite retreat,
Reset my hops to try again,
Shorter waits, a quicker mend,
Hoppity-hop — deploy complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: graceful tunnel drain on SIGTERM, which is the core feature introduced across multiple files in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive, well-structured, and covers all required template sections including a clear summary, related issues context, type of change, testing plan, and additional notes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat-relay-graceful-drain

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-15T04:13:13Z

Greptile Summary

This PR adds a graceful tunnel-drain path to the relay so that rolling deploys no longer leave host-services in exponential-backoff limbo (~60 s) after a SIGTERM. The relay closes every open WebSocket with code 1001 before exiting, and the host-service now recognises code 1001 as a clean drain signal and resets its reconnect counter so it retries within ~1 s.

apps/relay/src/tunnel.ts + index.ts: New TunnelManager.drain() closes all tunnels with WS code 1001 and waits 250 ms for frames to flush; wired to SIGTERM in index.ts behind a draining guard.
packages/host-service/src/tunnel/tunnel-client.ts: onclose resets reconnectAttempts = 0 when event.code === 1001, bypassing saturated backoff for what the client treats as a clean relay restart.

Confidence Score: 3/5

The deploy-drain path works correctly in the happy path, but the close-code collision with the existing ping-timeout path means the host-service cannot distinguish the two scenarios and may trigger reconnect storms under load.

The relay's existing ping-timeout handler already closes host sockets with code 1001. The new client-side logic resets backoff for any code-1001 close, so a wave of ping timeouts during relay overload would produce the same thundering-herd that the exponential backoff was designed to suppress.

apps/relay/src/tunnel.ts and packages/host-service/src/tunnel/tunnel-client.ts need a consistent, distinct close code for the deploy-drain signal vs. the ping-timeout path.

Important Files Changed

Filename	Overview
apps/relay/src/index.ts	Adds SIGTERM handler with `draining` guard that calls `tunnelManager.drain()` then exits; HTTP server is not stopped before the 250 ms flush window, leaving a brief reconnect-then-immediate-exit gap.
apps/relay/src/tunnel.ts	New `drain()` method closes all tunnels with close code 1001; the same code 1001 is already used in the existing ping-timeout path, creating a semantic collision that the host-service cannot resolve.
packages/host-service/src/tunnel/tunnel-client.ts	Resets `reconnectAttempts` to 0 on close code 1001; because ping-timeout closures also use 1001, this fast-reconnect path can be triggered unintentionally under load, potentially causing reconnect storms.

Sequence Diagram

sequenceDiagram
    participant Fly as Fly Platform
    participant Relay as Relay (index.ts)
    participant TM as TunnelManager
    participant Host as Host Service

    Fly->>Relay: SIGTERM
    Relay->>Relay: "draining = true"
    Relay->>TM: drain()
    loop each open tunnel
        TM->>Host: ws.close(1001, "Server draining for deploy")
    end
    Note over TM: await 250ms (flush close frames)
    Host->>Host: "onclose(event.code = 1001)"
    Host->>Host: "reconnectAttempts = 0"
    TM-->>Relay: drain() resolved
    Relay->>Relay: process.exit(0)
    Note over Host: scheduleReconnect() ~500-1000ms delay
    Host->>Relay: reconnect (new relay instance)

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/host-service/src/tunnel/tunnel-client.ts:133-141
**Close code 1001 collision with ping-timeout path**

The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.

### Issue 2 of 2
apps/relay/src/index.ts:55-65
**HTTP server keeps accepting connections during drain window**

After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain.

_{Reviews (1): Last reviewed commit: "feat(relay): graceful tunnel drain on SI..." | Re-trigger Greptile}

greptile-apps · 2026-05-15T04:13:16Z

+					// Clean close from the relay (1001 = "Going Away", typically a
+					// deploy drain). Reset backoff so we don't sit in 30s retry
+					// mode while the relay is restarting in <2s.
+					if (event.code === 1001) {
+						this.reconnectAttempts = 0;
+						console.log(
+							"[host-service:tunnel] relay draining; reconnecting immediately",
+						);
+					}


Close code 1001 collision with ping-timeout path

The relay already closes tunnels with code 1001 in the ping-timeout path (tunnel.ts:112: ws.close(1001, "Ping timeout")). Since the new host-service handler treats any 1001 as a clean deploy drain and resets reconnectAttempts = 0, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/host-service/src/tunnel/tunnel-client.ts Line: 133-141 Comment: **Close code 1001 collision with ping-timeout path** The relay already closes tunnels with code 1001 in the ping-timeout path (`tunnel.ts:112`: `ws.close(1001, "Ping timeout")`). Since the new host-service handler treats *any* 1001 as a clean deploy drain and resets `reconnectAttempts = 0`, a mass ping-timeout event (e.g., relay under load, latency spike causing 3+ missed pings across many hosts) would cause all affected hosts to fire reconnects within ~500–1000 ms simultaneously — exactly the thundering-herd the backoff is designed to prevent. The two close-code paths need to be disambiguated: use an application-level code (e.g., 4001) for the deploy-drain signal and keep 1001 only for RFC-conformant "going away" closures, or change the ping-timeout path to use a distinct code so the host can tell the two apart. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-15T04:13:17Z

+process.on("SIGTERM", async () => {
+	if (draining) return;
+	draining = true;
+	console.log("[relay] SIGTERM received, draining tunnels");
+	try {
+		await tunnelManager.drain();
+	} catch (err) {
+		console.error("[relay] drain failed", err);
+	}
+	process.exit(0);
+});


HTTP server keeps accepting connections during drain window

After SIGTERM fires, drain() closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when process.exit(0) fires. Calling server.close() before awaiting tunnelManager.drain() would stop the listener from accepting new connections while still allowing in-flight sockets to drain.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/relay/src/index.ts Line: 55-65 Comment: **HTTP server keeps accepting connections during drain window** After SIGTERM fires, `drain()` closes all existing tunnel WebSockets but leaves the HTTP server running for the full 250 ms flush window. Any host that reconnects very quickly (possible given the 1001 reset-backoff change on the client side) can successfully open a new tunnel that immediately gets torn down when `process.exit(0)` fires. Calling `server.close()` before awaiting `tunnelManager.drain()` would stop the listener from accepting new connections while still allowing in-flight sockets to drain. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-05-15T04:16:53Z

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

✅ Neon database branch

Thank you for your contribution! 🎉

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/relay/src/directory.ts`:
- Around line 144-149: The startup cleanup currently loads OWNER_KEY into
allOwners then calls unregister(hostId, region, machineId) serially, causing one
Redis RTT per host; change this to batch the cleanup: either maintain an
owner-indexed set (e.g., OWNER_INDEX_<owner>) and retrieve that set to
pipeline/publish a bounded cleanup, or implement a single Redis Lua script that
accepts myOwner, region and machineId and removes/unregisters all matching
hostIds server-side. Update the logic around OWNER_KEY, allOwners, and
unregister so you collect hostIds locally and perform a pipelined/multi or
single EVAL that performs the equivalent unregister operations atomically,
avoiding awaiting unregister in a loop. Ensure region and machineId parameters
are passed into the batch script/pipeline so the same validation/side-effects
occur as the original unregister function.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aed582de-fc34-4f99-b147-02dd38fb8abb

📥 Commits

Reviewing files that changed from the base of the PR and between ef63c7f and 65dfa44.

📒 Files selected for processing (2)

apps/relay/src/directory.ts
apps/relay/src/index.ts

cubic-dev-ai

2 issues found across 14 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/relay/scripts/smoke-test.sh">

<violation number="1" location="apps/relay/scripts/smoke-test.sh:27">
P2: The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</violation>
</file>

<file name=".github/workflows/release-cli.yml">

<violation number="1" location=".github/workflows/release-cli.yml:68">
P1: The prerelease guard is overly broad and causes this step to be skipped for all `cli-v*` tags, including stable releases.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T15:02:36Z

+fail=0
+for region in "${REGIONS[@]}"; do
+  printf "  %-4s " "$region"
+  body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || {


P2: The smoke test never verifies HTTP status 200, so non-200 /health responses can be treated as success if the body still matches the region.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At apps/relay/scripts/smoke-test.sh, line 27: <comment>The smoke test never verifies HTTP status 200, so non-200 `/health` responses can be treated as success if the body still matches the region.</comment> <file context> @@ -0,0 +1,45 @@ +fail=0 +for region in "${REGIONS[@]}"; do + printf " %-4s " "$region" + body=$(curl -sS --max-time 8 -H "fly-prefer-region: $region" "https://${HOSTNAME}/health" 2>&1) || { + printf " ✗ curl failed: %s\n" "$body" + fail=$((fail + 1)) </file context>

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/release-cli.yml">

<violation number="1" location=".github/workflows/release-cli.yml:70">
P2: The release gate is too narrow: it excludes only `-alpha/-beta/-rc`, so other prerelease tags can still publish `cli-latest`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

Discovered during multi-region rolling-deploy game-day: every relay machine restart left host tunnels disconnected for ~60s because: 1. Fly sends SIGTERM, relay process exits without closing WS, host sees a TCP-RST'd socket as a generic error. 2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the 30s/attempt tail after ~6 failed attempts while the relay is down. 3. By the time the relay is back up, the host is in deep backoff and takes another 15-30s to retry. Result: ~60s user-visible outage per rolling-restart cycle, multiplied across regions during multi-region deploys. Fix is two-sided: - Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on TunnelManager that walks every open WS and closes it with code 1001 ("Going Away"), waits 250ms for frames to flush, then exits. Wired to SIGTERM in index.ts so Fly's pre-stop signal does the right thing. - Host (packages/host-service/src/tunnel/tunnel-client.ts): when the WS onClose event reports code 1001, reset reconnectAttempts to 0 so the next reconnect fires at the 0.5-1s base delay instead of the saturated 30s/attempt tail. Combined, deploy outages should drop from ~60s to ~2-5s per affected host. Pre-requisite for safe multi-region prod rollout — multi-region amortizes the impact across regions but doesn't fix the per-host window without graceful drain. Verified via game-day on superset-relay-staging (sjc, iad, fra).

Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the main process during rolling deploys, AND its default ~5s grace period kills the process before the JS handler runs. Two fixes: - Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the drain handler intercepts Fly's actual signal (also covers `fly machine stop` and local Ctrl-C). - Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml and fly.staging.toml so the handler actually has time to drain. Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT received, draining tunnels", unregisters open WSs cleanly, then exits with code 0. Host re-registration happens ~12s after WS close (down from 64s pre-fix).

Game-day #14 exposed a ~2min user-visible glitch when a relay machine gets killed and auto-restarted by Fly. The killed process never gets to unregister its tunnels (SIGKILL leaves no opportunity), so the directory keeps entries pointing at "{region}:{machineId}". Fly auto-restarts the same machineId, but the new process has no in-memory tunnels — meanwhile, cross-region requests still fly-replay traffic at that stale machineId and the reconciler reports `is_online=yes` based on the stale directory. Fix: on startup, scan the OWNER hash for entries matching our own encoded owner string and delete them via the existing compare-and- delete `unregister` helper. Runs synchronously before serve() via top-level await — the relay won't accept connections until cleanup completes, so we can't race a fresh registration. Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2 stale directory entries on startup" log → sandbox re-registered 13s after kill (down from ~2min recovery without this fix).

Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds.

…h cleanup Four fixes from PR review on #4594: 1. Drain uses app-defined close code 4001 instead of 1001. The ping-timeout path in tunnel.ts already uses 1001, and the host-side reset-on-clean-close would have fired for those too — risking a thundering-herd reconnect storm under a mass ping-timeout event. Now the host only resets on 4001 (deliberate deploy drain), and 1001 keeps its RFC-conformant "going away" semantics elsewhere. (Greptile P1) 2. Stop accepting new TCP connections during drain. Call server.close() before tunnelManager.drain() so a host that reconnects during the 250ms flush window doesn't open a tunnel that immediately dies on process.exit. (Greptile P2) 3. Startup directory cleanup batched into a single Lua EVAL. Previous loop did one Redis round-trip per stale entry — at prod scale (~500 tunnels per machine) that would have eaten 25s of startup time on typical Upstash latency. Single EVAL keeps cleanup at one RTT regardless of directory size. (CodeRabbit Major) 4. Update host-side comment from "30s retry mode" to "5s ceiling" (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit Minor)

Adds smoke-test.sh that pings /health on every configured region via `fly-prefer-region` and verifies the response body's `region` field matches the requested region. Wired into both deploy.sh and deploy-staging.sh so deploys halt if any region failed to come up. Catches partial-deploy failures, missing regions, and machines that booted but aren't actually serving — exactly the multi-region failure modes we couldn't detect with the existing single-region setup. Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra) — all 3 returned correct region in /health response.

stage-review · 2026-05-16T19:03:46Z

Ready to review this PR? Stage has broken it down into 6 individual chapters for you:

	Title
1	Define drain message in tunnel protocol
2	Implement stale entry cleanup in relay directory
3	Implement tunnel draining in TunnelManager
4	Wire drain logic to relay lifecycle
5	Handle drain signals in host-service
6	Configure Fly lifecycle and deployment scripts

_{Chapters generated by Stage for commit 75994e0 on May 16, 2026 7:26pm UTC.}

Game-day testing on multi-region staging revealed the WS close frame doesn't reliably reach the host within Fly's kill_timeout window. The host's TCP socket stayed ESTABLISHED with no onclose event firing until the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible outage on every deploy. Send a JSON {type:"drain"} message on the WS data channel first, wait 500ms for it to flush, then close. The message channel already carries ping/pong every 30s, so we know it traverses the path cleanly. Host's onmessage handler resets backoff and triggers an immediate reconnect. The subsequent close() is belt-and-suspenders. Verified end-to-end on staging: host log shows "[host-service:tunnel] relay drain notice received (Server draining for deploy); reconnecting immediately" followed by reconnect within ~900ms.

…ile draining drain() previously relied on WS onClose → unregister → fire-and-forget directory.unregister for directory cleanup. process.exit(0) runs right after drain() returns, so those async Redis writes routinely didn't land before the process died — leaving the dead machine's directory entries live for ~90s (TTL), during which other relay machines fly-replay traffic to the corpse. Invisible single-region; a real per-deploy misrouting window multi-region. - drain() now awaits an explicit clearDirectory() callback (clearStaleEntriesForMachine — the batch Lua script already used on startup) before the process exits. - A draining flag rejects new tunnel registrations during the drain window, in register() and the /tunnel onOpen handler, re-checked after the async auth calls so a host can't slip in while draining flips. - drain() routes each tunnel through disposeTunnel and removes it from the map, so the trailing onClose callbacks are no-ops rather than racing the explicit directory clear.

saddlepaddle temporarily deployed to preview May 15, 2026 04:10 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 15, 2026 04:11 — with GitHub Actions Inactive

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

saddlepaddle had a problem deploying to preview May 15, 2026 04:29 — with GitHub Actions Error

saddlepaddle temporarily deployed to preview May 15, 2026 04:31 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 15, 2026 04:43 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 15, 2026 04:44 — with GitHub Actions Inactive

saddlepaddle had a problem deploying to preview May 15, 2026 04:44 — with GitHub Actions Error

saddlepaddle temporarily deployed to preview May 15, 2026 04:44 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread apps/relay/src/directory.ts Outdated

saddlepaddle temporarily deployed to preview May 15, 2026 04:49 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 15, 2026 04:50 — with GitHub Actions Inactive

saddlepaddle temporarily deployed to preview May 15, 2026 15:01 — with GitHub Actions Inactive

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

Comment thread .github/workflows/release-cli.yml Outdated

saddlepaddle added 7 commits May 16, 2026 12:00

chore(relay): add deploy-staging.sh for parametrized staging deploys

96bae9b

saddlepaddle force-pushed the feat-relay-graceful-drain branch from 062be1c to ac4f410 Compare May 16, 2026 19:03

saddlepaddle temporarily deployed to preview May 16, 2026 19:03 — with GitHub Actions Inactive

saddlepaddle had a problem deploying to preview May 16, 2026 19:04 — with GitHub Actions Error

saddlepaddle force-pushed the feat-relay-graceful-drain branch from ac4f410 to 3aee590 Compare May 16, 2026 19:06

saddlepaddle temporarily deployed to preview May 16, 2026 19:07 — with GitHub Actions Inactive

This was referenced May 16, 2026

chore(cli): bump version to 0.2.19 #4642

Merged

chore(relay): prod deploy script — 6 regions, scale before deploy #4678

Merged

Kitenite mentioned this pull request May 18, 2026

docs: weekly changelog — 2026-05-18 #4695

Merged

5 tasks

coderabbitai Bot mentioned this pull request May 26, 2026

fix(automation): create workspace + launch agent in a single relay call (SUPER-783) #4936

Draft

4 tasks

Conversation

saddlepaddle commented May 15, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix (two-sided)

Test plan

Out of scope (separate concerns to track)

Summary by cubic

Summary by CodeRabbit

Uh oh!

capy-ai Bot commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 15, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧹 Preview Cleanup Complete

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stage-review Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saddlepaddle commented May 15, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading

stage-review Bot commented May 16, 2026 •

edited

Loading