padding fix by AviPeltz · Pull Request #15 · superset-sh/superset

AviPeltz · 2025-11-03T21:53:10Z

Description

Related Issues

Type of Change

Testing

Screenshots (if applicable)

Additional Notes

vercel · 2025-11-03T21:53:15Z

@AviPeltz is attempting to deploy a commit to the Personal team on Vercel, but is not a member of this team. To resolve this issue, you can:

Make your repository public. Collaboration is free for open source and public repositories.
Upgrade to pro and add @AviPeltz as a member. A Pro subscription is required to access Vercel's collaborative features.
- If you're the owner of the team, click here to upgrade and add @AviPeltz as a member.
- If you're the user who initiated this build request, click here to request access.
- If you're already a member of the Personal team, make sure that your Vercel account is connected to your GitHub account.

To read more about collaboration on Vercel, click here.

Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds.

* feat(relay): graceful tunnel drain on SIGTERM Discovered during multi-region rolling-deploy game-day: every relay machine restart left host tunnels disconnected for ~60s because: 1. Fly sends SIGTERM, relay process exits without closing WS, host sees a TCP-RST'd socket as a generic error. 2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the 30s/attempt tail after ~6 failed attempts while the relay is down. 3. By the time the relay is back up, the host is in deep backoff and takes another 15-30s to retry. Result: ~60s user-visible outage per rolling-restart cycle, multiplied across regions during multi-region deploys. Fix is two-sided: - Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on TunnelManager that walks every open WS and closes it with code 1001 ("Going Away"), waits 250ms for frames to flush, then exits. Wired to SIGTERM in index.ts so Fly's pre-stop signal does the right thing. - Host (packages/host-service/src/tunnel/tunnel-client.ts): when the WS onClose event reports code 1001, reset reconnectAttempts to 0 so the next reconnect fires at the 0.5-1s base delay instead of the saturated 30s/attempt tail. Combined, deploy outages should drop from ~60s to ~2-5s per affected host. Pre-requisite for safe multi-region prod rollout — multi-region amortizes the impact across regions but doesn't fix the per-host window without graceful drain. Verified via game-day on superset-relay-staging (sjc, iad, fra). * fix(relay): use SIGINT for drain + kill_timeout=10s on Fly Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the main process during rolling deploys, AND its default ~5s grace period kills the process before the JS handler runs. Two fixes: - Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the drain handler intercepts Fly's actual signal (also covers `fly machine stop` and local Ctrl-C). - Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml and fly.staging.toml so the handler actually has time to drain. Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT received, draining tunnels", unregisters open WSs cleanly, then exits with code 0. Host re-registration happens ~12s after WS close (down from 64s pre-fix). * chore(relay): add deploy-staging.sh for parametrized staging deploys * fix(relay): clear self-owned directory entries on startup Game-day #14 exposed a ~2min user-visible glitch when a relay machine gets killed and auto-restarted by Fly. The killed process never gets to unregister its tunnels (SIGKILL leaves no opportunity), so the directory keeps entries pointing at "{region}:{machineId}". Fly auto-restarts the same machineId, but the new process has no in-memory tunnels — meanwhile, cross-region requests still fly-replay traffic at that stale machineId and the reconciler reports `is_online=yes` based on the stale directory. Fix: on startup, scan the OWNER hash for entries matching our own encoded owner string and delete them via the existing compare-and- delete `unregister` helper. Runs synchronously before serve() via top-level await — the relay won't accept connections until cleanup completes, so we can't race a fresh registration. Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2 stale directory entries on startup" log → sandbox re-registered 13s after kill (down from ~2min recovery without this fix). * fix(host-service): lower reconnect MAX_MS from 30s to 5s Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds. * fix(relay): address PR review — close-code, accept-during-drain, batch cleanup Four fixes from PR review on #4594: 1. Drain uses app-defined close code 4001 instead of 1001. The ping-timeout path in tunnel.ts already uses 1001, and the host-side reset-on-clean-close would have fired for those too — risking a thundering-herd reconnect storm under a mass ping-timeout event. Now the host only resets on 4001 (deliberate deploy drain), and 1001 keeps its RFC-conformant "going away" semantics elsewhere. (Greptile P1) 2. Stop accepting new TCP connections during drain. Call server.close() before tunnelManager.drain() so a host that reconnects during the 250ms flush window doesn't open a tunnel that immediately dies on process.exit. (Greptile P2) 3. Startup directory cleanup batched into a single Lua EVAL. Previous loop did one Redis round-trip per stale entry — at prod scale (~500 tunnels per machine) that would have eaten 25s of startup time on typical Upstash latency. Single EVAL keeps cleanup at one RTT regardless of directory size. (CodeRabbit Major) 4. Update host-side comment from "30s retry mode" to "5s ceiling" (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit Minor) * chore(relay): post-deploy smoke test across all regions Adds smoke-test.sh that pings /health on every configured region via `fly-prefer-region` and verifies the response body's `region` field matches the requested region. Wired into both deploy.sh and deploy-staging.sh so deploys halt if any region failed to come up. Catches partial-deploy failures, missing regions, and machines that booted but aren't actually serving — exactly the multi-region failure modes we couldn't detect with the existing single-region setup. Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra) — all 3 returned correct region in /health response. * fix(relay): in-band drain signal before WS close on shutdown Game-day testing on multi-region staging revealed the WS close frame doesn't reliably reach the host within Fly's kill_timeout window. The host's TCP socket stayed ESTABLISHED with no onclose event firing until the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible outage on every deploy. Send a JSON {type:"drain"} message on the WS data channel first, wait 500ms for it to flush, then close. The message channel already carries ping/pong every 30s, so we know it traverses the path cleanly. Host's onmessage handler resets backoff and triggers an immediate reconnect. The subsequent close() is belt-and-suspenders. Verified end-to-end on staging: host log shows "[host-service:tunnel] relay drain notice received (Server draining for deploy); reconnecting immediately" followed by reconnect within ~900ms. * fix(relay): await directory cleanup during drain; reject registers while draining drain() previously relied on WS onClose → unregister → fire-and-forget directory.unregister for directory cleanup. process.exit(0) runs right after drain() returns, so those async Redis writes routinely didn't land before the process died — leaving the dead machine's directory entries live for ~90s (TTL), during which other relay machines fly-replay traffic to the corpse. Invisible single-region; a real per-deploy misrouting window multi-region. - drain() now awaits an explicit clearDirectory() callback (clearStaleEntriesForMachine — the batch Lua script already used on startup) before the process exits. - A draining flag rejects new tunnel registrations during the drain window, in register() and the /tunnel onOpen handler, re-checked after the async auth calls so a host can't slip in while draining flips. - drain() routes each tunnel through disposeTunnel and removes it from the map, so the trailing onClose callbacks are no-ops rather than racing the explicit directory clear.

padding fix

5b70785

Kitenite merged commit fbe18a3 into main Nov 3, 2025
2 of 6 checks passed

Kitenite deleted the padding-fix branch November 3, 2025 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

padding fix#15

padding fix#15
Kitenite merged 1 commit into
mainfrom
padding-fix

AviPeltz commented Nov 3, 2025

Uh oh!

vercel Bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AviPeltz commented Nov 3, 2025

Description

Related Issues

Type of Change

Testing

Screenshots (if applicable)

Additional Notes

Uh oh!

vercel Bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants