padding fix#15
Merged
Merged
Conversation
|
@AviPeltz is attempting to deploy a commit to the Personal team on Vercel, but is not a member of this team. To resolve this issue, you can:
To read more about collaboration on Vercel, click here. |
saddlepaddle
added a commit
that referenced
this pull request
May 15, 2026
Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds.
saddlepaddle
added a commit
that referenced
this pull request
May 16, 2026
Game-day #15 (WS storm) showed daemons stranded for minutes after consecutive disconnects. With the previous 30s cap, retries land 15-30s apart in the saturated tail of the exponential curve, and the post-recovery retry frequently misses the relay-back-online window. 5s cap means slightly more retry traffic under prolonged outages but ensures hosts reconnect promptly when the relay is back. Pairs with the 1001-reset (clean close → reset attempts) to keep deploy-time disconnects to a few seconds.
saddlepaddle
added a commit
that referenced
this pull request
May 16, 2026
* feat(relay): graceful tunnel drain on SIGTERM
Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:
1. Fly sends SIGTERM, relay process exits without closing WS, host
sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
takes another 15-30s to retry.
Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.
Fix is two-sided:
- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
TunnelManager that walks every open WS and closes it with code 1001
("Going Away"), waits 250ms for frames to flush, then exits. Wired
to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.
- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
WS onClose event reports code 1001, reset reconnectAttempts to 0 so
the next reconnect fires at the 0.5-1s base delay instead of the
saturated 30s/attempt tail.
Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.
Verified via game-day on superset-relay-staging (sjc, iad, fra).
* fix(relay): use SIGINT for drain + kill_timeout=10s on Fly
Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the
main process during rolling deploys, AND its default ~5s grace period
kills the process before the JS handler runs. Two fixes:
- Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the
drain handler intercepts Fly's actual signal (also covers
`fly machine stop` and local Ctrl-C).
- Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml
and fly.staging.toml so the handler actually has time to drain.
Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT
received, draining tunnels", unregisters open WSs cleanly, then exits
with code 0. Host re-registration happens ~12s after WS close (down
from 64s pre-fix).
* chore(relay): add deploy-staging.sh for parametrized staging deploys
* fix(relay): clear self-owned directory entries on startup
Game-day #14 exposed a ~2min user-visible glitch when a relay machine
gets killed and auto-restarted by Fly. The killed process never gets
to unregister its tunnels (SIGKILL leaves no opportunity), so the
directory keeps entries pointing at "{region}:{machineId}". Fly
auto-restarts the same machineId, but the new process has no in-memory
tunnels — meanwhile, cross-region requests still fly-replay traffic
at that stale machineId and the reconciler reports `is_online=yes`
based on the stale directory.
Fix: on startup, scan the OWNER hash for entries matching our own
encoded owner string and delete them via the existing compare-and-
delete `unregister` helper. Runs synchronously before serve() via
top-level await — the relay won't accept connections until cleanup
completes, so we can't race a fresh registration.
Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2
stale directory entries on startup" log → sandbox re-registered 13s
after kill (down from ~2min recovery without this fix).
* fix(host-service): lower reconnect MAX_MS from 30s to 5s
Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.
5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.
* fix(relay): address PR review — close-code, accept-during-drain, batch cleanup
Four fixes from PR review on #4594:
1. Drain uses app-defined close code 4001 instead of 1001. The
ping-timeout path in tunnel.ts already uses 1001, and the host-side
reset-on-clean-close would have fired for those too — risking a
thundering-herd reconnect storm under a mass ping-timeout event.
Now the host only resets on 4001 (deliberate deploy drain), and
1001 keeps its RFC-conformant "going away" semantics elsewhere.
(Greptile P1)
2. Stop accepting new TCP connections during drain. Call server.close()
before tunnelManager.drain() so a host that reconnects during the
250ms flush window doesn't open a tunnel that immediately dies on
process.exit. (Greptile P2)
3. Startup directory cleanup batched into a single Lua EVAL. Previous
loop did one Redis round-trip per stale entry — at prod scale (~500
tunnels per machine) that would have eaten 25s of startup time on
typical Upstash latency. Single EVAL keeps cleanup at one RTT
regardless of directory size. (CodeRabbit Major)
4. Update host-side comment from "30s retry mode" to "5s ceiling"
(matches the MAX_MS=5s change earlier in this PR). (CodeRabbit
Minor)
* chore(relay): post-deploy smoke test across all regions
Adds smoke-test.sh that pings /health on every configured region via
`fly-prefer-region` and verifies the response body's `region` field
matches the requested region. Wired into both deploy.sh and
deploy-staging.sh so deploys halt if any region failed to come up.
Catches partial-deploy failures, missing regions, and machines that
booted but aren't actually serving — exactly the multi-region failure
modes we couldn't detect with the existing single-region setup.
Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra)
— all 3 returned correct region in /health response.
* fix(relay): in-band drain signal before WS close on shutdown
Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.
Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.
Verified end-to-end on staging: host log shows
"[host-service:tunnel] relay drain notice received (Server draining
for deploy); reconnecting immediately"
followed by reconnect within ~900ms.
* fix(relay): await directory cleanup during drain; reject registers while draining
drain() previously relied on WS onClose → unregister → fire-and-forget
directory.unregister for directory cleanup. process.exit(0) runs right
after drain() returns, so those async Redis writes routinely didn't land
before the process died — leaving the dead machine's directory entries
live for ~90s (TTL), during which other relay machines fly-replay
traffic to the corpse. Invisible single-region; a real per-deploy
misrouting window multi-region.
- drain() now awaits an explicit clearDirectory() callback
(clearStaleEntriesForMachine — the batch Lua script already used on
startup) before the process exits.
- A draining flag rejects new tunnel registrations during the drain
window, in register() and the /tunnel onOpen handler, re-checked after
the async auth calls so a host can't slip in while draining flips.
- drain() routes each tunnel through disposeTunnel and removes it from
the map, so the trailing onClose callbacks are no-ops rather than
racing the explicit directory clear.
sazabi Bot
pushed a commit
that referenced
this pull request
May 20, 2026
* feat(relay): graceful tunnel drain on SIGTERM
Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:
1. Fly sends SIGTERM, relay process exits without closing WS, host
sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
takes another 15-30s to retry.
Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.
Fix is two-sided:
- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
TunnelManager that walks every open WS and closes it with code 1001
("Going Away"), waits 250ms for frames to flush, then exits. Wired
to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.
- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
WS onClose event reports code 1001, reset reconnectAttempts to 0 so
the next reconnect fires at the 0.5-1s base delay instead of the
saturated 30s/attempt tail.
Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.
Verified via game-day on superset-relay-staging (sjc, iad, fra).
* fix(relay): use SIGINT for drain + kill_timeout=10s on Fly
Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the
main process during rolling deploys, AND its default ~5s grace period
kills the process before the JS handler runs. Two fixes:
- Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the
drain handler intercepts Fly's actual signal (also covers
`fly machine stop` and local Ctrl-C).
- Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml
and fly.staging.toml so the handler actually has time to drain.
Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT
received, draining tunnels", unregisters open WSs cleanly, then exits
with code 0. Host re-registration happens ~12s after WS close (down
from 64s pre-fix).
* chore(relay): add deploy-staging.sh for parametrized staging deploys
* fix(relay): clear self-owned directory entries on startup
Game-day #14 exposed a ~2min user-visible glitch when a relay machine
gets killed and auto-restarted by Fly. The killed process never gets
to unregister its tunnels (SIGKILL leaves no opportunity), so the
directory keeps entries pointing at "{region}:{machineId}". Fly
auto-restarts the same machineId, but the new process has no in-memory
tunnels — meanwhile, cross-region requests still fly-replay traffic
at that stale machineId and the reconciler reports `is_online=yes`
based on the stale directory.
Fix: on startup, scan the OWNER hash for entries matching our own
encoded owner string and delete them via the existing compare-and-
delete `unregister` helper. Runs synchronously before serve() via
top-level await — the relay won't accept connections until cleanup
completes, so we can't race a fresh registration.
Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2
stale directory entries on startup" log → sandbox re-registered 13s
after kill (down from ~2min recovery without this fix).
* fix(host-service): lower reconnect MAX_MS from 30s to 5s
Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.
5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.
* fix(relay): address PR review — close-code, accept-during-drain, batch cleanup
Four fixes from PR review on #4594:
1. Drain uses app-defined close code 4001 instead of 1001. The
ping-timeout path in tunnel.ts already uses 1001, and the host-side
reset-on-clean-close would have fired for those too — risking a
thundering-herd reconnect storm under a mass ping-timeout event.
Now the host only resets on 4001 (deliberate deploy drain), and
1001 keeps its RFC-conformant "going away" semantics elsewhere.
(Greptile P1)
2. Stop accepting new TCP connections during drain. Call server.close()
before tunnelManager.drain() so a host that reconnects during the
250ms flush window doesn't open a tunnel that immediately dies on
process.exit. (Greptile P2)
3. Startup directory cleanup batched into a single Lua EVAL. Previous
loop did one Redis round-trip per stale entry — at prod scale (~500
tunnels per machine) that would have eaten 25s of startup time on
typical Upstash latency. Single EVAL keeps cleanup at one RTT
regardless of directory size. (CodeRabbit Major)
4. Update host-side comment from "30s retry mode" to "5s ceiling"
(matches the MAX_MS=5s change earlier in this PR). (CodeRabbit
Minor)
* chore(relay): post-deploy smoke test across all regions
Adds smoke-test.sh that pings /health on every configured region via
`fly-prefer-region` and verifies the response body's `region` field
matches the requested region. Wired into both deploy.sh and
deploy-staging.sh so deploys halt if any region failed to come up.
Catches partial-deploy failures, missing regions, and machines that
booted but aren't actually serving — exactly the multi-region failure
modes we couldn't detect with the existing single-region setup.
Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra)
— all 3 returned correct region in /health response.
* fix(relay): in-band drain signal before WS close on shutdown
Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.
Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.
Verified end-to-end on staging: host log shows
"[host-service:tunnel] relay drain notice received (Server draining
for deploy); reconnecting immediately"
followed by reconnect within ~900ms.
* fix(relay): await directory cleanup during drain; reject registers while draining
drain() previously relied on WS onClose → unregister → fire-and-forget
directory.unregister for directory cleanup. process.exit(0) runs right
after drain() returns, so those async Redis writes routinely didn't land
before the process died — leaving the dead machine's directory entries
live for ~90s (TTL), during which other relay machines fly-replay
traffic to the corpse. Invisible single-region; a real per-deploy
misrouting window multi-region.
- drain() now awaits an explicit clearDirectory() callback
(clearStaleEntriesForMachine — the batch Lua script already used on
startup) before the process exits.
- A draining flag rejects new tunnel registrations during the drain
window, in register() and the /tunnel onOpen handler, re-checked after
the async auth calls so a host can't slip in while draining flips.
- drain() routes each tunnel through disposeTunnel and removes it from
the map, so the trailing onClose callbacks are no-ops rather than
racing the explicit directory clear.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Related Issues
Type of Change
Testing
Screenshots (if applicable)
Additional Notes