Skip to content

padding fix#15

Merged
Kitenite merged 1 commit into
mainfrom
padding-fix
Nov 3, 2025
Merged

padding fix#15
Kitenite merged 1 commit into
mainfrom
padding-fix

Conversation

@AviPeltz
Copy link
Copy Markdown
Collaborator

@AviPeltz AviPeltz commented Nov 3, 2025

Description

Related Issues

Type of Change

  • Bug fix
  • New feature
  • Documentation
  • Refactor
  • Other (please describe):

Testing

Screenshots (if applicable)

Additional Notes

@vercel
Copy link
Copy Markdown

vercel Bot commented Nov 3, 2025

@AviPeltz is attempting to deploy a commit to the Personal team on Vercel, but is not a member of this team. To resolve this issue, you can:

  • Make your repository public. Collaboration is free for open source and public repositories.
  • Upgrade to pro and add @AviPeltz as a member. A Pro subscription is required to access Vercel's collaborative features.
    • If you're the owner of the team, click here to upgrade and add @AviPeltz as a member.
    • If you're the user who initiated this build request, click here to request access.
    • If you're already a member of the Personal team, make sure that your Vercel account is connected to your GitHub account.

To read more about collaboration on Vercel, click here.

@Kitenite Kitenite merged commit fbe18a3 into main Nov 3, 2025
2 of 6 checks passed
@Kitenite Kitenite deleted the padding-fix branch November 3, 2025 21:54
saddlepaddle added a commit that referenced this pull request May 15, 2026
Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.

5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.
saddlepaddle added a commit that referenced this pull request May 16, 2026
Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.

5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.
saddlepaddle added a commit that referenced this pull request May 16, 2026
* feat(relay): graceful tunnel drain on SIGTERM

Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:

1. Fly sends SIGTERM, relay process exits without closing WS, host
   sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
   30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
   takes another 15-30s to retry.

Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.

Fix is two-sided:

- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
  TunnelManager that walks every open WS and closes it with code 1001
  ("Going Away"), waits 250ms for frames to flush, then exits. Wired
  to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.

- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
  WS onClose event reports code 1001, reset reconnectAttempts to 0 so
  the next reconnect fires at the 0.5-1s base delay instead of the
  saturated 30s/attempt tail.

Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.

Verified via game-day on superset-relay-staging (sjc, iad, fra).

* fix(relay): use SIGINT for drain + kill_timeout=10s on Fly

Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the
main process during rolling deploys, AND its default ~5s grace period
kills the process before the JS handler runs. Two fixes:

- Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the
  drain handler intercepts Fly's actual signal (also covers
  `fly machine stop` and local Ctrl-C).
- Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml
  and fly.staging.toml so the handler actually has time to drain.

Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT
received, draining tunnels", unregisters open WSs cleanly, then exits
with code 0. Host re-registration happens ~12s after WS close (down
from 64s pre-fix).

* chore(relay): add deploy-staging.sh for parametrized staging deploys

* fix(relay): clear self-owned directory entries on startup

Game-day #14 exposed a ~2min user-visible glitch when a relay machine
gets killed and auto-restarted by Fly. The killed process never gets
to unregister its tunnels (SIGKILL leaves no opportunity), so the
directory keeps entries pointing at "{region}:{machineId}". Fly
auto-restarts the same machineId, but the new process has no in-memory
tunnels — meanwhile, cross-region requests still fly-replay traffic
at that stale machineId and the reconciler reports `is_online=yes`
based on the stale directory.

Fix: on startup, scan the OWNER hash for entries matching our own
encoded owner string and delete them via the existing compare-and-
delete `unregister` helper. Runs synchronously before serve() via
top-level await — the relay won't accept connections until cleanup
completes, so we can't race a fresh registration.

Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2
stale directory entries on startup" log → sandbox re-registered 13s
after kill (down from ~2min recovery without this fix).

* fix(host-service): lower reconnect MAX_MS from 30s to 5s

Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.

5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.

* fix(relay): address PR review — close-code, accept-during-drain, batch cleanup

Four fixes from PR review on #4594:

1. Drain uses app-defined close code 4001 instead of 1001. The
   ping-timeout path in tunnel.ts already uses 1001, and the host-side
   reset-on-clean-close would have fired for those too — risking a
   thundering-herd reconnect storm under a mass ping-timeout event.
   Now the host only resets on 4001 (deliberate deploy drain), and
   1001 keeps its RFC-conformant "going away" semantics elsewhere.
   (Greptile P1)

2. Stop accepting new TCP connections during drain. Call server.close()
   before tunnelManager.drain() so a host that reconnects during the
   250ms flush window doesn't open a tunnel that immediately dies on
   process.exit. (Greptile P2)

3. Startup directory cleanup batched into a single Lua EVAL. Previous
   loop did one Redis round-trip per stale entry — at prod scale (~500
   tunnels per machine) that would have eaten 25s of startup time on
   typical Upstash latency. Single EVAL keeps cleanup at one RTT
   regardless of directory size. (CodeRabbit Major)

4. Update host-side comment from "30s retry mode" to "5s ceiling"
   (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit
   Minor)

* chore(relay): post-deploy smoke test across all regions

Adds smoke-test.sh that pings /health on every configured region via
`fly-prefer-region` and verifies the response body's `region` field
matches the requested region. Wired into both deploy.sh and
deploy-staging.sh so deploys halt if any region failed to come up.

Catches partial-deploy failures, missing regions, and machines that
booted but aren't actually serving — exactly the multi-region failure
modes we couldn't detect with the existing single-region setup.

Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra)
— all 3 returned correct region in /health response.

* fix(relay): in-band drain signal before WS close on shutdown

Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.

Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.

Verified end-to-end on staging: host log shows
  "[host-service:tunnel] relay drain notice received (Server draining
   for deploy); reconnecting immediately"
followed by reconnect within ~900ms.

* fix(relay): await directory cleanup during drain; reject registers while draining

drain() previously relied on WS onClose → unregister → fire-and-forget
directory.unregister for directory cleanup. process.exit(0) runs right
after drain() returns, so those async Redis writes routinely didn't land
before the process died — leaving the dead machine's directory entries
live for ~90s (TTL), during which other relay machines fly-replay
traffic to the corpse. Invisible single-region; a real per-deploy
misrouting window multi-region.

- drain() now awaits an explicit clearDirectory() callback
  (clearStaleEntriesForMachine — the batch Lua script already used on
  startup) before the process exits.
- A draining flag rejects new tunnel registrations during the drain
  window, in register() and the /tunnel onOpen handler, re-checked after
  the async auth calls so a host can't slip in while draining flips.
- drain() routes each tunnel through disposeTunnel and removes it from
  the map, so the trailing onClose callbacks are no-ops rather than
  racing the explicit directory clear.
sazabi Bot pushed a commit that referenced this pull request May 20, 2026
* feat(relay): graceful tunnel drain on SIGTERM

Discovered during multi-region rolling-deploy game-day: every relay
machine restart left host tunnels disconnected for ~60s because:

1. Fly sends SIGTERM, relay process exits without closing WS, host
   sees a TCP-RST'd socket as a generic error.
2. Host's exponential reconnect backoff (1s → 30s cap) climbs to the
   30s/attempt tail after ~6 failed attempts while the relay is down.
3. By the time the relay is back up, the host is in deep backoff and
   takes another 15-30s to retry.

Result: ~60s user-visible outage per rolling-restart cycle, multiplied
across regions during multi-region deploys.

Fix is two-sided:

- Relay (apps/relay/src/tunnel.ts + index.ts): add `drain()` on
  TunnelManager that walks every open WS and closes it with code 1001
  ("Going Away"), waits 250ms for frames to flush, then exits. Wired
  to SIGTERM in index.ts so Fly's pre-stop signal does the right thing.

- Host (packages/host-service/src/tunnel/tunnel-client.ts): when the
  WS onClose event reports code 1001, reset reconnectAttempts to 0 so
  the next reconnect fires at the 0.5-1s base delay instead of the
  saturated 30s/attempt tail.

Combined, deploy outages should drop from ~60s to ~2-5s per affected
host. Pre-requisite for safe multi-region prod rollout — multi-region
amortizes the impact across regions but doesn't fix the per-host
window without graceful drain.

Verified via game-day on superset-relay-staging (sjc, iad, fra).

* fix(relay): use SIGINT for drain + kill_timeout=10s on Fly

Verified during game-day: Fly's init sends SIGINT (not SIGTERM) to the
main process during rolling deploys, AND its default ~5s grace period
kills the process before the JS handler runs. Two fixes:

- Listen on both SIGINT and SIGTERM in apps/relay/src/index.ts so the
  drain handler intercepts Fly's actual signal (also covers
  `fly machine stop` and local Ctrl-C).
- Add `kill_signal = "SIGINT"` + `kill_timeout = "10s"` to both fly.toml
  and fly.staging.toml so the handler actually has time to drain.

Verified on staging deploy 4: SJC SIGINT now fires "[relay] SIGINT
received, draining tunnels", unregisters open WSs cleanly, then exits
with code 0. Host re-registration happens ~12s after WS close (down
from 64s pre-fix).

* chore(relay): add deploy-staging.sh for parametrized staging deploys

* fix(relay): clear self-owned directory entries on startup

Game-day #14 exposed a ~2min user-visible glitch when a relay machine
gets killed and auto-restarted by Fly. The killed process never gets
to unregister its tunnels (SIGKILL leaves no opportunity), so the
directory keeps entries pointing at "{region}:{machineId}". Fly
auto-restarts the same machineId, but the new process has no in-memory
tunnels — meanwhile, cross-region requests still fly-replay traffic
at that stale machineId and the reconciler reports `is_online=yes`
based on the stale directory.

Fix: on startup, scan the OWNER hash for entries matching our own
encoded owner string and delete them via the existing compare-and-
delete `unregister` helper. Runs synchronously before serve() via
top-level await — the relay won't accept connections until cleanup
completes, so we can't race a fresh registration.

Verified on staging: kill SJC machine → 6s VM cycle → "cleared 2
stale directory entries on startup" log → sandbox re-registered 13s
after kill (down from ~2min recovery without this fix).

* fix(host-service): lower reconnect MAX_MS from 30s to 5s

Game-day #15 (WS storm) showed daemons stranded for minutes after
consecutive disconnects. With the previous 30s cap, retries land 15-30s
apart in the saturated tail of the exponential curve, and the
post-recovery retry frequently misses the relay-back-online window.

5s cap means slightly more retry traffic under prolonged outages but
ensures hosts reconnect promptly when the relay is back. Pairs with
the 1001-reset (clean close → reset attempts) to keep deploy-time
disconnects to a few seconds.

* fix(relay): address PR review — close-code, accept-during-drain, batch cleanup

Four fixes from PR review on #4594:

1. Drain uses app-defined close code 4001 instead of 1001. The
   ping-timeout path in tunnel.ts already uses 1001, and the host-side
   reset-on-clean-close would have fired for those too — risking a
   thundering-herd reconnect storm under a mass ping-timeout event.
   Now the host only resets on 4001 (deliberate deploy drain), and
   1001 keeps its RFC-conformant "going away" semantics elsewhere.
   (Greptile P1)

2. Stop accepting new TCP connections during drain. Call server.close()
   before tunnelManager.drain() so a host that reconnects during the
   250ms flush window doesn't open a tunnel that immediately dies on
   process.exit. (Greptile P2)

3. Startup directory cleanup batched into a single Lua EVAL. Previous
   loop did one Redis round-trip per stale entry — at prod scale (~500
   tunnels per machine) that would have eaten 25s of startup time on
   typical Upstash latency. Single EVAL keeps cleanup at one RTT
   regardless of directory size. (CodeRabbit Major)

4. Update host-side comment from "30s retry mode" to "5s ceiling"
   (matches the MAX_MS=5s change earlier in this PR). (CodeRabbit
   Minor)

* chore(relay): post-deploy smoke test across all regions

Adds smoke-test.sh that pings /health on every configured region via
`fly-prefer-region` and verifies the response body's `region` field
matches the requested region. Wired into both deploy.sh and
deploy-staging.sh so deploys halt if any region failed to come up.

Catches partial-deploy failures, missing regions, and machines that
booted but aren't actually serving — exactly the multi-region failure
modes we couldn't detect with the existing single-region setup.

Verified manually against superset-relay-staging.fly.dev (sjc, iad, fra)
— all 3 returned correct region in /health response.

* fix(relay): in-band drain signal before WS close on shutdown

Game-day testing on multi-region staging revealed the WS close frame
doesn't reliably reach the host within Fly's kill_timeout window. The
host's TCP socket stayed ESTABLISHED with no onclose event firing until
the 75s inbound-silence watchdog tripped — i.e. ~110s of user-visible
outage on every deploy.

Send a JSON {type:"drain"} message on the WS data channel first, wait
500ms for it to flush, then close. The message channel already carries
ping/pong every 30s, so we know it traverses the path cleanly. Host's
onmessage handler resets backoff and triggers an immediate reconnect.
The subsequent close() is belt-and-suspenders.

Verified end-to-end on staging: host log shows
  "[host-service:tunnel] relay drain notice received (Server draining
   for deploy); reconnecting immediately"
followed by reconnect within ~900ms.

* fix(relay): await directory cleanup during drain; reject registers while draining

drain() previously relied on WS onClose → unregister → fire-and-forget
directory.unregister for directory cleanup. process.exit(0) runs right
after drain() returns, so those async Redis writes routinely didn't land
before the process died — leaving the dead machine's directory entries
live for ~90s (TTL), during which other relay machines fly-replay
traffic to the corpse. Invisible single-region; a real per-deploy
misrouting window multi-region.

- drain() now awaits an explicit clearDirectory() callback
  (clearStaleEntriesForMachine — the batch Lua script already used on
  startup) before the process exits.
- A draining flag rejects new tunnel registrations during the drain
  window, in register() and the /tunnel onOpen handler, re-checked after
  the async auth calls so a host can't slip in while draining flips.
- drain() routes each tunnel through disposeTunnel and removes it from
  the map, so the trailing onClose callbacks are no-ops rather than
  racing the explicit directory clear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants