docs(relay): hardening + horizontal scale-out plan#3636
Conversation
Captures the architecture plan for SUPER-427 (relay hardening) so phases can ship independently. Documents the fly-replay + Upstash directory approach (reusing the existing rate-limit instance), the truthful is_online story, streaming-proxy fix, graceful shutdown, observability, and data-model cleanups — with rejected alternatives recorded so we don't relitigate them.
📝 WalkthroughWalkthroughA new design and implementation plan document was added for the Superset Relay service, outlining a multi-phase effort to harden and horizontally scale the service through architecture changes, control-flow modifications, correctness fixes, reliability improvements, and observability enhancements. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
apps/relay/plans/20260420-relay-hardening.md (1)
139-144: Resolve the v2_machines vs device_presence question before Phase 6 implementation.The open question at line 154 notes uncertainty about whether to create
v2_machinesor fold the data onto the existingdevice_presencetable. Since Phase 6 involves a data model migration, resolving this architectural decision early will prevent rework.The PR objectives mention this is pending review by Avi/Kiet—consider blocking Phase 6 implementation until that alignment is complete.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/relay/plans/20260420-relay-hardening.md` around lines 139 - 144, The PR calls out an unresolved architectural decision: whether to introduce v2_machines or reuse device_presence for the lifted machine-level attributes before implementing Phase 6; stop and decide the canonical model, coordinate with Avi/Kiet, and then update the Phase 6 plan and migrations accordingly (ensure v2_hosts.machine_id FK targets the chosen table, remove name from v2_hosts only if moving to v2_machines, and if keeping device_presence adjust its schema to contain name/os/cpu_count/total_memory/agent_version and telemetry columns last_connected_at/last_disconnected_at/connected_machine_id); also audit register paths (device.setHostOnline and packages/trpc/src/router/device/device.ts) to ensure name is not written on register once the final model is chosen and block Phase 6 work until alignment is confirmed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/relay/plans/20260420-relay-hardening.md`:
- Around line 114-121: The doc's ping/pong section is inconsistent; update the
text to explicitly state the final server and client ping intervals and the
detection formula, e.g., set Server ping interval = 10s (tunnel.ts ping sender
at line referenced by tunnel.ts:55), Client ping interval = 10s (client
app-level ping), and Detection threshold = 3 × ping_interval = 30s (so three
missed pings = 30s); also reconcile the earlier 45s and 90s references and
mention REQUEST_TIMEOUT_MS and RECONNECT_MAX_MS only where relevant to
timeouts/backoff (REQUEST_TIMEOUT_MS -> 120s as noted, RECONNECT_MAX_MS
unchanged) so the math and behavior are unambiguous.
- Around line 50-54: Add a language identifier to the fenced code block
containing the Redis commands so syntax highlighting and linting pass; update
the opening fence from ``` to ```text for the block that contains the lines
"HSET tunnel-owner <hostId> <machineId>", "HSET tunnel-meta <hostId>
<json-blob>", and "ZADD tunnel-ttl <expireAt-ms> <hostId>" in the markdown so
the block is explicitly marked as plain text.
---
Nitpick comments:
In `@apps/relay/plans/20260420-relay-hardening.md`:
- Around line 139-144: The PR calls out an unresolved architectural decision:
whether to introduce v2_machines or reuse device_presence for the lifted
machine-level attributes before implementing Phase 6; stop and decide the
canonical model, coordinate with Avi/Kiet, and then update the Phase 6 plan and
migrations accordingly (ensure v2_hosts.machine_id FK targets the chosen table,
remove name from v2_hosts only if moving to v2_machines, and if keeping
device_presence adjust its schema to contain
name/os/cpu_count/total_memory/agent_version and telemetry columns
last_connected_at/last_disconnected_at/connected_machine_id); also audit
register paths (device.setHostOnline and
packages/trpc/src/router/device/device.ts) to ensure name is not written on
register once the final model is chosen and block Phase 6 work until alignment
is confirmed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f9951e3-1e48-4b43-a07c-6814185b408a
📒 Files selected for processing (1)
apps/relay/plans/20260420-relay-hardening.md
| ``` | ||
| HSET tunnel-owner <hostId> <machineId> # set on register, DEL on unregister | ||
| HSET tunnel-meta <hostId> <json-blob> # registeredAt, lastPongAt | ||
| ZADD tunnel-ttl <expireAt-ms> <hostId> # for stale-owner cleanup | ||
| ``` |
There was a problem hiding this comment.
Add language specification to the code block.
The fenced code block showing the Redis schema should specify a language identifier for proper syntax highlighting and linting compliance.
📝 Proposed fix
-```
+```text
HSET tunnel-owner <hostId> <machineId> # set on register, DEL on unregister
HSET tunnel-meta <hostId> <json-blob> # registeredAt, lastPongAt
ZADD tunnel-ttl <expireAt-ms> <hostId> # for stale-owner cleanup
</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>
[warning] 50-50: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @apps/relay/plans/20260420-relay-hardening.md around lines 50 - 54, Add a
language identifier to the fenced code block containing the Redis commands so
syntax highlighting and linting pass; update the opening fence from ``` to
<machineId>", "HSET tunnel-meta <hostId> <json-blob>", and "ZADD tunnel-ttl
<expireAt-ms> <hostId>" in the markdown so the block is explicitly marked as
plain text.
| ### Phase 4 — reliability & graceful shutdown | ||
|
|
||
| - **SIGTERM handling on relay.** Intercept SIGTERM (fly sends this on scale-down/deploy), stop accepting new tunnels, send a close frame `{code: 4001, reason: "server-drain"}` to every tunneled host, wait up to 10s for graceful close, then exit. Hosts treat code 4001 as "reconnect immediately, no backoff." | ||
| - **Client-side missed-pong detection.** Today the client only reconnects on `onclose`. Add a dead-socket detector: if no `onmessage` (including pings) for 45s, forcibly close and reconnect. Server pings every 30s (`tunnel.ts:55`); three missed pings = 90s to detect, too slow. Also have the client send an app-level ping every 15s so the server can detect dead clients sooner (currently the server only knows about client liveness via TCP). | ||
| - **Tighter ping cadence.** 10s ping interval, 3 missed = 30s to detect. Tradeoff: ~10 ping frames/minute/host × 1k hosts = 166 msg/s, trivial. | ||
| - **Request timeout policy.** `REQUEST_TIMEOUT_MS = 30_000` at `tunnel.ts:33` is too short for some tRPC calls (chat.sendMessage saw 11s already). Raise to 120s, and mark streaming requests (phase 3) as never-timing-out at this layer. | ||
| - **Backoff cap lowered.** `RECONNECT_MAX_MS = 30_000` is fine but combined with TCP FIN timing a dead server isn't noticed for up to ~90s. Not a backoff issue; the missed-pong detector above fixes it. | ||
|
|
There was a problem hiding this comment.
Clarify the ping interval and detection math.
The ping/pong detection strategy has some inconsistencies:
- Line 117 mentions detecting dead sockets after 45s with no message, but with a 30s server ping interval, 45s is only 1.5 intervals.
- Line 118 first proposes client pings every 15s, then proposes "10s ping interval" without specifying whether that's client-only, server-only, or both.
- The calculation "three missed pings = 90s" doesn't align with the earlier 45s threshold.
Recommend clarifying:
- Final server ping interval (10s?)
- Final client ping interval (10s or 15s?)
- Detection threshold formula (e.g., "3 × ping_interval" = 30s if both ping at 10s)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/relay/plans/20260420-relay-hardening.md` around lines 114 - 121, The
doc's ping/pong section is inconsistent; update the text to explicitly state the
final server and client ping intervals and the detection formula, e.g., set
Server ping interval = 10s (tunnel.ts ping sender at line referenced by
tunnel.ts:55), Client ping interval = 10s (client app-level ping), and Detection
threshold = 3 × ping_interval = 30s (so three missed pings = 30s); also
reconcile the earlier 45s and 90s references and mention REQUEST_TIMEOUT_MS and
RECONNECT_MAX_MS only where relevant to timeouts/backoff (REQUEST_TIMEOUT_MS ->
120s as noted, RECONNECT_MAX_MS unchanged) so the math and behavior are
unambiguous.
Greptile SummaryThis PR adds a detailed architecture and phasing plan ( The document is well-grounded — line references to
Confidence Score: 4/5Safe to merge as a planning document; no production code is changed, and the identified gaps should be resolved during Phase 1 implementation rather than blocking this PR. This is a documentation-only PR with no code changes, so there is no direct production risk. The plan is thorough, well-structured, and accurately references the existing codebase. The issues found are design-level concerns that need to be addressed before or during implementation of Phase 1, not blockers to merging the plan itself. Score is 4 rather than 5 because the Redis-as-SPOF gap and concurrent-sweeper race are meaningful architectural issues that should be acknowledged and resolved in the document before the plan is used as an implementation spec. apps/relay/plans/20260420-relay-hardening.md — specifically the Redis failure-mode section, the sweeper design, and the Phase 2 dependency note.
|
| Filename | Overview |
|---|---|
| apps/relay/plans/20260420-relay-hardening.md | New architecture planning doc for relay hardening and horizontal scale-out via fly-replay + Redis ownership directory; well-structured and grounded in the existing codebase, with a few design gaps around Redis failure modes, concurrent sweeper races, WS-upgrade replay verification, and a Phase 2 boot-cleanup race that depends on Phase 1 landing first. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Incoming Request to Fly LB] --> B[Routed to Relay Machine N]
B --> C{Machine N owns tunnel?}
C -- Yes --> D[Serve request in-process via WS tunnel]
C -- No --> E[HGET tunnel-owner from Upstash Redis]
E --> F{Owner found?}
F -- No --> G[Return 503 Host not connected]
F -- Yes --> H[Reply with fly-replay header pointing to owner machine]
H --> I[Fly LB re-routes to owning machine]
I --> D
D --> J[Response returned to client]
subgraph Registration
R1[Host opens WS on register] --> R2[TunnelManager.register in-process]
R2 --> R3[HSET tunnel-owner in Redis]
R3 --> R4[ZADD tunnel-ttl in Redis]
end
subgraph Shutdown
S1[SIGTERM received] --> S2[Stop accepting new tunnels]
S2 --> S3[Send close frame to all hosts]
S3 --> S4[DEL tunnel-owner from Redis]
S4 --> S5[Mark hosts offline in DB]
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 63-65
Comment:
**Redis becomes a new hot-path single point of failure**
The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful `HGET` against Upstash before the `fly-replay` header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.
The plan should explicitly state the intended fallback behavior:
- **Fail-open**: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
- **Fail-open with local-only serving**: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.
Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 58-61
Comment:
**Concurrent sweepers can race-delete live entries**
The plan has every machine independently running a sweeper that scans `tunnel-ttl` ZSET and deletes entries whose `expireAt-ms` has passed. With N machines, N sweepers are running simultaneously, and there's a window where:
1. Machine A's sweeper identifies `hostId` as stale (heartbeat > 2× ping interval).
2. Machine B's sweeper does the same simultaneously.
3. Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
4. Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.
A few approaches to address this:
- Use a Redis Lua script or `WATCH`/`MULTI` to atomically check-and-delete only if the caller's machine owns the entry (i.e., `if HGET tunnel-owner hostId == myMachineId then DEL`).
- Elect a single sweeper leader via a short-TTL Redis lock (`SET sweeper-lock <machineId> NX PX 25000`).
- Drop the ZSET approach and rely solely on native Redis key expiry: `SET tunnel-owner:<hostId> <machineId> PX <2×ping>` refreshed on each heartbeat. No sweeper process needed.
The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 99-102
Comment:
**`fly-replay` on WS upgrades should be a hard go/no-go gate, not a "verify" step**
The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.
This should be verified against Fly's infrastructure *before* committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If `fly-replay` doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 118-122
Comment:
**Phase 2 boot-time cleanup has a startup race before Phase 1 lands**
The plan says:
> Boot-time cleanup: on relay start, `UPDATE v2_hosts SET is_online = false WHERE is_online = true` filtered to rows not present in the local in-memory map
During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to `is_online = false` and then immediately set back to true on reconnect — causing a visible flicker in the UI.
The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:
- Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
- Be skipped entirely until Phase 1's Redis directory is available.
Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "docs(relay): hardening + horizontal scal..." | Re-trigger Greptile
| - **Fly's "Managed Redis" add-on.** It is just Upstash under the hood (reseller relationship; product has since been de-emphasized). Zero latency or feature win over using Upstash directly. | ||
| - **Self-hosted Redis on a fly machine behind flycast.** Sub-ms latency via the wireguard network, but we own replication, backups, HA, and deploy sequencing. Not worth it for ~few-hundred ops/sec. | ||
| - **Putting Upstash behind fly's private network.** Upstash's REST endpoint is public-internet HTTPS; they don't expose a flycast-routable endpoint. Can't keep it inside the VPN. |
There was a problem hiding this comment.
Redis becomes a new hot-path single point of failure
The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful HGET against Upstash before the fly-replay header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.
The plan should explicitly state the intended fallback behavior:
- Fail-open: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
- Fail-open with local-only serving: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.
Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 63-65
Comment:
**Redis becomes a new hot-path single point of failure**
The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful `HGET` against Upstash before the `fly-replay` header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.
The plan should explicitly state the intended fallback behavior:
- **Fail-open**: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
- **Fail-open with local-only serving**: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.
Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.
How can I resolve this? If you propose a fix, please make it concise.| ### Redis itself | ||
|
|
||
| **Reuse the existing Upstash instance** (`KV_REST_API_URL` / `KV_REST_API_TOKEN`, already used by `packages/auth/src/lib/rate-limit.ts` and a dozen api routes). No new infra, no new secrets, no new failure mode to monitor. Write through the `@upstash/redis` REST client we already depend on. | ||
|
|
There was a problem hiding this comment.
Concurrent sweepers can race-delete live entries
The plan has every machine independently running a sweeper that scans tunnel-ttl ZSET and deletes entries whose expireAt-ms has passed. With N machines, N sweepers are running simultaneously, and there's a window where:
- Machine A's sweeper identifies
hostIdas stale (heartbeat > 2× ping interval). - Machine B's sweeper does the same simultaneously.
- Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
- Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.
A few approaches to address this:
- Use a Redis Lua script or
WATCH/MULTIto atomically check-and-delete only if the caller's machine owns the entry (i.e.,if HGET tunnel-owner hostId == myMachineId then DEL). - Elect a single sweeper leader via a short-TTL Redis lock (
SET sweeper-lock <machineId> NX PX 25000). - Drop the ZSET approach and rely solely on native Redis key expiry:
SET tunnel-owner:<hostId> <machineId> PX <2×ping>refreshed on each heartbeat. No sweeper process needed.
The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 58-61
Comment:
**Concurrent sweepers can race-delete live entries**
The plan has every machine independently running a sweeper that scans `tunnel-ttl` ZSET and deletes entries whose `expireAt-ms` has passed. With N machines, N sweepers are running simultaneously, and there's a window where:
1. Machine A's sweeper identifies `hostId` as stale (heartbeat > 2× ping interval).
2. Machine B's sweeper does the same simultaneously.
3. Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
4. Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.
A few approaches to address this:
- Use a Redis Lua script or `WATCH`/`MULTI` to atomically check-and-delete only if the caller's machine owns the entry (i.e., `if HGET tunnel-owner hostId == myMachineId then DEL`).
- Elect a single sweeper leader via a short-TTL Redis lock (`SET sweeper-lock <machineId> NX PX 25000`).
- Drop the ZSET approach and rely solely on native Redis key expiry: `SET tunnel-owner:<hostId> <machineId> PX <2×ping>` refreshed on each heartbeat. No sweeper process needed.
The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.
How can I resolve this? If you propose a fix, please make it concise.| - Periodic reconciliation: every 30s, reconcile DB state against the Redis directory. Drift → DB loses, directory wins. | ||
| - Move the is-online write off the hot path: today we fire `device.setHostOnline.mutate(...)` against the API on every register. With thrash (flaky clients reconnecting) this is a lot of writes. Batch or debounce. | ||
|
|
||
| **Exit criteria:** Kill a relay machine with `fly machine stop`; within 60s, every host that was on it is flipped to `is_online=false` in the DB and the UI reflects it. |
There was a problem hiding this comment.
fly-replay on WS upgrades should be a hard go/no-go gate, not a "verify" step
The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.
This should be verified against Fly's infrastructure before committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If fly-replay doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 99-102
Comment:
**`fly-replay` on WS upgrades should be a hard go/no-go gate, not a "verify" step**
The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.
This should be verified against Fly's infrastructure *before* committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If `fly-replay` doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.
How can I resolve this? If you propose a fix, please make it concise.| - **Tighter ping cadence.** 10s ping interval, 3 missed = 30s to detect. Tradeoff: ~10 ping frames/minute/host × 1k hosts = 166 msg/s, trivial. | ||
| - **Request timeout policy.** `REQUEST_TIMEOUT_MS = 30_000` at `tunnel.ts:33` is too short for some tRPC calls (chat.sendMessage saw 11s already). Raise to 120s, and mark streaming requests (phase 3) as never-timing-out at this layer. | ||
| - **Backoff cap lowered.** `RECONNECT_MAX_MS = 30_000` is fine but combined with TCP FIN timing a dead server isn't noticed for up to ~90s. Not a backoff issue; the missed-pong detector above fixes it. | ||
|
|
||
| ### Phase 5 — observability |
There was a problem hiding this comment.
Phase 2 boot-time cleanup has a startup race before Phase 1 lands
The plan says:
Boot-time cleanup: on relay start,
UPDATE v2_hosts SET is_online = false WHERE is_online = truefiltered to rows not present in the local in-memory map
During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to is_online = false and then immediately set back to true on reconnect — causing a visible flicker in the UI.
The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:
- Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
- Be skipped entirely until Phase 1's Redis directory is available.
Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 118-122
Comment:
**Phase 2 boot-time cleanup has a startup race before Phase 1 lands**
The plan says:
> Boot-time cleanup: on relay start, `UPDATE v2_hosts SET is_online = false WHERE is_online = true` filtered to rows not present in the local in-memory map
During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to `is_online = false` and then immediately set back to true on reconnect — causing a visible flicker in the UI.
The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:
- Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
- Be skipped entirely until Phase 1's Redis directory is available.
Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
Summary
apps/relay/plans/20260420-relay-hardening.mdso the work can ship in independent phases.packages/auth/src/lib/rate-limit.tsalready uses). Rejected alternatives (pub/sub bus, per-tunnel subdomains, vertical-only, edge consistent-hash) recorded inline so we don't relitigate.is_online→ streaming proxy fix → graceful shutdown / dead-socket detection → observability →v2_machinesdata-model split → load testing.Test plan
v2_machinesvsdevice_presenceopen question in §6).Summary by CodeRabbit
Summary by cubic
Adds the SUPER-427 relay hardening and horizontal scale-out plan in
apps/relay/plans/20260420-relay-hardening.md. Proposes fly-replay sticky routing with a Redis-backed tunnel-ownership directory via@upstash/redis, and outlines phased work for shared-state scale-out, accurateis_online, streaming proxy fixes, graceful shutdown, observability, data-model cleanups, and load testing, with alternatives and tradeoffs documented.Written for commit 159e4ac. Summary will update on new commits.