Skip to content

docs(relay): hardening + horizontal scale-out plan#3636

Merged
saddlepaddle merged 1 commit into
mainfrom
satya/relay-hardening-plan
Apr 22, 2026
Merged

docs(relay): hardening + horizontal scale-out plan#3636
saddlepaddle merged 1 commit into
mainfrom
satya/relay-hardening-plan

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented Apr 22, 2026

Summary

  • Captures the SUPER-427 plan in apps/relay/plans/20260420-relay-hardening.md so the work can ship in independent phases.
  • Architecture decision: fly-replay sticky routing + a tunnel-ownership directory in our existing Upstash Redis (the one packages/auth/src/lib/rate-limit.ts already uses). Rejected alternatives (pub/sub bus, per-tunnel subdomains, vertical-only, edge consistent-hash) recorded inline so we don't relitigate.
  • Phased backlog: shared-state scale-out → truthful is_online → streaming proxy fix → graceful shutdown / dead-socket detection → observability → v2_machines data-model split → load testing.

Test plan

  • Plan reviewed by Avi/Kiet (esp. the v2_machines vs device_presence open question in §6).
  • Linear SUPER-427 mirrors the phase list once approved.

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive implementation plan document outlining planned architectural and operational improvements to the Relay service. This is an internal planning document with no immediate product changes.

Summary by cubic

Adds the SUPER-427 relay hardening and horizontal scale-out plan in apps/relay/plans/20260420-relay-hardening.md. Proposes fly-replay sticky routing with a Redis-backed tunnel-ownership directory via @upstash/redis, and outlines phased work for shared-state scale-out, accurate is_online, streaming proxy fixes, graceful shutdown, observability, data-model cleanups, and load testing, with alternatives and tradeoffs documented.

Written for commit 159e4ac. Summary will update on new commits.

Captures the architecture plan for SUPER-427 (relay hardening) so
phases can ship independently. Documents the fly-replay + Upstash
directory approach (reusing the existing rate-limit instance), the
truthful is_online story, streaming-proxy fix, graceful shutdown,
observability, and data-model cleanups — with rejected alternatives
recorded so we don't relitigate them.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

A new design and implementation plan document was added for the Superset Relay service, outlining a multi-phase effort to harden and horizontally scale the service through architecture changes, control-flow modifications, correctness fixes, reliability improvements, and observability enhancements.

Changes

Cohort / File(s) Summary
Relay Hardening Plan
apps/relay/plans/20260420-relay-hardening.md
New planning document detailing multi-phase hardening and scaling strategy, including cross-machine tunnel ownership via fly-replay sticky routing, Redis-backed directory, control-flow changes for proxying and WebSocket upgrades, correctness fixes, operational scaling steps, and planned improvements for protocol streaming, reliability, observability, and data modeling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A relay strengthened, scaled with care,
Through sticky routes and Redis air,
Tunnels owned, directories bright,
WebSocket whispers, streaming light,
Plans laid out for phases grand—
A hardened service, firmly planned! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers the key aspects (summary, architecture decision, phased backlog) but omits required template sections like Related Issues, Type of Change, and Testing status. Complete the remaining template sections: explicitly list Related Issues (SUPER-427), confirm the Type of Change (Documentation), and clarify test/review status with checkboxes or completion notes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: a plan document for relay hardening and horizontal scale-out.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch satya/relay-hardening-plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
apps/relay/plans/20260420-relay-hardening.md (1)

139-144: Resolve the v2_machines vs device_presence question before Phase 6 implementation.

The open question at line 154 notes uncertainty about whether to create v2_machines or fold the data onto the existing device_presence table. Since Phase 6 involves a data model migration, resolving this architectural decision early will prevent rework.

The PR objectives mention this is pending review by Avi/Kiet—consider blocking Phase 6 implementation until that alignment is complete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/relay/plans/20260420-relay-hardening.md` around lines 139 - 144, The PR
calls out an unresolved architectural decision: whether to introduce v2_machines
or reuse device_presence for the lifted machine-level attributes before
implementing Phase 6; stop and decide the canonical model, coordinate with
Avi/Kiet, and then update the Phase 6 plan and migrations accordingly (ensure
v2_hosts.machine_id FK targets the chosen table, remove name from v2_hosts only
if moving to v2_machines, and if keeping device_presence adjust its schema to
contain name/os/cpu_count/total_memory/agent_version and telemetry columns
last_connected_at/last_disconnected_at/connected_machine_id); also audit
register paths (device.setHostOnline and
packages/trpc/src/router/device/device.ts) to ensure name is not written on
register once the final model is chosen and block Phase 6 work until alignment
is confirmed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/relay/plans/20260420-relay-hardening.md`:
- Around line 114-121: The doc's ping/pong section is inconsistent; update the
text to explicitly state the final server and client ping intervals and the
detection formula, e.g., set Server ping interval = 10s (tunnel.ts ping sender
at line referenced by tunnel.ts:55), Client ping interval = 10s (client
app-level ping), and Detection threshold = 3 × ping_interval = 30s (so three
missed pings = 30s); also reconcile the earlier 45s and 90s references and
mention REQUEST_TIMEOUT_MS and RECONNECT_MAX_MS only where relevant to
timeouts/backoff (REQUEST_TIMEOUT_MS -> 120s as noted, RECONNECT_MAX_MS
unchanged) so the math and behavior are unambiguous.
- Around line 50-54: Add a language identifier to the fenced code block
containing the Redis commands so syntax highlighting and linting pass; update
the opening fence from ``` to ```text for the block that contains the lines
"HSET tunnel-owner <hostId> <machineId>", "HSET tunnel-meta  <hostId>
<json-blob>", and "ZADD tunnel-ttl   <expireAt-ms> <hostId>" in the markdown so
the block is explicitly marked as plain text.

---

Nitpick comments:
In `@apps/relay/plans/20260420-relay-hardening.md`:
- Around line 139-144: The PR calls out an unresolved architectural decision:
whether to introduce v2_machines or reuse device_presence for the lifted
machine-level attributes before implementing Phase 6; stop and decide the
canonical model, coordinate with Avi/Kiet, and then update the Phase 6 plan and
migrations accordingly (ensure v2_hosts.machine_id FK targets the chosen table,
remove name from v2_hosts only if moving to v2_machines, and if keeping
device_presence adjust its schema to contain
name/os/cpu_count/total_memory/agent_version and telemetry columns
last_connected_at/last_disconnected_at/connected_machine_id); also audit
register paths (device.setHostOnline and
packages/trpc/src/router/device/device.ts) to ensure name is not written on
register once the final model is chosen and block Phase 6 work until alignment
is confirmed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f9951e3-1e48-4b43-a07c-6814185b408a

📥 Commits

Reviewing files that changed from the base of the PR and between 2c6d6eb and 159e4ac.

📒 Files selected for processing (1)
  • apps/relay/plans/20260420-relay-hardening.md

Comment on lines +50 to +54
```
HSET tunnel-owner <hostId> <machineId> # set on register, DEL on unregister
HSET tunnel-meta <hostId> <json-blob> # registeredAt, lastPongAt
ZADD tunnel-ttl <expireAt-ms> <hostId> # for stale-owner cleanup
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specification to the code block.

The fenced code block showing the Redis schema should specify a language identifier for proper syntax highlighting and linting compliance.

📝 Proposed fix
-```
+```text
 HSET tunnel-owner <hostId> <machineId>            # set on register, DEL on unregister
 HSET tunnel-meta  <hostId> <json-blob>            # registeredAt, lastPongAt
 ZADD tunnel-ttl   <expireAt-ms> <hostId>          # for stale-owner cleanup

</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @apps/relay/plans/20260420-relay-hardening.md around lines 50 - 54, Add a
language identifier to the fenced code block containing the Redis commands so
syntax highlighting and linting pass; update the opening fence from ``` to

<machineId>", "HSET tunnel-meta  <hostId> <json-blob>", and "ZADD tunnel-ttl  
<expireAt-ms> <hostId>" in the markdown so the block is explicitly marked as
plain text.

Comment on lines +114 to +121
### Phase 4 — reliability & graceful shutdown

- **SIGTERM handling on relay.** Intercept SIGTERM (fly sends this on scale-down/deploy), stop accepting new tunnels, send a close frame `{code: 4001, reason: "server-drain"}` to every tunneled host, wait up to 10s for graceful close, then exit. Hosts treat code 4001 as "reconnect immediately, no backoff."
- **Client-side missed-pong detection.** Today the client only reconnects on `onclose`. Add a dead-socket detector: if no `onmessage` (including pings) for 45s, forcibly close and reconnect. Server pings every 30s (`tunnel.ts:55`); three missed pings = 90s to detect, too slow. Also have the client send an app-level ping every 15s so the server can detect dead clients sooner (currently the server only knows about client liveness via TCP).
- **Tighter ping cadence.** 10s ping interval, 3 missed = 30s to detect. Tradeoff: ~10 ping frames/minute/host × 1k hosts = 166 msg/s, trivial.
- **Request timeout policy.** `REQUEST_TIMEOUT_MS = 30_000` at `tunnel.ts:33` is too short for some tRPC calls (chat.sendMessage saw 11s already). Raise to 120s, and mark streaming requests (phase 3) as never-timing-out at this layer.
- **Backoff cap lowered.** `RECONNECT_MAX_MS = 30_000` is fine but combined with TCP FIN timing a dead server isn't noticed for up to ~90s. Not a backoff issue; the missed-pong detector above fixes it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the ping interval and detection math.

The ping/pong detection strategy has some inconsistencies:

  • Line 117 mentions detecting dead sockets after 45s with no message, but with a 30s server ping interval, 45s is only 1.5 intervals.
  • Line 118 first proposes client pings every 15s, then proposes "10s ping interval" without specifying whether that's client-only, server-only, or both.
  • The calculation "three missed pings = 90s" doesn't align with the earlier 45s threshold.

Recommend clarifying:

  1. Final server ping interval (10s?)
  2. Final client ping interval (10s or 15s?)
  3. Detection threshold formula (e.g., "3 × ping_interval" = 30s if both ping at 10s)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/relay/plans/20260420-relay-hardening.md` around lines 114 - 121, The
doc's ping/pong section is inconsistent; update the text to explicitly state the
final server and client ping intervals and the detection formula, e.g., set
Server ping interval = 10s (tunnel.ts ping sender at line referenced by
tunnel.ts:55), Client ping interval = 10s (client app-level ping), and Detection
threshold = 3 × ping_interval = 30s (so three missed pings = 30s); also
reconcile the earlier 45s and 90s references and mention REQUEST_TIMEOUT_MS and
RECONNECT_MAX_MS only where relevant to timeouts/backoff (REQUEST_TIMEOUT_MS ->
120s as noted, RECONNECT_MAX_MS unchanged) so the math and behavior are
unambiguous.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR adds a detailed architecture and phasing plan (apps/relay/plans/20260420-relay-hardening.md) for hardening the relay service and enabling horizontal scale-out via fly-replay sticky routing backed by an Upstash Redis ownership directory. No production code is changed; the plan is intended for team review and to mirror into Linear SUPER-427.

The document is well-grounded — line references to tunnel.ts and index.ts are accurate against the current codebase, rejected alternatives are recorded with clear reasoning, and the phases are independently shippable with explicit exit criteria. A few design gaps were found that are worth addressing before Phase 1 implementation begins:

  • Redis failure mode unaddressed: The plan says "no new failure mode to monitor" but Upstash becomes a required dependency on the cross-machine hot path. The plan should document the intended behavior when Redis is unavailable (fail-open returning 503 is the safe default).
  • Concurrent sweeper race: The ZSET-based TTL sweeper running independently on every machine can race-delete entries that are still live. Using native Redis key expiry (refreshed on each heartbeat) is simpler and eliminates the concurrency issue entirely.
  • WS upgrade replay verification needs to be a hard gate: Whether fly-replay works for WS upgrades should be confirmed via a small spike before the Phase 1 design is locked in, since the fallback (client-side pre-flight HTTP) requires protocol changes that aren't yet scoped.
  • Phase 2 boot-cleanup races during startup before Phase 1 lands: The boot-time is_online = false sweep is unsafe before the Redis directory is available, because the in-memory map is empty at startup; hosts that would reconnect within seconds get incorrectly flipped offline. The Phase 2 description should note this dependency on Phase 1.

Confidence Score: 4/5

Safe to merge as a planning document; no production code is changed, and the identified gaps should be resolved during Phase 1 implementation rather than blocking this PR.

This is a documentation-only PR with no code changes, so there is no direct production risk. The plan is thorough, well-structured, and accurately references the existing codebase. The issues found are design-level concerns that need to be addressed before or during implementation of Phase 1, not blockers to merging the plan itself. Score is 4 rather than 5 because the Redis-as-SPOF gap and concurrent-sweeper race are meaningful architectural issues that should be acknowledged and resolved in the document before the plan is used as an implementation spec.

apps/relay/plans/20260420-relay-hardening.md — specifically the Redis failure-mode section, the sweeper design, and the Phase 2 dependency note.

Important Files Changed

Filename Overview
apps/relay/plans/20260420-relay-hardening.md New architecture planning doc for relay hardening and horizontal scale-out via fly-replay + Redis ownership directory; well-structured and grounded in the existing codebase, with a few design gaps around Redis failure modes, concurrent sweeper races, WS-upgrade replay verification, and a Phase 2 boot-cleanup race that depends on Phase 1 landing first.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Incoming Request to Fly LB] --> B[Routed to Relay Machine N]
    B --> C{Machine N owns tunnel?}
    C -- Yes --> D[Serve request in-process via WS tunnel]
    C -- No --> E[HGET tunnel-owner from Upstash Redis]
    E --> F{Owner found?}
    F -- No --> G[Return 503 Host not connected]
    F -- Yes --> H[Reply with fly-replay header pointing to owner machine]
    H --> I[Fly LB re-routes to owning machine]
    I --> D
    D --> J[Response returned to client]

    subgraph Registration
        R1[Host opens WS on register] --> R2[TunnelManager.register in-process]
        R2 --> R3[HSET tunnel-owner in Redis]
        R3 --> R4[ZADD tunnel-ttl in Redis]
    end

    subgraph Shutdown
        S1[SIGTERM received] --> S2[Stop accepting new tunnels]
        S2 --> S3[Send close frame to all hosts]
        S3 --> S4[DEL tunnel-owner from Redis]
        S4 --> S5[Mark hosts offline in DB]
    end
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 63-65

Comment:
**Redis becomes a new hot-path single point of failure**

The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful `HGET` against Upstash before the `fly-replay` header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.

The plan should explicitly state the intended fallback behavior:
- **Fail-open**: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
- **Fail-open with local-only serving**: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.

Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 58-61

Comment:
**Concurrent sweepers can race-delete live entries**

The plan has every machine independently running a sweeper that scans `tunnel-ttl` ZSET and deletes entries whose `expireAt-ms` has passed. With N machines, N sweepers are running simultaneously, and there's a window where:

1. Machine A's sweeper identifies `hostId` as stale (heartbeat > 2× ping interval).
2. Machine B's sweeper does the same simultaneously.
3. Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
4. Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.

A few approaches to address this:
- Use a Redis Lua script or `WATCH`/`MULTI` to atomically check-and-delete only if the caller's machine owns the entry (i.e., `if HGET tunnel-owner hostId == myMachineId then DEL`).
- Elect a single sweeper leader via a short-TTL Redis lock (`SET sweeper-lock <machineId> NX PX 25000`).
- Drop the ZSET approach and rely solely on native Redis key expiry: `SET tunnel-owner:<hostId> <machineId> PX <2×ping>` refreshed on each heartbeat. No sweeper process needed.

The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 99-102

Comment:
**`fly-replay` on WS upgrades should be a hard go/no-go gate, not a "verify" step**

The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.

This should be verified against Fly's infrastructure *before* committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If `fly-replay` doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 118-122

Comment:
**Phase 2 boot-time cleanup has a startup race before Phase 1 lands**

The plan says:
> Boot-time cleanup: on relay start, `UPDATE v2_hosts SET is_online = false WHERE is_online = true` filtered to rows not present in the local in-memory map

During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to `is_online = false` and then immediately set back to true on reconnect — causing a visible flicker in the UI.

The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:
- Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
- Be skipped entirely until Phase 1's Redis directory is available.

Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "docs(relay): hardening + horizontal scal..." | Re-trigger Greptile

Comment on lines +63 to +65
- **Fly's "Managed Redis" add-on.** It is just Upstash under the hood (reseller relationship; product has since been de-emphasized). Zero latency or feature win over using Upstash directly.
- **Self-hosted Redis on a fly machine behind flycast.** Sub-ms latency via the wireguard network, but we own replication, backups, HA, and deploy sequencing. Not worth it for ~few-hundred ops/sec.
- **Putting Upstash behind fly's private network.** Upstash's REST endpoint is public-internet HTTPS; they don't expose a flycast-routable endpoint. Can't keep it inside the VPN.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Redis becomes a new hot-path single point of failure

The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful HGET against Upstash before the fly-replay header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.

The plan should explicitly state the intended fallback behavior:

  • Fail-open: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
  • Fail-open with local-only serving: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.

Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 63-65

Comment:
**Redis becomes a new hot-path single point of failure**

The plan states "No new failure mode to monitor" for the Upstash choice, but that's not quite right. Today the relay serves every request entirely in-process — no external dependencies on the hot path. After Phase 1, every cross-machine proxy request requires a successful `HGET` against Upstash before the `fly-replay` header can be emitted. If Upstash is unavailable (even for a brief outage or REST endpoint blip), all cross-machine requests degrade to 503.

The plan should explicitly state the intended fallback behavior:
- **Fail-open**: if the Redis lookup throws, return 503 (same as today — acceptable degradation).
- **Fail-open with local-only serving**: if Redis is unreachable, serve only tunnels owned by the current machine and return 503 for others. This at least preserves partial availability.

Recommending adding a "Failure modes" section or at minimum a note under "### Redis itself" documenting the chosen behavior. This also affects the Phase 5 alerts — a Redis error rate alert should be added alongside the proxy-latency ones.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +58 to +61
### Redis itself

**Reuse the existing Upstash instance** (`KV_REST_API_URL` / `KV_REST_API_TOKEN`, already used by `packages/auth/src/lib/rate-limit.ts` and a dozen api routes). No new infra, no new secrets, no new failure mode to monitor. Write through the `@upstash/redis` REST client we already depend on.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Concurrent sweepers can race-delete live entries

The plan has every machine independently running a sweeper that scans tunnel-ttl ZSET and deletes entries whose expireAt-ms has passed. With N machines, N sweepers are running simultaneously, and there's a window where:

  1. Machine A's sweeper identifies hostId as stale (heartbeat > 2× ping interval).
  2. Machine B's sweeper does the same simultaneously.
  3. Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
  4. Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.

A few approaches to address this:

  • Use a Redis Lua script or WATCH/MULTI to atomically check-and-delete only if the caller's machine owns the entry (i.e., if HGET tunnel-owner hostId == myMachineId then DEL).
  • Elect a single sweeper leader via a short-TTL Redis lock (SET sweeper-lock <machineId> NX PX 25000).
  • Drop the ZSET approach and rely solely on native Redis key expiry: SET tunnel-owner:<hostId> <machineId> PX <2×ping> refreshed on each heartbeat. No sweeper process needed.

The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 58-61

Comment:
**Concurrent sweepers can race-delete live entries**

The plan has every machine independently running a sweeper that scans `tunnel-ttl` ZSET and deletes entries whose `expireAt-ms` has passed. With N machines, N sweepers are running simultaneously, and there's a window where:

1. Machine A's sweeper identifies `hostId` as stale (heartbeat > 2× ping interval).
2. Machine B's sweeper does the same simultaneously.
3. Machine A also owns the tunnel — it just hasn't heartbeated yet because it was processing a burst of frames.
4. Both sweepers delete the Redis entry; the tunnel's ownership row is gone even though it's still alive.

A few approaches to address this:
- Use a Redis Lua script or `WATCH`/`MULTI` to atomically check-and-delete only if the caller's machine owns the entry (i.e., `if HGET tunnel-owner hostId == myMachineId then DEL`).
- Elect a single sweeper leader via a short-TTL Redis lock (`SET sweeper-lock <machineId> NX PX 25000`).
- Drop the ZSET approach and rely solely on native Redis key expiry: `SET tunnel-owner:<hostId> <machineId> PX <2×ping>` refreshed on each heartbeat. No sweeper process needed.

The native-TTL approach is the simplest and eliminates the concurrency issue entirely — worth considering over the current ZSET design.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +99 to +102
- Periodic reconciliation: every 30s, reconcile DB state against the Redis directory. Drift → DB loses, directory wins.
- Move the is-online write off the hot path: today we fire `device.setHostOnline.mutate(...)` against the API on every register. With thrash (flaky clients reconnecting) this is a lot of writes. Batch or debounce.

**Exit criteria:** Kill a relay machine with `fly machine stop`; within 60s, every host that was on it is flipped to `is_online=false` in the DB and the UI reflects it.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 fly-replay on WS upgrades should be a hard go/no-go gate, not a "verify" step

The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.

This should be verified against Fly's infrastructure before committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If fly-replay doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 99-102

Comment:
**`fly-replay` on WS upgrades should be a hard go/no-go gate, not a "verify" step**

The plan says "Verify fly honors the header on 101 upgrade. If not, we do an HTTP pre-flight first." The pre-flight fallback is a meaningful protocol change — clients would need to add an extra round-trip before every WS open. The fallback also adds complexity: the pre-flight response needs to be interpreted by the client, which doesn't exist today.

This should be verified against Fly's infrastructure *before* committing to the Phase 1 design (ideally a small spike: two machines, one tunnel, one cross-machine WS upgrade, check the logs). Recommend making this an explicit prerequisite task for Phase 1 rather than an inline note. If `fly-replay` doesn't work for WS, the pre-flight approach needs its own design phase with client changes included.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +118 to +122
- **Tighter ping cadence.** 10s ping interval, 3 missed = 30s to detect. Tradeoff: ~10 ping frames/minute/host × 1k hosts = 166 msg/s, trivial.
- **Request timeout policy.** `REQUEST_TIMEOUT_MS = 30_000` at `tunnel.ts:33` is too short for some tRPC calls (chat.sendMessage saw 11s already). Raise to 120s, and mark streaming requests (phase 3) as never-timing-out at this layer.
- **Backoff cap lowered.** `RECONNECT_MAX_MS = 30_000` is fine but combined with TCP FIN timing a dead server isn't noticed for up to ~90s. Not a backoff issue; the missed-pong detector above fixes it.

### Phase 5 — observability
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Phase 2 boot-time cleanup has a startup race before Phase 1 lands

The plan says:

Boot-time cleanup: on relay start, UPDATE v2_hosts SET is_online = false WHERE is_online = true filtered to rows not present in the local in-memory map

During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to is_online = false and then immediately set back to true on reconnect — causing a visible flicker in the UI.

The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:

  • Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
  • Be skipped entirely until Phase 1's Redis directory is available.

Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/plans/20260420-relay-hardening.md
Line: 118-122

Comment:
**Phase 2 boot-time cleanup has a startup race before Phase 1 lands**

The plan says:
> Boot-time cleanup: on relay start, `UPDATE v2_hosts SET is_online = false WHERE is_online = true` filtered to rows not present in the local in-memory map

During the initial reconnection window after a restart (before all hosts have re-registered), the in-memory map is empty. Any host that would normally reconnect within the first N seconds will be incorrectly flipped to `is_online = false` and then immediately set back to true on reconnect — causing a visible flicker in the UI.

The plan itself acknowledges the fix ("once phase 1 lands, filter to rows not in the Redis directory"), but Phase 2 is described as "independently valuable" and could ship before Phase 1. If Phase 2 ships first, the boot-time cleanup should either:
- Be delayed by the expected max-reconnect window (e.g., run after 60s, not at boot).
- Be skipped entirely until Phase 1's Redis directory is available.

Worth adding a sequencing note in the Phase 2 description clarifying this dependency on Phase 1 for the boot-cleanup to be safe.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

@saddlepaddle saddlepaddle merged commit 400989f into main Apr 22, 2026
14 checks passed
@Kitenite Kitenite deleted the satya/relay-hardening-plan branch May 6, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant