fix(relay): read tunnel directory from regional replicas#5019
Conversation
…ites) @upstash/redis defaults readYourWrites to true, which stamps an upstash-sync-token on every request and forces each read to block until the nearest replica has caught up to this client's latest write. Because every relay writes continuously (register/heartbeat/sweep), that token keeps advancing, so directory.lookup never gets a fast local replica read — it pays cross-region replication lag on every cross-region routing decision, defeating the per-region read replicas. The directory is eventually-consistent by design (90s TTL + sweepStale + the maybeReplay self-owner guard), so read-your-writes is not needed. Disable it so lookups serve from the nearest regional replica.
|
Ready to review this PR? Stage has broken it down into 1 individual chapter for you:
Chapters generated by Stage for commit 5a9ccea on Jun 1, 2026 4:42am UTC. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Redis client initialization in the directory service now explicitly sets ChangesRedis Client Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR fixes cross-region read latency in the relay by setting
Confidence Score: 4/5Safe to merge — the one-line change is correct and well-scoped, with routing reads now served from the nearest regional replica as intended. The fix is technically sound and the PR description thoroughly documents the consistency trade-offs. The one area worth watching is No files require special attention; the change is confined to the Redis client constructor in
|
| Filename | Overview |
|---|---|
| apps/relay/src/directory.ts | Adds readYourWrites: false to the Upstash Redis client, enabling reads to be served from the nearest regional replica instead of always waiting for the primary; subtly widens the stale-read window for heartbeat's own-write visibility, but this is within the system's documented eventual-consistency design. |
Sequence Diagram
sequenceDiagram
participant EU_Relay as EU Relay (fra)
participant Regional_Replica as Upstash EU Replica
participant Primary as Upstash Primary (US)
Note over EU_Relay,Primary: Before fix (readYourWrites: true)
EU_Relay->>Primary: register/heartbeat writes (eval/hset/zadd)
Primary-->>EU_Relay: upstash-sync-token advanced
EU_Relay->>Regional_Replica: lookup (hget) + sync-token header
Regional_Replica->>Primary: wait for replication catch-up
Primary-->>Regional_Replica: replicated
Regional_Replica-->>EU_Relay: result (cross-region lag paid)
Note over EU_Relay,Primary: After fix (readYourWrites: false)
EU_Relay->>Primary: register/heartbeat writes (eval/hset/zadd)
Note right of Primary: no sync-token issued
EU_Relay->>Regional_Replica: lookup (hget) — no sync-token
Regional_Replica-->>EU_Relay: result served immediately (local replica)
Comments Outside Diff (1)
-
apps/relay/src/directory.ts, line 99-114 (link)heartbeatmay now skip TTL extension when it immediately followsregisterWith
readYourWrites: false, the client's ownregisterwrite is no longer guaranteed to be visible to the subsequenthexistscall inheartbeat. If a pong arrives very quickly after connect and the EU replica hasn't replicated theOWNER_KEYentry yet,hexistsreturns false and the entire heartbeat — including theZADD TTL_KEYextension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if theregisteredAtfield inTunnelMetais ever used for age-gating logic, since a stale replica returningnullforMETA_KEYcauses heartbeat to overwrite it withnowrather than preserving the original value.Prompt To Fix With AI
This is a comment left during a code review. Path: apps/relay/src/directory.ts Line: 99-114 Comment: **`heartbeat` may now skip TTL extension when it immediately follows `register`** With `readYourWrites: false`, the client's own `register` write is no longer guaranteed to be visible to the subsequent `hexists` call in `heartbeat`. If a pong arrives very quickly after connect and the EU replica hasn't replicated the `OWNER_KEY` entry yet, `hexists` returns false and the entire heartbeat — including the `ZADD TTL_KEY` extension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if the `registeredAt` field in `TunnelMeta` is ever used for age-gating logic, since a stale replica returning `null` for `META_KEY` causes heartbeat to overwrite it with `now` rather than preserving the original value. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
apps/relay/src/directory.ts:99-114
**`heartbeat` may now skip TTL extension when it immediately follows `register`**
With `readYourWrites: false`, the client's own `register` write is no longer guaranteed to be visible to the subsequent `hexists` call in `heartbeat`. If a pong arrives very quickly after connect and the EU replica hasn't replicated the `OWNER_KEY` entry yet, `hexists` returns false and the entire heartbeat — including the `ZADD TTL_KEY` extension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if the `registeredAt` field in `TunnelMeta` is ever used for age-gating logic, since a stale replica returning `null` for `META_KEY` causes heartbeat to overwrite it with `now` rather than preserving the original value.
Reviews (1): Last reviewed commit: "fix(relay): read directory from regional..." | Re-trigger Greptile
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
Problem
Users report poor latency in EU (and other non-primary regions). The relay is already multi-region on Fly (sjc/iad/fra/nrt/sin/gru, 1 machine each) and the Upstash KV directory has per-region read replicas — but the relay was not actually reading from them.
Root cause
apps/relay/src/directory.tsconstructed the Upstash client asnew Redis({ url, token }). In@upstash/redis@1.37.0,readYourWritesdefaults totrue(this.readYourWrites = config.readYourWrites ?? true). With it on, the client stamps anupstash-sync-tokenon every request and each read blocks until the nearest replica has replicated up to this client's latest write.Every relay writes continuously —
registeron connect,heartbeaton every pong (30s/tunnel),sweepStaleevery 30s — so the sync token is always advancing. Net effect:directory.lookup(the hot read inmaybeReplay, used on every cross-region routing decision) never gets a fast local replica read; it pays cross-region replication lag, defeating the per-region read replicas.Fix
Set
readYourWrites: false. The directory is eventually-consistent by design (90sTTL_GRACE_MS+sweepStale+ the self-owner guard inmaybeReplay), so read-your-writes consistency buys nothing here. Lookups now serve from the nearest regional replica. One-line change; writes (Luaeval) still go to the primary as before.Validation
bun run typecheck --filter=@superset/relay✅biome check apps/relay/src/directory.ts✅agent_session_launchlatency (PostHog, 30d) shows EU p50 ~330ms vs NA ~297ms — moderately elevated. This addresses one clear gap in the cross-region control path; it is not expected to be a silver bullet on its own (that metric is partly host-processing-dominated).Out of scope (flagged for follow-up)
evalis never replica-routed). Worth confirming the primary region is sensibly located vs. the traffic centroid.checkHostAccess/ JWKS /setOnlineviaNEXT_PUBLIC_API_URL) — cached, but cold-cache first interactions for EU pay a trans-region RTT.--max-per-region 1) — no EU headroom; saturation/deploy spills EU traffic toiad.RELAY_SYNTHETIC_JWTisn't set onsuperset-relay, sorelay_synthetic_check/latency_msis never emitted. Setting it would give per-region latency series to measure changes like this one.Summary by cubic
Read the tunnel directory from regional replicas by disabling read-your-writes on the
@upstash/redisclient. This avoids sync-token blocking and reduces cross‑region latency, especially in EU and other non-primary regions.Written for commit 5a9ccea. Summary will update on new commits.
Summary by CodeRabbit