Skip to content

fix(relay): read tunnel directory from regional replicas#5019

Merged
saddlepaddle merged 1 commit into
mainfrom
relay-latency-audit
Jun 1, 2026
Merged

fix(relay): read tunnel directory from regional replicas#5019
saddlepaddle merged 1 commit into
mainfrom
relay-latency-audit

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented Jun 1, 2026

Problem

Users report poor latency in EU (and other non-primary regions). The relay is already multi-region on Fly (sjc/iad/fra/nrt/sin/gru, 1 machine each) and the Upstash KV directory has per-region read replicas — but the relay was not actually reading from them.

Root cause

apps/relay/src/directory.ts constructed the Upstash client as new Redis({ url, token }). In @upstash/redis@1.37.0, readYourWrites defaults to true (this.readYourWrites = config.readYourWrites ?? true). With it on, the client stamps an upstash-sync-token on every request and each read blocks until the nearest replica has replicated up to this client's latest write.

Every relay writes continuously — register on connect, heartbeat on every pong (30s/tunnel), sweepStale every 30s — so the sync token is always advancing. Net effect: directory.lookup (the hot read in maybeReplay, used on every cross-region routing decision) never gets a fast local replica read; it pays cross-region replication lag, defeating the per-region read replicas.

Fix

Set readYourWrites: false. The directory is eventually-consistent by design (90s TTL_GRACE_MS + sweepStale + the self-owner guard in maybeReplay), so read-your-writes consistency buys nothing here. Lookups now serve from the nearest regional replica. One-line change; writes (Lua eval) still go to the primary as before.

Validation

  • bun run typecheck --filter=@superset/relay
  • biome check apps/relay/src/directory.ts
  • Real-world agent_session_launch latency (PostHog, 30d) shows EU p50 ~330ms vs NA ~297ms — moderately elevated. This addresses one clear gap in the cross-region control path; it is not expected to be a silver bullet on its own (that metric is partly host-processing-dominated).

Out of scope (flagged for follow-up)

  • All directory writes go to the single Upstash primary (Lua eval is never replica-routed). Worth confirming the primary region is sensibly located vs. the traffic centroid.
  • Control-plane API is single-region (checkHostAccess / JWKS / setOnline via NEXT_PUBLIC_API_URL) — cached, but cold-cache first interactions for EU pay a trans-region RTT.
  • 1 machine per region (--max-per-region 1) — no EU headroom; saturation/deploy spills EU traffic to iad.
  • Synthetic latency telemetry is disabled in prodRELAY_SYNTHETIC_JWT isn't set on superset-relay, so relay_synthetic_check / latency_ms is never emitted. Setting it would give per-region latency series to measure changes like this one.

Open in Stage

Summary by cubic

Read the tunnel directory from regional replicas by disabling read-your-writes on the @upstash/redis client. This avoids sync-token blocking and reduces cross‑region latency, especially in EU and other non-primary regions.

Written for commit 5a9ccea. Summary will update on new commits.

Review in cubic

Summary by CodeRabbit

  • Chores
    • Updated internal system configuration settings.

…ites)

@upstash/redis defaults readYourWrites to true, which stamps an
upstash-sync-token on every request and forces each read to block until
the nearest replica has caught up to this client's latest write. Because
every relay writes continuously (register/heartbeat/sweep), that token
keeps advancing, so directory.lookup never gets a fast local replica
read — it pays cross-region replication lag on every cross-region
routing decision, defeating the per-region read replicas.

The directory is eventually-consistent by design (90s TTL + sweepStale +
the maybeReplay self-owner guard), so read-your-writes is not needed.
Disable it so lookups serve from the nearest regional replica.
@stage-review
Copy link
Copy Markdown

stage-review Bot commented Jun 1, 2026

Ready to review this PR? Stage has broken it down into 1 individual chapter for you:

Title
1 Enable regional replica reads for tunnel directory
Open in Stage

Chapters generated by Stage for commit 5a9ccea on Jun 1, 2026 4:42am UTC.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6e30807-e256-45f8-8e02-d4968b2f3530

📥 Commits

Reviewing files that changed from the base of the PR and between 71062f5 and 5a9ccea.

📒 Files selected for processing (1)
  • apps/relay/src/directory.ts

📝 Walkthrough

Walkthrough

The Redis client initialization in the directory service now explicitly sets readYourWrites: false to control read consistency behavior. This single configuration parameter change modifies how the Redis client handles eventual consistency in the relay application.

Changes

Redis Client Configuration

Layer / File(s) Summary
Redis readYourWrites Configuration
apps/relay/src/directory.ts
Redis client initialization explicitly sets readYourWrites: false to manage read consistency semantics in the directory service.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • superset-sh/superset#4594: Also modifies apps/relay/src/directory.ts; the Redis client configuration change may affect consistency guarantees for Redis operations used by that PR's cleanup logic.

Poem

🐰 A whisper to Redis, so swift and so true,
"Stop reading what's old, let consistency brew,"
One flag, one line, a rabbit's delight,
The writes find their way to eventually right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enabling regional replica reads in the directory lookup by fixing the Redis client configuration.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with Problem, Root cause, Fix, Validation sections, and includes implementation details and out-of-scope considerations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch relay-latency-audit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Re-trigger cubic

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR fixes cross-region read latency in the relay by setting readYourWrites: false on the Upstash Redis client, allowing lookup (the hot path in every routing decision) to be served from the nearest regional replica instead of stalling behind the sync-token replication handshake triggered by continuous heartbeat/register writes.

  • One-line config change in apps/relay/src/directory.ts disables the upstash-sync-token header on reads; all writes (Lua eval, hset, zadd) continue to target the Upstash primary unchanged.
  • Eventual-consistency tradeoff is intentional: the existing 90 s TTL_GRACE_MS, sweepStale, and the self-owner guard in maybeReplay already make the directory safe to read with replica lag; the fix correctly identifies that read-your-writes consistency buys nothing here.
  • Subtle heartbeat behavior change: reads in heartbeat (hexists / hget) may now momentarily see stale state for this client's own writes, which can cause an early heartbeat to skip its TTL extension or overwrite registeredAt. Given Upstash intra-region replica lag and the 90 s grace window this is practically harmless, but worth noting if registeredAt gains business significance later.

Confidence Score: 4/5

Safe to merge — the one-line change is correct and well-scoped, with routing reads now served from the nearest regional replica as intended.

The fix is technically sound and the PR description thoroughly documents the consistency trade-offs. The one area worth watching is heartbeat: with the client no longer stamping a sync-token, the hexists guard and the hget that preserves registeredAt may transiently see stale replica state for writes this same client just made. In practice, Upstash intra-region replica lag is sub-millisecond, the system has a 90 s TTL grace window, and the worst-case outcome is a slightly inaccurate registeredAt timestamp — none of which affects routing correctness. No routing logic or TTL invariant is broken by the change.

No files require special attention; the change is confined to the Redis client constructor in apps/relay/src/directory.ts.

Important Files Changed

Filename Overview
apps/relay/src/directory.ts Adds readYourWrites: false to the Upstash Redis client, enabling reads to be served from the nearest regional replica instead of always waiting for the primary; subtly widens the stale-read window for heartbeat's own-write visibility, but this is within the system's documented eventual-consistency design.

Sequence Diagram

sequenceDiagram
    participant EU_Relay as EU Relay (fra)
    participant Regional_Replica as Upstash EU Replica
    participant Primary as Upstash Primary (US)

    Note over EU_Relay,Primary: Before fix (readYourWrites: true)
    EU_Relay->>Primary: register/heartbeat writes (eval/hset/zadd)
    Primary-->>EU_Relay: upstash-sync-token advanced
    EU_Relay->>Regional_Replica: lookup (hget) + sync-token header
    Regional_Replica->>Primary: wait for replication catch-up
    Primary-->>Regional_Replica: replicated
    Regional_Replica-->>EU_Relay: result (cross-region lag paid)

    Note over EU_Relay,Primary: After fix (readYourWrites: false)
    EU_Relay->>Primary: register/heartbeat writes (eval/hset/zadd)
    Note right of Primary: no sync-token issued
    EU_Relay->>Regional_Replica: lookup (hget) — no sync-token
    Regional_Replica-->>EU_Relay: result served immediately (local replica)
Loading

Comments Outside Diff (1)

  1. apps/relay/src/directory.ts, line 99-114 (link)

    P2 heartbeat may now skip TTL extension when it immediately follows register

    With readYourWrites: false, the client's own register write is no longer guaranteed to be visible to the subsequent hexists call in heartbeat. If a pong arrives very quickly after connect and the EU replica hasn't replicated the OWNER_KEY entry yet, hexists returns false and the entire heartbeat — including the ZADD TTL_KEY extension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if the registeredAt field in TunnelMeta is ever used for age-gating logic, since a stale replica returning null for META_KEY causes heartbeat to overwrite it with now rather than preserving the original value.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/relay/src/directory.ts
    Line: 99-114
    
    Comment:
    **`heartbeat` may now skip TTL extension when it immediately follows `register`**
    
    With `readYourWrites: false`, the client's own `register` write is no longer guaranteed to be visible to the subsequent `hexists` call in `heartbeat`. If a pong arrives very quickly after connect and the EU replica hasn't replicated the `OWNER_KEY` entry yet, `hexists` returns false and the entire heartbeat — including the `ZADD TTL_KEY` extension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if the `registeredAt` field in `TunnelMeta` is ever used for age-gating logic, since a stale replica returning `null` for `META_KEY` causes heartbeat to overwrite it with `now` rather than preserving the original value.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/relay/src/directory.ts:99-114
**`heartbeat` may now skip TTL extension when it immediately follows `register`**

With `readYourWrites: false`, the client's own `register` write is no longer guaranteed to be visible to the subsequent `hexists` call in `heartbeat`. If a pong arrives very quickly after connect and the EU replica hasn't replicated the `OWNER_KEY` entry yet, `hexists` returns false and the entire heartbeat — including the `ZADD TTL_KEY` extension — is silently skipped. The tunnel stays alive on its initial 90 s TTL, so missing one or two early heartbeats is safe in practice given typical Upstash replica lag (<1 ms within a region). This is an expected trade-off of the eventual-consistency design, but it's worth keeping in mind if the `registeredAt` field in `TunnelMeta` is ever used for age-gating logic, since a stale replica returning `null` for `META_KEY` causes heartbeat to overwrite it with `now` rather than preserving the original value.

Reviews (1): Last reviewed commit: "fix(relay): read directory from regional..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

@saddlepaddle saddlepaddle merged commit 9bf4052 into main Jun 1, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant