Skip to content

fix(relay): cache host.checkAccess by userId, not token#4491

Merged
saddlepaddle merged 1 commit into
mainfrom
saddlepaddle/relay-checkaccess-cache
May 13, 2026
Merged

fix(relay): cache host.checkAccess by userId, not token#4491
saddlepaddle merged 1 commit into
mainfrom
saddlepaddle/relay-checkaccess-cache

Conversation

@saddlepaddle
Copy link
Copy Markdown
Collaborator

@saddlepaddle saddlepaddle commented May 13, 2026

Summary

host.checkAccess was accounting for ~29% of /api log volume on Vercel. The relay caches results in an LRU, but the cache key was \${token}:${hostId}`` — and JWTs rotate every hour or so. Every refresh invalidated the cache entry even though the underlying user→host authorization hadn't changed, so active desktops were re-hitting the API roughly every JWT lifetime regardless of the 5-minute LRU TTL.

Changes

  • Cache key: (userId, hostId) via verified auth.sub. Stable across token refreshes.
  • Allowed TTL: 5m → 15m. Org/host membership changes maybe a few times a day per user; a 15m worst-case stale-allow window is fine.
  • Negative cache: 30s for denied. Prevents a misconfigured client with a bad token from hammering the API.
  • Local short-circuit: if the host's organizationId isn't in auth.organizationIds, return false without an API call. The API does this exact check from the JWT before its DB query, so the round trip was wasted.

Expected impact

~80% drop in host.checkAccess calls = ~23% of total /api log volume.

Follow-up to #4490 (deleted device.heartbeat, ~10% of volume).

Test plan

  • CI green
  • Watch Vercel /api host.checkAccess request count drop within an hour of staging deploy
  • Spot-check that revoking a user from an org still bounces them within 15 minutes (the new allowed-TTL)
  • Confirm no spike in 403 Forbidden after deploy (would indicate the negative cache is being too aggressive)

Summary by cubic

Cache host.checkAccess by userId instead of token to stop cache busting on JWT refresh. Adds a local org check and tuned TTLs to cut API calls by ~80% (~23% of /api volume).

  • Refactors
    • Cache key: auth.sub + hostId (stable across token rotation).
    • TTLs: allowed 15m; denied 30s.
    • Local short-circuit: return false if host org not in auth.organizationIds (no API call).
    • Signature change: checkHostAccess(auth, token, hostId); call sites updated.

Written for commit 2d5c787. Summary will update on new commits.

Summary by CodeRabbit

  • Refactor
    • Improved access authorization efficiency with enhanced caching strategy for host access validation.
    • Optimized authorization flow to reduce repeated API checks and provide faster access decisions.

Review Change Stack

The relay caches host.checkAccess results in an LRU to avoid hitting
the API on every tunneled request. The cache was keyed by
`${token}:${hostId}`, which meant every JWT refresh invalidated the
entry — even though the underlying user→host authorization hadn't
changed. With active desktops cycling tokens roughly hourly, the
endpoint accounted for ~29% of /api log volume.

Changes:
- Key by (userId, hostId) — userId comes from verified JWT.sub and is
  stable across token refreshes.
- Bump allowed TTL from 5m to 15m. Org/host membership changes a few
  times a day per user, not every five minutes; worst-case stale-allow
  window of 15m is acceptable.
- Add 30s negative cache for denied results so a misconfigured client
  with a bad token can't hammer the API.
- Short-circuit locally when the host's organizationId isn't in
  auth.organizationIds — the API does the same check before the DB
  query, so the round trip is wasted.

Together these should drop checkAccess volume ~80% on the API side.
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 13, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5860e85b-62f9-4823-b744-c78a1f211c86

📥 Commits

Reviewing files that changed from the base of the PR and between d1aee09 and 2d5c787.

📒 Files selected for processing (2)
  • apps/relay/src/access.ts
  • apps/relay/src/index.ts

📝 Walkthrough

Walkthrough

The relay's host access authorization is enhanced to accept authenticated user context, enabling local organization-based access checks and separate result caching. The checkHostAccess function now derives its cache key from the authenticated user and host, with distinct TTLs for allowed (15m) and denied (30s) results, while callers in both HTTP middleware and WebSocket handlers provide the auth context.

Changes

Authorization Cache and Access Control

Layer / File(s) Summary
Cache infrastructure and authorization check
apps/relay/src/access.ts
checkHostAccess accepts AuthContext to extract user identity and organization memberships, imports routing-key parsing, and establishes two separate caches (allowedCache, deniedCache) keyed by auth.sub:hostId with distinct TTLs. The function performs local organization verification and short-circuits to false if the host's organization is not in auth.organizationIds, then caches the API result into the appropriate cache based on success or failure.
Middleware and WebSocket caller updates
apps/relay/src/index.ts
Both authMiddleware and the /tunnel WebSocket onOpen handler invoke checkHostAccess(auth, token, hostId) with the verified authentication context alongside token and host ID, preserving existing denial responses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 Cache keys now remember the user, not just the token's fate,
Org checks happen quick, before we call and wait,
Denied results get thirty seconds to breathe,
While allowed ones rest for fifteen, I believe—
Two caches, one flow, authorization's more neat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: caching host.checkAccess by userId instead of token, which directly addresses the core issue of cache invalidation on JWT refresh.
Description check ✅ Passed The description includes a clear summary of the problem, detailed explanation of changes, expected impact metrics, and a comprehensive test plan with specific metrics to validate the fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch saddlepaddle/relay-checkaccess-cache

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR fixes a caching inefficiency in the relay's checkHostAccess path: the old LRU key included the JWT token, which meant every token rotation (hourly) invalidated the cache entry and forced a live API call. The new key is (userId, hostId) derived from the verified auth.sub, which is stable across rotations.

  • Cache key fix (access.ts): key is now auth.sub + hostId, TTL extended to 15 min for allowed results; a new deniedCache (30s TTL, 10k entries) prevents hammering on denials.
  • Local short-circuit (access.ts): org membership is now validated directly from JWT claims before touching either cache or the API, eliminating a network round-trip that the API itself would have made anyway.
  • Call-site update (index.ts): both the HTTP authMiddleware and the WebSocket onOpen handler pass the AuthContext through to checkHostAccess.

Confidence Score: 4/5

Safe to merge — the cache-key change is correct and stable, the short-circuit is a faithful reimplementation of the server-side org check, and both call sites are updated consistently.

The core logic change is well-reasoned and the trade-offs (15m stale-allow, 30s stale-deny) are explicitly documented. Two minor observations: API exceptions bypass the negative cache, so a consistently failing API call path will still hit the API on every attempt; and a stale deniedCache entry becomes unreachable (harmless but occupies a slot) after a user is removed from an org. Neither affects correctness in the normal path.

apps/relay/src/access.ts — the caching logic is the heart of this change; the error-path and short-circuit ordering warrant a second read.

Important Files Changed

Filename Overview
apps/relay/src/access.ts Replaces token-based cache key with userId (auth.sub), adds a 30s negative cache for denied results, extends allowed TTL to 15m, and introduces a local org-membership short-circuit. Logic is sound; one edge case where API errors bypass the negative cache protection.
apps/relay/src/index.ts Two call sites updated to pass the new auth parameter to checkHostAccess. Both HTTP middleware and WebSocket tunnel paths are correctly updated with no logic changes beyond the parameter addition.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[checkHostAccess called] --> B{parseHostRoutingKey}
    B -- invalid --> C[return false]
    B -- valid --> D{auth.organizationIds\nincludes parsed.organizationId?}
    D -- No --> C
    D -- Yes --> E{allowedCache.has\nuserId:hostId?}
    E -- Hit --> F[return true]
    E -- Miss --> G{deniedCache.has\nuserId:hostId?}
    G -- Hit --> C
    G -- Miss --> H[API: host.checkAccess.query]
    H -- throws --> I[return false\nnot cached]
    H -- ok=true --> J[allowedCache.set\nTTL 15m\nreturn true]
    H -- ok=false --> K[deniedCache.set\nTTL 30s\nreturn false]
Loading

Comments Outside Diff (1)

  1. apps/relay/src/access.ts, line 47-49 (link)

    P2 Exception path bypasses the negative cache

    When client.host.checkAccess.query throws (e.g., network error, unexpected 5xx), the function returns false without populating deniedCache. A client that consistently triggers this path — say, the API router is broken for a specific hostId — will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/relay/src/access.ts
    Line: 47-49
    
    Comment:
    **Exception path bypasses the negative cache**
    
    When `client.host.checkAccess.query` throws (e.g., network error, unexpected 5xx), the function returns `false` without populating `deniedCache`. A client that consistently triggers this path — say, the API router is broken for a specific `hostId` — will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/relay/src/access.ts:47-49
**Exception path bypasses the negative cache**

When `client.host.checkAccess.query` throws (e.g., network error, unexpected 5xx), the function returns `false` without populating `deniedCache`. A client that consistently triggers this path — say, the API router is broken for a specific `hostId` — will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases.

### Issue 2 of 2
apps/relay/src/access.ts:27-35
**Short-circuit fires before the `deniedCache` check, but order is reversed from what you might expect**

The org short-circuit (`!auth.organizationIds.includes(parsed.organizationId)`) is evaluated before the `deniedCache` lookup. Consider the transition: user is in the org + gets API-denied → `deniedCache["userId:hostId"]` is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return `false` before ever reaching `deniedCache`. The stale `deniedCache` entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead `deniedCache` slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.

Reviews (1): Last reviewed commit: "fix(relay): cache host.checkAccess by us..." | Re-trigger Greptile

Comment thread apps/relay/src/access.ts
Comment on lines +27 to +35
// Short-circuit "not in org" locally: the API does this same check from
// the JWT before hitting the DB, so the round trip is wasted.
const parsed = parseHostRoutingKey(hostId);
if (!parsed) return false;
if (!auth.organizationIds.includes(parsed.organizationId)) return false;

const key = `${auth.sub}:${hostId}`;
if (allowedCache.has(key)) return true;
if (deniedCache.has(key)) return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Short-circuit fires before the deniedCache check, but order is reversed from what you might expect

The org short-circuit (!auth.organizationIds.includes(parsed.organizationId)) is evaluated before the deniedCache lookup. Consider the transition: user is in the org + gets API-denied → deniedCache["userId:hostId"] is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return false before ever reaching deniedCache. The stale deniedCache entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead deniedCache slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/src/access.ts
Line: 27-35

Comment:
**Short-circuit fires before the `deniedCache` check, but order is reversed from what you might expect**

The org short-circuit (`!auth.organizationIds.includes(parsed.organizationId)`) is evaluated before the `deniedCache` lookup. Consider the transition: user is in the org + gets API-denied → `deniedCache["userId:hostId"]` is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return `false` before ever reaching `deniedCache`. The stale `deniedCache` entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead `deniedCache` slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@saddlepaddle saddlepaddle merged commit f3e3e93 into main May 13, 2026
17 checks passed
@saddlepaddle saddlepaddle mentioned this pull request May 14, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant