fix(relay): cache host.checkAccess by userId, not token#4491
Conversation
The relay caches host.checkAccess results in an LRU to avoid hitting
the API on every tunneled request. The cache was keyed by
`${token}:${hostId}`, which meant every JWT refresh invalidated the
entry — even though the underlying user→host authorization hadn't
changed. With active desktops cycling tokens roughly hourly, the
endpoint accounted for ~29% of /api log volume.
Changes:
- Key by (userId, hostId) — userId comes from verified JWT.sub and is
stable across token refreshes.
- Bump allowed TTL from 5m to 15m. Org/host membership changes a few
times a day per user, not every five minutes; worst-case stale-allow
window of 15m is acceptable.
- Add 30s negative cache for denied results so a misconfigured client
with a bad token can't hammer the API.
- Short-circuit locally when the host's organizationId isn't in
auth.organizationIds — the API does the same check before the DB
query, so the round trip is wasted.
Together these should drop checkAccess volume ~80% on the API side.
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe relay's host access authorization is enhanced to accept authenticated user context, enabling local organization-based access checks and separate result caching. The ChangesAuthorization Cache and Access Control
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR fixes a caching inefficiency in the relay's
Confidence Score: 4/5Safe to merge — the cache-key change is correct and stable, the short-circuit is a faithful reimplementation of the server-side org check, and both call sites are updated consistently. The core logic change is well-reasoned and the trade-offs (15m stale-allow, 30s stale-deny) are explicitly documented. Two minor observations: API exceptions bypass the negative cache, so a consistently failing API call path will still hit the API on every attempt; and a stale deniedCache entry becomes unreachable (harmless but occupies a slot) after a user is removed from an org. Neither affects correctness in the normal path. apps/relay/src/access.ts — the caching logic is the heart of this change; the error-path and short-circuit ordering warrant a second read.
|
| Filename | Overview |
|---|---|
| apps/relay/src/access.ts | Replaces token-based cache key with userId (auth.sub), adds a 30s negative cache for denied results, extends allowed TTL to 15m, and introduces a local org-membership short-circuit. Logic is sound; one edge case where API errors bypass the negative cache protection. |
| apps/relay/src/index.ts | Two call sites updated to pass the new auth parameter to checkHostAccess. Both HTTP middleware and WebSocket tunnel paths are correctly updated with no logic changes beyond the parameter addition. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[checkHostAccess called] --> B{parseHostRoutingKey}
B -- invalid --> C[return false]
B -- valid --> D{auth.organizationIds\nincludes parsed.organizationId?}
D -- No --> C
D -- Yes --> E{allowedCache.has\nuserId:hostId?}
E -- Hit --> F[return true]
E -- Miss --> G{deniedCache.has\nuserId:hostId?}
G -- Hit --> C
G -- Miss --> H[API: host.checkAccess.query]
H -- throws --> I[return false\nnot cached]
H -- ok=true --> J[allowedCache.set\nTTL 15m\nreturn true]
H -- ok=false --> K[deniedCache.set\nTTL 30s\nreturn false]
Comments Outside Diff (1)
-
apps/relay/src/access.ts, line 47-49 (link)Exception path bypasses the negative cache
When
client.host.checkAccess.querythrows (e.g., network error, unexpected 5xx), the function returnsfalsewithout populatingdeniedCache. A client that consistently triggers this path — say, the API router is broken for a specifichostId— will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases.Prompt To Fix With AI
This is a comment left during a code review. Path: apps/relay/src/access.ts Line: 47-49 Comment: **Exception path bypasses the negative cache** When `client.host.checkAccess.query` throws (e.g., network error, unexpected 5xx), the function returns `false` without populating `deniedCache`. A client that consistently triggers this path — say, the API router is broken for a specific `hostId` — will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
apps/relay/src/access.ts:47-49
**Exception path bypasses the negative cache**
When `client.host.checkAccess.query` throws (e.g., network error, unexpected 5xx), the function returns `false` without populating `deniedCache`. A client that consistently triggers this path — say, the API router is broken for a specific `hostId` — will make a live API call on every request with no rate-limiting via the cache. The PR description frames the negative cache as protection against hammering, but that protection only kicks in for HTTP-level denials, not for thrown exceptions. For the "bad token" scenario specifically this is fine (JWT failure happens before reaching here), but it is worth noting for infrastructure-failure cases.
### Issue 2 of 2
apps/relay/src/access.ts:27-35
**Short-circuit fires before the `deniedCache` check, but order is reversed from what you might expect**
The org short-circuit (`!auth.organizationIds.includes(parsed.organizationId)`) is evaluated before the `deniedCache` lookup. Consider the transition: user is in the org + gets API-denied → `deniedCache["userId:hostId"]` is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return `false` before ever reaching `deniedCache`. The stale `deniedCache` entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead `deniedCache` slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.
Reviews (1): Last reviewed commit: "fix(relay): cache host.checkAccess by us..." | Re-trigger Greptile
| // Short-circuit "not in org" locally: the API does this same check from | ||
| // the JWT before hitting the DB, so the round trip is wasted. | ||
| const parsed = parseHostRoutingKey(hostId); | ||
| if (!parsed) return false; | ||
| if (!auth.organizationIds.includes(parsed.organizationId)) return false; | ||
|
|
||
| const key = `${auth.sub}:${hostId}`; | ||
| if (allowedCache.has(key)) return true; | ||
| if (deniedCache.has(key)) return false; |
There was a problem hiding this comment.
Short-circuit fires before the
deniedCache check, but order is reversed from what you might expect
The org short-circuit (!auth.organizationIds.includes(parsed.organizationId)) is evaluated before the deniedCache lookup. Consider the transition: user is in the org + gets API-denied → deniedCache["userId:hostId"] is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return false before ever reaching deniedCache. The stale deniedCache entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead deniedCache slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/relay/src/access.ts
Line: 27-35
Comment:
**Short-circuit fires before the `deniedCache` check, but order is reversed from what you might expect**
The org short-circuit (`!auth.organizationIds.includes(parsed.organizationId)`) is evaluated before the `deniedCache` lookup. Consider the transition: user is in the org + gets API-denied → `deniedCache["userId:hostId"]` is populated → user is later removed from the org → their new JWT no longer contains the org → subsequent requests hit the short-circuit and return `false` before ever reaching `deniedCache`. The stale `deniedCache` entry becomes permanently unreachable for that user+host pair and will sit there until TTL expires. This is harmless (the short-circuit is cheap and correct), but the dead `deniedCache` slot wastes memory in the 10k-entry LRU during those 30 seconds. Low impact, but worth being aware of if the org-removal→re-add cycle is frequent.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
Summary
host.checkAccesswas accounting for ~29% of/apilog volume on Vercel. The relay caches results in an LRU, but the cache key was\${token}:${hostId}`` — and JWTs rotate every hour or so. Every refresh invalidated the cache entry even though the underlying user→host authorization hadn't changed, so active desktops were re-hitting the API roughly every JWT lifetime regardless of the 5-minute LRU TTL.Changes
(userId, hostId)via verifiedauth.sub. Stable across token refreshes.organizationIdisn't inauth.organizationIds, return false without an API call. The API does this exact check from the JWT before its DB query, so the round trip was wasted.Expected impact
~80% drop in
host.checkAccesscalls = ~23% of total/apilog volume.Follow-up to #4490 (deleted
device.heartbeat, ~10% of volume).Test plan
/apihost.checkAccessrequest count drop within an hour of staging deploySummary by cubic
Cache
host.checkAccessby userId instead of token to stop cache busting on JWT refresh. Adds a local org check and tuned TTLs to cut API calls by ~80% (~23% of/apivolume).auth.sub + hostId(stable across token rotation).auth.organizationIds(no API call).checkHostAccess(auth, token, hostId); call sites updated.Written for commit 2d5c787. Summary will update on new commits.
Summary by CodeRabbit