fix(auth-profiles): bound agentOwner / agentOrg caches at 1024 entries#855
Conversation
Both caches were "lazy refresh on read" — they update an entry's expiresAt when the same agentId is looked up again, but never delete entries for agentIds that are never re-queried. Net: cache size grows monotonically with distinct agentIds the gateway has ever seen, bounded only by the pod's lifetime. In practice the growth rate is small (~200 bytes per distinct agentId, hundreds-to-thousands of agents per day) and almost certainly NOT the cause of the 1 Gi OOM that prompted #782 — but it's still a genuine bound-less Map that ideally cleans up after itself. Adds a tiny cacheSet helper with a 1024-entry cap that evicts the oldest insertion (Maps iterate in insertion order, so size-1 peek-and-delete is O(1)). Test exercises 2048 distinct lookups and asserts both caches stay <= 1024. Refs #782 — hardening, not root-cause fix. SSE keepalive teardown (#833) and in-memory pending-interactions removal (#834) remain the most likely actual OOM fixes.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds bounded caching to ChangesBounded Agent Cache for AuthProfilesManager
🎯 2 (Simple) | ⏱️ ~10 minutes
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Hardening pass on the auth-profile resolver caches in
AuthProfilesManager, motivated by #782's leak hunt.Root cause
AuthProfilesManagerhas two short-lived caches:Both use "lazy refresh on read" — when an entry is looked up and expired, the resolver is called again and
set()updates the entry. But for agentIds that are looked up once and never re-queried, the entry stays in the Map forever. The 60-second TTL only refreshes values, not map size.In practice this is small (~200 bytes per distinct agentId, hundreds-to-thousands of agents per day). Almost certainly not the cause of the 1 Gi OOM in #782 — but a real Map that should clean up after itself.
Fix
Adds a tiny
cacheSet(map, key, value)helper that enforces a 1024-entry hard cap. Whenset()would push the size over the cap, evict the oldest insertion first (Maps iterate in insertion order, somap.keys().next().value+map.delete()is O(1)).Both call sites updated to use the helper.
Reproducer (e2e gate per AGENTS.md)
Before fix (RED)
The test inserts 2048 distinct lookups and asserts
cache.size <= 1024. Without the cap, size hits 2048.After fix (GREEN)
Refs
Summary by CodeRabbit
Tests
Chores