atomic updates to budgets and ratelimits#2844
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughSummary by CodeRabbit
WalkthroughReplaces blind store operations with CAS-backed upsert/reset/bump primitives for budgets and rate limits; in-memory update and expiration flows now use these atomic helpers. Adds concurrency tests validating no lost increments and single-reset semantics under contention. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Comment |
|
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/governance/store.go`:
- Around line 344-363: The reset currently checks only that
old.LastReset.Before(newLastReset), which allows an intervening BumpBudgetUsage
to roll the window and increment CurrentUsage before the CAS, losing that spend;
modify ResetBudgetAt so it only succeeds if the stored budget equals the exact
snapshot observed by the sweep (i.e., CAS the original raw snapshot you loaded)
or compare against the exact LastReset value from that snapshot rather than any
older timestamp. Concretely, use the original raw from budgets.Load (or the
snapshot's LastReset value) in the budgets.CompareAndSwap call/condition so
ResetBudgetAt (and the same logic around ResetExpiredBudgetsInMemory) will only
zero CurrentUsage and update LastReset when the stored TableBudget matches the
exact observed snapshot, protecting recent BumpBudgetUsage updates.
- Around line 376-405: ResetRateLimitAt is resetting counters whenever stored
TokenLastReset/RequestLastReset is before the new reset time, which can
overwrite concurrent updates from BumpRateLimitUsage; change the logic to only
zero a counter if the stored *LastReset equals the exact timestamp you observed
when iterating (tie the reset to the observed
old.TokenLastReset/old.RequestLastReset snapshot), use CompareAndSwap with that
exact snapshot values, and on success return which specific counters were reset
(e.g., booleans for token and request) along with the exact LastReset values so
the caller can clear DB baseline markers for the counters that truly changed;
apply the same fix pattern to the duplicate block at the other location (lines
referenced) and ensure you check/compare TokenLastReset and RequestLastReset
separately so a reset of one counter doesn’t zero the other due to a stale
snapshot.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ea2e5fa-9b46-4f37-a78f-29ef7f0c7c9d
📒 Files selected for processing (1)
plugins/governance/store.go
Confidence Score: 5/5Safe to merge; the CAS mechanics are sound and all remaining findings are P2 style/quality suggestions. The core CAS-retry pattern is correct and the tests directly validate the no-lost-increment and single-winner-reset guarantees. The only new finding is a P2 baseline-clearing condition that affects exported gossip maps but has no impact on single-node correctness or the primary dump path (called with nil baselines). All P0/P1-level concerns are covered in prior review threads. The baseline-clearing block in ResetExpiredRateLimitsInMemory (store.go lines 1431–1439) is the one spot worth a second look before merge. Important Files Changed
Reviews (4): Last reviewed commit: "atomic updates to budgets and ratelimits" | Re-trigger Greptile |
e8a2d06 to
de9ab68
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
plugins/governance/store.go (1)
376-405:⚠️ Potential issue | 🔴 CriticalTrack which counter actually reset before clearing DB baselines.
ResetRateLimitAtcan succeed after resetting only one side. If a concurrentBumpRateLimitUsagehas already advanced, say, the token window but the request window is still stale, this returnstruefor the request reset only.ResetExpiredRateLimitsInMemorythen clears both baseline maps whenever both resets were requested, which can erase fresh token baseline state and under-count until the next dump. Return per-counter reset results (or exact old/new reset timestamps) and only clear the matching baseline map.Also applies to: 1363-1376
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/governance/store.go` around lines 376 - 405, ResetRateLimitAt currently returns a single bool even when only one of Token or Request windows was reset, which causes ResetExpiredRateLimitsInMemory to indiscriminately clear both baseline maps and can erase a fresh baseline; update ResetRateLimitAt (and its callers such as ResetExpiredRateLimitsInMemory) to return per-counter results (e.g., two bools or the old/new reset timestamps for TokenLastReset and RequestLastReset) so the caller can detect exactly which counter was reset and only clear the corresponding baseline map(s); locate ResetRateLimitAt and the baseline-clearing logic in ResetExpiredRateLimitsInMemory and change the return type/handling to use these per-counter indicators and conditionally clear only the matching baseline map(s).
🧹 Nitpick comments (1)
plugins/governance/store_concurrency_test.go (1)
58-123: Add regressions for calendar-aligned bumps and partial rate-limit resets.The current coverage proves the lost-update story, but it still leaves the two easiest-to-regress semantics unpinned:
CalendarAlignedreset timestamps inBumpBudgetUsage/BumpRateLimitUsage, andResetRateLimitAtwhere only one of token/request actually resets under contention.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/governance/store_concurrency_test.go` around lines 58 - 123, Add targeted concurrency tests to lock in calendar-aligned reset behavior and prevent partial resets for rate-limits: create new tests similar to TestBumpRateLimitUsage_NoLostIncrements and TestResetBudgetAt_ConcurrentResettersCollapse that (1) exercise BumpBudgetUsage and BumpRateLimitUsage for CalendarAligned budgets/rate-limits so concurrent bumps preserve and correctly compute calendar-aligned LastReset timestamps, and (2) exercise ResetRateLimitAt under heavy contention ensuring both TokenCurrentUsage and RequestCurrentUsage are reset together (no partial reset) and that only one goroutine wins the CAS when resetting to the same timestamp; locate and mirror patterns around store.BumpBudgetUsage, store.BumpRateLimitUsage, store.ResetRateLimitAt, and store.ResetBudgetAt to add these regression tests and assert final counters and LastReset equality.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/governance/store.go`:
- Around line 276-283: The reset logic unconditionally stamps
LastReset/TokenLastReset/RequestLastReset with now and compares elapsed since
that value, which breaks calendar-aligned quotas; change the helpers (the blocks
using clone.ResetDuration, configstoreTables.ParseDuration and now) to detect
clone.CalendarAligned and, when true, compute and use the calendar-aligned
boundary (e.g., start of day/week/month depending on the parsed duration) as the
reset timestamp and comparison baseline instead of now; update both the
CurrentUsage-reset branches and any places that assign
LastReset/TokenLastReset/RequestLastReset so they set the aligned boundary when
CalendarAligned is true (and keep existing behavior when CalendarAligned is
false), and apply the same change to the other analogous block referenced (lines
309-324).
---
Duplicate comments:
In `@plugins/governance/store.go`:
- Around line 376-405: ResetRateLimitAt currently returns a single bool even
when only one of Token or Request windows was reset, which causes
ResetExpiredRateLimitsInMemory to indiscriminately clear both baseline maps and
can erase a fresh baseline; update ResetRateLimitAt (and its callers such as
ResetExpiredRateLimitsInMemory) to return per-counter results (e.g., two bools
or the old/new reset timestamps for TokenLastReset and RequestLastReset) so the
caller can detect exactly which counter was reset and only clear the
corresponding baseline map(s); locate ResetRateLimitAt and the baseline-clearing
logic in ResetExpiredRateLimitsInMemory and change the return type/handling to
use these per-counter indicators and conditionally clear only the matching
baseline map(s).
---
Nitpick comments:
In `@plugins/governance/store_concurrency_test.go`:
- Around line 58-123: Add targeted concurrency tests to lock in calendar-aligned
reset behavior and prevent partial resets for rate-limits: create new tests
similar to TestBumpRateLimitUsage_NoLostIncrements and
TestResetBudgetAt_ConcurrentResettersCollapse that (1) exercise BumpBudgetUsage
and BumpRateLimitUsage for CalendarAligned budgets/rate-limits so concurrent
bumps preserve and correctly compute calendar-aligned LastReset timestamps, and
(2) exercise ResetRateLimitAt under heavy contention ensuring both
TokenCurrentUsage and RequestCurrentUsage are reset together (no partial reset)
and that only one goroutine wins the CAS when resetting to the same timestamp;
locate and mirror patterns around store.BumpBudgetUsage,
store.BumpRateLimitUsage, store.ResetRateLimitAt, and store.ResetBudgetAt to add
these regression tests and assert final counters and LastReset equality.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f03124c2-3ed8-4048-a1eb-fecf29caf5aa
📒 Files selected for processing (2)
plugins/governance/store.goplugins/governance/store_concurrency_test.go
de9ab68 to
36a5b0c
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/governance/store.go (1)
2749-2756:⚠️ Potential issue | 🟡 MinorMissing
LastResetpreservation in ModelConfig and Provider updates.Unlike other
Update*InMemorymethods that preserve bothCurrentUsageandLastReset, these two only copyCurrentUsage. If an admin update races with a reset,LastResetcould revert to its original value, causing premature or delayed resets.🔧 Proposed fix to also preserve LastReset
For
UpdateModelConfigInMemory:if eb, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && eb != nil { clone.Budget.CurrentUsage = eb.CurrentUsage + clone.Budget.LastReset = eb.LastReset }For
UpdateProviderInMemory:if eb, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && eb != nil { clone.Budget.CurrentUsage = eb.CurrentUsage + clone.Budget.LastReset = eb.LastReset }Also applies to: 2828-2835
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/governance/store.go` around lines 2749 - 2756, The budget update logic only copies CurrentUsage and omits LastReset, which can cause races; in both UpdateModelConfigInMemory and UpdateProviderInMemory locate the block that loads from gs.budgets with gs.budgets.Load(clone.Budget.ID) and, after confirming the loaded value is a *configstoreTables.TableBudget (existingBudgetValue -> eb), also assign clone.Budget.LastReset = eb.LastReset (in addition to clone.Budget.CurrentUsage = eb.CurrentUsage) before calling gs.budgets.Store(clone.Budget.ID, clone.Budget); apply the same change to the other occurrence referenced (the block at the other location).
🧹 Nitpick comments (1)
plugins/governance/store.go (1)
2332-2339: TOCTOU race when config update races with usage increment.This Load→Store pattern can clobber a concurrent
BumpBudgetUsage:
UpdateVirtualKeyInMemoryloads budget withCurrentUsage=X- Concurrent request:
BumpBudgetUsageCAS succeeds,CurrentUsage=X+costUpdateVirtualKeyInMemorystores budget withCurrentUsage=X→ increment lostConsider using
UpsertBudgetConfighere (pass the new config and let it preserve usage atomically), or at minimum wrap the Load+Store in a CAS loop. The same pattern appears in:
UpdateTeamInMemory(lines 2513-2522)UpdateCustomerInMemory(lines 2623-2632)UpdateModelConfigInMemory(lines 2749-2756)UpdateProviderInMemory(lines 2828-2835)While admin updates are rare compared to request processing, this still violates the PR's stated goal of eliminating unsafe Load→Store patterns.
♻️ Suggested direction using UpsertBudgetConfig
for i := range clone.Budgets { - // Preserve existing usage from memory - if existingBudgetValue, exists := gs.budgets.Load(clone.Budgets[i].ID); exists && existingBudgetValue != nil { - if existingBudget, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && existingBudget != nil { - clone.Budgets[i].CurrentUsage = existingBudget.CurrentUsage - clone.Budgets[i].LastReset = existingBudget.LastReset - } - } - gs.budgets.Store(clone.Budgets[i].ID, &clone.Budgets[i]) + gs.UpsertBudgetConfig(ctx, clone.Budgets[i].ID, &clone.Budgets[i]) }This delegates usage-preservation to the CAS-based helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/governance/store.go` around lines 2332 - 2339, The Load→Store in UpdateVirtualKeyInMemory can race with BumpBudgetUsage and lose increments; replace the manual budgets.Load + budgets.Store sequence for each clone.Budgets[i] with a call to UpsertBudgetConfig passing the new budget config so the helper preserves CurrentUsage atomically, or implement a CAS loop on gs.budgets (Load, merge CurrentUsage from existing, attempt Store via CompareAndSwap) to ensure updates are atomic; apply the same change pattern to the analogous places UpdateTeamInMemory, UpdateCustomerInMemory, UpdateModelConfigInMemory, and UpdateProviderInMemory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@plugins/governance/store.go`:
- Around line 2749-2756: The budget update logic only copies CurrentUsage and
omits LastReset, which can cause races; in both UpdateModelConfigInMemory and
UpdateProviderInMemory locate the block that loads from gs.budgets with
gs.budgets.Load(clone.Budget.ID) and, after confirming the loaded value is a
*configstoreTables.TableBudget (existingBudgetValue -> eb), also assign
clone.Budget.LastReset = eb.LastReset (in addition to clone.Budget.CurrentUsage
= eb.CurrentUsage) before calling gs.budgets.Store(clone.Budget.ID,
clone.Budget); apply the same change to the other occurrence referenced (the
block at the other location).
---
Nitpick comments:
In `@plugins/governance/store.go`:
- Around line 2332-2339: The Load→Store in UpdateVirtualKeyInMemory can race
with BumpBudgetUsage and lose increments; replace the manual budgets.Load +
budgets.Store sequence for each clone.Budgets[i] with a call to
UpsertBudgetConfig passing the new budget config so the helper preserves
CurrentUsage atomically, or implement a CAS loop on gs.budgets (Load, merge
CurrentUsage from existing, attempt Store via CompareAndSwap) to ensure updates
are atomic; apply the same change pattern to the analogous places
UpdateTeamInMemory, UpdateCustomerInMemory, UpdateModelConfigInMemory, and
UpdateProviderInMemory.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a549fe6c-3dab-4cd0-aa07-d70e3994d9b3
📒 Files selected for processing (2)
plugins/governance/store.goplugins/governance/store_concurrency_test.go
✅ Files skipped from review due to trivial changes (1)
- plugins/governance/store_concurrency_test.go
Merge activity
|
The base branch was changed.
36a5b0c to
0c86605
Compare
## Summary In-memory budget and rate-limit counters were updated via a plain Load → clone → mutate → Store pattern, which is not safe under concurrent callers: two goroutines racing on the same entry could silently drop one of the increments. This PR replaces that pattern with CAS-retry loops that guarantee every increment is applied exactly once. ## Changes - Introduced `BumpBudgetUsage` and `BumpRateLimitUsage` as the single authoritative write path for incrementing budget and rate-limit counters. Both use `sync.Map.CompareAndSwap` with a retry loop so concurrent callers never lose an update. Both also handle rolling-window resets inline (zeroing usage and advancing `LastReset` when the reset duration has elapsed). - Introduced `ResetBudgetAt` and `ResetRateLimitAt` as the single authoritative write path for periodic resets. Each is conditional on the stored `LastReset` being strictly older than the target, so concurrent resetters collapse into a single successful write and the loser is a clean no-op. - Refactored `UpdateVirtualKeyBudgetUsageInMemory`, `UpdateProviderAndModelBudgetUsageInMemory`, `UpdateProviderAndModelRateLimitUsageInMemory`, and `UpdateVirtualKeyRateLimitUsageInMemory` to delegate to `BumpBudgetUsage` / `BumpRateLimitUsage` instead of duplicating the clone-and-store logic. - Refactored `ResetExpiredBudgetsInMemory` and `ResetExpiredRateLimitsInMemory` to delegate to `ResetBudgetAt` / `ResetRateLimitAt`. The Range loop now only decides *whether* a reset is due; the CAS methods handle the actual write and skip redundant work when another writer has already advanced the timestamp. - Extracted a `resolvePeriodStart` helper inside `ResetExpiredRateLimitsInMemory` to unify calendar-aligned and rolling-duration reset logic for both token and request counters, eliminating the previous duplicated branches. ## Type of change - [ ] Bug fix - [x] Refactor - [ ] Feature - [ ] Documentation - [ ] Chore/CI ## Affected areas - [ ] Core (Go) - [ ] Transports (HTTP) - [ ] Providers/Integrations - [x] Plugins - [ ] UI (React) - [ ] Docs ## How to test ```sh go test ./plugins/governance/... ``` Concurrent correctness can be validated by running the governance store tests under the race detector: ```sh go test -race ./plugins/governance/... ``` ## Breaking changes - [ ] Yes - [x] No ## Security considerations No auth, secrets, or PII changes. The fix closes a race window where concurrent requests could under-count usage against a budget or rate limit, which previously could allow spend or request volume to exceed configured limits. ## Checklist - [ ] I read `docs/contributing/README.md` and followed the guidelines - [ ] I added/updated tests where appropriate - [ ] I updated documentation where needed - [ ] I verified builds succeed (Go and UI) - [ ] I verified the CI pipeline passes locally if applicable

Summary
In-memory budget and rate-limit counters were updated via a plain Load → clone → mutate → Store pattern, which is not safe under concurrent callers: two goroutines racing on the same entry could silently drop one of the increments. This PR replaces that pattern with CAS-retry loops that guarantee every increment is applied exactly once.
Changes
BumpBudgetUsageandBumpRateLimitUsageas the single authoritative write path for incrementing budget and rate-limit counters. Both usesync.Map.CompareAndSwapwith a retry loop so concurrent callers never lose an update. Both also handle rolling-window resets inline (zeroing usage and advancingLastResetwhen the reset duration has elapsed).ResetBudgetAtandResetRateLimitAtas the single authoritative write path for periodic resets. Each is conditional on the storedLastResetbeing strictly older than the target, so concurrent resetters collapse into a single successful write and the loser is a clean no-op.UpdateVirtualKeyBudgetUsageInMemory,UpdateProviderAndModelBudgetUsageInMemory,UpdateProviderAndModelRateLimitUsageInMemory, andUpdateVirtualKeyRateLimitUsageInMemoryto delegate toBumpBudgetUsage/BumpRateLimitUsageinstead of duplicating the clone-and-store logic.ResetExpiredBudgetsInMemoryandResetExpiredRateLimitsInMemoryto delegate toResetBudgetAt/ResetRateLimitAt. The Range loop now only decides whether a reset is due; the CAS methods handle the actual write and skip redundant work when another writer has already advanced the timestamp.resolvePeriodStarthelper insideResetExpiredRateLimitsInMemoryto unify calendar-aligned and rolling-duration reset logic for both token and request counters, eliminating the previous duplicated branches.Type of change
Affected areas
How to test
go test ./plugins/governance/...Concurrent correctness can be validated by running the governance store tests under the race detector:
go test -race ./plugins/governance/...Breaking changes
Security considerations
No auth, secrets, or PII changes. The fix closes a race window where concurrent requests could under-count usage against a budget or rate limit, which previously could allow spend or request volume to exceed configured limits.
Checklist
docs/contributing/README.mdand followed the guidelines