Skip to content

atomic updates to budgets and ratelimits#2844

Merged
akshaydeo merged 1 commit intomainfrom
04-20-atomic_updates_to_budgets_and_ratelimits
Apr 20, 2026
Merged

atomic updates to budgets and ratelimits#2844
akshaydeo merged 1 commit intomainfrom
04-20-atomic_updates_to_budgets_and_ratelimits

Conversation

@akshaydeo
Copy link
Copy Markdown
Contributor

@akshaydeo akshaydeo commented Apr 20, 2026

Summary

In-memory budget and rate-limit counters were updated via a plain Load → clone → mutate → Store pattern, which is not safe under concurrent callers: two goroutines racing on the same entry could silently drop one of the increments. This PR replaces that pattern with CAS-retry loops that guarantee every increment is applied exactly once.

Changes

  • Introduced BumpBudgetUsage and BumpRateLimitUsage as the single authoritative write path for incrementing budget and rate-limit counters. Both use sync.Map.CompareAndSwap with a retry loop so concurrent callers never lose an update. Both also handle rolling-window resets inline (zeroing usage and advancing LastReset when the reset duration has elapsed).
  • Introduced ResetBudgetAt and ResetRateLimitAt as the single authoritative write path for periodic resets. Each is conditional on the stored LastReset being strictly older than the target, so concurrent resetters collapse into a single successful write and the loser is a clean no-op.
  • Refactored UpdateVirtualKeyBudgetUsageInMemory, UpdateProviderAndModelBudgetUsageInMemory, UpdateProviderAndModelRateLimitUsageInMemory, and UpdateVirtualKeyRateLimitUsageInMemory to delegate to BumpBudgetUsage / BumpRateLimitUsage instead of duplicating the clone-and-store logic.
  • Refactored ResetExpiredBudgetsInMemory and ResetExpiredRateLimitsInMemory to delegate to ResetBudgetAt / ResetRateLimitAt. The Range loop now only decides whether a reset is due; the CAS methods handle the actual write and skip redundant work when another writer has already advanced the timestamp.
  • Extracted a resolvePeriodStart helper inside ResetExpiredRateLimitsInMemory to unify calendar-aligned and rolling-duration reset logic for both token and request counters, eliminating the previous duplicated branches.

Type of change

  • Bug fix
  • Refactor
  • Feature
  • Documentation
  • Chore/CI

Affected areas

  • Core (Go)
  • Transports (HTTP)
  • Providers/Integrations
  • Plugins
  • UI (React)
  • Docs

How to test

go test ./plugins/governance/...

Concurrent correctness can be validated by running the governance store tests under the race detector:

go test -race ./plugins/governance/...

Breaking changes

  • Yes
  • No

Security considerations

No auth, secrets, or PII changes. The fix closes a race window where concurrent requests could under-count usage against a budget or rate limit, which previously could allow spend or request volume to exceed configured limits.

Checklist

  • I read docs/contributing/README.md and followed the guidelines
  • I added/updated tests where appropriate
  • I updated documentation where needed
  • I verified builds succeed (Go and UI)
  • I verified the CI pipeline passes locally if applicable

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Summary by CodeRabbit

  • Internal Improvements
    • More robust budget and rate-limit handling with atomic, conflict-resistant updates that preserve usage during config changes, prevent lost increments, reduce unnecessary writes, and ensure consistent reset behavior.
  • Tests
    • Added concurrency tests to verify exact accounting under heavy contention and that concurrent resets collapse to a single successful reset.

Walkthrough

Replaces blind store operations with CAS-backed upsert/reset/bump primitives for budgets and rate limits; in-memory update and expiration flows now use these atomic helpers. Adds concurrency tests validating no lost increments and single-reset semantics under contention.

Changes

Cohort / File(s) Summary
Governance store implementation & API
plugins/governance/store.go
Replaced StoreBudget/StoreRateLimit with UpsertBudgetConfig/UpsertRateLimitConfig; added atomic CAS primitives: BumpBudgetUsage, BumpRateLimitUsage, ResetBudgetAt, ResetRateLimitAt. Refactored in-memory update/reset flows to use these helpers and preserve usage/last-reset during config upserts.
Concurrency tests
plugins/governance/store_concurrency_test.go
Added newStandaloneStore(t) test helper and tests exercising concurrent CAS behavior: BumpBudgetUsage and BumpRateLimitUsage no lost increments, and ResetBudgetAt collapsing concurrent resetters to a single successful CAS.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop and nudge with careful paws,

CAS keeps counts without a pause.
Many hops race, none drop away,
One reset wins the bustling fray.
I nibble carrots — orderly days.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'atomic updates to budgets and ratelimits' clearly summarizes the main change: replacing unsafe Load → clone → store patterns with atomic CAS operations for concurrent safety.
Description check ✅ Passed The description is comprehensive, covering summary, changes, type, affected areas, testing instructions, breaking changes, security considerations, and checklist—all required template sections are addressed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 04-20-atomic_updates_to_budgets_and_ratelimits

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Comment @coderabbitai help to get the list of available commands and usage tips.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

This was referenced Apr 20, 2026
Copy link
Copy Markdown
Contributor Author

akshaydeo commented Apr 20, 2026

@akshaydeo akshaydeo marked this pull request as ready for review April 20, 2026 03:25
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/governance/store.go`:
- Around line 344-363: The reset currently checks only that
old.LastReset.Before(newLastReset), which allows an intervening BumpBudgetUsage
to roll the window and increment CurrentUsage before the CAS, losing that spend;
modify ResetBudgetAt so it only succeeds if the stored budget equals the exact
snapshot observed by the sweep (i.e., CAS the original raw snapshot you loaded)
or compare against the exact LastReset value from that snapshot rather than any
older timestamp. Concretely, use the original raw from budgets.Load (or the
snapshot's LastReset value) in the budgets.CompareAndSwap call/condition so
ResetBudgetAt (and the same logic around ResetExpiredBudgetsInMemory) will only
zero CurrentUsage and update LastReset when the stored TableBudget matches the
exact observed snapshot, protecting recent BumpBudgetUsage updates.
- Around line 376-405: ResetRateLimitAt is resetting counters whenever stored
TokenLastReset/RequestLastReset is before the new reset time, which can
overwrite concurrent updates from BumpRateLimitUsage; change the logic to only
zero a counter if the stored *LastReset equals the exact timestamp you observed
when iterating (tie the reset to the observed
old.TokenLastReset/old.RequestLastReset snapshot), use CompareAndSwap with that
exact snapshot values, and on success return which specific counters were reset
(e.g., booleans for token and request) along with the exact LastReset values so
the caller can clear DB baseline markers for the counters that truly changed;
apply the same fix pattern to the duplicate block at the other location (lines
referenced) and ensure you check/compare TokenLastReset and RequestLastReset
separately so a reset of one counter doesn’t zero the other due to a stale
snapshot.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7ea2e5fa-9b46-4f37-a78f-29ef7f0c7c9d

📥 Commits

Reviewing files that changed from the base of the PR and between c8bf1f6 and e8a2d06.

📒 Files selected for processing (1)
  • plugins/governance/store.go

Comment thread plugins/governance/store.go
Comment thread plugins/governance/store.go
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Confidence Score: 5/5

Safe to merge; the CAS mechanics are sound and all remaining findings are P2 style/quality suggestions.

The core CAS-retry pattern is correct and the tests directly validate the no-lost-increment and single-winner-reset guarantees. The only new finding is a P2 baseline-clearing condition that affects exported gossip maps but has no impact on single-node correctness or the primary dump path (called with nil baselines). All P0/P1-level concerns are covered in prior review threads.

The baseline-clearing block in ResetExpiredRateLimitsInMemory (store.go lines 1431–1439) is the one spot worth a second look before merge.

Important Files Changed

Filename Overview
plugins/governance/store.go Introduces CAS-retry BumpBudgetUsage, BumpRateLimitUsage, ResetBudgetAt, and ResetRateLimitAt; ResetExpiredRateLimitsInMemory baseline-clearing condition guards on intent rather than actual CAS outcome for each counter.
plugins/governance/store_concurrency_test.go Adds targeted concurrency tests for budget/rate-limit CAS loops and budget-resetter collapse; lacks a symmetric collapse test for the more complex dual-counter ResetRateLimitAt.

Reviews (4): Last reviewed commit: "atomic updates to budgets and ratelimits" | Re-trigger Greptile

Comment thread plugins/governance/store.go
Comment thread plugins/governance/store.go
Comment thread plugins/governance/store.go
@akshaydeo akshaydeo force-pushed the 04-20-atomic_updates_to_budgets_and_ratelimits branch from e8a2d06 to de9ab68 Compare April 20, 2026 03:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
plugins/governance/store.go (1)

376-405: ⚠️ Potential issue | 🔴 Critical

Track which counter actually reset before clearing DB baselines.

ResetRateLimitAt can succeed after resetting only one side. If a concurrent BumpRateLimitUsage has already advanced, say, the token window but the request window is still stale, this returns true for the request reset only. ResetExpiredRateLimitsInMemory then clears both baseline maps whenever both resets were requested, which can erase fresh token baseline state and under-count until the next dump. Return per-counter reset results (or exact old/new reset timestamps) and only clear the matching baseline map.

Also applies to: 1363-1376

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/governance/store.go` around lines 376 - 405, ResetRateLimitAt
currently returns a single bool even when only one of Token or Request windows
was reset, which causes ResetExpiredRateLimitsInMemory to indiscriminately clear
both baseline maps and can erase a fresh baseline; update ResetRateLimitAt (and
its callers such as ResetExpiredRateLimitsInMemory) to return per-counter
results (e.g., two bools or the old/new reset timestamps for TokenLastReset and
RequestLastReset) so the caller can detect exactly which counter was reset and
only clear the corresponding baseline map(s); locate ResetRateLimitAt and the
baseline-clearing logic in ResetExpiredRateLimitsInMemory and change the return
type/handling to use these per-counter indicators and conditionally clear only
the matching baseline map(s).
🧹 Nitpick comments (1)
plugins/governance/store_concurrency_test.go (1)

58-123: Add regressions for calendar-aligned bumps and partial rate-limit resets.

The current coverage proves the lost-update story, but it still leaves the two easiest-to-regress semantics unpinned: CalendarAligned reset timestamps in BumpBudgetUsage/BumpRateLimitUsage, and ResetRateLimitAt where only one of token/request actually resets under contention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/governance/store_concurrency_test.go` around lines 58 - 123, Add
targeted concurrency tests to lock in calendar-aligned reset behavior and
prevent partial resets for rate-limits: create new tests similar to
TestBumpRateLimitUsage_NoLostIncrements and
TestResetBudgetAt_ConcurrentResettersCollapse that (1) exercise BumpBudgetUsage
and BumpRateLimitUsage for CalendarAligned budgets/rate-limits so concurrent
bumps preserve and correctly compute calendar-aligned LastReset timestamps, and
(2) exercise ResetRateLimitAt under heavy contention ensuring both
TokenCurrentUsage and RequestCurrentUsage are reset together (no partial reset)
and that only one goroutine wins the CAS when resetting to the same timestamp;
locate and mirror patterns around store.BumpBudgetUsage,
store.BumpRateLimitUsage, store.ResetRateLimitAt, and store.ResetBudgetAt to add
these regression tests and assert final counters and LastReset equality.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/governance/store.go`:
- Around line 276-283: The reset logic unconditionally stamps
LastReset/TokenLastReset/RequestLastReset with now and compares elapsed since
that value, which breaks calendar-aligned quotas; change the helpers (the blocks
using clone.ResetDuration, configstoreTables.ParseDuration and now) to detect
clone.CalendarAligned and, when true, compute and use the calendar-aligned
boundary (e.g., start of day/week/month depending on the parsed duration) as the
reset timestamp and comparison baseline instead of now; update both the
CurrentUsage-reset branches and any places that assign
LastReset/TokenLastReset/RequestLastReset so they set the aligned boundary when
CalendarAligned is true (and keep existing behavior when CalendarAligned is
false), and apply the same change to the other analogous block referenced (lines
309-324).

---

Duplicate comments:
In `@plugins/governance/store.go`:
- Around line 376-405: ResetRateLimitAt currently returns a single bool even
when only one of Token or Request windows was reset, which causes
ResetExpiredRateLimitsInMemory to indiscriminately clear both baseline maps and
can erase a fresh baseline; update ResetRateLimitAt (and its callers such as
ResetExpiredRateLimitsInMemory) to return per-counter results (e.g., two bools
or the old/new reset timestamps for TokenLastReset and RequestLastReset) so the
caller can detect exactly which counter was reset and only clear the
corresponding baseline map(s); locate ResetRateLimitAt and the baseline-clearing
logic in ResetExpiredRateLimitsInMemory and change the return type/handling to
use these per-counter indicators and conditionally clear only the matching
baseline map(s).

---

Nitpick comments:
In `@plugins/governance/store_concurrency_test.go`:
- Around line 58-123: Add targeted concurrency tests to lock in calendar-aligned
reset behavior and prevent partial resets for rate-limits: create new tests
similar to TestBumpRateLimitUsage_NoLostIncrements and
TestResetBudgetAt_ConcurrentResettersCollapse that (1) exercise BumpBudgetUsage
and BumpRateLimitUsage for CalendarAligned budgets/rate-limits so concurrent
bumps preserve and correctly compute calendar-aligned LastReset timestamps, and
(2) exercise ResetRateLimitAt under heavy contention ensuring both
TokenCurrentUsage and RequestCurrentUsage are reset together (no partial reset)
and that only one goroutine wins the CAS when resetting to the same timestamp;
locate and mirror patterns around store.BumpBudgetUsage,
store.BumpRateLimitUsage, store.ResetRateLimitAt, and store.ResetBudgetAt to add
these regression tests and assert final counters and LastReset equality.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f03124c2-3ed8-4048-a1eb-fecf29caf5aa

📥 Commits

Reviewing files that changed from the base of the PR and between e8a2d06 and de9ab68.

📒 Files selected for processing (2)
  • plugins/governance/store.go
  • plugins/governance/store_concurrency_test.go

Comment thread plugins/governance/store.go
@akshaydeo akshaydeo force-pushed the 04-20-atomic_updates_to_budgets_and_ratelimits branch from de9ab68 to 36a5b0c Compare April 20, 2026 03:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/governance/store.go (1)

2749-2756: ⚠️ Potential issue | 🟡 Minor

Missing LastReset preservation in ModelConfig and Provider updates.

Unlike other Update*InMemory methods that preserve both CurrentUsage and LastReset, these two only copy CurrentUsage. If an admin update races with a reset, LastReset could revert to its original value, causing premature or delayed resets.

🔧 Proposed fix to also preserve LastReset

For UpdateModelConfigInMemory:

 if eb, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && eb != nil {
     clone.Budget.CurrentUsage = eb.CurrentUsage
+    clone.Budget.LastReset = eb.LastReset
 }

For UpdateProviderInMemory:

 if eb, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && eb != nil {
     clone.Budget.CurrentUsage = eb.CurrentUsage
+    clone.Budget.LastReset = eb.LastReset
 }

Also applies to: 2828-2835

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/governance/store.go` around lines 2749 - 2756, The budget update
logic only copies CurrentUsage and omits LastReset, which can cause races; in
both UpdateModelConfigInMemory and UpdateProviderInMemory locate the block that
loads from gs.budgets with gs.budgets.Load(clone.Budget.ID) and, after
confirming the loaded value is a *configstoreTables.TableBudget
(existingBudgetValue -> eb), also assign clone.Budget.LastReset = eb.LastReset
(in addition to clone.Budget.CurrentUsage = eb.CurrentUsage) before calling
gs.budgets.Store(clone.Budget.ID, clone.Budget); apply the same change to the
other occurrence referenced (the block at the other location).
🧹 Nitpick comments (1)
plugins/governance/store.go (1)

2332-2339: TOCTOU race when config update races with usage increment.

This Load→Store pattern can clobber a concurrent BumpBudgetUsage:

  1. UpdateVirtualKeyInMemory loads budget with CurrentUsage=X
  2. Concurrent request: BumpBudgetUsage CAS succeeds, CurrentUsage=X+cost
  3. UpdateVirtualKeyInMemory stores budget with CurrentUsage=X → increment lost

Consider using UpsertBudgetConfig here (pass the new config and let it preserve usage atomically), or at minimum wrap the Load+Store in a CAS loop. The same pattern appears in:

  • UpdateTeamInMemory (lines 2513-2522)
  • UpdateCustomerInMemory (lines 2623-2632)
  • UpdateModelConfigInMemory (lines 2749-2756)
  • UpdateProviderInMemory (lines 2828-2835)

While admin updates are rare compared to request processing, this still violates the PR's stated goal of eliminating unsafe Load→Store patterns.

♻️ Suggested direction using UpsertBudgetConfig
 for i := range clone.Budgets {
-    // Preserve existing usage from memory
-    if existingBudgetValue, exists := gs.budgets.Load(clone.Budgets[i].ID); exists && existingBudgetValue != nil {
-        if existingBudget, ok := existingBudgetValue.(*configstoreTables.TableBudget); ok && existingBudget != nil {
-            clone.Budgets[i].CurrentUsage = existingBudget.CurrentUsage
-            clone.Budgets[i].LastReset = existingBudget.LastReset
-        }
-    }
-    gs.budgets.Store(clone.Budgets[i].ID, &clone.Budgets[i])
+    gs.UpsertBudgetConfig(ctx, clone.Budgets[i].ID, &clone.Budgets[i])
 }

This delegates usage-preservation to the CAS-based helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/governance/store.go` around lines 2332 - 2339, The Load→Store in
UpdateVirtualKeyInMemory can race with BumpBudgetUsage and lose increments;
replace the manual budgets.Load + budgets.Store sequence for each
clone.Budgets[i] with a call to UpsertBudgetConfig passing the new budget config
so the helper preserves CurrentUsage atomically, or implement a CAS loop on
gs.budgets (Load, merge CurrentUsage from existing, attempt Store via
CompareAndSwap) to ensure updates are atomic; apply the same change pattern to
the analogous places UpdateTeamInMemory, UpdateCustomerInMemory,
UpdateModelConfigInMemory, and UpdateProviderInMemory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/governance/store.go`:
- Around line 2749-2756: The budget update logic only copies CurrentUsage and
omits LastReset, which can cause races; in both UpdateModelConfigInMemory and
UpdateProviderInMemory locate the block that loads from gs.budgets with
gs.budgets.Load(clone.Budget.ID) and, after confirming the loaded value is a
*configstoreTables.TableBudget (existingBudgetValue -> eb), also assign
clone.Budget.LastReset = eb.LastReset (in addition to clone.Budget.CurrentUsage
= eb.CurrentUsage) before calling gs.budgets.Store(clone.Budget.ID,
clone.Budget); apply the same change to the other occurrence referenced (the
block at the other location).

---

Nitpick comments:
In `@plugins/governance/store.go`:
- Around line 2332-2339: The Load→Store in UpdateVirtualKeyInMemory can race
with BumpBudgetUsage and lose increments; replace the manual budgets.Load +
budgets.Store sequence for each clone.Budgets[i] with a call to
UpsertBudgetConfig passing the new budget config so the helper preserves
CurrentUsage atomically, or implement a CAS loop on gs.budgets (Load, merge
CurrentUsage from existing, attempt Store via CompareAndSwap) to ensure updates
are atomic; apply the same change pattern to the analogous places
UpdateTeamInMemory, UpdateCustomerInMemory, UpdateModelConfigInMemory, and
UpdateProviderInMemory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a549fe6c-3dab-4cd0-aa07-d70e3994d9b3

📥 Commits

Reviewing files that changed from the base of the PR and between de9ab68 and 36a5b0c.

📒 Files selected for processing (2)
  • plugins/governance/store.go
  • plugins/governance/store_concurrency_test.go
✅ Files skipped from review due to trivial changes (1)
  • plugins/governance/store_concurrency_test.go

coderabbitai[bot]
coderabbitai Bot previously approved these changes Apr 20, 2026
Copy link
Copy Markdown
Contributor Author

akshaydeo commented Apr 20, 2026

Merge activity

  • Apr 20, 6:39 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 20, 6:47 AM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 20, 6:48 AM UTC: @akshaydeo merged this pull request with Graphite.

@akshaydeo akshaydeo changed the base branch from 04-20-team_budget_db_structure_refactor to graphite-base/2844 April 20, 2026 06:43
@akshaydeo akshaydeo changed the base branch from graphite-base/2844 to main April 20, 2026 06:45
@akshaydeo akshaydeo dismissed coderabbitai[bot]’s stale review April 20, 2026 06:45

The base branch was changed.

@akshaydeo akshaydeo force-pushed the 04-20-atomic_updates_to_budgets_and_ratelimits branch from 36a5b0c to 0c86605 Compare April 20, 2026 06:47
@akshaydeo akshaydeo merged commit 441601f into main Apr 20, 2026
11 of 17 checks passed
@akshaydeo akshaydeo deleted the 04-20-atomic_updates_to_budgets_and_ratelimits branch April 20, 2026 06:48
dominictayloruk pushed a commit to dominictayloruk/bifrost that referenced this pull request Apr 21, 2026
## Summary

In-memory budget and rate-limit counters were updated via a plain Load → clone → mutate → Store pattern, which is not safe under concurrent callers: two goroutines racing on the same entry could silently drop one of the increments. This PR replaces that pattern with CAS-retry loops that guarantee every increment is applied exactly once.

## Changes

- Introduced `BumpBudgetUsage` and `BumpRateLimitUsage` as the single authoritative write path for incrementing budget and rate-limit counters. Both use `sync.Map.CompareAndSwap` with a retry loop so concurrent callers never lose an update. Both also handle rolling-window resets inline (zeroing usage and advancing `LastReset` when the reset duration has elapsed).
- Introduced `ResetBudgetAt` and `ResetRateLimitAt` as the single authoritative write path for periodic resets. Each is conditional on the stored `LastReset` being strictly older than the target, so concurrent resetters collapse into a single successful write and the loser is a clean no-op.
- Refactored `UpdateVirtualKeyBudgetUsageInMemory`, `UpdateProviderAndModelBudgetUsageInMemory`, `UpdateProviderAndModelRateLimitUsageInMemory`, and `UpdateVirtualKeyRateLimitUsageInMemory` to delegate to `BumpBudgetUsage` / `BumpRateLimitUsage` instead of duplicating the clone-and-store logic.
- Refactored `ResetExpiredBudgetsInMemory` and `ResetExpiredRateLimitsInMemory` to delegate to `ResetBudgetAt` / `ResetRateLimitAt`. The Range loop now only decides *whether* a reset is due; the CAS methods handle the actual write and skip redundant work when another writer has already advanced the timestamp.
- Extracted a `resolvePeriodStart` helper inside `ResetExpiredRateLimitsInMemory` to unify calendar-aligned and rolling-duration reset logic for both token and request counters, eliminating the previous duplicated branches.

## Type of change

- [ ] Bug fix
- [x] Refactor
- [ ] Feature
- [ ] Documentation
- [ ] Chore/CI

## Affected areas

- [ ] Core (Go)
- [ ] Transports (HTTP)
- [ ] Providers/Integrations
- [x] Plugins
- [ ] UI (React)
- [ ] Docs

## How to test

```sh
go test ./plugins/governance/...
```

Concurrent correctness can be validated by running the governance store tests under the race detector:

```sh
go test -race ./plugins/governance/...
```

## Breaking changes

- [ ] Yes
- [x] No

## Security considerations

No auth, secrets, or PII changes. The fix closes a race window where concurrent requests could under-count usage against a budget or rate limit, which previously could allow spend or request volume to exceed configured limits.

## Checklist

- [ ] I read `docs/contributing/README.md` and followed the guidelines
- [ ] I added/updated tests where appropriate
- [ ] I updated documentation where needed
- [ ] I verified builds succeed (Go and UI)
- [ ] I verified the CI pipeline passes locally if applicable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants