[Cosmos] Fix excess GetDatabaseAccount calls and eliminate data-plane fallback to customer endpoint (#25468) by tvaron3 · Pull Request #26815 · Azure/azure-sdk-for-go

tvaron3 · 2026-05-19T22:15:37Z

Fix excess `GetDatabaseAccount` calls and trim data-plane fallback to default endpoint

Customers using preferred regions reported thousands of GetDatabaseAccount (account topology) HTTP calls per 5-minute window where the contract is one. Investigation surfaced multiple compounding root causes plus a related issue: data-plane retries could trail into the customer-supplied (default) endpoint even when full regional metadata was available.

Root causes

`GetDatabaseAccount` storm

globalEndpointManager.Update only advanced lastUpdateTime on success. Any failed refresh left ShouldRefresh()==true, so every subsequent caller re-issued the HTTP call immediately.
globalEndpointManagerPolicy.Do spawned a goroutine per request whenever ShouldRefresh() was true, with no in-flight coalescing.
A failed initial sync.Once bootstrap latched done=true, pinning the client into the async-refresh herd path forever.
locationCache.readEndpoints/writeEndpoints could self-deadlock by acquiring mapMutex.RLock and then calling update(), which takes mapMutex.Lock via refreshStaleEndpoints (Go's sync.RWMutex cannot upgrade RLock to Lock).
Async refresh failures were silently discarded — post-bootstrap topology drift was unobservable.

Data-plane retry trailing into the default endpoint

getPrefAvailableEndpointsLocked appended the customer-supplied default endpoint as a trailing entry in every preferred route list, so retry traversal eventually issued data-plane requests there even with full regional metadata.

Changes

`cosmos_global_endpoint_manager.go`

Add lastAttemptTime; shouldRefresh() honours max(lastUpdateTime, lastAttemptTime). Failures throttle just like successes.
Replace the simple mutex in Update with a per-flight *updateFlight{done, err} for true single-flight: concurrent callers coalesce to one HTTP call and share its error.
Leader's HTTP call runs under context.WithoutCancel(ctx) so an unrelated caller-side cancellation does not poison the shared flight result for other coalesced waiters. Waiters still respect their own ctx.Done().
Add invalidationGen: leader snapshots it before the HTTP call and declines to commit timestamps if it changed during the flight, so an invalidation during an in-flight refresh is not lost.
Add lastUpdateErr: throttled Update returns the cached error while the GEM has never been successfully populated, so a chronic bootstrap failure surfaces on every request rather than silently swallowing.
Add populated() / invalidate() helpers and everPopulated (atomic), so a transient post-invalidation refresh failure does not stall the data plane.
MarkEndpointUnavailableFor{Read,Write} invalidate once per newly-unavailable endpoint (snapshot wasUnavailable atomically with the mark inside the location cache).

`cosmos_global_endpoint_manager_policy.go`

Drop sync.Once. Synchronous bootstrap is gated on !gem.populated() and coalesced via gem.Update's single-flight. A failed bootstrap is naturally retried on the next request, throttled by lastAttemptTime.
Async refresh goroutine: gated by asyncRefreshPending atomic.Bool so concurrent callers don't each spawn a goroutine; panic-safe recover with logging; failures are now logged so chronic post-bootstrap topology drift is observable; refreshCtx captured before the goroutine launches so it doesn't depend on req's lifetime.

`cosmos_location_cache.go`

Split into Locked variants (updateLocked, refreshStaleEndpointsLocked, isEndpointUnavailableLocked, getPrefAvailableEndpointsLocked).
readEndpoints / writeEndpoints now snapshot the staleness flag under RLock, release, then call update which acquires the write lock itself. Fixes the self-deadlock.
update() picks regional fallbacks: writeFallback = availWriteEndpointsByLocation[availWriteLocations[0]] and readFallback = availReadEndpointsByLocation[availReadLocations[0]]. Only the degenerate "zero regions" case falls back to lc.defaultEndpoint.
getPrefAvailableEndpointsLocked no longer appends the fallback as a trailing entry in every list; it only appears via the empty-list guard.
resolveServiceEndpoint is unchanged behaviorally: with enableCrossRegionRetries=false single-master writes still route to the default endpoint, preserving long-standing semantics.
markEndpointUnavailable fast-path: skips the full updateLocked recompute when wasAlreadyUnavailable==true (the route lists cannot change in that case).
Public CanUseMultipleWriteLocs (locked) added; internal canUseMultipleWriteLocs keeps the lock-free contract for already-locked callers.

`cosmos_client_retry_policy.go`

Minor: write-retry path keeps gem.Update(ctx, true) so a regional 403/WriteForbidden always force-refreshes (preserved intentionally).

CHANGELOG

Single tightened "Bugs Fixed" bullet under 1.5.0-beta.7 summarizing both fixes.

Tests

cosmos_dbaccount_refresh_test.go (new) — regression tests, all passing under -race:

TestFix1_FailedUpdateIsThrottled
TestFix1b_InvalidateDuringInflightRefreshIsHonored
TestFix2_ConcurrentUpdateCallersCoalesce
TestFix3_WriteRetryForceRefreshesGEM
TestFix3b_BootstrapFailureIsSurfacedOnEveryRequestUntilThrottleExpires
TestFix3c_ConcurrentSameEndpointMarksAreBounded
TestFix4_InitialBootstrapFailureIsRetriedAndThrottled
TestFix5_ReadEndpointsDoesNotDeadlock
TestFix7_InvalidateThenRefreshFailureDoesNotStallDataPlane
TestFix8_UpdateWaiterRespectsContextCancellation
TestRegression_HealthyHighConcurrencyStaysAtOneGEMCall (500 concurrent → 1 HTTP call)
TestRegression_FailingGEMHighConcurrencyStaysAtOneGEMCall (500 concurrent against failing GEM → 1 HTTP call)
TestDefaultEndpointElim_DataPlaneNeverHitsDefaultWhenPopulated
TestDefaultEndpointElim_SessionRetryOnSingleMasterRoutesToRegionalWrite
TestDefaultEndpointElim_UnmatchedPreferredRegionRoutesToRegional
TestDefaultEndpointElim_ZeroWriteRegionsRetainsDefaultFallback (documented degenerate exception)
TestDefaultEndpointElim_CrossRegionRetriesDisabledUsesDefaultEndpoint (documents preserved behavior)
TestDefaultEndpointElim_ZeroWriteRegionsReadGoesToReadRegion

Existing tests: 4 expectations in cosmos_location_cache_test.go updated to drop trailing default-endpoint entries from route lists; TestAddedAllowTentativeHeaderGEMPolicy updated to reset lastAttemptTime and wait deterministically for the async refresh.

Full sdk/data/azcosmos suite: green with -race (~32 s).

Reviews

Three deep-review passes completed (PR Deep Reviewer agent from sdk-copilot-toolkit). All actionable findings addressed:

Async refresh error swallow → now logged
Leader ctx cancellation contaminating waiters → leader runs under context.WithoutCancel
markEndpointUnavailable no-op redundant updateLocked → fast-path skip
Goroutine req lifetime dependency → refreshCtx captured before go

Out of scope (filed separately / future PRs)

Pre-existing data race on locationCache.locationInfo reads in resolveServiceEndpoint/getLocation (broader refactor; this PR fixes the adjacent enableMultipleWriteLocations read race surfaced by TestAddedAllowTentativeHeaderGEMPolicy).
Multi-master with N distinct newly-unavailable endpoints in one retry burst can still cause up to N refreshes (bounded by TestFix3c).
Exponential throttle backoff on consecutive GEM failures (currently a flat refreshTimeInterval).
Lock-order documentation banners across the GEM/LC files.

Validation matrix

Check	Result
`go build ./...`	✅
`go test -race -count=1 ./...`	✅ ~32 s
`go vet ./...`	✅
`gofmt -l .`	✅ clean
Deep-review passes	3× completed; all actionable findings addressed

…back (Azure#25468) Fixes issue Azure#25468 (excess GetDatabaseAccount HTTP calls with preferred regions) plus the related default-endpoint fallback in data-plane routing. Root causes addressed: * GEM Update only advanced lastUpdateTime on success, so any failed refresh caused every subsequent caller to re-issue the HTTP call immediately (no 5-min throttle on failures). * globalEndpointManagerPolicy.Do spawned a goroutine per request whenever ShouldRefresh() was true, with no in-flight coalescing. * attemptRetryOnEndpointFailure passed isWriteOperation as forceRefresh, so every retried write 403 force-refreshed GEM regardless of the refresh interval. * A failed initial sync.Once bootstrap latched done=true, pinning the client into the async-refresh herd path forever. * locationCache.readEndpoints/writeEndpoints could self-deadlock by acquiring RLock and then calling update() which takes Lock(). * getPrefAvailableEndpointsLocked appended the customer-supplied default endpoint as a trailing fallback in every route list, so retry traversal eventually issued data-plane requests there. * resolveServiceEndpoint fell back to the default endpoint whenever enableCrossRegionRetries was false even with non-empty availWriteLocations. Changes: * globalEndpointManager: add lastAttemptTime, per-flight singleflight, invalidationGen, lastUpdateErr, populated(), invalidate(); MarkEndpoint Unavailable* now invalidates once per newly-unavailable endpoint. * globalEndpointManagerPolicy: drop sync.Once; bootstrap is gated on populated() and coalesced via gem.Update's singleflight. * clientRetryPolicy: write retries pass forceRefresh=false. * locationCache: split Locked variants; readEndpoints/writeEndpoints drop the RLock before calling update; update() chooses regional write/read fallbacks; resolveServiceEndpoint always picks a regional endpoint when availWriteLocations is non-empty (cross-region-retries flag now only gates retry walking); getPrefAvailableEndpointsLocked no longer appends a trailing fallback. * Only remaining data-plane -> default-endpoint route is the degenerate 'zero write regions on a write request' case. Tests: * cosmos_dbaccount_refresh_test.go (new): 12 regression tests covering throttle-on-failure, single-flight coalescing, write-retry single refresh, in-flight invalidation, cached bootstrap error, deadlock-free readEndpoints, default-endpoint elimination across resource types and configurations, plus 500-concurrency soak tests on healthy and failing GEMs. * cosmos_location_cache_test.go: update 4 expectations to drop trailing default-endpoint entries from route lists. All tests pass with -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…st hardening) Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the deep-review pass: 1. Panic-safe gem.Update: wrap refreshOnce in a deferred cleanup so a panic in the HTTP pipeline (or any code refreshOnce transitively calls) cannot leak gem.inflight / leave flight.done unclosed -- which would have permanently wedged every subsequent Update caller. The panic is re-raised after cleanup. 2. Atomic check-and-mark in locationCache.markEndpointUnavailable: the helper now returns wasAlreadyUnavailable from inside the same mapMutex critical section that performs the mark. This eliminates the check-then-act race between MarkEndpointUnavailableFor* and isEndpointUnavailable that previously let multiple concurrent markers each call invalidate(). Tightens TestFix3c bound from <=3 to a provable <=2. 3. TestFix4 hardening: assert require.Error on each follow-up request, so the test cannot pass if a regression made populated() return true and silently skip the bootstrap. All tests pass with -race; gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… ctx, async panic) Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the PR Deep Reviewer pass. BLOCKING finding addressed: * invalidate() zeroed lastUpdateTime, which made populated() return false. If the post-invalidate refresh then failed, every subsequent request was blocked for refreshTimeInterval (5 min default) by populated()==false + cached error -- even though the locationCache still held a valid regional topology. This was a regression introduced by this PR. Fix: add everPopulated bool, set true on first successful refresh and never reset. populated() reads everPopulated instead of the timestamp. lastUpdateErr is now only surfaced to throttled callers the cached topology while a refresh attempt is throttled after a failure. New regression test TestFix7_InvalidateThenRefreshFailure DoesNotStallDataPlane locks in the behaviour. Recommendations addressed: * gem.Update waiters now select between flight completion and ctx.Done(), so a caller-side timeout cannot be exceeded by an unrelated stuck refresh on a different goroutine. New regression test TestFix8_UpdateWaiterRespectsContextCancellation. * Async refresh goroutine in globalEndpointManagerPolicy.Do now recovers panics, so a panic re-raised by gem.Update's panic-safe defer (added in the previous commit) cannot bring down the host process via the detached goroutine. Test: * Tightened TestFix3c bound from <=2 to ==1 -- the combination of the atomic check-and-mark and the in-flight singleflight makes the upper bound provably 1 HTTP call for the tested fixture. All 14 fix tests pass with -race; gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Idiomatic improvements identified by the PR Deep Reviewer's Go-craft pass. No functional behaviour changes. cosmos_global_endpoint_manager.go: * everPopulated is now atomic.Bool, so the policy's hot-path populated() check is lock-free instead of contending on gemMutex against Update's critical section. * Hoisted the panic-safe defer out of an IIFE in Update -- a plain defer at function scope has identical semantics with less ceremony. * Wrap GetAccountProperties / locationCache.update errors with %w so callers can errors.As / errors.Is the underlying response error when the cached error is surfaced via lastUpdateErr. * Added hasInflight() test helper so tests don't peek at gemMutex/ inflight directly; keeps the test surface stable across future refactors of the internal sync primitives. * One-line comment documenting that lastUpdateErr is intentionally shared across force=true and force=false callers. cosmos_global_endpoint_manager_policy.go: * Gate the async refresh goroutine spawn with an atomic.Bool CAS. Without this, every request that observed ShouldRefresh()==true during a burst spawned its own goroutine that then queued as a waiter inside gem.Update. Now N→1 per refresh-window. * Recovered panic is logged via the azcore log facility instead of being silently dropped, so production crashes remain triageable. Captures the stack trace too. cosmos_dbaccount_refresh_test.go: * TestFix1b: replaced racy 30ms time.Sleep with require.Eventually polling the new hasInflight() helper. * TestFix8: replaced manual mutex poll loop with require.Eventually + hasInflight(). * TestFix2: include goroutine index in the failure message so a sporadic single-goroutine failure is easier to debug. * Gated the 500-goroutine soak tests and the 2-second TestFix8 behind testing.Short() so the dev-loop cost is opt-in. All tests pass with -race; -short also passes (skips three slow tests as designed). gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iew #1,#2,Azure#4,Azure#13, comment cleanup Behavioral reverts (preserve long-standing semantics): - cosmos_client_retry_policy.go: write-retry path uses gem.Update(ctx, true) to force-refresh per attempt (intentional; restores prior behavior). - cosmos_location_cache.go: resolveServiceEndpoint with enableCrossRegionRetries=false returns lc.defaultEndpoint for single-master writes (avoids a breaking change). Fixes addressing prior deep-review findings: - cosmos_global_endpoint_manager_policy.go: * Log async refresh failures (was discarded via '_ ='); chronic post-bootstrap topology drift is now observable. * Capture refreshCtx via context.WithoutCancel BEFORE launching the goroutine so it does not depend on req's lifetime. - cosmos_global_endpoint_manager.go: leader's HTTP call runs under context.WithoutCancel(ctx); waiters still respect their own ctx so caller-side cancellation no longer poisons coalesced waiters. - cosmos_location_cache.go markEndpointUnavailable: fast-path skips the full updateLocked recompute when wasAlreadyUnavailable==true (route lists cannot change in that case). Tests and docs: - TestFix3 renamed/inverted to TestFix3_WriteRetryForceRefreshesGEM. - TestDefaultEndpointElim_CrossRegionRetriesDisabled* renamed/inverted to assert the restored default-endpoint behavior. - TestRegression25468_* renamed to TestRegression_*. - Comments stripped of Azure#25468 references and tightened. - CHANGELOG updated to link to PR 26815 and reflect the reverted cross-region-retries=false behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolved conflict in sdk/data/azcosmos/CHANGELOG.md: moved the PR 26815 entry into the unreleased 1.5.0-beta.7 section and tightened the two prior bullets into a single 'Bugs Fixed' summary. Test fixups for upstream test that broke against the new throttle: - TestAddedAllowTentativeHeaderGEMPolicy now also resets lastAttemptTime (shouldRefresh now honours max(lastUpdateTime, lastAttemptTime)). - Added require.Eventually wait on CanUseMultipleWriteLocs so the async refresh deterministically lands before the next request. Fixed a pre-existing data race surfaced by that test: - locationCache.canUseMultipleWriteLocs reads enableMultipleWriteLocations without the lock. Added CanUseMultipleWriteLocs which acquires mapMutex.RLock; lowercase remains for already-locked callers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Under some schedulers (observed on windows_go_1261 CI), every concurrent gem.Update can complete observing the still-fresh lastUpdateTime before any of the concurrent MarkEndpointUnavailableForWrite goroutines fires invalidate(), leaving the post-invalidation refresh un-triggered and the HTTP call count at 0 instead of the expected 1. Add an explicit gem.Update after wg.Wait() to deterministically trigger the post-invalidation refresh. The singleflight + invalidationGen bound still guarantees the burst-plus-trailing-Update collapses to exactly one HTTP round-trip, so the assertion (transport.count == 1) stays tight. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR addresses excessive Cosmos DB GetDatabaseAccount (topology) calls under preferred regions and prevents data-plane retry traversal from falling back to the customer-supplied default endpoint when regional metadata is available.

Changes:

Implemented throttled, coalesced GEM refresh with in-flight single-flight coordination and improved invalidation semantics.
Refactored locationCache locking to remove a self-deadlock risk and adjusted preferred endpoint selection to avoid trailing into the default endpoint.
Added a comprehensive regression test suite covering high-concurrency refresh behavior, invalidation timing, and default-endpoint elimination.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
sdk/data/azcosmos/cosmos_location_cache.go	Refactors locking, removes self-deadlock, and updates endpoint selection/fallback behavior.
sdk/data/azcosmos/cosmos_location_cache_test.go	Updates expectations to match new route list behavior and new method signatures.
sdk/data/azcosmos/cosmos_global_endpoint_manager.go	Adds single-flight Update coalescing, attempt throttling, invalidation generation tracking, and bootstrap error surfacing.
sdk/data/azcosmos/cosmos_global_endpoint_manager_test.go	Adjusts tests for new throttling fields and async refresh determinism.
sdk/data/azcosmos/cosmos_global_endpoint_manager_policy.go	Removes `sync.Once` bootstrap, gates async refresh goroutine spawning, and logs async refresh failures.
sdk/data/azcosmos/cosmos_dbaccount_refresh_test.go	New regression suite validating throttling/coalescing, deadlock avoidance, and default-endpoint elimination.
sdk/data/azcosmos/cosmos_client_retry_policy.go	Ensures endpoint-failure retries force-refresh the GEM.
sdk/data/azcosmos/CHANGELOG.md	Documents the fixes under `1.5.0-beta.7`.

- getPrefAvailableEndpointsLocked now takes prefLocations as a parameter rather than reading lc.locationInfo.prefLocations. updateLocked passes nextLoc.prefLocations so the route lists are computed from a single consistent in-progress snapshot, not the still-committed lc.locationInfo. - countingTransport.Do now sets Response.Status (e.g. '200 OK') and ContentLength so simulated responses match net/http behavior and SDK helpers like NewResponseErrorWithErrorCode produce useful error text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolved CHANGELOG.md conflict by keeping both PR 26815 and PR 26855 'Bugs Fixed' bullets under 1.5.0-beta.7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar

LGTM, thanks @tvaron3

Includes Azure#26815 (GEM coalescing + invalidation). With that in place, the ambiguous-write / single-master-write give-up branch in attemptRetryOnNetworkError can rely on MarkEndpointUnavailable*'s implicit invalidation instead of a synchronous gem.Update(ctx, true). Added an inline comment explaining the choice (per Kushagra's review on Azure#26858).

github-actions Bot added the Cosmos label May 19, 2026

Tomas Varon and others added 6 commits May 19, 2026 19:58

tvaron3 marked this pull request as ready for review May 22, 2026 02:33

tvaron3 requested a review from a team as a code owner May 22, 2026 02:33

Copilot AI review requested due to automatic review settings May 22, 2026 02:33

Copilot started reviewing on behalf of tvaron3 May 22, 2026 02:34 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread sdk/data/azcosmos/cosmos_dbaccount_refresh_test.go

Comment thread sdk/data/azcosmos/cosmos_location_cache.go Outdated

tvaron3 and others added 3 commits May 21, 2026 19:42

azcosmos: consolidate CHANGELOG entry to one line

0da5418

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge upstream/main into tvaron3/25468-dbaccount-rootcause

7a5c1ac

Resolved CHANGELOG.md conflict by keeping both PR 26815 and PR 26855 'Bugs Fixed' bullets under 1.5.0-beta.7. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar mentioned this pull request May 22, 2026

azcosmos: cross-region retry policy for connection errors and 408s #26858

Merged

kushagraThapar approved these changes May 22, 2026

View reviewed changes

tvaron3 merged commit 2c530d7 into Azure:main May 22, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cosmos] Fix excess GetDatabaseAccount calls and eliminate data-plane fallback to customer endpoint (#25468)#26815

[Cosmos] Fix excess GetDatabaseAccount calls and eliminate data-plane fallback to customer endpoint (#25468)#26815
tvaron3 merged 10 commits into
Azure:mainfrom
tvaron3:tvaron3/25468-dbaccount-rootcause

tvaron3 commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tvaron3 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix excess GetDatabaseAccount calls and trim data-plane fallback to default endpoint

Root causes

GetDatabaseAccount storm

Data-plane retry trailing into the default endpoint

Changes

cosmos_global_endpoint_manager.go

cosmos_global_endpoint_manager_policy.go

cosmos_location_cache.go

cosmos_client_retry_policy.go

CHANGELOG

Tests

Reviews

Out of scope (filed separately / future PRs)

Validation matrix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tvaron3 commented May 19, 2026 •

edited

Loading

Fix excess `GetDatabaseAccount` calls and trim data-plane fallback to default endpoint

`GetDatabaseAccount` storm

`cosmos_global_endpoint_manager.go`

`cosmos_global_endpoint_manager_policy.go`

`cosmos_location_cache.go`

`cosmos_client_retry_policy.go`