[azcosmos] unblock cross-region failover from synchronous gem.Update by tvaron3 · Pull Request #26889 · Azure/azure-sdk-for-go

tvaron3 · 2026-05-26T21:56:19Z

What & Why

Three retry paths in the Cosmos client retry policy (cosmos_client_retry_policy.go) used to gate the cross-region retry on a successful synchronous gem.Update(ctx, true) against the account's global endpoint. During a regional outage the global FQDN typically resolves to the same regional FE pool we just marked unavailable (Cosmos accounts in a region share FEs), so the forced refresh stalls on a connect timeout and surfaces that error instead of attempting the failover.

This PR also fixes two interlocking bugs in attemptRetryOnNetworkError that prevented the failover from actually routing to another region even when the refresh succeeded, and tightens a few related correctness issues.

Changes

attemptRetryOnNetworkError (connection error) — dropped the forced gem.Update. MarkEndpointUnavailable* already calls locationCache.updateLocked which recomputes readEndpoints / writeEndpoints with the unavailable endpoint demoted; invalidate() arms the next non-force Update for real topology changes. Also left retryCount at 0 after the failover — getPrefAvailableEndpointsLocked demotes unavailable endpoints to the tail of the route list rather than removing them, so retryCount=1 (after retryCount += 1) re-indexed back to the just-marked unhealthy endpoint via ResolveServiceEndpoint(1 % 2).
attemptRetryOnEndpointFailure (HTTP 403 / WriteForbidden) — replaced the blocking gem.Update with a fire-and-forget asyncForceRefreshGEM helper. The helper uses a CompareAndSwap gate so a retry storm cannot leak a goroutine per attempt, runs on context.Background() so a near-expired caller deadline cannot abort the refresh, and has panic recovery + error logging.
attemptRetryOnRequestTimeout (HTTP 408) — removed both MarkEndpointUnavailableForRead and the forced gem.Update. A 408 is a per-request signal (the gateway accepted the request but did not finish in time), not regional unhealthiness; demoting the whole region for unavailableLocationExpirationTime based on one slow request would penalize concurrent reads. The outer loop's retryCount++ already advances the location index for the cross-region retry.
URL-key normalization in locationCache — MarkEndpointUnavailable* was called with the full request URL (path/query/RawPath), but the unavailability map was keyed by url.URL struct equality while availReadEndpointsByLocation / availWriteEndpointsByLocation only stored scheme+host. Marks were silently written under a key nothing else looked up, so isEndpointUnavailableLocked always returned false and the demote was a no-op. Now normalized to scheme+host via a new internal endpointKey helper.
locationCache.resolveServiceEndpoint — now takes mapMutex.RLock. Reads of locationInfo / enableMultipleWriteLocations race the writes in update/updateLocked, which the new async refresh paths surface deterministically under go test -race.

Repro that motivated the fix

Reproduced live by running azcosmos_perf in AKS with iptables OUTPUT DROP rules applied via kubectl debug --profile=netadmin to block the workload account's westus2 FE IPs (20.42.169.199, 20.42.170.147, 20.9.156.133).

Account topology:

Workload account: write=westus2, reads=[westus2, westus]
Client args: --application-region westus2 --preferred-regions westus

For 10 minutes:

ss -tan inside the pod showed every goroutine in SYN_SENT to a blocked westus2 IP. Zero TCP attempts to 40.112.241.99 (the westus replica).
Per-op error rate ≈ 15/min/op across concurrency=50 → ~18–22 s wall-clock per failed op, matching the budgeted same-region retries (3 × (5s + 1s backoff)) plus one failover attempt whose gem.Update itself timed out.
Effective throughput went to zero and stayed there for the full window.

After the fix, the cache mutation from MarkEndpointUnavailable* is enough for the next iteration's ResolveServiceEndpoint to return westus, and the cross-region retry actually executes.

Regression tests

TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable — pins the GEM mock to net.DNSError so any forced refresh would fail; asserts the 5th-attempt 200 from the failover server is returned. Pre-fix the request surfaces a DNS error instead.
TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails — 403 retry returns (true, nil) even when GEM transport returns 500 on every call.
TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable — after a 408 retry, locationUnavailabilityInfoMap is empty and the GEM transport was never invoked.
TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM — with a 10-minute hanging GEM transport, attemptRetryOnRequestTimeout returns within 5 s.
TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm — 50 concurrent asyncForceRefreshGEM() calls produce far fewer than concurrency/2 HTTP refreshes.
TestFix3e_AsyncRefreshIgnoresCallerContextCancellation — an already-cancelled caller context still triggers a successful background refresh (proves context.Background()).
TestFix3_WriteRetryForceRefreshesGEM updated to use require.Eventually since the refresh is now async and the singleflight + CAS gate legitimately coalesce concurrent forced refreshes.

Backwards compatibility

No public API changes.
The single-master-write branch of attemptRetryOnNetworkError already had a comment explicitly noting that no gem.Update is needed here (added in [Cosmos] Fix excess GetDatabaseAccount calls and eliminate data-plane fallback to customer endpoint (#25468) #26815). This change extends the same reasoning to the read / multi-master-write failover branch and to the 403 / 408 paths.

CHANGELOG

Entry added under 1.5.0-beta.7 → Bugs Fixed.

Testing

go test -count=1 -timeout 300s -race ./...
ok  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos    71.863s

…ta refresh attemptRetryOnNetworkError used to call gem.Update(ctx, true) after MarkEndpointUnavailable* to force a topology refresh before the cross-region retry. When the regional gateway outage that triggered the failover also takes out the global endpoint's address -- the common case, since the global account FQDN typically resolves to the same regional FE pool as the write region -- the forced refresh dials a blocked address, hits the 5s connect timeout, and returns an error. attemptRetryOnNetworkError then returns (false, err) and surfaces the original connection failure to the caller, never attempting the actual cross-region retry against the next preferred region. The forced refresh added no routing information: MarkEndpointUnavailable* already invokes locationCache.updateLocked, which recomputes readEndpoints / writeEndpoints with the unavailable endpoint demoted, so the next ResolveServiceEndpoint(retryCount+1, ...) picks the failover region on its own. A connection error indicates regional unhealthiness, not a topology change, so the synchronous refresh is unnecessary here. The invalidate() inside MarkEndpointUnavailable* still ensures the next non-force Update will pick up any real topology changes once the global endpoint is reachable again. Adds TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable that pins the gemServer to net.DNSError so any forced refresh would fail. Pre-fix this test fails (the request surfaces a DNS error after 3 same-region retries instead of the 5th-attempt 200); post-fix the failover succeeds. Reproduced live by running azcosmos_perf in AKS with iptables OUTPUT DROP rules against the workload account's westus2 FE IPs. Without this fix, no goroutine ever attempted a TCP connection to the westus replica even though westus was a configured preferred read region. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adjusts the azcosmos client retry policy so that cross-region failover for connection errors is no longer blocked by a forced global-endpoint metadata refresh (which can fail during regional gateway outages), ensuring the failover retry actually occurs.

Changes:

Removed the synchronous gem.Update(ctx, true) gating from attemptRetryOnNetworkError after marking an endpoint unavailable.
Added a regression test covering failover behavior when the global endpoint is unreachable.
Added a CHANGELOG entry under 1.5.0-beta.7 describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`sdk/data/azcosmos/cosmos_client_retry_policy.go`	Stops forcing a GEM refresh on the connection-error failover path so failover isn’t blocked by unreachable metadata endpoint.
`sdk/data/azcosmos/cosmos_client_retry_policy_test.go`	Adds a regression test ensuring read failover proceeds even when the global endpoint can’t be reached.
`sdk/data/azcosmos/CHANGELOG.md`	Documents the bug fix in the unreleased changelog.

…vailability map Repro on a real two-region account showed the original fix wasn't enough: with iptables OUTPUT DROP rules blocking the workload account's westus2 FE IPs, the SDK still kept every goroutine pinned to a blocked westus2 IP and never attempted the westus replica. ss -tan inside the pod showed zero TCP attempts to the configured preferred-region replica. Investigation surfaced two more bugs in the same code path: 1. After MarkEndpointUnavailable* the policy did retryCount += 1. getPrefAvailableEndpointsLocked demotes unavailable endpoints to the TAIL of readEndpoints / writeEndpoints rather than removing them, so readEndpoints becomes [healthy, unhealthy]. ResolveServiceEndpoint uses retryCount as an index, so the bumped retryCount=1 made ResolveServiceEndpoint(1 % 2) return the just-marked unhealthy endpoint at the tail -- the 'failover' attempt would hit the same dead region again. 2. MarkEndpointUnavailable* was called with the full request URL (path, query, RawPath included). The unavailability map was keyed by url.URL struct equality, but the cache's availReadEndpointsByLocation / availWriteEndpointsByLocation stored only scheme+host base URLs. The marks were therefore written under keys nothing else looked up, so isEndpointUnavailableLocked always returned false and the demote step silently did nothing. The previous test suite did not catch this because CreateMockLC pointed every region at the same defaultEndpoint URL, masking the routing decision entirely. Fixes: * cosmos_client_retry_policy.go: drop the retryCount += 1 in attemptRetryOnNetworkError after the mark, with a comment explaining why. retryCount stays at 0 so the next ResolveServiceEndpoint returns readEndpoints[0] -- the just-promoted preferred region. * cosmos_location_cache.go: introduce endpointKey(u url.URL) url.URL that returns {Scheme, Host}. markEndpointUnavailableFor{Read,Write} and isEndpointUnavailableLocked use it so mark and check operations agree on identity regardless of the input URL's path/query. * cosmos_client_retry_policy_test.go: - Flip the retryCount assertion from 1 to 0 in five existing connection-error tests (the two 408 tests are deliberately not touched -- the 408 retry path is separate and gets bumped in the Do loop). - Replace the regression test body with a version that wires TWO distinct mock servers (badSrv = original/unhealthy region, goodSrv = failover region) via a small routingMockTransport that dispatches by URL host. Asserts exactly 4 requests hit badSrv and exactly 1 hits goodSrv. Catches all three pre-fix conditions. * CHANGELOG.md: rewrite the entry to describe all three fixes together under a single bullet, including the 'demote-not-remove' index math and the URL-key normalization. Reproduced live by re-running the same iptables outage scenario in AKS with a new image of azcosmos_perf built against this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three retry paths in the Cosmos client retry policy used to gate the cross-region retry on a successful synchronous gem.Update(ctx, true) against the account's global endpoint. During a regional outage the global FQDN typically resolves to the same regional FE pool we just marked unavailable, so the forced refresh stalled on a connect timeout and surfaced that error instead of attempting the failover. - attemptRetryOnEndpointFailure (403/WriteForbidden): switched from blocking gem.Update to a fire-and-forget asyncForceRefreshGEM helper. The helper uses a CompareAndSwap gate so a retry storm cannot leak a goroutine per attempt, runs on context.Background() so a near-expired caller deadline cannot abort the refresh, and has panic recovery + error logging. - attemptRetryOnRequestTimeout (HTTP 408): no longer calls MarkEndpointUnavailableForRead or gem.Update. A 408 is a per-request signal (the gateway accepted the request but did not finish in time), not regional unhealthiness; demoting the whole region for unavailableLocationExpirationTime based on one slow request would penalize concurrent reads. The outer loop's retryCount++ already advances the location index for the cross-region retry. Also took mapMutex.RLock around locationCache.resolveServiceEndpoint: its reads (locationInfo, enableMultipleWriteLocations) race the writes in update/updateLocked, which the new async refresh paths surface deterministically under go test -race. Tests added in cosmos_dbaccount_refresh_test.go: - TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails - TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable - TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM - TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm - TestFix3e_AsyncRefreshIgnoresCallerContextCancellation TestFix3_WriteRetryForceRefreshesGEM updated to use require.Eventually since the refresh is now async and the singleflight + CAS gate legitimately coalesce concurrent forced refreshes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…dback) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3

PR Deep Review summary

I found three blocking correctness risks and several targeted test/robustness gaps. The main concern is that the new demote-to-tail routing only works when the next retry indexes 0; the 403 path still increments retryCount, and the connection-error path does not reset a pre-existing non-zero retryCount. I also found remaining unlocked location-cache readers exposed by the new async refresh path.

I could not run local Go tests because go is not installed in this environment (go: command not found).

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

kushagraThapar

Thanks for the rapid turnaround on top of #26815 and the excellent live-AKS repro narrative — iptables OUTPUT DROP + ss -tan showing zero TCP attempts to the replica IP is exactly the level of forensic detail that made the three interlocking bugs visible. The URL-key normalization catch was particularly nice. Core fix (MarkEndpointUnavailable* cache mutation is sufficient; synchronous refresh is the trap) reads cleanly and aligns with .NET on the demote-to-tail routing model. LGTM 🚀

Follow-up changes addressing four rounds of deep-reviewer feedback on top of the original PR. All changes preserve the surface fix (no synchronous gem.Update blocking cross-region retries) and add hardening around the supporting machinery. Routing / retry-state - Add retryContext.resolveFromHead as a one-shot signal so retry paths that mark-then-demote can force the next ResolveServiceEndpoint to use locationIndex 0 instead of an inherited (possibly non-zero) retryCount. Without this, the outer Do loop's retryCount += 1 routed the failover right back to the just-demoted endpoint at the tail of readEndpoints / writeEndpoints. Applied in both attemptRetryOnNetworkError and attemptRetryOnEndpointFailure. Forced refresh - Replace asyncRefreshPending atomic.Bool with asyncRefreshState atomic.Int32 (Idle / Pending / Failed). - asyncForceRefreshGEM no longer re-panics from its detached goroutine (an unrecovered panic in any goroutine terminates the process). Panics are recorded as failures for the state machine. - Add lastForcedRefreshUnixNano + forcedRefreshMinInterval to rate-limit repeat forced refreshes against an already-unavailable endpoint. Covers both stale-success and stale-failure recovery in one path; single-master writes can re-force after the window elapses, but a tight 403 loop cannot storm GetDatabaseAccount. gem.Update - Add updateFlight.genAtStart so a forceRefresh waiter that wakes on an in-flight refresh whose flight predates the latest invalidation loops to lead a fresh refresh. The waiter re-reads invalidationGen AFTER <-waitFlight.done to catch invalidations that fire while it is blocked. - Add a leader-side loop: a forceRefresh leader whose own flight sees a mid-flight invalidate() also loops to lead a fresh refresh. - Bound both loops with a shared maxForceRefreshRetries budget so sustained invalidations cannot keep a single goroutine spinning indefinitely. Location-cache locking - resolveServiceEndpoint, getLocation, and the new sessionRetrySnapshot helper now take mapMutex.RLock. The async refresh paths can rewrite locationInfo concurrently, which go test -race surfaced. sessionRetrySnapshot returns (multiWrite, readN, writeN) under a single RLock so attemptRetryOnSessionUnavailable cannot mix pre- and post-refresh topology state. MarkEndpointUnavailableFor{Read,Write} signature - Now returns (wasAlreadyUnavailable bool, err error) so callers can distinguish first-transition from repeat marks. The retry policy uses this to gate the forced refresh. Tests added in cosmos_dbaccount_refresh_test.go: TestFix1c_ForceRefreshWaiterReReadsInvalidationGenAfterWait TestFix1d_ForceRefreshLeaderLoopsOnMidFlightInvalidation TestFix1e_ForceRefreshLeaderLoopIsBounded TestFix1f_ForceRefreshWaiterLoopIsBounded TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm TestFix3e_AsyncRefreshIgnoresCallerContextCancellation TestFix3f_FailedAsyncRefreshIsRetriedOnNextSameEndpoint403 TestFix3g_SessionUnavailableSnapshotIsAtomic TestFix3h_RepeatWriteForbiddenForcesRefreshAfterRateWindow Tests added in cosmos_client_retry_policy_test.go: TestWriteForbiddenFailsOverToHealthyRegion TestConnectionErrorFailoverResetsNonZeroRetryCount Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar

LGTM, thanks @tvaron3

The Build Analyze stage of the azcosmos CI pipeline runs cspell, which treated these identifiers/comment words as unknown and failed the build. azlog/Writef are existing azcore log API names referenced in cosmos_client_retry_policy.go, Retriable is part of the existing errorinfo.NonRetriableError API, and unrecovered appears in a comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… dictionary" This reverts commit b934b51.

The new identifiers/comment words used in the azcosmos retry-policy changes (azlog, Writef, Retriable, unrecovered) were previously added to the global words array. Per the cspell layout this repo uses, overrides[].filename lets us scope words to a specific module's filetree, which avoids polluting the global namespace. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit 32c7cc3.

Follow the repo convention (cosmos_container_query_engine.go uses the same pattern) of declaring file-local cspell exceptions via an inline // cSpell:ignore comment rather than modifying the global cspell.json. The flagged identifiers (azlog, Writef, Retriable) are existing azcore API names; 'unrecovered' is a comment word. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cspell only scans changed files and flagged the new occurrences of 'azcosmosgemtest' / 'azcosmostest' in the modified test file (these strings are pre-existing throughout the test suite, just not in this particular file before). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add the directive to cosmos_dbaccount_refresh_test.go (azcosmosgemtest / azcosmostest pre-existed in this file but new occurrences trip cspell on changed-file scans) and add 'retriable' (lowercase comment word) to both test files' directives. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

cspell scans the full content of every changed file. Now adding the directive to cosmos_global_endpoint_manager.go (azlog, ctxt - latter is a pre-existing typo'd local var name) and cosmos_global_endpoint_manager_test.go (azcosmostest - pipeline name string pre-existing in many test files). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Found the real cause by running cspell locally with the repo's pinned 1. cosmos_global_endpoint_manager_test.go: 4 more azcosmosgemtest + 2 westus occurrences (only added azcosmostest previously). 2. CHANGELOG.md: two pre-existing words (documentdb / unmarshalling) in older release-history entries the file now gets cspell-scanned end-to-end because we added one line to it. Verified locally: npx cspell --config .vscode/cspell.json <changed files> CSpell: Files checked: 7, Issues found: 0 in 0 files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

A leftover whitespace in 'int64(1 + elapsed/forcedRefreshMinInterval)' made gofmt unhappy and that's what was actually failing the Analyze stage now that all the cspell flags are gone. Verified locally: - go vet, gofmt: clean - golangci-lint v2.11.2 (pipeline pin): 0 issues - cspell ^9.6.3: 0 issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 26, 2026 21:56

tvaron3 requested a review from a team as a code owner May 26, 2026 21:56

Copilot started reviewing on behalf of tvaron3 May 26, 2026 21:56 View session

github-actions Bot added the Cosmos label May 26, 2026

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated

tvaron3 and others added 4 commits May 26, 2026 15:33

azcosmos: shorten CHANGELOG entry for PR 26889

09419e6

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azcosmos: add CHANGELOG entry for PR 26889

0fabe5f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 changed the title ~~[azcosmos] don't gate cross-region connection-error failover on metadata refresh~~ [azcosmos] unblock cross-region failover from synchronous gem.Update May 27, 2026

azcosmos: clarify test comment to match injected error (PR review fee…

4c19925

…dback) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>