[azcosmos] unblock cross-region failover from synchronous gem.Update#26889
[azcosmos] unblock cross-region failover from synchronous gem.Update#26889tvaron3 wants to merge 17 commits into
Conversation
…ta refresh attemptRetryOnNetworkError used to call gem.Update(ctx, true) after MarkEndpointUnavailable* to force a topology refresh before the cross-region retry. When the regional gateway outage that triggered the failover also takes out the global endpoint's address -- the common case, since the global account FQDN typically resolves to the same regional FE pool as the write region -- the forced refresh dials a blocked address, hits the 5s connect timeout, and returns an error. attemptRetryOnNetworkError then returns (false, err) and surfaces the original connection failure to the caller, never attempting the actual cross-region retry against the next preferred region. The forced refresh added no routing information: MarkEndpointUnavailable* already invokes locationCache.updateLocked, which recomputes readEndpoints / writeEndpoints with the unavailable endpoint demoted, so the next ResolveServiceEndpoint(retryCount+1, ...) picks the failover region on its own. A connection error indicates regional unhealthiness, not a topology change, so the synchronous refresh is unnecessary here. The invalidate() inside MarkEndpointUnavailable* still ensures the next non-force Update will pick up any real topology changes once the global endpoint is reachable again. Adds TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable that pins the gemServer to net.DNSError so any forced refresh would fail. Pre-fix this test fails (the request surfaces a DNS error after 3 same-region retries instead of the 5th-attempt 200); post-fix the failover succeeds. Reproduced live by running azcosmos_perf in AKS with iptables OUTPUT DROP rules against the workload account's westus2 FE IPs. Without this fix, no goroutine ever attempted a TCP connection to the westus replica even though westus was a configured preferred read region. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adjusts the azcosmos client retry policy so that cross-region failover for connection errors is no longer blocked by a forced global-endpoint metadata refresh (which can fail during regional gateway outages), ensuring the failover retry actually occurs.
Changes:
- Removed the synchronous
gem.Update(ctx, true)gating fromattemptRetryOnNetworkErrorafter marking an endpoint unavailable. - Added a regression test covering failover behavior when the global endpoint is unreachable.
- Added a CHANGELOG entry under
1.5.0-beta.7describing the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/data/azcosmos/cosmos_client_retry_policy.go |
Stops forcing a GEM refresh on the connection-error failover path so failover isn’t blocked by unreachable metadata endpoint. |
sdk/data/azcosmos/cosmos_client_retry_policy_test.go |
Adds a regression test ensuring read failover proceeds even when the global endpoint can’t be reached. |
sdk/data/azcosmos/CHANGELOG.md |
Documents the bug fix in the unreleased changelog. |
…vailability map
Repro on a real two-region account showed the original fix wasn't
enough: with iptables OUTPUT DROP rules blocking the workload account's
westus2 FE IPs, the SDK still kept every goroutine pinned to a blocked
westus2 IP and never attempted the westus replica. ss -tan inside the
pod showed zero TCP attempts to the configured preferred-region replica.
Investigation surfaced two more bugs in the same code path:
1. After MarkEndpointUnavailable* the policy did retryCount += 1.
getPrefAvailableEndpointsLocked demotes unavailable endpoints to the
TAIL of readEndpoints / writeEndpoints rather than removing them, so
readEndpoints becomes [healthy, unhealthy]. ResolveServiceEndpoint
uses retryCount as an index, so the bumped retryCount=1 made
ResolveServiceEndpoint(1 % 2) return the just-marked unhealthy
endpoint at the tail -- the 'failover' attempt would hit the same
dead region again.
2. MarkEndpointUnavailable* was called with the full request URL (path,
query, RawPath included). The unavailability map was keyed by url.URL
struct equality, but the cache's availReadEndpointsByLocation /
availWriteEndpointsByLocation stored only scheme+host base URLs. The
marks were therefore written under keys nothing else looked up, so
isEndpointUnavailableLocked always returned false and the demote step
silently did nothing. The previous test suite did not catch this
because CreateMockLC pointed every region at the same defaultEndpoint
URL, masking the routing decision entirely.
Fixes:
* cosmos_client_retry_policy.go: drop the retryCount += 1 in
attemptRetryOnNetworkError after the mark, with a comment explaining
why. retryCount stays at 0 so the next ResolveServiceEndpoint returns
readEndpoints[0] -- the just-promoted preferred region.
* cosmos_location_cache.go: introduce endpointKey(u url.URL) url.URL
that returns {Scheme, Host}. markEndpointUnavailableFor{Read,Write}
and isEndpointUnavailableLocked use it so mark and check operations
agree on identity regardless of the input URL's path/query.
* cosmos_client_retry_policy_test.go:
- Flip the retryCount assertion from 1 to 0 in five existing
connection-error tests (the two 408 tests are deliberately not
touched -- the 408 retry path is separate and gets bumped in the
Do loop).
- Replace the regression test body with a version that wires TWO
distinct mock servers (badSrv = original/unhealthy region, goodSrv
= failover region) via a small routingMockTransport that dispatches
by URL host. Asserts exactly 4 requests hit badSrv and exactly 1
hits goodSrv. Catches all three pre-fix conditions.
* CHANGELOG.md: rewrite the entry to describe all three fixes together
under a single bullet, including the 'demote-not-remove' index math
and the URL-key normalization.
Reproduced live by re-running the same iptables outage scenario in AKS
with a new image of azcosmos_perf built against this branch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three retry paths in the Cosmos client retry policy used to gate the
cross-region retry on a successful synchronous gem.Update(ctx, true)
against the account's global endpoint. During a regional outage the
global FQDN typically resolves to the same regional FE pool we just
marked unavailable, so the forced refresh stalled on a connect timeout
and surfaced that error instead of attempting the failover.
- attemptRetryOnEndpointFailure (403/WriteForbidden): switched from
blocking gem.Update to a fire-and-forget asyncForceRefreshGEM
helper. The helper uses a CompareAndSwap gate so a retry storm
cannot leak a goroutine per attempt, runs on context.Background()
so a near-expired caller deadline cannot abort the refresh, and
has panic recovery + error logging.
- attemptRetryOnRequestTimeout (HTTP 408): no longer calls
MarkEndpointUnavailableForRead or gem.Update. A 408 is a
per-request signal (the gateway accepted the request but did not
finish in time), not regional unhealthiness; demoting the whole
region for unavailableLocationExpirationTime based on one slow
request would penalize concurrent reads. The outer loop's
retryCount++ already advances the location index for the
cross-region retry.
Also took mapMutex.RLock around locationCache.resolveServiceEndpoint:
its reads (locationInfo, enableMultipleWriteLocations) race the writes
in update/updateLocked, which the new async refresh paths surface
deterministically under go test -race.
Tests added in cosmos_dbaccount_refresh_test.go:
- TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails
- TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable
- TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM
- TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm
- TestFix3e_AsyncRefreshIgnoresCallerContextCancellation
TestFix3_WriteRetryForceRefreshesGEM updated to use require.Eventually
since the refresh is now async and the singleflight + CAS gate
legitimately coalesce concurrent forced refreshes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dback) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3
left a comment
There was a problem hiding this comment.
PR Deep Review summary
I found three blocking correctness risks and several targeted test/robustness gaps. The main concern is that the new demote-to-tail routing only works when the next retry indexes 0; the 403 path still increments retryCount, and the connection-error path does not reset a pre-existing non-zero retryCount. I also found remaining unlocked location-cache readers exposed by the new async refresh path.
I could not run local Go tests because go is not installed in this environment (go: command not found).
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks for the rapid turnaround on top of #26815 and the excellent live-AKS repro narrative — iptables OUTPUT DROP + ss -tan showing zero TCP attempts to the replica IP is exactly the level of forensic detail that made the three interlocking bugs visible. The URL-key normalization catch was particularly nice. Core fix (MarkEndpointUnavailable* cache mutation is sufficient; synchronous refresh is the trap) reads cleanly and aligns with .NET on the demote-to-tail routing model. LGTM 🚀
Follow-up changes addressing four rounds of deep-reviewer feedback on
top of the original PR. All changes preserve the surface fix (no
synchronous gem.Update blocking cross-region retries) and add hardening
around the supporting machinery.
Routing / retry-state
- Add retryContext.resolveFromHead as a one-shot signal so retry
paths that mark-then-demote can force the next ResolveServiceEndpoint
to use locationIndex 0 instead of an inherited (possibly non-zero)
retryCount. Without this, the outer Do loop's retryCount += 1 routed
the failover right back to the just-demoted endpoint at the tail
of readEndpoints / writeEndpoints. Applied in both
attemptRetryOnNetworkError and attemptRetryOnEndpointFailure.
Forced refresh
- Replace asyncRefreshPending atomic.Bool with asyncRefreshState
atomic.Int32 (Idle / Pending / Failed).
- asyncForceRefreshGEM no longer re-panics from its detached
goroutine (an unrecovered panic in any goroutine terminates the
process). Panics are recorded as failures for the state machine.
- Add lastForcedRefreshUnixNano + forcedRefreshMinInterval to
rate-limit repeat forced refreshes against an already-unavailable
endpoint. Covers both stale-success and stale-failure recovery in
one path; single-master writes can re-force after the window
elapses, but a tight 403 loop cannot storm GetDatabaseAccount.
gem.Update
- Add updateFlight.genAtStart so a forceRefresh waiter that wakes
on an in-flight refresh whose flight predates the latest
invalidation loops to lead a fresh refresh. The waiter re-reads
invalidationGen AFTER <-waitFlight.done to catch invalidations
that fire while it is blocked.
- Add a leader-side loop: a forceRefresh leader whose own flight
sees a mid-flight invalidate() also loops to lead a fresh refresh.
- Bound both loops with a shared maxForceRefreshRetries budget so
sustained invalidations cannot keep a single goroutine spinning
indefinitely.
Location-cache locking
- resolveServiceEndpoint, getLocation, and the new sessionRetrySnapshot
helper now take mapMutex.RLock. The async refresh paths can rewrite
locationInfo concurrently, which go test -race surfaced.
sessionRetrySnapshot returns (multiWrite, readN, writeN) under a
single RLock so attemptRetryOnSessionUnavailable cannot mix
pre- and post-refresh topology state.
MarkEndpointUnavailableFor{Read,Write} signature
- Now returns (wasAlreadyUnavailable bool, err error) so callers can
distinguish first-transition from repeat marks. The retry policy
uses this to gate the forced refresh.
Tests added in cosmos_dbaccount_refresh_test.go:
TestFix1c_ForceRefreshWaiterReReadsInvalidationGenAfterWait
TestFix1d_ForceRefreshLeaderLoopsOnMidFlightInvalidation
TestFix1e_ForceRefreshLeaderLoopIsBounded
TestFix1f_ForceRefreshWaiterLoopIsBounded
TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails
TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable
TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM
TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm
TestFix3e_AsyncRefreshIgnoresCallerContextCancellation
TestFix3f_FailedAsyncRefreshIsRetriedOnNextSameEndpoint403
TestFix3g_SessionUnavailableSnapshotIsAtomic
TestFix3h_RepeatWriteForbiddenForcesRefreshAfterRateWindow
Tests added in cosmos_client_retry_policy_test.go:
TestWriteForbiddenFailsOverToHealthyRegion
TestConnectionErrorFailoverResetsNonZeroRetryCount
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Build Analyze stage of the azcosmos CI pipeline runs cspell, which treated these identifiers/comment words as unknown and failed the build. azlog/Writef are existing azcore log API names referenced in cosmos_client_retry_policy.go, Retriable is part of the existing errorinfo.NonRetriableError API, and unrecovered appears in a comment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… dictionary" This reverts commit b934b51.
The new identifiers/comment words used in the azcosmos retry-policy changes (azlog, Writef, Retriable, unrecovered) were previously added to the global words array. Per the cspell layout this repo uses, overrides[].filename lets us scope words to a specific module's filetree, which avoids polluting the global namespace. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 32c7cc3.
Follow the repo convention (cosmos_container_query_engine.go uses the same pattern) of declaring file-local cspell exceptions via an inline // cSpell:ignore comment rather than modifying the global cspell.json. The flagged identifiers (azlog, Writef, Retriable) are existing azcore API names; 'unrecovered' is a comment word. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cspell only scans changed files and flagged the new occurrences of 'azcosmosgemtest' / 'azcosmostest' in the modified test file (these strings are pre-existing throughout the test suite, just not in this particular file before). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the directive to cosmos_dbaccount_refresh_test.go (azcosmosgemtest / azcosmostest pre-existed in this file but new occurrences trip cspell on changed-file scans) and add 'retriable' (lowercase comment word) to both test files' directives. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cspell scans the full content of every changed file. Now adding the directive to cosmos_global_endpoint_manager.go (azlog, ctxt - latter is a pre-existing typo'd local var name) and cosmos_global_endpoint_manager_test.go (azcosmostest - pipeline name string pre-existing in many test files). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Found the real cause by running cspell locally with the repo's pinned
1. cosmos_global_endpoint_manager_test.go: 4 more azcosmosgemtest +
2 westus occurrences (only added azcosmostest previously).
2. CHANGELOG.md: two pre-existing words (documentdb / unmarshalling)
in older release-history entries the file now gets cspell-scanned
end-to-end because we added one line to it.
Verified locally:
npx cspell --config .vscode/cspell.json <changed files>
CSpell: Files checked: 7, Issues found: 0 in 0 files.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A leftover whitespace in 'int64(1 + elapsed/forcedRefreshMinInterval)' made gofmt unhappy and that's what was actually failing the Analyze stage now that all the cspell flags are gone. Verified locally: - go vet, gofmt: clean - golangci-lint v2.11.2 (pipeline pin): 0 issues - cspell ^9.6.3: 0 issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What & Why
Three retry paths in the Cosmos client retry policy (
cosmos_client_retry_policy.go) used to gate the cross-region retry on a successful synchronousgem.Update(ctx, true)against the account's global endpoint. During a regional outage the global FQDN typically resolves to the same regional FE pool we just marked unavailable (Cosmos accounts in a region share FEs), so the forced refresh stalls on a connect timeout and surfaces that error instead of attempting the failover.This PR also fixes two interlocking bugs in
attemptRetryOnNetworkErrorthat prevented the failover from actually routing to another region even when the refresh succeeded, and tightens a few related correctness issues.Changes
attemptRetryOnNetworkError(connection error) — dropped the forcedgem.Update.MarkEndpointUnavailable*already callslocationCache.updateLockedwhich recomputesreadEndpoints/writeEndpointswith the unavailable endpoint demoted;invalidate()arms the next non-forceUpdatefor real topology changes. Also leftretryCountat 0 after the failover —getPrefAvailableEndpointsLockeddemotes unavailable endpoints to the tail of the route list rather than removing them, soretryCount=1(afterretryCount += 1) re-indexed back to the just-marked unhealthy endpoint viaResolveServiceEndpoint(1 % 2).attemptRetryOnEndpointFailure(HTTP 403 /WriteForbidden) — replaced the blockinggem.Updatewith a fire-and-forgetasyncForceRefreshGEMhelper. The helper uses aCompareAndSwapgate so a retry storm cannot leak a goroutine per attempt, runs oncontext.Background()so a near-expired caller deadline cannot abort the refresh, and has panic recovery + error logging.attemptRetryOnRequestTimeout(HTTP 408) — removed bothMarkEndpointUnavailableForReadand the forcedgem.Update. A 408 is a per-request signal (the gateway accepted the request but did not finish in time), not regional unhealthiness; demoting the whole region forunavailableLocationExpirationTimebased on one slow request would penalize concurrent reads. The outer loop'sretryCount++already advances the location index for the cross-region retry.locationCache—MarkEndpointUnavailable*was called with the full request URL (path/query/RawPath), but the unavailability map was keyed byurl.URLstruct equality whileavailReadEndpointsByLocation/availWriteEndpointsByLocationonly stored scheme+host. Marks were silently written under a key nothing else looked up, soisEndpointUnavailableLockedalways returned false and the demote was a no-op. Now normalized to scheme+host via a new internalendpointKeyhelper.locationCache.resolveServiceEndpoint— now takesmapMutex.RLock. Reads oflocationInfo/enableMultipleWriteLocationsrace the writes inupdate/updateLocked, which the new async refresh paths surface deterministically undergo test -race.Repro that motivated the fix
Reproduced live by running
azcosmos_perfin AKS withiptables OUTPUT DROPrules applied viakubectl debug --profile=netadminto block the workload account'swestus2FE IPs (20.42.169.199,20.42.170.147,20.9.156.133).Account topology:
westus2, reads=[westus2, westus]--application-region westus2 --preferred-regions westusFor 10 minutes:
ss -taninside the pod showed every goroutine inSYN_SENTto a blockedwestus2IP. Zero TCP attempts to40.112.241.99(thewestusreplica).concurrency=50→ ~18–22 s wall-clock per failed op, matching the budgeted same-region retries (3 × (5s + 1s backoff)) plus one failover attempt whosegem.Updateitself timed out.After the fix, the cache mutation from
MarkEndpointUnavailable*is enough for the next iteration'sResolveServiceEndpointto returnwestus, and the cross-region retry actually executes.Regression tests
TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable— pins the GEM mock tonet.DNSErrorso any forced refresh would fail; asserts the 5th-attempt 200 from the failover server is returned. Pre-fix the request surfaces a DNS error instead.TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails— 403 retry returns(true, nil)even when GEM transport returns 500 on every call.TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable— after a 408 retry,locationUnavailabilityInfoMapis empty and the GEM transport was never invoked.TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM— with a 10-minute hanging GEM transport,attemptRetryOnRequestTimeoutreturns within 5 s.TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm— 50 concurrentasyncForceRefreshGEM()calls produce far fewer thanconcurrency/2HTTP refreshes.TestFix3e_AsyncRefreshIgnoresCallerContextCancellation— an already-cancelled caller context still triggers a successful background refresh (provescontext.Background()).TestFix3_WriteRetryForceRefreshesGEMupdated to userequire.Eventuallysince the refresh is now async and the singleflight + CAS gate legitimately coalesce concurrent forced refreshes.Backwards compatibility
attemptRetryOnNetworkErroralready had a comment explicitly noting that nogem.Updateis needed here (added in [Cosmos] Fix excess GetDatabaseAccount calls and eliminate data-plane fallback to customer endpoint (#25468) #26815). This change extends the same reasoning to the read / multi-master-write failover branch and to the 403 / 408 paths.CHANGELOG
Entry added under
1.5.0-beta.7→Bugs Fixed.Testing