Skip to content

[azcosmos] unblock cross-region failover from synchronous gem.Update#26889

Open
tvaron3 wants to merge 17 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-failover-no-gem-update
Open

[azcosmos] unblock cross-region failover from synchronous gem.Update#26889
tvaron3 wants to merge 17 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-failover-no-gem-update

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 26, 2026

What & Why

Three retry paths in the Cosmos client retry policy (cosmos_client_retry_policy.go) used to gate the cross-region retry on a successful synchronous gem.Update(ctx, true) against the account's global endpoint. During a regional outage the global FQDN typically resolves to the same regional FE pool we just marked unavailable (Cosmos accounts in a region share FEs), so the forced refresh stalls on a connect timeout and surfaces that error instead of attempting the failover.

This PR also fixes two interlocking bugs in attemptRetryOnNetworkError that prevented the failover from actually routing to another region even when the refresh succeeded, and tightens a few related correctness issues.

Changes

  • attemptRetryOnNetworkError (connection error) — dropped the forced gem.Update. MarkEndpointUnavailable* already calls locationCache.updateLocked which recomputes readEndpoints / writeEndpoints with the unavailable endpoint demoted; invalidate() arms the next non-force Update for real topology changes. Also left retryCount at 0 after the failover — getPrefAvailableEndpointsLocked demotes unavailable endpoints to the tail of the route list rather than removing them, so retryCount=1 (after retryCount += 1) re-indexed back to the just-marked unhealthy endpoint via ResolveServiceEndpoint(1 % 2).
  • attemptRetryOnEndpointFailure (HTTP 403 / WriteForbidden) — replaced the blocking gem.Update with a fire-and-forget asyncForceRefreshGEM helper. The helper uses a CompareAndSwap gate so a retry storm cannot leak a goroutine per attempt, runs on context.Background() so a near-expired caller deadline cannot abort the refresh, and has panic recovery + error logging.
  • attemptRetryOnRequestTimeout (HTTP 408) — removed both MarkEndpointUnavailableForRead and the forced gem.Update. A 408 is a per-request signal (the gateway accepted the request but did not finish in time), not regional unhealthiness; demoting the whole region for unavailableLocationExpirationTime based on one slow request would penalize concurrent reads. The outer loop's retryCount++ already advances the location index for the cross-region retry.
  • URL-key normalization in locationCacheMarkEndpointUnavailable* was called with the full request URL (path/query/RawPath), but the unavailability map was keyed by url.URL struct equality while availReadEndpointsByLocation / availWriteEndpointsByLocation only stored scheme+host. Marks were silently written under a key nothing else looked up, so isEndpointUnavailableLocked always returned false and the demote was a no-op. Now normalized to scheme+host via a new internal endpointKey helper.
  • locationCache.resolveServiceEndpoint — now takes mapMutex.RLock. Reads of locationInfo / enableMultipleWriteLocations race the writes in update/updateLocked, which the new async refresh paths surface deterministically under go test -race.

Repro that motivated the fix

Reproduced live by running azcosmos_perf in AKS with iptables OUTPUT DROP rules applied via kubectl debug --profile=netadmin to block the workload account's westus2 FE IPs (20.42.169.199, 20.42.170.147, 20.9.156.133).

Account topology:

  • Workload account: write=westus2, reads=[westus2, westus]
  • Client args: --application-region westus2 --preferred-regions westus

For 10 minutes:

  • ss -tan inside the pod showed every goroutine in SYN_SENT to a blocked westus2 IP. Zero TCP attempts to 40.112.241.99 (the westus replica).
  • Per-op error rate ≈ 15/min/op across concurrency=50 → ~18–22 s wall-clock per failed op, matching the budgeted same-region retries (3 × (5s + 1s backoff)) plus one failover attempt whose gem.Update itself timed out.
  • Effective throughput went to zero and stayed there for the full window.

After the fix, the cache mutation from MarkEndpointUnavailable* is enough for the next iteration's ResolveServiceEndpoint to return westus, and the cross-region retry actually executes.

Regression tests

  • TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable — pins the GEM mock to net.DNSError so any forced refresh would fail; asserts the 5th-attempt 200 from the failover server is returned. Pre-fix the request surfaces a DNS error instead.
  • TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails — 403 retry returns (true, nil) even when GEM transport returns 500 on every call.
  • TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable — after a 408 retry, locationUnavailabilityInfoMap is empty and the GEM transport was never invoked.
  • TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM — with a 10-minute hanging GEM transport, attemptRetryOnRequestTimeout returns within 5 s.
  • TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm — 50 concurrent asyncForceRefreshGEM() calls produce far fewer than concurrency/2 HTTP refreshes.
  • TestFix3e_AsyncRefreshIgnoresCallerContextCancellation — an already-cancelled caller context still triggers a successful background refresh (proves context.Background()).
  • TestFix3_WriteRetryForceRefreshesGEM updated to use require.Eventually since the refresh is now async and the singleflight + CAS gate legitimately coalesce concurrent forced refreshes.

Backwards compatibility

CHANGELOG

Entry added under 1.5.0-beta.7Bugs Fixed.

Testing

go test -count=1 -timeout 300s -race ./...
ok  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos    71.863s

…ta refresh

attemptRetryOnNetworkError used to call gem.Update(ctx, true) after
MarkEndpointUnavailable* to force a topology refresh before the
cross-region retry. When the regional gateway outage that triggered the
failover also takes out the global endpoint's address -- the common
case, since the global account FQDN typically resolves to the same
regional FE pool as the write region -- the forced refresh dials a
blocked address, hits the 5s connect timeout, and returns an error.
attemptRetryOnNetworkError then returns (false, err) and surfaces the
original connection failure to the caller, never attempting the actual
cross-region retry against the next preferred region.

The forced refresh added no routing information: MarkEndpointUnavailable*
already invokes locationCache.updateLocked, which recomputes
readEndpoints / writeEndpoints with the unavailable endpoint demoted,
so the next ResolveServiceEndpoint(retryCount+1, ...) picks the
failover region on its own. A connection error indicates regional
unhealthiness, not a topology change, so the synchronous refresh is
unnecessary here. The invalidate() inside MarkEndpointUnavailable*
still ensures the next non-force Update will pick up any real topology
changes once the global endpoint is reachable again.

Adds TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable
that pins the gemServer to net.DNSError so any forced refresh would
fail. Pre-fix this test fails (the request surfaces a DNS error after
3 same-region retries instead of the 5th-attempt 200); post-fix the
failover succeeds.

Reproduced live by running azcosmos_perf in AKS with iptables OUTPUT
DROP rules against the workload account's westus2 FE IPs. Without
this fix, no goroutine ever attempted a TCP connection to the westus
replica even though westus was a configured preferred read region.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 26, 2026 21:56
@tvaron3 tvaron3 requested a review from a team as a code owner May 26, 2026 21:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the azcosmos client retry policy so that cross-region failover for connection errors is no longer blocked by a forced global-endpoint metadata refresh (which can fail during regional gateway outages), ensuring the failover retry actually occurs.

Changes:

  • Removed the synchronous gem.Update(ctx, true) gating from attemptRetryOnNetworkError after marking an endpoint unavailable.
  • Added a regression test covering failover behavior when the global endpoint is unreachable.
  • Added a CHANGELOG entry under 1.5.0-beta.7 describing the fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
sdk/data/azcosmos/cosmos_client_retry_policy.go Stops forcing a GEM refresh on the connection-error failover path so failover isn’t blocked by unreachable metadata endpoint.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go Adds a regression test ensuring read failover proceeds even when the global endpoint can’t be reached.
sdk/data/azcosmos/CHANGELOG.md Documents the bug fix in the unreleased changelog.

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated
tvaron3 and others added 4 commits May 26, 2026 15:33
…vailability map

Repro on a real two-region account showed the original fix wasn't
enough: with iptables OUTPUT DROP rules blocking the workload account's
westus2 FE IPs, the SDK still kept every goroutine pinned to a blocked
westus2 IP and never attempted the westus replica. ss -tan inside the
pod showed zero TCP attempts to the configured preferred-region replica.

Investigation surfaced two more bugs in the same code path:

1. After MarkEndpointUnavailable* the policy did retryCount += 1.
   getPrefAvailableEndpointsLocked demotes unavailable endpoints to the
   TAIL of readEndpoints / writeEndpoints rather than removing them, so
   readEndpoints becomes [healthy, unhealthy]. ResolveServiceEndpoint
   uses retryCount as an index, so the bumped retryCount=1 made
   ResolveServiceEndpoint(1 % 2) return the just-marked unhealthy
   endpoint at the tail -- the 'failover' attempt would hit the same
   dead region again.

2. MarkEndpointUnavailable* was called with the full request URL (path,
   query, RawPath included). The unavailability map was keyed by url.URL
   struct equality, but the cache's availReadEndpointsByLocation /
   availWriteEndpointsByLocation stored only scheme+host base URLs. The
   marks were therefore written under keys nothing else looked up, so
   isEndpointUnavailableLocked always returned false and the demote step
   silently did nothing. The previous test suite did not catch this
   because CreateMockLC pointed every region at the same defaultEndpoint
   URL, masking the routing decision entirely.

Fixes:
* cosmos_client_retry_policy.go: drop the retryCount += 1 in
  attemptRetryOnNetworkError after the mark, with a comment explaining
  why. retryCount stays at 0 so the next ResolveServiceEndpoint returns
  readEndpoints[0] -- the just-promoted preferred region.
* cosmos_location_cache.go: introduce endpointKey(u url.URL) url.URL
  that returns {Scheme, Host}. markEndpointUnavailableFor{Read,Write}
  and isEndpointUnavailableLocked use it so mark and check operations
  agree on identity regardless of the input URL's path/query.
* cosmos_client_retry_policy_test.go:
  - Flip the retryCount assertion from 1 to 0 in five existing
    connection-error tests (the two 408 tests are deliberately not
    touched -- the 408 retry path is separate and gets bumped in the
    Do loop).
  - Replace the regression test body with a version that wires TWO
    distinct mock servers (badSrv = original/unhealthy region, goodSrv
    = failover region) via a small routingMockTransport that dispatches
    by URL host. Asserts exactly 4 requests hit badSrv and exactly 1
    hits goodSrv. Catches all three pre-fix conditions.
* CHANGELOG.md: rewrite the entry to describe all three fixes together
  under a single bullet, including the 'demote-not-remove' index math
  and the URL-key normalization.

Reproduced live by re-running the same iptables outage scenario in AKS
with a new image of azcosmos_perf built against this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three retry paths in the Cosmos client retry policy used to gate the
cross-region retry on a successful synchronous gem.Update(ctx, true)
against the account's global endpoint. During a regional outage the
global FQDN typically resolves to the same regional FE pool we just
marked unavailable, so the forced refresh stalled on a connect timeout
and surfaced that error instead of attempting the failover.

  - attemptRetryOnEndpointFailure (403/WriteForbidden): switched from
    blocking gem.Update to a fire-and-forget asyncForceRefreshGEM
    helper. The helper uses a CompareAndSwap gate so a retry storm
    cannot leak a goroutine per attempt, runs on context.Background()
    so a near-expired caller deadline cannot abort the refresh, and
    has panic recovery + error logging.

  - attemptRetryOnRequestTimeout (HTTP 408): no longer calls
    MarkEndpointUnavailableForRead or gem.Update. A 408 is a
    per-request signal (the gateway accepted the request but did not
    finish in time), not regional unhealthiness; demoting the whole
    region for unavailableLocationExpirationTime based on one slow
    request would penalize concurrent reads. The outer loop's
    retryCount++ already advances the location index for the
    cross-region retry.

Also took mapMutex.RLock around locationCache.resolveServiceEndpoint:
its reads (locationInfo, enableMultipleWriteLocations) race the writes
in update/updateLocked, which the new async refresh paths surface
deterministically under go test -race.

Tests added in cosmos_dbaccount_refresh_test.go:
  - TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails
  - TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable
  - TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM
  - TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm
  - TestFix3e_AsyncRefreshIgnoresCallerContextCancellation

TestFix3_WriteRetryForceRefreshesGEM updated to use require.Eventually
since the refresh is now async and the singleflight + CAS gate
legitimately coalesce concurrent forced refreshes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 changed the title [azcosmos] don't gate cross-region connection-error failover on metadata refresh [azcosmos] unblock cross-region failover from synchronous gem.Update May 27, 2026
…dback)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member Author

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Deep Review summary

I found three blocking correctness risks and several targeted test/robustness gaps. The main concern is that the new demote-to-tail routing only works when the next retry indexes 0; the 403 path still increments retryCount, and the connection-error path does not reset a pre-existing non-zero retryCount. I also found remaining unlocked location-cache readers exposed by the new async refresh path.

I could not run local Go tests because go is not installed in this environment (go: command not found).


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
Comment thread sdk/data/azcosmos/cosmos_location_cache.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_dbaccount_refresh_test.go
Comment thread sdk/data/azcosmos/cosmos_dbaccount_refresh_test.go Outdated
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rapid turnaround on top of #26815 and the excellent live-AKS repro narrative — iptables OUTPUT DROP + ss -tan showing zero TCP attempts to the replica IP is exactly the level of forensic detail that made the three interlocking bugs visible. The URL-key normalization catch was particularly nice. Core fix (MarkEndpointUnavailable* cache mutation is sufficient; synchronous refresh is the trap) reads cleanly and aligns with .NET on the demote-to-tail routing model. LGTM 🚀

Follow-up changes addressing four rounds of deep-reviewer feedback on
top of the original PR. All changes preserve the surface fix (no
synchronous gem.Update blocking cross-region retries) and add hardening
around the supporting machinery.

Routing / retry-state
  - Add retryContext.resolveFromHead as a one-shot signal so retry
    paths that mark-then-demote can force the next ResolveServiceEndpoint
    to use locationIndex 0 instead of an inherited (possibly non-zero)
    retryCount. Without this, the outer Do loop's retryCount += 1 routed
    the failover right back to the just-demoted endpoint at the tail
    of readEndpoints / writeEndpoints. Applied in both
    attemptRetryOnNetworkError and attemptRetryOnEndpointFailure.

Forced refresh
  - Replace asyncRefreshPending atomic.Bool with asyncRefreshState
    atomic.Int32 (Idle / Pending / Failed).
  - asyncForceRefreshGEM no longer re-panics from its detached
    goroutine (an unrecovered panic in any goroutine terminates the
    process). Panics are recorded as failures for the state machine.
  - Add lastForcedRefreshUnixNano + forcedRefreshMinInterval to
    rate-limit repeat forced refreshes against an already-unavailable
    endpoint. Covers both stale-success and stale-failure recovery in
    one path; single-master writes can re-force after the window
    elapses, but a tight 403 loop cannot storm GetDatabaseAccount.

gem.Update
  - Add updateFlight.genAtStart so a forceRefresh waiter that wakes
    on an in-flight refresh whose flight predates the latest
    invalidation loops to lead a fresh refresh. The waiter re-reads
    invalidationGen AFTER <-waitFlight.done to catch invalidations
    that fire while it is blocked.
  - Add a leader-side loop: a forceRefresh leader whose own flight
    sees a mid-flight invalidate() also loops to lead a fresh refresh.
  - Bound both loops with a shared maxForceRefreshRetries budget so
    sustained invalidations cannot keep a single goroutine spinning
    indefinitely.

Location-cache locking
  - resolveServiceEndpoint, getLocation, and the new sessionRetrySnapshot
    helper now take mapMutex.RLock. The async refresh paths can rewrite
    locationInfo concurrently, which go test -race surfaced.
    sessionRetrySnapshot returns (multiWrite, readN, writeN) under a
    single RLock so attemptRetryOnSessionUnavailable cannot mix
    pre- and post-refresh topology state.

MarkEndpointUnavailableFor{Read,Write} signature
  - Now returns (wasAlreadyUnavailable bool, err error) so callers can
    distinguish first-transition from repeat marks. The retry policy
    uses this to gate the forced refresh.

Tests added in cosmos_dbaccount_refresh_test.go:
  TestFix1c_ForceRefreshWaiterReReadsInvalidationGenAfterWait
  TestFix1d_ForceRefreshLeaderLoopsOnMidFlightInvalidation
  TestFix1e_ForceRefreshLeaderLoopIsBounded
  TestFix1f_ForceRefreshWaiterLoopIsBounded
  TestFix3a_WriteRetrySucceedsWhenGEMRefreshFails
  TestFix3b_RequestTimeoutDoesNotMarkEndpointUnavailable
  TestFix3c_RequestTimeoutDoesNotBlockOnStalledGEM
  TestFix3d_AsyncRefreshCASGateCoalescesRetryStorm
  TestFix3e_AsyncRefreshIgnoresCallerContextCancellation
  TestFix3f_FailedAsyncRefreshIsRetriedOnNextSameEndpoint403
  TestFix3g_SessionUnavailableSnapshotIsAtomic
  TestFix3h_RepeatWriteForbiddenForcesRefreshAfterRateWindow

Tests added in cosmos_client_retry_policy_test.go:
  TestWriteForbiddenFailsOverToHealthyRegion
  TestConnectionErrorFailoverResetsNonZeroRetryCount

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tvaron3

The Build Analyze stage of the azcosmos CI pipeline runs cspell, which
treated these identifiers/comment words as unknown and failed the
build. azlog/Writef are existing azcore log API names referenced in
cosmos_client_retry_policy.go, Retriable is part of the existing
errorinfo.NonRetriableError API, and unrecovered appears in a comment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 2 commits May 27, 2026 20:30
The new identifiers/comment words used in the azcosmos retry-policy
changes (azlog, Writef, Retriable, unrecovered) were previously
added to the global words array. Per the cspell layout this repo uses,
overrides[].filename lets us scope words to a specific module's
filetree, which avoids polluting the global namespace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 7 commits May 27, 2026 20:40
Follow the repo convention (cosmos_container_query_engine.go uses the
same pattern) of declaring file-local cspell exceptions via an inline
// cSpell:ignore comment rather than modifying the global cspell.json.
The flagged identifiers (azlog, Writef, Retriable) are existing
azcore API names; 'unrecovered' is a comment word.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cspell only scans changed files and flagged the new occurrences of
'azcosmosgemtest' / 'azcosmostest' in the modified test file (these
strings are pre-existing throughout the test suite, just not in this
particular file before).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the directive to cosmos_dbaccount_refresh_test.go (azcosmosgemtest /
azcosmostest pre-existed in this file but new occurrences trip cspell
on changed-file scans) and add 'retriable' (lowercase comment word) to
both test files' directives.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cspell scans the full content of every changed file. Now adding the
directive to cosmos_global_endpoint_manager.go (azlog, ctxt - latter
is a pre-existing typo'd local var name) and cosmos_global_endpoint_manager_test.go
(azcosmostest - pipeline name string pre-existing in many test files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Found the real cause by running cspell locally with the repo's pinned
  1. cosmos_global_endpoint_manager_test.go: 4 more azcosmosgemtest +
     2 westus occurrences (only added azcosmostest previously).
  2. CHANGELOG.md: two pre-existing words (documentdb / unmarshalling)
     in older release-history entries the file now gets cspell-scanned
     end-to-end because we added one line to it.

Verified locally:
  npx cspell --config .vscode/cspell.json <changed files>
  CSpell: Files checked: 7, Issues found: 0 in 0 files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A leftover whitespace in 'int64(1 + elapsed/forcedRefreshMinInterval)'
made gofmt unhappy and that's what was actually failing the Analyze
stage now that all the cspell flags are gone.

Verified locally:
  - go vet, gofmt: clean
  - golangci-lint v2.11.2 (pipeline pin): 0 issues
  - cspell ^9.6.3: 0 issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants