azcosmos: only mark region unavailable on NotSent network errors#26915
Merged
Conversation
Previously, attemptRetryOnNetworkError called MarkEndpointUnavailable* on both NotSent (DNS/TCP connect/TLS handshake failures) and ambiguous (EOF/RST/transport timeout mid-exchange) transport errors. An ambiguous mid-exchange failure is too weak a signal to declare a whole region unavailable for every concurrent and future request on the client, and the request may even have been processed server-side. Now MarkEndpointUnavailable* is only invoked when the error is connectionErrorNotSent. The ambiguous-read cross-region failover can no longer rely on demote-to-tail, so it bumps retryContext.retryCount instead, mirroring how the 503/408 paths failover without marking. The 403 endpoint-failure path is unchanged. Adds a two-server routing test for the ambiguous-read failover that asserts the retry actually reaches the second region without touching the location cache, and inverts the ambiguous-write test to assert no endpoint is marked unavailable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Refines the azcosmos client retry policy so that regional unavailability is only recorded for transport errors classified as connectionErrorNotSent (i.e., the request definitely never reached the service). For ambiguous mid-exchange transport failures, reads can still fail over cross-region, but without marking/demoting the endpoint in the shared location cache.
Changes:
- Updated
attemptRetryOnNetworkErrorto callMarkEndpointUnavailableForRead/Writeonly forconnectionErrorNotSent, and to route ambiguous-read failover by bumpingretryContext.retryCount. - Adjusted retry-policy tests to validate ambiguous-read failover reaches a different region without mutating the location-cache unavailability map.
- Renamed/inverted the ambiguous-write test to assert that ambiguous write failures do not mark any endpoint unavailable.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| sdk/data/azcosmos/cosmos_client_retry_policy.go | Restricts endpoint-unavailability marking to NotSent transport errors and changes ambiguous-read failover routing to avoid cache mutation. |
| sdk/data/azcosmos/cosmos_client_retry_policy_test.go | Updates/extends tests to verify ambiguous-read failover routes cross-region without marking endpoints unavailable; adjusts ambiguous-write expectations accordingly. |
simorenoh
approved these changes
Jun 1, 2026
Member
simorenoh
left a comment
There was a problem hiding this comment.
Only missing changelog, LGTM!
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
simorenoh
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
attemptRetryOnNetworkErrorinsdk/data/azcosmos/cosmos_client_retry_policy.goso thatMarkEndpointUnavailableForRead/ForWriteis only invoked when the transport error is classified asconnectionErrorNotSent— meaning we are sure the request never reached the service (DNS failure, TCP connect refused/unreachable, TLS handshake failure, etc.).For ambiguous transport errors (EOF, RST, transport-level timeout mid-exchange — the request may have been processed server-side), the region is no longer marked unavailable. A single mid-exchange failure is too weak a signal to declare a regional outage for all concurrent and future traffic on the client.
The
403endpoint-failure path (attemptRetryOnEndpointFailure) is intentionally unchanged — that path is a service response, not a transport error.Behavior change
resolveFromHeadresolveFromHeadretryContext.retryCountso the next iteration of theDoloop resolves a differentlocationIndex. Mirrors how the 503/408 paths already failover without marking.Tests
TestAmbiguousConnectionErrorReadFailsOver— rewritten to use the two-serverroutingMockTransportpattern (same asTestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable). Asserts the ambiguous-read retry actually reaches the second region (goodSrv.Requests() == 1),retryCount == 1, and the location-cache unavailability map stays empty.TestAmbiguousWriteMarksEndpointUnavailableForRead→ renamedTestAmbiguousWriteDoesNotMarkEndpointUnavailableand inverted to assert the location cache stays empty.TestNotSentConnectionErrorMultiMasterWriteFailsOver,TestConnectionErrorGivesUpAfterSingleCrossRegionFailover,TestDnsErrorRetry,TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable, etc.) pass unchanged.Notes
attemptRetryOnNetworkErrorwere also significantly shortened.