azcosmos: only mark region unavailable on NotSent network errors#4
Closed
tvaron3 wants to merge 1 commit into
Closed
azcosmos: only mark region unavailable on NotSent network errors#4tvaron3 wants to merge 1 commit into
tvaron3 wants to merge 1 commit into
Conversation
Previously, attemptRetryOnNetworkError called MarkEndpointUnavailable* on both NotSent (DNS/TCP connect/TLS handshake failures) and ambiguous (EOF/RST/transport timeout mid-exchange) transport errors. An ambiguous mid-exchange failure is too weak a signal to declare a whole region unavailable for every concurrent and future request on the client, and the request may even have been processed server-side. Now MarkEndpointUnavailable* is only invoked when the error is connectionErrorNotSent. The ambiguous-read cross-region failover can no longer rely on demote-to-tail, so it bumps retryContext.retryCount instead, mirroring how the 503/408 paths failover without marking. The 403 endpoint-failure path is unchanged. Adds a two-server routing test for the ambiguous-read failover that asserts the retry actually reaches the second region without touching the location cache, and inverts the ambiguous-write test to assert no endpoint is marked unavailable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Owner
Author
|
Superseded by upstream PR Azure#26915 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
attemptRetryOnNetworkErrorinsdk/data/azcosmos/cosmos_client_retry_policy.goso thatMarkEndpointUnavailableForRead/ForWriteis only invoked when the transport error is classified asconnectionErrorNotSent— meaning we are sure the request never reached the service (DNS failure, TCP connect refused/unreachable, TLS handshake failure, etc.).For ambiguous transport errors (EOF, RST, transport-level timeout mid-exchange — the request may have been processed server-side), the region is no longer marked unavailable. A single mid-exchange failure is too weak a signal to declare a regional outage for all concurrent and future traffic on the client.
The
403endpoint-failure path (attemptRetryOnEndpointFailure) is intentionally unchanged — that path is a service response, not a transport error.Behavior change
resolveFromHeadresolveFromHeadretryContext.retryCountso the next iteration of theDoloop resolves a differentlocationIndex. Mirrors how the 503/408 paths already failover without marking.Tests
TestAmbiguousConnectionErrorReadFailsOver— rewritten to use the two-serverroutingMockTransportpattern (same asTestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable). Asserts the ambiguous-read retry actually reaches the second region (goodSrv.Requests() == 1),retryCount == 1, and the location-cache unavailability map stays empty.TestAmbiguousWriteMarksEndpointUnavailableForRead→ renamedTestAmbiguousWriteDoesNotMarkEndpointUnavailableand inverted to assert the location cache stays empty.TestNotSentConnectionErrorMultiMasterWriteFailsOver,TestConnectionErrorGivesUpAfterSingleCrossRegionFailover,TestDnsErrorRetry,TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable, etc.) pass unchanged.Notes
attemptRetryOnNetworkErrorwere also significantly shortened.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com