Skip to content

azcosmos: only mark region unavailable on NotSent network errors#4

Closed
tvaron3 wants to merge 1 commit into
mainfrom
tvaron3/special-tribble
Closed

azcosmos: only mark region unavailable on NotSent network errors#4
tvaron3 wants to merge 1 commit into
mainfrom
tvaron3/special-tribble

Conversation

@tvaron3
Copy link
Copy Markdown
Owner

@tvaron3 tvaron3 commented Jun 1, 2026

Summary

Changes attemptRetryOnNetworkError in sdk/data/azcosmos/cosmos_client_retry_policy.go so that MarkEndpointUnavailableForRead/ForWrite is only invoked when the transport error is classified as connectionErrorNotSent — meaning we are sure the request never reached the service (DNS failure, TCP connect refused/unreachable, TLS handshake failure, etc.).

For ambiguous transport errors (EOF, RST, transport-level timeout mid-exchange — the request may have been processed server-side), the region is no longer marked unavailable. A single mid-exchange failure is too weak a signal to declare a regional outage for all concurrent and future traffic on the client.

The 403 endpoint-failure path (attemptRetryOnEndpointFailure) is intentionally unchanged — that path is a service response, not a transport error.

Behavior change

Scenario Before After
Read, NotSent Mark read, demote-to-tail, failover via resolveFromHead Unchanged
Read, ambiguous Mark read, demote-to-tail, failover via resolveFromHead No mark. Failover by bumping retryContext.retryCount so the next iteration of the Do loop resolves a different locationIndex. Mirrors how the 503/408 paths already failover without marking.
Write (multi-master), NotSent Mark read+write, demote, failover Unchanged
Write (multi-master), ambiguous Mark read+write, no retry No mark, no retry
Write (single-master), NotSent Mark read only, no retry Unchanged
Write (single-master), ambiguous Mark read only, no retry No mark, no retry
403 WriteForbidden / DatabaseAccountNotFound Mark + forced async refresh + failover Unchanged

Tests

  • TestAmbiguousConnectionErrorReadFailsOver — rewritten to use the two-server routingMockTransport pattern (same as TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable). Asserts the ambiguous-read retry actually reaches the second region (goodSrv.Requests() == 1), retryCount == 1, and the location-cache unavailability map stays empty.
  • TestAmbiguousWriteMarksEndpointUnavailableForRead → renamed TestAmbiguousWriteDoesNotMarkEndpointUnavailable and inverted to assert the location cache stays empty.
  • All NotSent-path tests (TestNotSentConnectionErrorMultiMasterWriteFailsOver, TestConnectionErrorGivesUpAfterSingleCrossRegionFailover, TestDnsErrorRetry, TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable, etc.) pass unchanged.

Notes

  • Comments in attemptRetryOnNetworkError were also significantly shortened.
  • The change is internal to the retry policy; no public API change.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Previously, attemptRetryOnNetworkError called MarkEndpointUnavailable*
on both NotSent (DNS/TCP connect/TLS handshake failures) and ambiguous
(EOF/RST/transport timeout mid-exchange) transport errors. An ambiguous
mid-exchange failure is too weak a signal to declare a whole region
unavailable for every concurrent and future request on the client, and
the request may even have been processed server-side.

Now MarkEndpointUnavailable* is only invoked when the error is
connectionErrorNotSent. The ambiguous-read cross-region failover can no
longer rely on demote-to-tail, so it bumps retryContext.retryCount
instead, mirroring how the 503/408 paths failover without marking. The
403 endpoint-failure path is unchanged.

Adds a two-server routing test for the ambiguous-read failover that
asserts the retry actually reaches the second region without touching
the location cache, and inverts the ambiguous-write test to assert no
endpoint is marked unavailable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the Cosmos label Jun 1, 2026
@tvaron3
Copy link
Copy Markdown
Owner Author

tvaron3 commented Jun 1, 2026

Superseded by upstream PR Azure#26915

@tvaron3 tvaron3 closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant