Skip to content

azcosmos: only mark region unavailable on NotSent network errors#26915

Merged
tvaron3 merged 2 commits into
Azure:mainfrom
tvaron3:tvaron3/special-tribble
Jun 1, 2026
Merged

azcosmos: only mark region unavailable on NotSent network errors#26915
tvaron3 merged 2 commits into
Azure:mainfrom
tvaron3:tvaron3/special-tribble

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Jun 1, 2026

Summary

Changes attemptRetryOnNetworkError in sdk/data/azcosmos/cosmos_client_retry_policy.go so that MarkEndpointUnavailableForRead/ForWrite is only invoked when the transport error is classified as connectionErrorNotSent — meaning we are sure the request never reached the service (DNS failure, TCP connect refused/unreachable, TLS handshake failure, etc.).

For ambiguous transport errors (EOF, RST, transport-level timeout mid-exchange — the request may have been processed server-side), the region is no longer marked unavailable. A single mid-exchange failure is too weak a signal to declare a regional outage for all concurrent and future traffic on the client.

The 403 endpoint-failure path (attemptRetryOnEndpointFailure) is intentionally unchanged — that path is a service response, not a transport error.

Behavior change

Scenario Before After
Read, NotSent Mark read, demote-to-tail, failover via resolveFromHead Unchanged
Read, ambiguous Mark read, demote-to-tail, failover via resolveFromHead No mark. Failover by bumping retryContext.retryCount so the next iteration of the Do loop resolves a different locationIndex. Mirrors how the 503/408 paths already failover without marking.
Write (multi-master), NotSent Mark read+write, demote, failover Unchanged
Write (multi-master), ambiguous Mark read+write, no retry No mark, no retry
Write (single-master), NotSent Mark read only, no retry Unchanged
Write (single-master), ambiguous Mark read only, no retry No mark, no retry
403 WriteForbidden / DatabaseAccountNotFound Mark + forced async refresh + failover Unchanged

Tests

  • TestAmbiguousConnectionErrorReadFailsOver — rewritten to use the two-server routingMockTransport pattern (same as TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable). Asserts the ambiguous-read retry actually reaches the second region (goodSrv.Requests() == 1), retryCount == 1, and the location-cache unavailability map stays empty.
  • TestAmbiguousWriteMarksEndpointUnavailableForRead → renamed TestAmbiguousWriteDoesNotMarkEndpointUnavailable and inverted to assert the location cache stays empty.
  • All NotSent-path tests (TestNotSentConnectionErrorMultiMasterWriteFailsOver, TestConnectionErrorGivesUpAfterSingleCrossRegionFailover, TestDnsErrorRetry, TestConnectionErrorReadFailsOverWhenGlobalEndpointIsUnreachable, etc.) pass unchanged.

Notes

  • Comments in attemptRetryOnNetworkError were also significantly shortened.
  • The change is internal to the retry policy; no public API change.

Previously, attemptRetryOnNetworkError called MarkEndpointUnavailable*
on both NotSent (DNS/TCP connect/TLS handshake failures) and ambiguous
(EOF/RST/transport timeout mid-exchange) transport errors. An ambiguous
mid-exchange failure is too weak a signal to declare a whole region
unavailable for every concurrent and future request on the client, and
the request may even have been processed server-side.

Now MarkEndpointUnavailable* is only invoked when the error is
connectionErrorNotSent. The ambiguous-read cross-region failover can no
longer rely on demote-to-tail, so it bumps retryContext.retryCount
instead, mirroring how the 503/408 paths failover without marking. The
403 endpoint-failure path is unchanged.

Adds a two-server routing test for the ambiguous-read failover that
asserts the retry actually reaches the second region without touching
the location cache, and inverts the ambiguous-write test to assert no
endpoint is marked unavailable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the Cosmos label Jun 1, 2026
@tvaron3 tvaron3 marked this pull request as ready for review June 1, 2026 21:58
@tvaron3 tvaron3 requested a review from a team as a code owner June 1, 2026 21:58
Copilot AI review requested due to automatic review settings June 1, 2026 21:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refines the azcosmos client retry policy so that regional unavailability is only recorded for transport errors classified as connectionErrorNotSent (i.e., the request definitely never reached the service). For ambiguous mid-exchange transport failures, reads can still fail over cross-region, but without marking/demoting the endpoint in the shared location cache.

Changes:

  • Updated attemptRetryOnNetworkError to call MarkEndpointUnavailableForRead/Write only for connectionErrorNotSent, and to route ambiguous-read failover by bumping retryContext.retryCount.
  • Adjusted retry-policy tests to validate ambiguous-read failover reaches a different region without mutating the location-cache unavailability map.
  • Renamed/inverted the ambiguous-write test to assert that ambiguous write failures do not mark any endpoint unavailable.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
sdk/data/azcosmos/cosmos_client_retry_policy.go Restricts endpoint-unavailability marking to NotSent transport errors and changes ambiguous-read failover routing to avoid cache mutation.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go Updates/extends tests to verify ambiguous-read failover routes cross-region without marking endpoints unavailable; adjusts ambiguous-write expectations accordingly.

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only missing changelog, LGTM!

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 enabled auto-merge (squash) June 1, 2026 22:40
@tvaron3 tvaron3 merged commit 70b713e into Azure:main Jun 1, 2026
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants