azcosmos: cross-region retry policy for connection errors and 408s by tvaron3 · Pull Request #26858 · Azure/azure-sdk-for-go

tvaron3 · 2026-05-20T23:15:06Z

Summary

Tightens the Azure Cosmos DB Go SDK's client retry policy so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Also adds new handling for HTTP 408 Request Timeout.

Connection-error behavior

Classifies transport-level failures as either not-sent (DNS, dial, TLS handshake, ECONNREFUSED/EHOSTUNREACH/ENETUNREACH) or ambiguous (EOF, ECONNRESET, EPIPE, transport-level timeouts, generic net.OpError).
Retries up to 3 times against the current region without marking the endpoint unavailable.
After the same-region budget is exhausted, performs at most one cross-region failover per call:
- Reads → always.
- Writes → only when the error is not-sent (we are sure the request never reached the service). Ambiguous-error writes give up rather than risk a duplicate.
The policy short-circuits if the caller's context deadline expired or was cancelled, so we don't consume the customer's e2e timeout budget with retries.

HTTP 408 handling

408 is now handled by the client retry policy (previously fell through to azcore's outer retry, which has no region awareness).
Reads are retried exactly once against another region.
Writes are returned to the caller immediately as non-retriable (avoids duplicate writes).

azcore interaction

All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop our policy and inflate the per-call request count.

Per-call request budget for persistent connection errors

1 initial + 3 same-region retries + 1 cross-region failover = 5 HTTP requests max, then a NonRetriableError is surfaced to the caller.

Tests added

Classifier unit test (TestClassifyNetworkError).
TestConnectionErrorReadFailsOverAfterThreeSameRegionAttempts
TestNotSentConnectionErrorWriteFailsOver (single-master)
TestNotSentConnectionErrorMultiMasterWriteFailsOver
TestAmbiguousConnectionErrorWriteDoesNotFailOver
TestAmbiguousConnectionErrorReadFailsOver
TestConnectionErrorGivesUpAfterSingleCrossRegionFailover
TestCallerDeadlineExceededDoesNotRetry
TestRequestTimeoutReadRetriesCrossRegion
TestRequestTimeoutReadGivesUpAfterOneCrossRegionRetry
TestRequestTimeoutWriteDoesNotRetry
TestDnsErrorRetry (existing, expectation updated to assert sameRegionRetryCount)

All sdk/data/azcosmos package tests pass (go test -short ./..., ~48s).

CHANGELOG

Entry added under 1.5.0-beta.7 → Bugs Fixed.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Tightens client retry behavior so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Connection errors: - Classify transport-level failures as either 'not-sent' (DNS, dial, TLS handshake, ECONNREFUSED-family) or 'ambiguous' (EOF, ECONNRESET, EPIPE, transport timeouts, generic net.OpError). - Retry up to 3 times on the current region without marking the endpoint unavailable. - After the same-region budget, perform at most one cross-region failover per call: always for reads, only for 'not-sent' writes. Writes on ambiguous errors give up rather than risk a duplicate. - Honor caller-set context deadlines/cancellations by short-circuiting without any retries. HTTP 408 (Request Timeout): - Newly handled by the client retry policy. Reads are retried exactly once against another region; writes are returned immediately as non-retriable to avoid duplicates. All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop. Adds focused unit tests for the classifier and end-to-end tests for each branch (same-region success, multi-master write failover, ambiguous read failover, ambiguous write give-up, single-failover cap, caller deadline, 408 read failover, 408 read cap, 408 write no-retry). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates sdk/data/azcosmos retry behavior to be region-aware for transport errors and HTTP 408s, with the goal of preventing ambiguous write retries from silently duplicating across regions.

Changes:

Added connection-error classification (not-sent vs ambiguous) and updated retry behavior to do limited same-region retries before a single cross-region failover (writes only when “not-sent”).
Added explicit handling for HTTP 408 Request Timeout (read: single cross-region retry; write: no retry) and wrapped “give up” paths with errorinfo.NonRetriableError to prevent outer azcore retries.
Added/updated unit tests and a changelog entry describing the new retry semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
sdk/data/azcosmos/cosmos_client_retry_policy.go	Implements new connection-error classification, same-region retry budget, single cross-region failover rules, and HTTP 408 handling.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go	Adds tests for the new retry behaviors and updates existing DNS retry expectations.
sdk/data/azcosmos/CHANGELOG.md	Documents the updated retry behavior and HTTP 408 handling under Bugs Fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Connection-error retry policy (cosmos_client_retry_policy.go): - Wrap caller-context error (ctx.Err()) when short-circuiting instead of the underlying transport error, so errors.Is(returned, context.DeadlineExceeded) works and the real cancellation reason is preserved alongside the transport error. - Honor enableCrossRegionRetries=false as 'fail fast' across all branches — no same-region retries when the user has explicitly opted out. - When giving up on an ambiguous write or a single-master write that cannot meaningfully fail over, still mark the endpoint unavailable for read so concurrent requests learn about the regional outage. Skip marking write-unavailable on single-master accounts (no other write region exists). - Short-circuit single-master writes before attempting a 'cross-region failover' that would resolve back to the same write region (wasted attempt + degraded cache). - Replace bare time.Sleep with a context-aware sleepWithContext helper so backoffs honor caller-set deadlines instead of consuming the caller's budget asleep. - Force a GEM refresh (Update(ctx, true)) after marking an endpoint unavailable so the new readEndpoints/writeEndpoints take effect for concurrent requests too — without the force, refresh is gated on a 5-minute interval and the unavailability signal is lost. - 408 cross-region read retry now also forces a refresh and respects caller context during its backoff. - Restructure classifyNetworkError: fold the trailing OpError check into the OpError branch for clarity, add syscall.ETIMEDOUT as not-sent (covers TCP connect timeouts), and document the cross-platform syscall normalization. Tests (cosmos_client_retry_policy_test.go): - Parameterize setupRetryPolicyTestClient via setupRetryPolicyTestClientOpts so tests can opt into single-master and disabled-cross-region configurations. - Add TestSingleMasterWriteDoesNotFailoverOnConnectionError covering the production-default single-master write topology. - Add TestAmbiguousWriteMarksEndpointUnavailableForRead asserting the shared-state hygiene fix. - Add TestConnectionErrorWithCrossRegionRetriesDisabledFailsFast. - Add TestCallerDeadlineDuringBackoffShortCircuits verifying the context-aware sleep returns early when ctx fires. - Add classifier coverage for syscall.ETIMEDOUT and a real *net.OpError{Op: read, Err: syscall.ECONNRESET}. - Rename fakeNetOpError → fakeAmbiguousNetError (more accurate — it is not an *net.OpError). - Replace ignored json.Marshal errors with require.NoError in all 4 affected tests (Copilot review feedback). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar

Thanks @tvaron3 for this — it's a substantively good improvement and directionally aligns Go with the more conservative classifier patterns used by the Java, Python, and Rust Cosmos SDKs (Go was previously closer to the more permissive .NET-style "any HttpRequestException → mark + failover"). Moving away from that is a real win for write safety, and the connectionErrorNotSent vs connectionErrorAmbiguous taxonomy is exactly the right shape.

One supportability angle I'd love to flag for discussion before this lands: this PR locks in transport-error retry semantics that we'll inherit as part of the public client contract — customers will start writing code (or just operating production workloads) that assumes "at most N attempts on connection error" and "ambiguous writes don't duplicate". Once those expectations form, tightening or loosening them later is hard to do without a breaking-change conversation. So I'd love to make sure we get the safety story right the first time.

With that lens, I have one headline finding I'd really love your input on, plus five smaller things — all inline.

Headline (B1): I think the same-region 3-retry budget may unintentionally retry ambiguous-error writes against the same endpoint, which would contradict the PR's own stated safety guarantee and open a duplicate-mutation surface for non-idempotent operations like PatchItem(AppendIncrement) / AppendAdd and ExecuteTransactionalBatch containing such ops. The cross-region branch correctly gates on kind == connectionErrorNotSent, but the same-region branch above it doesn't seem to. Java / Python / Rust all do 0 same-region retries for ambiguous writes; I think Go would be the outlier here. Detail (with the trace + peer-SDK evidence) inline at line 226.

Others (M1–M5), inline:

M1 — three new sleepWithContext call sites that swallow ctx.Err() (lines 228, 284, 371), inconsistent with the working pattern at line 115.
M2 — errRetry returned unwrapped at line 121 (and err at line 159 for 408) — looks like it bypasses the NonRetriableError contract the PR description establishes.
M3 — gem.Update(ctx, true) force-refresh at lines 277 + 367 could create a thundering-herd serialized under gemMutex.
M4 — the new internal continue paths interact with the pre-existing pipeline ordering at cosmos_client.go:198-200 in a way that might silently affect short-lived AAD/MSI tokens.
M5 — the ambiguous-write give-up branch at line 263 skips the Update(forceRefresh=true) step that the cross-region failover branch does at line 277 — likely intertwined with M3.

Two things worth calling out as already-good:

The cross-region failover budget = 1 is dramatically tighter than the peers (.NET / Rust = 120, Java large, Python iterates all regions). I think that conservative stance is actually the right call for write safety — wanted to flag that this divergence reads as deliberate and is probably worth a quick line in the PR description so future readers know.
All 6 prior Copilot inline findings appear addressed in a0dff927e7 — nice cleanup.

Happy to discuss any of this — totally defer to you on which to take. Could you please share your thinking on B1 in particular?

- attemptRetryOnNetworkError: ambiguous-error writes now skip the same-region retry budget. Previously the same-region branch was only gated on crossRegionFailoverDone/sameRegionRetryCount, so a non-idempotent mutation (PatchItem(Increment), TransactionalBatch with non-idempotent ops) that returned ECONNRESET mid-response could be silently applied up to 4 times. Aligns with Java/Python/ Rust SDKs and with the PR description / CHANGELOG. - Preserve caller-context cancellation cause across backoff: sleepWithContext failures in attemptRetryOnNetworkError now return fmt.Errorf("%w: underlying transport error: %v", sleepErr, transportErr), and the 408 path returns the bare sleep error. The outer loop wraps these as NonRetriableError, so errors.Is(returned, context.DeadlineExceeded) now works consistently whether the deadline fires before req.Next() or during backoff. - Wrap errRetry from attemptRetryOnNetworkError and attemptRetryOnRequestTimeout in NonRetriableError. Without this wrap, gem.MarkEndpointUnavailable* / gem.Update errors propagate to azcore's outer retry policy, which can re-loop us up to its MaxRetries=3 default and blow past the documented 5-request budget. - cosmos_client.go: swap PerRetry order to [clientRetryPolicy, authPolicy] so the internal retry loop's req.Next() naturally re-invokes authPolicy on every attempt. Short-lived AAD/MSI tokens could otherwise expire across the cumulative same-region + cross-region retry window and surface as a misleading 401 on a late retry. - Test updates: TestAmbiguousConnectionErrorWriteDoesNotFailOver now asserts srv.Requests() == 1 and no retries, encoding the new ambiguous-write contract instead of the buggy prior behavior. TestCallerDeadlineDuringBackoffShortCircuits adds an ErrorIs(context.DeadlineExceeded) assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…n-retry-policy # Conflicts: # sdk/data/azcosmos/CHANGELOG.md

kushagraThapar

Re-reviewed against caaea52f55c3.

B1, M1, M4 ✅ — verified. B1 gate at :240-248 short-circuits ambiguous writes before the budget increment and sleep, and TestAmbiguousConnectionErrorWriteDoesNotFailOver asserts srv.Requests()==1. M1's bare 408 sleepErr is correctly wrapped by the outer Do()'s NonRetriableError (verified Unwrap() chain preserves errors.Is(_, context.DeadlineExceeded)). M4: confirmed SharedKey is safe — sharedKeyCredPolicy.Do only sets x-ms-date when empty (line 109-111), so the internal-retry signature stays identical; the 1+3+1 budget is single-digit seconds.

M2 — ✅ for the two sites I called out. One sibling-consistency item: the 403 path at cosmos_client_retry_policy.go:145 (shouldRetry, err := p.attemptRetryOnEndpointFailure(...); if err != nil { return nil, err }) is the only switch-arm where a helper error still escapes unwrapped. Pre-existing, not a regression from this PR — but the M2 fix made it the lone outlier. Mind tightening it for consistency with the same errorinfo.NonRetriableError(err) wrap?

M3 / M5 ⚠️ — Deferral is structurally correct: I verified #26815 contains exactly what you described (inflight *updateFlight, lastAttemptTime, invalidationGen, everPopulated, invalidate(), context.WithoutCancel). One gating question + one small ask:

Merge-order: #26815 is still OPEN and PR 26858's base is main. Can we merge #26815 first, or add a "blocked by #26815" marker here, so we don't ship the force-refresh against the current gem.Update (which still holds gemMutex.Lock() for the duration of GetAccountProperties)?
Independent of #26815: would you mind adding a 1-2 line comment at each gem.Update(req.Raw().Context(), true) call site (lines 290 and 380) explaining why the force-refresh is safe — referencing the GEM-side singleflight/throttling/invalidation in #26815? Future readers shouldn't have to cross-PR to learn the thundering-herd is mitigated downstream.

Happy to approve once those are sorted (or once #26815 lands).

Includes Azure#26815 (GEM coalescing + invalidation). With that in place, the ambiguous-write / single-master-write give-up branch in attemptRetryOnNetworkError can rely on MarkEndpointUnavailable*'s implicit invalidation instead of a synchronous gem.Update(ctx, true). Added an inline comment explaining the choice (per Kushagra's review on Azure#26858).

Copilot AI review requested due to automatic review settings May 20, 2026 23:15

tvaron3 requested a review from a team as a code owner May 20, 2026 23:15

Copilot started reviewing on behalf of tvaron3 May 20, 2026 23:15 View session

github-actions Bot added the Cosmos label May 20, 2026

Copilot AI reviewed May 20, 2026

View reviewed changes

tvaron3 and others added 2 commits May 20, 2026 16:41

azcosmos: add PR link to CHANGELOG entries

1266d25

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar reviewed May 22, 2026

View reviewed changes

tvaron3 and others added 2 commits May 22, 2026 09:43

Merge remote-tracking branch 'upstream/main' into tvaron3/cross-regio…

ceae009

…n-retry-policy # Conflicts: # sdk/data/azcosmos/CHANGELOG.md

kushagraThapar approved these changes May 22, 2026

View reviewed changes

kushagraThapar reviewed May 22, 2026

View reviewed changes

kushagraThapar approved these changes May 22, 2026

View reviewed changes

tvaron3 merged commit 662195f into Azure:main May 22, 2026
20 checks passed

tvaron3 mentioned this pull request May 26, 2026

[azcosmos] unblock cross-region failover from synchronous gem.Update #26889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azcosmos: cross-region retry policy for connection errors and 408s#26858

azcosmos: cross-region retry policy for connection errors and 408s#26858
tvaron3 merged 6 commits into
Azure:mainfrom
tvaron3:tvaron3/cross-region-retry-policy

tvaron3 commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tvaron3 commented May 20, 2026

Summary

Connection-error behavior

HTTP 408 handling

azcore interaction

Per-call request budget for persistent connection errors

Tests added

CHANGELOG

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushagraThapar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants