azcosmos: cross-region retry policy for connection errors and 408s#26858
Conversation
Tightens client retry behavior so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Connection errors: - Classify transport-level failures as either 'not-sent' (DNS, dial, TLS handshake, ECONNREFUSED-family) or 'ambiguous' (EOF, ECONNRESET, EPIPE, transport timeouts, generic net.OpError). - Retry up to 3 times on the current region without marking the endpoint unavailable. - After the same-region budget, perform at most one cross-region failover per call: always for reads, only for 'not-sent' writes. Writes on ambiguous errors give up rather than risk a duplicate. - Honor caller-set context deadlines/cancellations by short-circuiting without any retries. HTTP 408 (Request Timeout): - Newly handled by the client retry policy. Reads are retried exactly once against another region; writes are returned immediately as non-retriable to avoid duplicates. All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop. Adds focused unit tests for the classifier and end-to-end tests for each branch (same-region success, multi-master write failover, ambiguous read failover, ambiguous write give-up, single-failover cap, caller deadline, 408 read failover, 408 read cap, 408 write no-retry). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates sdk/data/azcosmos retry behavior to be region-aware for transport errors and HTTP 408s, with the goal of preventing ambiguous write retries from silently duplicating across regions.
Changes:
- Added connection-error classification (not-sent vs ambiguous) and updated retry behavior to do limited same-region retries before a single cross-region failover (writes only when “not-sent”).
- Added explicit handling for HTTP
408 Request Timeout(read: single cross-region retry; write: no retry) and wrapped “give up” paths witherrorinfo.NonRetriableErrorto prevent outer azcore retries. - Added/updated unit tests and a changelog entry describing the new retry semantics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| sdk/data/azcosmos/cosmos_client_retry_policy.go | Implements new connection-error classification, same-region retry budget, single cross-region failover rules, and HTTP 408 handling. |
| sdk/data/azcosmos/cosmos_client_retry_policy_test.go | Adds tests for the new retry behaviors and updates existing DNS retry expectations. |
| sdk/data/azcosmos/CHANGELOG.md | Documents the updated retry behavior and HTTP 408 handling under Bugs Fixed. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Connection-error retry policy (cosmos_client_retry_policy.go):
- Wrap caller-context error (ctx.Err()) when short-circuiting instead
of the underlying transport error, so errors.Is(returned,
context.DeadlineExceeded) works and the real cancellation reason is
preserved alongside the transport error.
- Honor enableCrossRegionRetries=false as 'fail fast' across all
branches — no same-region retries when the user has explicitly
opted out.
- When giving up on an ambiguous write or a single-master write that
cannot meaningfully fail over, still mark the endpoint unavailable
for read so concurrent requests learn about the regional outage.
Skip marking write-unavailable on single-master accounts (no other
write region exists).
- Short-circuit single-master writes before attempting a
'cross-region failover' that would resolve back to the same write
region (wasted attempt + degraded cache).
- Replace bare time.Sleep with a context-aware sleepWithContext
helper so backoffs honor caller-set deadlines instead of consuming
the caller's budget asleep.
- Force a GEM refresh (Update(ctx, true)) after marking an endpoint
unavailable so the new readEndpoints/writeEndpoints take effect
for concurrent requests too — without the force, refresh is gated
on a 5-minute interval and the unavailability signal is lost.
- 408 cross-region read retry now also forces a refresh and respects
caller context during its backoff.
- Restructure classifyNetworkError: fold the trailing OpError check
into the OpError branch for clarity, add syscall.ETIMEDOUT as
not-sent (covers TCP connect timeouts), and document the
cross-platform syscall normalization.
Tests (cosmos_client_retry_policy_test.go):
- Parameterize setupRetryPolicyTestClient via setupRetryPolicyTestClientOpts
so tests can opt into single-master and disabled-cross-region
configurations.
- Add TestSingleMasterWriteDoesNotFailoverOnConnectionError covering
the production-default single-master write topology.
- Add TestAmbiguousWriteMarksEndpointUnavailableForRead asserting the
shared-state hygiene fix.
- Add TestConnectionErrorWithCrossRegionRetriesDisabledFailsFast.
- Add TestCallerDeadlineDuringBackoffShortCircuits verifying the
context-aware sleep returns early when ctx fires.
- Add classifier coverage for syscall.ETIMEDOUT and a real
*net.OpError{Op: read, Err: syscall.ECONNRESET}.
- Rename fakeNetOpError → fakeAmbiguousNetError (more accurate — it
is not an *net.OpError).
- Replace ignored json.Marshal errors with require.NoError in all 4
affected tests (Copilot review feedback).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks @tvaron3 for this — it's a substantively good improvement and directionally aligns Go with the more conservative classifier patterns used by the Java, Python, and Rust Cosmos SDKs (Go was previously closer to the more permissive .NET-style "any HttpRequestException → mark + failover"). Moving away from that is a real win for write safety, and the connectionErrorNotSent vs connectionErrorAmbiguous taxonomy is exactly the right shape.
One supportability angle I'd love to flag for discussion before this lands: this PR locks in transport-error retry semantics that we'll inherit as part of the public client contract — customers will start writing code (or just operating production workloads) that assumes "at most N attempts on connection error" and "ambiguous writes don't duplicate". Once those expectations form, tightening or loosening them later is hard to do without a breaking-change conversation. So I'd love to make sure we get the safety story right the first time.
With that lens, I have one headline finding I'd really love your input on, plus five smaller things — all inline.
Headline (B1): I think the same-region 3-retry budget may unintentionally retry ambiguous-error writes against the same endpoint, which would contradict the PR's own stated safety guarantee and open a duplicate-mutation surface for non-idempotent operations like PatchItem(AppendIncrement) / AppendAdd and ExecuteTransactionalBatch containing such ops. The cross-region branch correctly gates on kind == connectionErrorNotSent, but the same-region branch above it doesn't seem to. Java / Python / Rust all do 0 same-region retries for ambiguous writes; I think Go would be the outlier here. Detail (with the trace + peer-SDK evidence) inline at line 226.
Others (M1–M5), inline:
- M1 — three new
sleepWithContextcall sites that swallowctx.Err()(lines 228, 284, 371), inconsistent with the working pattern at line 115. - M2 —
errRetryreturned unwrapped at line 121 (anderrat line 159 for 408) — looks like it bypasses theNonRetriableErrorcontract the PR description establishes. - M3 —
gem.Update(ctx, true)force-refresh at lines 277 + 367 could create a thundering-herd serialized undergemMutex. - M4 — the new internal
continuepaths interact with the pre-existing pipeline ordering atcosmos_client.go:198-200in a way that might silently affect short-lived AAD/MSI tokens. - M5 — the ambiguous-write give-up branch at line 263 skips the
Update(forceRefresh=true)step that the cross-region failover branch does at line 277 — likely intertwined with M3.
Two things worth calling out as already-good:
- The cross-region failover budget = 1 is dramatically tighter than the peers (.NET / Rust = 120, Java large, Python iterates all regions). I think that conservative stance is actually the right call for write safety — wanted to flag that this divergence reads as deliberate and is probably worth a quick line in the PR description so future readers know.
- All 6 prior Copilot inline findings appear addressed in
a0dff927e7— nice cleanup.
Happy to discuss any of this — totally defer to you on which to take. Could you please share your thinking on B1 in particular?
- attemptRetryOnNetworkError: ambiguous-error writes now skip the
same-region retry budget. Previously the same-region branch was
only gated on crossRegionFailoverDone/sameRegionRetryCount, so a
non-idempotent mutation (PatchItem(Increment), TransactionalBatch
with non-idempotent ops) that returned ECONNRESET mid-response
could be silently applied up to 4 times. Aligns with Java/Python/
Rust SDKs and with the PR description / CHANGELOG.
- Preserve caller-context cancellation cause across backoff:
sleepWithContext failures in attemptRetryOnNetworkError now
return fmt.Errorf("%w: underlying transport error: %v", sleepErr,
transportErr), and the 408 path returns the bare sleep error.
The outer loop wraps these as NonRetriableError, so
errors.Is(returned, context.DeadlineExceeded) now works
consistently whether the deadline fires before req.Next() or
during backoff.
- Wrap errRetry from attemptRetryOnNetworkError and
attemptRetryOnRequestTimeout in NonRetriableError. Without this
wrap, gem.MarkEndpointUnavailable* / gem.Update errors propagate
to azcore's outer retry policy, which can re-loop us up to its
MaxRetries=3 default and blow past the documented 5-request
budget.
- cosmos_client.go: swap PerRetry order to [clientRetryPolicy,
authPolicy] so the internal retry loop's req.Next() naturally
re-invokes authPolicy on every attempt. Short-lived AAD/MSI tokens
could otherwise expire across the cumulative same-region +
cross-region retry window and surface as a misleading 401 on a
late retry.
- Test updates: TestAmbiguousConnectionErrorWriteDoesNotFailOver
now asserts srv.Requests() == 1 and no retries, encoding the new
ambiguous-write contract instead of the buggy prior behavior.
TestCallerDeadlineDuringBackoffShortCircuits adds an
ErrorIs(context.DeadlineExceeded) assertion.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n-retry-policy # Conflicts: # sdk/data/azcosmos/CHANGELOG.md
kushagraThapar
left a comment
There was a problem hiding this comment.
Re-reviewed against caaea52f55c3.
B1, M1, M4 ✅ — verified. B1 gate at :240-248 short-circuits ambiguous writes before the budget increment and sleep, and TestAmbiguousConnectionErrorWriteDoesNotFailOver asserts srv.Requests()==1. M1's bare 408 sleepErr is correctly wrapped by the outer Do()'s NonRetriableError (verified Unwrap() chain preserves errors.Is(_, context.DeadlineExceeded)). M4: confirmed SharedKey is safe — sharedKeyCredPolicy.Do only sets x-ms-date when empty (line 109-111), so the internal-retry signature stays identical; the 1+3+1 budget is single-digit seconds.
M2 — ✅ for the two sites I called out. One sibling-consistency item: the 403 path at cosmos_client_retry_policy.go:145 (shouldRetry, err := p.attemptRetryOnEndpointFailure(...); if err != nil { return nil, err }) is the only switch-arm where a helper error still escapes unwrapped. Pre-existing, not a regression from this PR — but the M2 fix made it the lone outlier. Mind tightening it for consistency with the same errorinfo.NonRetriableError(err) wrap?
M3 / M5 inflight *updateFlight, lastAttemptTime, invalidationGen, everPopulated, invalidate(), context.WithoutCancel). One gating question + one small ask:
- Merge-order: #26815 is still OPEN and PR 26858's base is
main. Can we merge #26815 first, or add a "blocked by #26815" marker here, so we don't ship the force-refresh against the currentgem.Update(which still holdsgemMutex.Lock()for the duration ofGetAccountProperties)? - Independent of #26815: would you mind adding a 1-2 line comment at each
gem.Update(req.Raw().Context(), true)call site (lines 290 and 380) explaining why the force-refresh is safe — referencing the GEM-side singleflight/throttling/invalidation in #26815? Future readers shouldn't have to cross-PR to learn the thundering-herd is mitigated downstream.
Happy to approve once those are sorted (or once #26815 lands).
Includes Azure#26815 (GEM coalescing + invalidation). With that in place, the ambiguous-write / single-master-write give-up branch in attemptRetryOnNetworkError can rely on MarkEndpointUnavailable*'s implicit invalidation instead of a synchronous gem.Update(ctx, true). Added an inline comment explaining the choice (per Kushagra's review on Azure#26858).
Summary
Tightens the Azure Cosmos DB Go SDK's client retry policy so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Also adds new handling for HTTP
408 Request Timeout.Connection-error behavior
ECONNREFUSED/EHOSTUNREACH/ENETUNREACH) or ambiguous (EOF,ECONNRESET,EPIPE, transport-level timeouts, genericnet.OpError).not-sent(we are sure the request never reached the service). Ambiguous-error writes give up rather than risk a duplicate.contextdeadline expired or was cancelled, so we don't consume the customer's e2e timeout budget with retries.HTTP 408 handling
azcore interaction
All give-up paths wrap the returned error with
errorinfo.NonRetriableErrorso the outer azcore retry policy does not re-loop our policy and inflate the per-call request count.Per-call request budget for persistent connection errors
1 initial + 3 same-region retries + 1 cross-region failover = 5 HTTP requests max, then aNonRetriableErroris surfaced to the caller.Tests added
TestClassifyNetworkError).TestConnectionErrorReadFailsOverAfterThreeSameRegionAttemptsTestNotSentConnectionErrorWriteFailsOver(single-master)TestNotSentConnectionErrorMultiMasterWriteFailsOverTestAmbiguousConnectionErrorWriteDoesNotFailOverTestAmbiguousConnectionErrorReadFailsOverTestConnectionErrorGivesUpAfterSingleCrossRegionFailoverTestCallerDeadlineExceededDoesNotRetryTestRequestTimeoutReadRetriesCrossRegionTestRequestTimeoutReadGivesUpAfterOneCrossRegionRetryTestRequestTimeoutWriteDoesNotRetryTestDnsErrorRetry(existing, expectation updated to assertsameRegionRetryCount)All
sdk/data/azcosmospackage tests pass (go test -short ./..., ~48s).CHANGELOG
Entry added under
1.5.0-beta.7→ Bugs Fixed.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com