Skip to content

azcosmos: cross-region retry policy for connection errors and 408s#26858

Merged
tvaron3 merged 6 commits into
Azure:mainfrom
tvaron3:tvaron3/cross-region-retry-policy
May 22, 2026
Merged

azcosmos: cross-region retry policy for connection errors and 408s#26858
tvaron3 merged 6 commits into
Azure:mainfrom
tvaron3:tvaron3/cross-region-retry-policy

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 20, 2026

Summary

Tightens the Azure Cosmos DB Go SDK's client retry policy so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Also adds new handling for HTTP 408 Request Timeout.

Connection-error behavior

  • Classifies transport-level failures as either not-sent (DNS, dial, TLS handshake, ECONNREFUSED/EHOSTUNREACH/ENETUNREACH) or ambiguous (EOF, ECONNRESET, EPIPE, transport-level timeouts, generic net.OpError).
  • Retries up to 3 times against the current region without marking the endpoint unavailable.
  • After the same-region budget is exhausted, performs at most one cross-region failover per call:
    • Reads → always.
    • Writes → only when the error is not-sent (we are sure the request never reached the service). Ambiguous-error writes give up rather than risk a duplicate.
  • The policy short-circuits if the caller's context deadline expired or was cancelled, so we don't consume the customer's e2e timeout budget with retries.

HTTP 408 handling

  • 408 is now handled by the client retry policy (previously fell through to azcore's outer retry, which has no region awareness).
  • Reads are retried exactly once against another region.
  • Writes are returned to the caller immediately as non-retriable (avoids duplicate writes).

azcore interaction

All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop our policy and inflate the per-call request count.

Per-call request budget for persistent connection errors

1 initial + 3 same-region retries + 1 cross-region failover = 5 HTTP requests max, then a NonRetriableError is surfaced to the caller.

Tests added

  • Classifier unit test (TestClassifyNetworkError).
  • TestConnectionErrorReadFailsOverAfterThreeSameRegionAttempts
  • TestNotSentConnectionErrorWriteFailsOver (single-master)
  • TestNotSentConnectionErrorMultiMasterWriteFailsOver
  • TestAmbiguousConnectionErrorWriteDoesNotFailOver
  • TestAmbiguousConnectionErrorReadFailsOver
  • TestConnectionErrorGivesUpAfterSingleCrossRegionFailover
  • TestCallerDeadlineExceededDoesNotRetry
  • TestRequestTimeoutReadRetriesCrossRegion
  • TestRequestTimeoutReadGivesUpAfterOneCrossRegionRetry
  • TestRequestTimeoutWriteDoesNotRetry
  • TestDnsErrorRetry (existing, expectation updated to assert sameRegionRetryCount)

All sdk/data/azcosmos package tests pass (go test -short ./..., ~48s).

CHANGELOG

Entry added under 1.5.0-beta.7 → Bugs Fixed.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Tightens client retry behavior so writes are not silently duplicated
across regions when the outcome of a failed request is ambiguous.

Connection errors:
- Classify transport-level failures as either 'not-sent' (DNS, dial,
  TLS handshake, ECONNREFUSED-family) or 'ambiguous' (EOF, ECONNRESET,
  EPIPE, transport timeouts, generic net.OpError).
- Retry up to 3 times on the current region without marking the
  endpoint unavailable.
- After the same-region budget, perform at most one cross-region
  failover per call: always for reads, only for 'not-sent' writes.
  Writes on ambiguous errors give up rather than risk a duplicate.
- Honor caller-set context deadlines/cancellations by short-circuiting
  without any retries.

HTTP 408 (Request Timeout):
- Newly handled by the client retry policy. Reads are retried exactly
  once against another region; writes are returned immediately as
  non-retriable to avoid duplicates.

All give-up paths wrap the returned error with errorinfo.NonRetriableError
so the outer azcore retry policy does not re-loop.

Adds focused unit tests for the classifier and end-to-end tests for
each branch (same-region success, multi-master write failover, ambiguous
read failover, ambiguous write give-up, single-failover cap, caller
deadline, 408 read failover, 408 read cap, 408 write no-retry).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 23:15
@tvaron3 tvaron3 requested a review from a team as a code owner May 20, 2026 23:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates sdk/data/azcosmos retry behavior to be region-aware for transport errors and HTTP 408s, with the goal of preventing ambiguous write retries from silently duplicating across regions.

Changes:

  • Added connection-error classification (not-sent vs ambiguous) and updated retry behavior to do limited same-region retries before a single cross-region failover (writes only when “not-sent”).
  • Added explicit handling for HTTP 408 Request Timeout (read: single cross-region retry; write: no retry) and wrapped “give up” paths with errorinfo.NonRetriableError to prevent outer azcore retries.
  • Added/updated unit tests and a changelog entry describing the new retry semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
sdk/data/azcosmos/cosmos_client_retry_policy.go Implements new connection-error classification, same-region retry budget, single cross-region failover rules, and HTTP 408 handling.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go Adds tests for the new retry behaviors and updates existing DNS retry expectations.
sdk/data/azcosmos/CHANGELOG.md Documents the updated retry behavior and HTTP 408 handling under Bugs Fixed.

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go Outdated
tvaron3 and others added 2 commits May 20, 2026 16:41
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Connection-error retry policy (cosmos_client_retry_policy.go):
- Wrap caller-context error (ctx.Err()) when short-circuiting instead
  of the underlying transport error, so errors.Is(returned,
  context.DeadlineExceeded) works and the real cancellation reason is
  preserved alongside the transport error.
- Honor enableCrossRegionRetries=false as 'fail fast' across all
  branches — no same-region retries when the user has explicitly
  opted out.
- When giving up on an ambiguous write or a single-master write that
  cannot meaningfully fail over, still mark the endpoint unavailable
  for read so concurrent requests learn about the regional outage.
  Skip marking write-unavailable on single-master accounts (no other
  write region exists).
- Short-circuit single-master writes before attempting a
  'cross-region failover' that would resolve back to the same write
  region (wasted attempt + degraded cache).
- Replace bare time.Sleep with a context-aware sleepWithContext
  helper so backoffs honor caller-set deadlines instead of consuming
  the caller's budget asleep.
- Force a GEM refresh (Update(ctx, true)) after marking an endpoint
  unavailable so the new readEndpoints/writeEndpoints take effect
  for concurrent requests too — without the force, refresh is gated
  on a 5-minute interval and the unavailability signal is lost.
- 408 cross-region read retry now also forces a refresh and respects
  caller context during its backoff.
- Restructure classifyNetworkError: fold the trailing OpError check
  into the OpError branch for clarity, add syscall.ETIMEDOUT as
  not-sent (covers TCP connect timeouts), and document the
  cross-platform syscall normalization.

Tests (cosmos_client_retry_policy_test.go):
- Parameterize setupRetryPolicyTestClient via setupRetryPolicyTestClientOpts
  so tests can opt into single-master and disabled-cross-region
  configurations.
- Add TestSingleMasterWriteDoesNotFailoverOnConnectionError covering
  the production-default single-master write topology.
- Add TestAmbiguousWriteMarksEndpointUnavailableForRead asserting the
  shared-state hygiene fix.
- Add TestConnectionErrorWithCrossRegionRetriesDisabledFailsFast.
- Add TestCallerDeadlineDuringBackoffShortCircuits verifying the
  context-aware sleep returns early when ctx fires.
- Add classifier coverage for syscall.ETIMEDOUT and a real
  *net.OpError{Op: read, Err: syscall.ECONNRESET}.
- Rename fakeNetOpError → fakeAmbiguousNetError (more accurate — it
  is not an *net.OpError).
- Replace ignored json.Marshal errors with require.NoError in all 4
  affected tests (Copilot review feedback).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tvaron3 for this — it's a substantively good improvement and directionally aligns Go with the more conservative classifier patterns used by the Java, Python, and Rust Cosmos SDKs (Go was previously closer to the more permissive .NET-style "any HttpRequestException → mark + failover"). Moving away from that is a real win for write safety, and the connectionErrorNotSent vs connectionErrorAmbiguous taxonomy is exactly the right shape.

One supportability angle I'd love to flag for discussion before this lands: this PR locks in transport-error retry semantics that we'll inherit as part of the public client contract — customers will start writing code (or just operating production workloads) that assumes "at most N attempts on connection error" and "ambiguous writes don't duplicate". Once those expectations form, tightening or loosening them later is hard to do without a breaking-change conversation. So I'd love to make sure we get the safety story right the first time.

With that lens, I have one headline finding I'd really love your input on, plus five smaller things — all inline.

Headline (B1): I think the same-region 3-retry budget may unintentionally retry ambiguous-error writes against the same endpoint, which would contradict the PR's own stated safety guarantee and open a duplicate-mutation surface for non-idempotent operations like PatchItem(AppendIncrement) / AppendAdd and ExecuteTransactionalBatch containing such ops. The cross-region branch correctly gates on kind == connectionErrorNotSent, but the same-region branch above it doesn't seem to. Java / Python / Rust all do 0 same-region retries for ambiguous writes; I think Go would be the outlier here. Detail (with the trace + peer-SDK evidence) inline at line 226.

Others (M1–M5), inline:

  • M1 — three new sleepWithContext call sites that swallow ctx.Err() (lines 228, 284, 371), inconsistent with the working pattern at line 115.
  • M2errRetry returned unwrapped at line 121 (and err at line 159 for 408) — looks like it bypasses the NonRetriableError contract the PR description establishes.
  • M3gem.Update(ctx, true) force-refresh at lines 277 + 367 could create a thundering-herd serialized under gemMutex.
  • M4 — the new internal continue paths interact with the pre-existing pipeline ordering at cosmos_client.go:198-200 in a way that might silently affect short-lived AAD/MSI tokens.
  • M5 — the ambiguous-write give-up branch at line 263 skips the Update(forceRefresh=true) step that the cross-region failover branch does at line 277 — likely intertwined with M3.

Two things worth calling out as already-good:

  • The cross-region failover budget = 1 is dramatically tighter than the peers (.NET / Rust = 120, Java large, Python iterates all regions). I think that conservative stance is actually the right call for write safety — wanted to flag that this divergence reads as deliberate and is probably worth a quick line in the PR description so future readers know.
  • All 6 prior Copilot inline findings appear addressed in a0dff927e7 — nice cleanup.

Happy to discuss any of this — totally defer to you on which to take. Could you please share your thinking on B1 in particular?

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
tvaron3 and others added 2 commits May 22, 2026 09:43
- attemptRetryOnNetworkError: ambiguous-error writes now skip the
  same-region retry budget. Previously the same-region branch was
  only gated on crossRegionFailoverDone/sameRegionRetryCount, so a
  non-idempotent mutation (PatchItem(Increment), TransactionalBatch
  with non-idempotent ops) that returned ECONNRESET mid-response
  could be silently applied up to 4 times. Aligns with Java/Python/
  Rust SDKs and with the PR description / CHANGELOG.

- Preserve caller-context cancellation cause across backoff:
  sleepWithContext failures in attemptRetryOnNetworkError now
  return fmt.Errorf("%w: underlying transport error: %v", sleepErr,
  transportErr), and the 408 path returns the bare sleep error.
  The outer loop wraps these as NonRetriableError, so
  errors.Is(returned, context.DeadlineExceeded) now works
  consistently whether the deadline fires before req.Next() or
  during backoff.

- Wrap errRetry from attemptRetryOnNetworkError and
  attemptRetryOnRequestTimeout in NonRetriableError. Without this
  wrap, gem.MarkEndpointUnavailable* / gem.Update errors propagate
  to azcore's outer retry policy, which can re-loop us up to its
  MaxRetries=3 default and blow past the documented 5-request
  budget.

- cosmos_client.go: swap PerRetry order to [clientRetryPolicy,
  authPolicy] so the internal retry loop's req.Next() naturally
  re-invokes authPolicy on every attempt. Short-lived AAD/MSI tokens
  could otherwise expire across the cumulative same-region +
  cross-region retry window and surface as a misleading 401 on a
  late retry.

- Test updates: TestAmbiguousConnectionErrorWriteDoesNotFailOver
  now asserts srv.Requests() == 1 and no retries, encoding the new
  ambiguous-write contract instead of the buggy prior behavior.
  TestCallerDeadlineDuringBackoffShortCircuits adds an
  ErrorIs(context.DeadlineExceeded) assertion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n-retry-policy

# Conflicts:
#	sdk/data/azcosmos/CHANGELOG.md
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed against caaea52f55c3.

B1, M1, M4 ✅ — verified. B1 gate at :240-248 short-circuits ambiguous writes before the budget increment and sleep, and TestAmbiguousConnectionErrorWriteDoesNotFailOver asserts srv.Requests()==1. M1's bare 408 sleepErr is correctly wrapped by the outer Do()'s NonRetriableError (verified Unwrap() chain preserves errors.Is(_, context.DeadlineExceeded)). M4: confirmed SharedKey is safe — sharedKeyCredPolicy.Do only sets x-ms-date when empty (line 109-111), so the internal-retry signature stays identical; the 1+3+1 budget is single-digit seconds.

M2 — ✅ for the two sites I called out. One sibling-consistency item: the 403 path at cosmos_client_retry_policy.go:145 (shouldRetry, err := p.attemptRetryOnEndpointFailure(...); if err != nil { return nil, err }) is the only switch-arm where a helper error still escapes unwrapped. Pre-existing, not a regression from this PR — but the M2 fix made it the lone outlier. Mind tightening it for consistency with the same errorinfo.NonRetriableError(err) wrap?

M3 / M5 ⚠️ — Deferral is structurally correct: I verified #26815 contains exactly what you described (inflight *updateFlight, lastAttemptTime, invalidationGen, everPopulated, invalidate(), context.WithoutCancel). One gating question + one small ask:

  1. Merge-order: #26815 is still OPEN and PR 26858's base is main. Can we merge #26815 first, or add a "blocked by #26815" marker here, so we don't ship the force-refresh against the current gem.Update (which still holds gemMutex.Lock() for the duration of GetAccountProperties)?
  2. Independent of #26815: would you mind adding a 1-2 line comment at each gem.Update(req.Raw().Context(), true) call site (lines 290 and 380) explaining why the force-refresh is safe — referencing the GEM-side singleflight/throttling/invalidation in #26815? Future readers shouldn't have to cross-PR to learn the thundering-herd is mitigated downstream.

Happy to approve once those are sorted (or once #26815 lands).

Includes Azure#26815 (GEM coalescing + invalidation). With that in place,
the ambiguous-write / single-master-write give-up branch in
attemptRetryOnNetworkError can rely on MarkEndpointUnavailable*'s
implicit invalidation instead of a synchronous gem.Update(ctx, true).
Added an inline comment explaining the choice (per Kushagra's review
on Azure#26858).
@tvaron3 tvaron3 merged commit 662195f into Azure:main May 22, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants