azcosmos: cross-region retry policy for connection errors and 408s by tvaron3 · Pull Request #3 · tvaron3/azure-sdk-for-go

tvaron3 · 2026-05-20T23:14:40Z

Summary

Tightens the Azure Cosmos DB Go SDK's client retry policy so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Also adds new handling for HTTP 408 Request Timeout.

Connection-error behavior

Classifies transport-level failures as either not-sent (DNS, dial, TLS handshake, ECONNREFUSED/EHOSTUNREACH/ENETUNREACH) or ambiguous (EOF, ECONNRESET, EPIPE, transport-level timeouts, generic net.OpError).
Retries up to 3 times against the current region without marking the endpoint unavailable.
After the same-region budget is exhausted, performs at most one cross-region failover per call:
- Reads → always.
- Writes → only when the error is not-sent (we are sure the request never reached the service). Ambiguous-error writes give up rather than risk a duplicate.
The policy short-circuits if the caller's context deadline expired or was cancelled, so we don't consume the customer's e2e timeout budget with retries.

HTTP 408 handling

408 is now handled by the client retry policy (previously fell through to azcore's outer retry, which has no region awareness).
Reads are retried exactly once against another region.
Writes are returned to the caller immediately as non-retriable (avoids duplicate writes).

azcore interaction

All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop our policy and inflate the per-call request count.

Per-call request budget for persistent connection errors

1 initial + 3 same-region retries + 1 cross-region failover = 5 HTTP requests max, then a NonRetriableError is surfaced to the caller.

Tests added

Classifier unit test (TestClassifyNetworkError).
TestConnectionErrorReadFailsOverAfterThreeSameRegionAttempts
TestNotSentConnectionErrorWriteFailsOver (single-master)
TestNotSentConnectionErrorMultiMasterWriteFailsOver
TestAmbiguousConnectionErrorWriteDoesNotFailOver
TestAmbiguousConnectionErrorReadFailsOver
TestConnectionErrorGivesUpAfterSingleCrossRegionFailover
TestCallerDeadlineExceededDoesNotRetry
TestRequestTimeoutReadRetriesCrossRegion
TestRequestTimeoutReadGivesUpAfterOneCrossRegionRetry
TestRequestTimeoutWriteDoesNotRetry
TestDnsErrorRetry (existing, expectation updated to assert sameRegionRetryCount)

All sdk/data/azcosmos package tests pass (go test -short ./..., ~48s).

CHANGELOG

Entry added under 1.5.0-beta.7 → Bugs Fixed.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Tightens client retry behavior so writes are not silently duplicated across regions when the outcome of a failed request is ambiguous. Connection errors: - Classify transport-level failures as either 'not-sent' (DNS, dial, TLS handshake, ECONNREFUSED-family) or 'ambiguous' (EOF, ECONNRESET, EPIPE, transport timeouts, generic net.OpError). - Retry up to 3 times on the current region without marking the endpoint unavailable. - After the same-region budget, perform at most one cross-region failover per call: always for reads, only for 'not-sent' writes. Writes on ambiguous errors give up rather than risk a duplicate. - Honor caller-set context deadlines/cancellations by short-circuiting without any retries. HTTP 408 (Request Timeout): - Newly handled by the client retry policy. Reads are retried exactly once against another region; writes are returned immediately as non-retriable to avoid duplicates. All give-up paths wrap the returned error with errorinfo.NonRetriableError so the outer azcore retry policy does not re-loop. Adds focused unit tests for the classifier and end-to-end tests for each branch (same-region success, multi-master write failover, ambiguous read failover, ambiguous write give-up, single-failover cap, caller deadline, 408 read failover, 408 read cap, 408 write no-retry). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ntinel wrap Major changes: 1. invalidate() now also cancels the in-flight refresh via a context.CancelFunc stored on refreshOp. Combined with the generation bump, this lets invalidate() actually unwedge a stalled refresh (e.g. sustained 429), not just discard its eventual result. 2. runRefresh wraps its body in a deferred cleanup that recovers from panics and ALWAYS clears entry.inFlight + closes op.done. A panic in fetch/decode/merge can no longer leak the in-flight slot. A new errPKRangeCacheRefreshPanic surfaces the panic value to the awaiter. 3. The internal sentinel errPKRangeCacheInvalidatedDuringRefresh no longer leaks to callers: when the retry loop in getRoutingMap/forceRefresh exhausts (3 attempts), it returns a new public-ish error errPKRangeCacheInvalidatedRepeatedly instead of the unexported sentinel. 4. The 410/PartitionKeyRangeGone caller in refreshPKRangeCache now invalidates the pkrange cache before calling forceRefresh, so the join semantics of forceRefresh produce a fresh post-invalidate snapshot rather than whatever pre-trigger op may already be running. 5. refreshEntryDetached gets a 'safe to call ONLY from runRefresh' warning docstring since its safety depends on the generation check upstream. 6. fetchAllChangeFeedPages comment documents that page-preservation across 429s is intentional cross-SDK divergence (.NET/Java/Python/Rust restart from page 1 on transient failures). 7. The retry-backoff multiplier is pre-capped to avoid integer overflow if a test or future contributor shrinks the base delay to nanoseconds. 8. gatePolicy.started buffer grown from 16 to 64 (named constant with a comment about silent drops past the bound). 9. TestMain comment about global mutation rewritten for accuracy. 10. Three new tests: - panicInRefresh_recoversAndAllowsRetry (Finding #2) - invalidateDuringThrottleStorm_unwedges (Finding #3) - forceRefreshAfterInvalidate_doesNotJoinPreInvalidateOp (Finding Azure#5) 11. forceRefresh docstring restructured with a prominent warning that callers needing strict post-call freshness MUST call invalidate() first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…zure#26855) * azcosmos: single-pending-I/O pkrange cache and mid-pagination retry Rewrite partitionKeyRangeCache so concurrent callers share a single in-flight refresh per container, and the cached routing map remains readable while a refresh is in flight. The refresh runs on a detached context.Background() so a single caller's context cancellation no longer aborts the shared fetch for other waiters; each caller still honors its own ctx.Done(). forceRefresh now takes the routing map pointer the caller observed. When the entry already holds a fresher map than the caller's view, the refresh is suppressed and the fresher map is returned immediately (pointer-identity dedup of stale-view triggers). Make change-feed pagination resilient to mid-loop transient failures: each page is retried up to changeFeedPageMaxAttempts times with linear backoff on 5xx / 408 / 429 / network errors, preserving the pages already accumulated. Non-transient errors (other 4xx, ctx errors) fail fast. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: add PR links to CHANGELOG entries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: point CHANGELOG entries at upstream PR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address PR + deep-review feedback Cache concurrency: - Bound the detached refresh with a 60s timeout so a wedged transport cannot wedge the cache slot indefinitely. - Move close(op.done) inside the entry.mu critical section so a caller racing the completion edge cannot observe inFlight=nil before the result is published, eliminating the redundant-refresh window. - Skip starting a new background refresh when the caller's ctx is already canceled and no op is in flight (returns ctx.Err() early without spawning). - Drop the unused `previous` parameter from forceRefresh and its pointer-identity dedup; it was only ever called with nil in production. Stale-view dedup will be added in a follow-up that wires the routing-map pointer through every 410-retry call site. invalidate semantics: - Add an entry.generation counter; runRefresh checks the generation at completion and discards its result if invalidate() bumped it in the meantime. Prevents a refresh that pre-dates an invalidate from silently overwriting the invalidation. Transient predicate: - Tighten isTransientPKRangeFetchError: only retry actual transport- level errors (net.Error, io.ErrUnexpectedEOF, syscall.ECONNRESET, syscall.ETIMEDOUT) and 408. Drop 5xx and 429 from this layer — the pipeline retry policy already handles them with Retry-After honoring, so retrying here was double-retry. Drop the blanket "non-ResponseError ⇒ transient" branch so JSON/body-read errors fail fast. - Add jitter to the linear backoff so a fleet doesn't synchronize retries across containers. Observability: - Log refresh start/finish with container link, elapsed, and error so detached HTTP spans can be correlated to the triggering caller. Tests: - Update invalidate-during-refresh test to assert the new result-discarded semantics. - Replace 503 with 408 in mid-pagination tests (5xx no longer retried by this layer). - Add tests for canceled-caller-no-background-fetch on both getRoutingMap and forceRefresh, and for second-wave-after-completion not triggering redundant network requests. - Drop the stale-view-suppression test that tested the removed pointer-identity dedup. CHANGELOG: - Move entries from "Features Added" to "Bugs Fixed" (these are reliability fixes, not API additions). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address re-review feedback - Remove duplicated/stale Concurrency-model doc block on partitionKeyRangeCache that still described the removed previous parameter. - Drop the no-op callerCtx parameter from ensureInFlightLocked / runRefresh / newDetachedRefreshContext along with the optimistic tracing-span propagation comments. - Fix the invalidate-during-refresh asymmetry: on generation mismatch op.rm is now nil and op.err is errPKRangeCacheInvalidatedDuringRefresh; awaiters re-enter and start a fresh post-invalidate refresh instead of routing against the stale view that motivated the invalidate. - Wrap context.DeadlineExceeded from the detached refresh timeout with errPKRangeCacheRefreshTimeout so upstream errors.Is checks for the caller's own deadline don't misclassify it. - Make pkRangeBackgroundRefreshTimeout a var so tests can override it. - Reduce per-refresh log volume from two lines (start + finish) to one (finish), embedding elapsed and invalidated. - Tests: update invalidate-during-refresh to assert the new sentinel semantics; add a new test that the cache's own refresh timeout is wrapped while preserving DeadlineExceeded as the cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: retry CI (transient go1.26.1 download flake) * Broaden isTransientPKRangeFetchError to retry 429, 5xx, and 408 The per-page retry at the cache layer preserves accumulated change-feed pages rather than restarting the drain. Retrying these statuses here is intentional even if the pipeline already retried -- aligns with how .NET and Java SDKs handle metadata request retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address kushagraThapar follow-ups: retry budget, forceRefresh isolation, sentinel handling - Bump changeFeedPageMaxAttempts 3 -> 6 and honor Retry-After on 429. Brings the per-page retry budget closer to .NET's MetadataRequestThrottleRetryPolicy / ResourceThrottleRetryPolicy stack (which retries 429s up to 9x with exponential backoff). Matters most on cold-start where there is no warm routing map to fall back on. (#1) - forceRefresh now displaces an in-flight non-forced op (spawned by an earlier getRoutingMap) so its result never escapes to a caller that explicitly asked for fresh data. The displaced goroutine's install is rejected by the generation check. Concurrent forceRefresh callers still share a single forced fetch. (#3) - errPKRangeCacheInvalidatedDuringRefresh is now handled internally by getRoutingMap / forceRefresh in a bounded retry loop (3 attempts). Callers never see the sentinel, so consumers do not have to know about this implementation detail when invalidate() starts being wired up to 410/PartitionKeyRangeGone detection. (Azure#4) - runRefresh only clears entry.inFlight if the slot still points to its own op, so a displaced goroutine's completion cannot clobber a successor op installed by forceRefresh. Tests: invalidateDuringRefresh_discardsResult updated to assert the caller observes the post-invalidate fresh result instead of the swallowed sentinel. TestMain shrinks per-page retry delays so the bumped attempt cap does not bloat test runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove detached refresh timeout; retry transient page failures indefinitely Per design discussion: the refresh is background-only, so a wedged transport cannot impact callers that already have a cached map. Cold start callers can bail via their own ctx, and forceRefresh self-heals by displacing the in-flight op. Trade-off: bounded goroutine leak proportional to forceRefresh-during-wedge calls. - runRefresh now uses context.Background() unconditionally (no deadline) - Drop pkRangeBackgroundRefreshTimeout, errPKRangeCacheRefreshTimeout, and newDetachedRefreshContext entirely - fetchOneChangeFeedPageWithRetry loops on transient errors with no attempt cap; returns only on success, non-transient error, or ctx cancellation - Drop Retry-After honoring (azcore pipeline already handles it; retrying here would just compound the wait) - Simpler backoff: capped linear + jitter Tests: replace exhausted-retries test with infinite-retry-eventually- succeeds test (5 transient 408s followed by success). Remove refreshTimeout_isWrapped (no timeout to wrap anymore). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert forceRefresh displacement; concurrent callers share in-flight refresh The displacement logic added complexity (forced flag, slot-ownership guard, wasted goroutine on every displacement) to enforce a strict 'fetch started after my call' horizon that practical callers don't need. The only production caller (refreshPKRangeCache on 410 retries) is happy with eventual consistency: the in-flight /pkranges drain is already pulling fresh change-feed pages, and the next 410 retriggers another refresh if needed. Callers that need strict post-call freshness must call invalidate() first — the generation counter then guarantees the in-flight op's result is discarded and a fresh post-invalidate fetch runs. The internal sentinel retry handles the join+invalidate dance so callers never see errPKRangeCacheInvalidatedDuringRefresh. - refreshOp.forced field removed - forceRefresh joins in-flight op (same as getRoutingMap) - ensureInFlightLocked no longer takes a forced parameter - runRefresh always clears entry.inFlight (no displacement to defend against) - Drop the forceRefresh CHANGELOG entry; the join behavior is now the documented contract on forceRefresh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Simplify PK range cache retry: 429-only, delegate others to pipeline Only 429 throttling is retried at the per-page layer (indefinitely with capped linear backoff + jitter) so accumulated change-feed pages aren't discarded when the service explicitly asks the client to slow down. All other errors — 5xx, 408, transport — are surfaced immediately and delegated to the azcore pipeline's retry policy, avoiding double-retry and keeping responsibility for those classes in one place. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Condense PK range cache comments Trim verbose multi-line comments added across this PR while preserving the key 'why' notes (concurrency model, generation check, 429-only retry rationale). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address deep-reviewer findings: panic recovery, invalidate cancel, sentinel wrap Major changes: 1. invalidate() now also cancels the in-flight refresh via a context.CancelFunc stored on refreshOp. Combined with the generation bump, this lets invalidate() actually unwedge a stalled refresh (e.g. sustained 429), not just discard its eventual result. 2. runRefresh wraps its body in a deferred cleanup that recovers from panics and ALWAYS clears entry.inFlight + closes op.done. A panic in fetch/decode/merge can no longer leak the in-flight slot. A new errPKRangeCacheRefreshPanic surfaces the panic value to the awaiter. 3. The internal sentinel errPKRangeCacheInvalidatedDuringRefresh no longer leaks to callers: when the retry loop in getRoutingMap/forceRefresh exhausts (3 attempts), it returns a new public-ish error errPKRangeCacheInvalidatedRepeatedly instead of the unexported sentinel. 4. The 410/PartitionKeyRangeGone caller in refreshPKRangeCache now invalidates the pkrange cache before calling forceRefresh, so the join semantics of forceRefresh produce a fresh post-invalidate snapshot rather than whatever pre-trigger op may already be running. 5. refreshEntryDetached gets a 'safe to call ONLY from runRefresh' warning docstring since its safety depends on the generation check upstream. 6. fetchAllChangeFeedPages comment documents that page-preservation across 429s is intentional cross-SDK divergence (.NET/Java/Python/Rust restart from page 1 on transient failures). 7. The retry-backoff multiplier is pre-capped to avoid integer overflow if a test or future contributor shrinks the base delay to nanoseconds. 8. gatePolicy.started buffer grown from 16 to 64 (named constant with a comment about silent drops past the bound). 9. TestMain comment about global mutation rewritten for accuracy. 10. Three new tests: - panicInRefresh_recoversAndAllowsRetry (Finding #2) - invalidateDuringThrottleStorm_unwedges (Finding #3) - forceRefreshAfterInvalidate_doesNotJoinPreInvalidateOp (Finding Azure#5) 11. forceRefresh docstring restructured with a prominent warning that callers needing strict post-call freshness MUST call invalidate() first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove accidentally-committed PR diff dump Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Drop CHANGELOG entry for invalidate-cancel/panic-recovery internals Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * gofmt: re-indent appended test fixtures Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 closed this May 20, 2026

github-actions Bot added the Cosmos label May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azcosmos: cross-region retry policy for connection errors and 408s#3

azcosmos: cross-region retry policy for connection errors and 408s#3
tvaron3 wants to merge 1 commit into
mainfrom
tvaron3/cross-region-retry-policy

tvaron3 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tvaron3 commented May 20, 2026

Summary

Connection-error behavior

HTTP 408 handling

azcore interaction

Per-call request budget for persistent connection errors

Tests added

CHANGELOG

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant