azcosmos: single-pending-I/O pkrange cache and mid-pagination retry by tvaron3 · Pull Request #1 · tvaron3/azure-sdk-for-go

tvaron3 · 2026-05-20T07:47:50Z

Today partitionKeyRangeCache.getRoutingMap / forceRefresh hold the per-entry mutex across the entire change-feed pagination. Two consequences hurt under load: a caller that already has a usable cached routing map can block behind an unrelated refresh, and concurrent 410-triggered forced refreshes serialize instead of sharing one network fetch. Also, any HTTP failure mid-pagination discards all already-accumulated pages and forces the next refresh to start from page 1.

This PR adopts a single-pending-I/O model for the cache and makes the change-feed loop resilient to transient mid-pagination errors.

Cache contract

At most one refresh per container at any time. Concurrent first-fetch callers, and concurrent forced-refresh callers, share one in-flight op.
While a refresh is in flight, callers that already see a usable routing map return immediately. Only first-fetch and forced-refresh paths await.
The refresh goroutine runs on context.Background(), so a single caller's ctx cancellation does not abort the shared fetch. Each waiter still honors its own ctx.Done() via select.
forceRefresh takes a previous *collectionRoutingMap pointer. If the entry already holds a fresher map than the caller's view, the call is short-circuited and the fresher map is returned without a new fetch (pointer-identity dedup of stale-view triggers).
invalidate no longer races refreshes: an in-flight result simply becomes the next observed map.

Mid-pagination retry

fetchAllChangeFeedPages now delegates each page to fetchOneChangeFeedPageWithRetry, which retries changeFeedPageMaxAttempts = 3 times with linear backoff (100 ms, 200 ms) on transient errors:

Retried: 5xx, 408, 429, and any non-HTTP/network error.
Failed fast: other 4xx and context.Canceled / DeadlineExceeded.

The accumulator and last-good ETag survive the retry, so a single bad page no longer discards the pages already collected.

Out of scope

ReadMany retry behaviour is unchanged (still full-restart on 410). The query engine path and GetFeedRanges are untouched.

Notes

No new public API surface.
New tests cover: cached-during-refresh fast path, concurrent shared fetch, stale-view dedup, caller-cancel-doesn't-abort-shared-fetch, invalidate-during-refresh, error-clears-inflight-slot, transient-retry-succeeds, exhausted-retries-fails, non-transient-fails-fast.
go test -count=1 -short -race ./sdk/data/azcosmos/... passes.
CHANGELOG updated under 1.5.0-beta.7 (Unreleased).

Checklist

The purpose of this PR is explained in this or a referenced issue.
The PR does not update generated files.
Tests are included and/or updated for code changes.
Updates to module CHANGELOG.md are included.
MIT license headers are included in each file.

Rewrite partitionKeyRangeCache so concurrent callers share a single in-flight refresh per container, and the cached routing map remains readable while a refresh is in flight. The refresh runs on a detached context.Background() so a single caller's context cancellation no longer aborts the shared fetch for other waiters; each caller still honors its own ctx.Done(). forceRefresh now takes the routing map pointer the caller observed. When the entry already holds a fresher map than the caller's view, the refresh is suppressed and the fresher map is returned immediately (pointer-identity dedup of stale-view triggers). Make change-feed pagination resilient to mid-loop transient failures: each page is retried up to changeFeedPageMaxAttempts times with linear backoff on 5xx / 408 / 429 / network errors, preserving the pages already accumulated. Non-transient errors (other 4xx, ctx errors) fail fast. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…on, sentinel handling - Bump changeFeedPageMaxAttempts 3 -> 6 and honor Retry-After on 429. Brings the per-page retry budget closer to .NET's MetadataRequestThrottleRetryPolicy / ResourceThrottleRetryPolicy stack (which retries 429s up to 9x with exponential backoff). Matters most on cold-start where there is no warm routing map to fall back on. (#1) - forceRefresh now displaces an in-flight non-forced op (spawned by an earlier getRoutingMap) so its result never escapes to a caller that explicitly asked for fresh data. The displaced goroutine's install is rejected by the generation check. Concurrent forceRefresh callers still share a single forced fetch. (#3) - errPKRangeCacheInvalidatedDuringRefresh is now handled internally by getRoutingMap / forceRefresh in a bounded retry loop (3 attempts). Callers never see the sentinel, so consumers do not have to know about this implementation detail when invalidate() starts being wired up to 410/PartitionKeyRangeGone detection. (Azure#4) - runRefresh only clears entry.inFlight if the slot still points to its own op, so a displaced goroutine's completion cannot clobber a successor op installed by forceRefresh. Tests: invalidateDuringRefresh_discardsResult updated to assert the caller observes the post-invalidate fresh result instead of the swallowed sentinel. TestMain shrinks per-page retry delays so the bumped attempt cap does not bloat test runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iew #1,#2,Azure#4,Azure#13, comment cleanup Behavioral reverts (preserve long-standing semantics): - cosmos_client_retry_policy.go: write-retry path uses gem.Update(ctx, true) to force-refresh per attempt (intentional; restores prior behavior). - cosmos_location_cache.go: resolveServiceEndpoint with enableCrossRegionRetries=false returns lc.defaultEndpoint for single-master writes (avoids a breaking change). Fixes addressing prior deep-review findings: - cosmos_global_endpoint_manager_policy.go: * Log async refresh failures (was discarded via '_ ='); chronic post-bootstrap topology drift is now observable. * Capture refreshCtx via context.WithoutCancel BEFORE launching the goroutine so it does not depend on req's lifetime. - cosmos_global_endpoint_manager.go: leader's HTTP call runs under context.WithoutCancel(ctx); waiters still respect their own ctx so caller-side cancellation no longer poisons coalesced waiters. - cosmos_location_cache.go markEndpointUnavailable: fast-path skips the full updateLocked recompute when wasAlreadyUnavailable==true (route lists cannot change in that case). Tests and docs: - TestFix3 renamed/inverted to TestFix3_WriteRetryForceRefreshesGEM. - TestDefaultEndpointElim_CrossRegionRetriesDisabled* renamed/inverted to assert the restored default-endpoint behavior. - TestRegression25468_* renamed to TestRegression_*. - Comments stripped of Azure#25468 references and tightened. - CHANGELOG updated to link to PR 26815 and reflect the reverted cross-region-retries=false behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…zure#26855) * azcosmos: single-pending-I/O pkrange cache and mid-pagination retry Rewrite partitionKeyRangeCache so concurrent callers share a single in-flight refresh per container, and the cached routing map remains readable while a refresh is in flight. The refresh runs on a detached context.Background() so a single caller's context cancellation no longer aborts the shared fetch for other waiters; each caller still honors its own ctx.Done(). forceRefresh now takes the routing map pointer the caller observed. When the entry already holds a fresher map than the caller's view, the refresh is suppressed and the fresher map is returned immediately (pointer-identity dedup of stale-view triggers). Make change-feed pagination resilient to mid-loop transient failures: each page is retried up to changeFeedPageMaxAttempts times with linear backoff on 5xx / 408 / 429 / network errors, preserving the pages already accumulated. Non-transient errors (other 4xx, ctx errors) fail fast. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: add PR links to CHANGELOG entries Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: point CHANGELOG entries at upstream PR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address PR + deep-review feedback Cache concurrency: - Bound the detached refresh with a 60s timeout so a wedged transport cannot wedge the cache slot indefinitely. - Move close(op.done) inside the entry.mu critical section so a caller racing the completion edge cannot observe inFlight=nil before the result is published, eliminating the redundant-refresh window. - Skip starting a new background refresh when the caller's ctx is already canceled and no op is in flight (returns ctx.Err() early without spawning). - Drop the unused `previous` parameter from forceRefresh and its pointer-identity dedup; it was only ever called with nil in production. Stale-view dedup will be added in a follow-up that wires the routing-map pointer through every 410-retry call site. invalidate semantics: - Add an entry.generation counter; runRefresh checks the generation at completion and discards its result if invalidate() bumped it in the meantime. Prevents a refresh that pre-dates an invalidate from silently overwriting the invalidation. Transient predicate: - Tighten isTransientPKRangeFetchError: only retry actual transport- level errors (net.Error, io.ErrUnexpectedEOF, syscall.ECONNRESET, syscall.ETIMEDOUT) and 408. Drop 5xx and 429 from this layer — the pipeline retry policy already handles them with Retry-After honoring, so retrying here was double-retry. Drop the blanket "non-ResponseError ⇒ transient" branch so JSON/body-read errors fail fast. - Add jitter to the linear backoff so a fleet doesn't synchronize retries across containers. Observability: - Log refresh start/finish with container link, elapsed, and error so detached HTTP spans can be correlated to the triggering caller. Tests: - Update invalidate-during-refresh test to assert the new result-discarded semantics. - Replace 503 with 408 in mid-pagination tests (5xx no longer retried by this layer). - Add tests for canceled-caller-no-background-fetch on both getRoutingMap and forceRefresh, and for second-wave-after-completion not triggering redundant network requests. - Drop the stale-view-suppression test that tested the removed pointer-identity dedup. CHANGELOG: - Move entries from "Features Added" to "Bugs Fixed" (these are reliability fixes, not API additions). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address re-review feedback - Remove duplicated/stale Concurrency-model doc block on partitionKeyRangeCache that still described the removed previous parameter. - Drop the no-op callerCtx parameter from ensureInFlightLocked / runRefresh / newDetachedRefreshContext along with the optimistic tracing-span propagation comments. - Fix the invalidate-during-refresh asymmetry: on generation mismatch op.rm is now nil and op.err is errPKRangeCacheInvalidatedDuringRefresh; awaiters re-enter and start a fresh post-invalidate refresh instead of routing against the stale view that motivated the invalidate. - Wrap context.DeadlineExceeded from the detached refresh timeout with errPKRangeCacheRefreshTimeout so upstream errors.Is checks for the caller's own deadline don't misclassify it. - Make pkRangeBackgroundRefreshTimeout a var so tests can override it. - Reduce per-refresh log volume from two lines (start + finish) to one (finish), embedding elapsed and invalidated. - Tests: update invalidate-during-refresh to assert the new sentinel semantics; add a new test that the cache's own refresh timeout is wrapped while preserving DeadlineExceeded as the cause. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: retry CI (transient go1.26.1 download flake) * Broaden isTransientPKRangeFetchError to retry 429, 5xx, and 408 The per-page retry at the cache layer preserves accumulated change-feed pages rather than restarting the drain. Retrying these statuses here is intentional even if the pipeline already retried -- aligns with how .NET and Java SDKs handle metadata request retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address kushagraThapar follow-ups: retry budget, forceRefresh isolation, sentinel handling - Bump changeFeedPageMaxAttempts 3 -> 6 and honor Retry-After on 429. Brings the per-page retry budget closer to .NET's MetadataRequestThrottleRetryPolicy / ResourceThrottleRetryPolicy stack (which retries 429s up to 9x with exponential backoff). Matters most on cold-start where there is no warm routing map to fall back on. (#1) - forceRefresh now displaces an in-flight non-forced op (spawned by an earlier getRoutingMap) so its result never escapes to a caller that explicitly asked for fresh data. The displaced goroutine's install is rejected by the generation check. Concurrent forceRefresh callers still share a single forced fetch. (#3) - errPKRangeCacheInvalidatedDuringRefresh is now handled internally by getRoutingMap / forceRefresh in a bounded retry loop (3 attempts). Callers never see the sentinel, so consumers do not have to know about this implementation detail when invalidate() starts being wired up to 410/PartitionKeyRangeGone detection. (Azure#4) - runRefresh only clears entry.inFlight if the slot still points to its own op, so a displaced goroutine's completion cannot clobber a successor op installed by forceRefresh. Tests: invalidateDuringRefresh_discardsResult updated to assert the caller observes the post-invalidate fresh result instead of the swallowed sentinel. TestMain shrinks per-page retry delays so the bumped attempt cap does not bloat test runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove detached refresh timeout; retry transient page failures indefinitely Per design discussion: the refresh is background-only, so a wedged transport cannot impact callers that already have a cached map. Cold start callers can bail via their own ctx, and forceRefresh self-heals by displacing the in-flight op. Trade-off: bounded goroutine leak proportional to forceRefresh-during-wedge calls. - runRefresh now uses context.Background() unconditionally (no deadline) - Drop pkRangeBackgroundRefreshTimeout, errPKRangeCacheRefreshTimeout, and newDetachedRefreshContext entirely - fetchOneChangeFeedPageWithRetry loops on transient errors with no attempt cap; returns only on success, non-transient error, or ctx cancellation - Drop Retry-After honoring (azcore pipeline already handles it; retrying here would just compound the wait) - Simpler backoff: capped linear + jitter Tests: replace exhausted-retries test with infinite-retry-eventually- succeeds test (5 transient 408s followed by success). Remove refreshTimeout_isWrapped (no timeout to wrap anymore). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert forceRefresh displacement; concurrent callers share in-flight refresh The displacement logic added complexity (forced flag, slot-ownership guard, wasted goroutine on every displacement) to enforce a strict 'fetch started after my call' horizon that practical callers don't need. The only production caller (refreshPKRangeCache on 410 retries) is happy with eventual consistency: the in-flight /pkranges drain is already pulling fresh change-feed pages, and the next 410 retriggers another refresh if needed. Callers that need strict post-call freshness must call invalidate() first — the generation counter then guarantees the in-flight op's result is discarded and a fresh post-invalidate fetch runs. The internal sentinel retry handles the join+invalidate dance so callers never see errPKRangeCacheInvalidatedDuringRefresh. - refreshOp.forced field removed - forceRefresh joins in-flight op (same as getRoutingMap) - ensureInFlightLocked no longer takes a forced parameter - runRefresh always clears entry.inFlight (no displacement to defend against) - Drop the forceRefresh CHANGELOG entry; the join behavior is now the documented contract on forceRefresh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Simplify PK range cache retry: 429-only, delegate others to pipeline Only 429 throttling is retried at the per-page layer (indefinitely with capped linear backoff + jitter) so accumulated change-feed pages aren't discarded when the service explicitly asks the client to slow down. All other errors — 5xx, 408, transport — are surfaced immediately and delegated to the azcore pipeline's retry policy, avoiding double-retry and keeping responsibility for those classes in one place. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Condense PK range cache comments Trim verbose multi-line comments added across this PR while preserving the key 'why' notes (concurrency model, generation check, 429-only retry rationale). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address deep-reviewer findings: panic recovery, invalidate cancel, sentinel wrap Major changes: 1. invalidate() now also cancels the in-flight refresh via a context.CancelFunc stored on refreshOp. Combined with the generation bump, this lets invalidate() actually unwedge a stalled refresh (e.g. sustained 429), not just discard its eventual result. 2. runRefresh wraps its body in a deferred cleanup that recovers from panics and ALWAYS clears entry.inFlight + closes op.done. A panic in fetch/decode/merge can no longer leak the in-flight slot. A new errPKRangeCacheRefreshPanic surfaces the panic value to the awaiter. 3. The internal sentinel errPKRangeCacheInvalidatedDuringRefresh no longer leaks to callers: when the retry loop in getRoutingMap/forceRefresh exhausts (3 attempts), it returns a new public-ish error errPKRangeCacheInvalidatedRepeatedly instead of the unexported sentinel. 4. The 410/PartitionKeyRangeGone caller in refreshPKRangeCache now invalidates the pkrange cache before calling forceRefresh, so the join semantics of forceRefresh produce a fresh post-invalidate snapshot rather than whatever pre-trigger op may already be running. 5. refreshEntryDetached gets a 'safe to call ONLY from runRefresh' warning docstring since its safety depends on the generation check upstream. 6. fetchAllChangeFeedPages comment documents that page-preservation across 429s is intentional cross-SDK divergence (.NET/Java/Python/Rust restart from page 1 on transient failures). 7. The retry-backoff multiplier is pre-capped to avoid integer overflow if a test or future contributor shrinks the base delay to nanoseconds. 8. gatePolicy.started buffer grown from 16 to 64 (named constant with a comment about silent drops past the bound). 9. TestMain comment about global mutation rewritten for accuracy. 10. Three new tests: - panicInRefresh_recoversAndAllowsRetry (Finding #2) - invalidateDuringThrottleStorm_unwedges (Finding #3) - forceRefreshAfterInvalidate_doesNotJoinPreInvalidateOp (Finding Azure#5) 11. forceRefresh docstring restructured with a prominent warning that callers needing strict post-call freshness MUST call invalidate() first. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove accidentally-committed PR diff dump Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Drop CHANGELOG entry for invalidate-cancel/panic-recovery internals Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * gofmt: re-indent appended test fixtures Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… fallback to customer endpoint (Azure#25468) (Azure#26815) * Cosmos: fix excess GetDatabaseAccount calls and default-endpoint fallback (Azure#25468) Fixes issue Azure#25468 (excess GetDatabaseAccount HTTP calls with preferred regions) plus the related default-endpoint fallback in data-plane routing. Root causes addressed: * GEM Update only advanced lastUpdateTime on success, so any failed refresh caused every subsequent caller to re-issue the HTTP call immediately (no 5-min throttle on failures). * globalEndpointManagerPolicy.Do spawned a goroutine per request whenever ShouldRefresh() was true, with no in-flight coalescing. * attemptRetryOnEndpointFailure passed isWriteOperation as forceRefresh, so every retried write 403 force-refreshed GEM regardless of the refresh interval. * A failed initial sync.Once bootstrap latched done=true, pinning the client into the async-refresh herd path forever. * locationCache.readEndpoints/writeEndpoints could self-deadlock by acquiring RLock and then calling update() which takes Lock(). * getPrefAvailableEndpointsLocked appended the customer-supplied default endpoint as a trailing fallback in every route list, so retry traversal eventually issued data-plane requests there. * resolveServiceEndpoint fell back to the default endpoint whenever enableCrossRegionRetries was false even with non-empty availWriteLocations. Changes: * globalEndpointManager: add lastAttemptTime, per-flight singleflight, invalidationGen, lastUpdateErr, populated(), invalidate(); MarkEndpoint Unavailable* now invalidates once per newly-unavailable endpoint. * globalEndpointManagerPolicy: drop sync.Once; bootstrap is gated on populated() and coalesced via gem.Update's singleflight. * clientRetryPolicy: write retries pass forceRefresh=false. * locationCache: split Locked variants; readEndpoints/writeEndpoints drop the RLock before calling update; update() chooses regional write/read fallbacks; resolveServiceEndpoint always picks a regional endpoint when availWriteLocations is non-empty (cross-region-retries flag now only gates retry walking); getPrefAvailableEndpointsLocked no longer appends a trailing fallback. * Only remaining data-plane -> default-endpoint route is the degenerate 'zero write regions on a write request' case. Tests: * cosmos_dbaccount_refresh_test.go (new): 12 regression tests covering throttle-on-failure, single-flight coalescing, write-retry single refresh, in-flight invalidation, cached bootstrap error, deadlock-free readEndpoints, default-endpoint elimination across resource types and configurations, plus 500-concurrency soak tests on healthy and failing GEMs. * cosmos_location_cache_test.go: update 4 expectations to drop trailing default-endpoint entries from route lists. All tests pass with -race. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address deep-review findings (panic-safety, atomic mark, test hardening) Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the deep-review pass: 1. Panic-safe gem.Update: wrap refreshOnce in a deferred cleanup so a panic in the HTTP pipeline (or any code refreshOnce transitively calls) cannot leak gem.inflight / leave flight.done unclosed -- which would have permanently wedged every subsequent Update caller. The panic is re-raised after cleanup. 2. Atomic check-and-mark in locationCache.markEndpointUnavailable: the helper now returns wasAlreadyUnavailable from inside the same mapMutex critical section that performs the mark. This eliminates the check-then-act race between MarkEndpointUnavailableFor* and isEndpointUnavailable that previously let multiple concurrent markers each call invalidate(). Tightens TestFix3c bound from <=3 to a provable <=2. 3. TestFix4 hardening: assert require.Error on each follow-up request, so the test cannot pass if a regression made populated() return true and silently skip the bootstrap. All tests pass with -race; gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address PR Deep Reviewer findings (data-plane stall, waiter ctx, async panic) Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the PR Deep Reviewer pass. BLOCKING finding addressed: * invalidate() zeroed lastUpdateTime, which made populated() return false. If the post-invalidate refresh then failed, every subsequent request was blocked for refreshTimeInterval (5 min default) by populated()==false + cached error -- even though the locationCache still held a valid regional topology. This was a regression introduced by this PR. Fix: add everPopulated bool, set true on first successful refresh and never reset. populated() reads everPopulated instead of the timestamp. lastUpdateErr is now only surfaced to throttled callers the cached topology while a refresh attempt is throttled after a failure. New regression test TestFix7_InvalidateThenRefreshFailure DoesNotStallDataPlane locks in the behaviour. Recommendations addressed: * gem.Update waiters now select between flight completion and ctx.Done(), so a caller-side timeout cannot be exceeded by an unrelated stuck refresh on a different goroutine. New regression test TestFix8_UpdateWaiterRespectsContextCancellation. * Async refresh goroutine in globalEndpointManagerPolicy.Do now recovers panics, so a panic re-raised by gem.Update's panic-safe defer (added in the previous commit) cannot bring down the host process via the detached goroutine. Test: * Tightened TestFix3c bound from <=2 to ==1 -- the combination of the atomic check-and-mark and the in-flight singleflight makes the upper bound provably 1 HTTP call for the tested fixture. All 14 fix tests pass with -race; gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: Go best-practices polish from second deep-review pass Idiomatic improvements identified by the PR Deep Reviewer's Go-craft pass. No functional behaviour changes. cosmos_global_endpoint_manager.go: * everPopulated is now atomic.Bool, so the policy's hot-path populated() check is lock-free instead of contending on gemMutex against Update's critical section. * Hoisted the panic-safe defer out of an IIFE in Update -- a plain defer at function scope has identical semantics with less ceremony. * Wrap GetAccountProperties / locationCache.update errors with %w so callers can errors.As / errors.Is the underlying response error when the cached error is surfaced via lastUpdateErr. * Added hasInflight() test helper so tests don't peek at gemMutex/ inflight directly; keeps the test surface stable across future refactors of the internal sync primitives. * One-line comment documenting that lastUpdateErr is intentionally shared across force=true and force=false callers. cosmos_global_endpoint_manager_policy.go: * Gate the async refresh goroutine spawn with an atomic.Bool CAS. Without this, every request that observed ShouldRefresh()==true during a burst spawned its own goroutine that then queued as a waiter inside gem.Update. Now N→1 per refresh-window. * Recovered panic is logged via the azcore log facility instead of being silently dropped, so production crashes remain triageable. Captures the stack trace too. cosmos_dbaccount_refresh_test.go: * TestFix1b: replaced racy 30ms time.Sleep with require.Eventually polling the new hasInflight() helper. * TestFix8: replaced manual mutex poll loop with require.Eventually + hasInflight(). * TestFix2: include goroutine index in the failure message so a sporadic single-goroutine failure is easier to debug. * Gated the 500-goroutine soak tests and the 2-second TestFix8 behind testing.Short() so the dev-loop cost is opt-in. All tests pass with -race; -short also passes (skips three slow tests as designed). gofmt and vet clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: revert cross-region/forceRefresh behavior, address deep-review #1,#2,Azure#4,Azure#13, comment cleanup Behavioral reverts (preserve long-standing semantics): - cosmos_client_retry_policy.go: write-retry path uses gem.Update(ctx, true) to force-refresh per attempt (intentional; restores prior behavior). - cosmos_location_cache.go: resolveServiceEndpoint with enableCrossRegionRetries=false returns lc.defaultEndpoint for single-master writes (avoids a breaking change). Fixes addressing prior deep-review findings: - cosmos_global_endpoint_manager_policy.go: * Log async refresh failures (was discarded via '_ ='); chronic post-bootstrap topology drift is now observable. * Capture refreshCtx via context.WithoutCancel BEFORE launching the goroutine so it does not depend on req's lifetime. - cosmos_global_endpoint_manager.go: leader's HTTP call runs under context.WithoutCancel(ctx); waiters still respect their own ctx so caller-side cancellation no longer poisons coalesced waiters. - cosmos_location_cache.go markEndpointUnavailable: fast-path skips the full updateLocked recompute when wasAlreadyUnavailable==true (route lists cannot change in that case). Tests and docs: - TestFix3 renamed/inverted to TestFix3_WriteRetryForceRefreshesGEM. - TestDefaultEndpointElim_CrossRegionRetriesDisabled* renamed/inverted to assert the restored default-endpoint behavior. - TestRegression25468_* renamed to TestRegression_*. - Comments stripped of Azure#25468 references and tightened. - CHANGELOG updated to link to PR 26815 and reflect the reverted cross-region-retries=false behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: deflake TestFix3c on Windows go1.26.1 Under some schedulers (observed on windows_go_1261 CI), every concurrent gem.Update can complete observing the still-fresh lastUpdateTime before any of the concurrent MarkEndpointUnavailableForWrite goroutines fires invalidate(), leaving the post-invalidation refresh un-triggered and the HTTP call count at 0 instead of the expected 1. Add an explicit gem.Update after wg.Wait() to deterministically trigger the post-invalidation refresh. The singleflight + invalidationGen bound still guarantees the burst-plus-trailing-Update collapses to exactly one HTTP round-trip, so the assertion (transport.count == 1) stays tight. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address PR Copilot review comments - getPrefAvailableEndpointsLocked now takes prefLocations as a parameter rather than reading lc.locationInfo.prefLocations. updateLocked passes nextLoc.prefLocations so the route lists are computed from a single consistent in-progress snapshot, not the still-committed lc.locationInfo. - countingTransport.Do now sets Response.Status (e.g. '200 OK') and ContentLength so simulated responses match net/http behavior and SDK helpers like NewResponseErrorWithErrorCode produce useful error text. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: consolidate CHANGELOG entry to one line Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Tomas Varon <tvaron@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the Cosmos label May 20, 2026

azcosmos: add PR links to CHANGELOG entries

1aa5d57

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azcosmos: single-pending-I/O pkrange cache and mid-pagination retry#1

azcosmos: single-pending-I/O pkrange cache and mid-pagination retry#1
tvaron3 wants to merge 2 commits into
mainfrom
tvaron3/background-pkrange-cache

tvaron3 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tvaron3 commented May 20, 2026

Cache contract

Mid-pagination retry

Out of scope

Notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant