Skip to content

azcosmos: single-pending-I/O pkrange cache and mid-pagination retry#1

Closed
tvaron3 wants to merge 2 commits into
mainfrom
tvaron3/background-pkrange-cache
Closed

azcosmos: single-pending-I/O pkrange cache and mid-pagination retry#1
tvaron3 wants to merge 2 commits into
mainfrom
tvaron3/background-pkrange-cache

Conversation

@tvaron3
Copy link
Copy Markdown
Owner

@tvaron3 tvaron3 commented May 20, 2026

Today partitionKeyRangeCache.getRoutingMap / forceRefresh hold the per-entry mutex across the entire change-feed pagination. Two consequences hurt under load: a caller that already has a usable cached routing map can block behind an unrelated refresh, and concurrent 410-triggered forced refreshes serialize instead of sharing one network fetch. Also, any HTTP failure mid-pagination discards all already-accumulated pages and forces the next refresh to start from page 1.

This PR adopts a single-pending-I/O model for the cache and makes the change-feed loop resilient to transient mid-pagination errors.

Cache contract

  • At most one refresh per container at any time. Concurrent first-fetch callers, and concurrent forced-refresh callers, share one in-flight op.
  • While a refresh is in flight, callers that already see a usable routing map return immediately. Only first-fetch and forced-refresh paths await.
  • The refresh goroutine runs on context.Background(), so a single caller's ctx cancellation does not abort the shared fetch. Each waiter still honors its own ctx.Done() via select.
  • forceRefresh takes a previous *collectionRoutingMap pointer. If the entry already holds a fresher map than the caller's view, the call is short-circuited and the fresher map is returned without a new fetch (pointer-identity dedup of stale-view triggers).
  • invalidate no longer races refreshes: an in-flight result simply becomes the next observed map.

Mid-pagination retry

fetchAllChangeFeedPages now delegates each page to fetchOneChangeFeedPageWithRetry, which retries changeFeedPageMaxAttempts = 3 times with linear backoff (100 ms, 200 ms) on transient errors:

  • Retried: 5xx, 408, 429, and any non-HTTP/network error.
  • Failed fast: other 4xx and context.Canceled / DeadlineExceeded.

The accumulator and last-good ETag survive the retry, so a single bad page no longer discards the pages already collected.

Out of scope

ReadMany retry behaviour is unchanged (still full-restart on 410). The query engine path and GetFeedRanges are untouched.

Notes

  • No new public API surface.
  • New tests cover: cached-during-refresh fast path, concurrent shared fetch, stale-view dedup, caller-cancel-doesn't-abort-shared-fetch, invalidate-during-refresh, error-clears-inflight-slot, transient-retry-succeeds, exhausted-retries-fails, non-transient-fails-fast.
  • go test -count=1 -short -race ./sdk/data/azcosmos/... passes.
  • CHANGELOG updated under 1.5.0-beta.7 (Unreleased).

Checklist

  • The purpose of this PR is explained in this or a referenced issue.
  • The PR does not update generated files.
  • Tests are included and/or updated for code changes.
  • Updates to module CHANGELOG.md are included.
  • MIT license headers are included in each file.

Rewrite partitionKeyRangeCache so concurrent callers share a single
in-flight refresh per container, and the cached routing map remains
readable while a refresh is in flight. The refresh runs on a detached
context.Background() so a single caller's context cancellation no
longer aborts the shared fetch for other waiters; each caller still
honors its own ctx.Done().

forceRefresh now takes the routing map pointer the caller observed.
When the entry already holds a fresher map than the caller's view, the
refresh is suppressed and the fresher map is returned immediately
(pointer-identity dedup of stale-view triggers).

Make change-feed pagination resilient to mid-loop transient failures:
each page is retried up to changeFeedPageMaxAttempts times with linear
backoff on 5xx / 408 / 429 / network errors, preserving the pages
already accumulated. Non-transient errors (other 4xx, ctx errors) fail
fast.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 closed this May 20, 2026
tvaron3 added a commit that referenced this pull request May 20, 2026
…on, sentinel handling

- Bump changeFeedPageMaxAttempts 3 -> 6 and honor Retry-After on 429.
  Brings the per-page retry budget closer to .NET's
  MetadataRequestThrottleRetryPolicy / ResourceThrottleRetryPolicy stack
  (which retries 429s up to 9x with exponential backoff). Matters most on
  cold-start where there is no warm routing map to fall back on. (#1)

- forceRefresh now displaces an in-flight non-forced op (spawned by an
  earlier getRoutingMap) so its result never escapes to a caller that
  explicitly asked for fresh data. The displaced goroutine's install is
  rejected by the generation check. Concurrent forceRefresh callers
  still share a single forced fetch. (#3)

- errPKRangeCacheInvalidatedDuringRefresh is now handled internally by
  getRoutingMap / forceRefresh in a bounded retry loop (3 attempts).
  Callers never see the sentinel, so consumers do not have to know about
  this implementation detail when invalidate() starts being wired up to
  410/PartitionKeyRangeGone detection. (Azure#4)

- runRefresh only clears entry.inFlight if the slot still points to its
  own op, so a displaced goroutine's completion cannot clobber a
  successor op installed by forceRefresh.

Tests: invalidateDuringRefresh_discardsResult updated to assert the
caller observes the post-invalidate fresh result instead of the swallowed
sentinel. TestMain shrinks per-page retry delays so the bumped attempt
cap does not bloat test runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 added a commit that referenced this pull request May 22, 2026
…iew #1,#2,Azure#4,Azure#13, comment cleanup

Behavioral reverts (preserve long-standing semantics):
- cosmos_client_retry_policy.go: write-retry path uses gem.Update(ctx, true)
  to force-refresh per attempt (intentional; restores prior behavior).
- cosmos_location_cache.go: resolveServiceEndpoint with
  enableCrossRegionRetries=false returns lc.defaultEndpoint for
  single-master writes (avoids a breaking change).

Fixes addressing prior deep-review findings:
- cosmos_global_endpoint_manager_policy.go:
  * Log async refresh failures (was discarded via '_ ='); chronic
    post-bootstrap topology drift is now observable.
  * Capture refreshCtx via context.WithoutCancel BEFORE launching the
    goroutine so it does not depend on req's lifetime.
- cosmos_global_endpoint_manager.go: leader's HTTP call runs under
  context.WithoutCancel(ctx); waiters still respect their own ctx so
  caller-side cancellation no longer poisons coalesced waiters.
- cosmos_location_cache.go markEndpointUnavailable: fast-path skips the
  full updateLocked recompute when wasAlreadyUnavailable==true (route
  lists cannot change in that case).

Tests and docs:
- TestFix3 renamed/inverted to TestFix3_WriteRetryForceRefreshesGEM.
- TestDefaultEndpointElim_CrossRegionRetriesDisabled* renamed/inverted
  to assert the restored default-endpoint behavior.
- TestRegression25468_* renamed to TestRegression_*.
- Comments stripped of Azure#25468 references and tightened.
- CHANGELOG updated to link to PR 26815 and reflect the reverted
  cross-region-retries=false behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 added a commit that referenced this pull request May 22, 2026
…zure#26855)

* azcosmos: single-pending-I/O pkrange cache and mid-pagination retry

Rewrite partitionKeyRangeCache so concurrent callers share a single
in-flight refresh per container, and the cached routing map remains
readable while a refresh is in flight. The refresh runs on a detached
context.Background() so a single caller's context cancellation no
longer aborts the shared fetch for other waiters; each caller still
honors its own ctx.Done().

forceRefresh now takes the routing map pointer the caller observed.
When the entry already holds a fresher map than the caller's view, the
refresh is suppressed and the fresher map is returned immediately
(pointer-identity dedup of stale-view triggers).

Make change-feed pagination resilient to mid-loop transient failures:
each page is retried up to changeFeedPageMaxAttempts times with linear
backoff on 5xx / 408 / 429 / network errors, preserving the pages
already accumulated. Non-transient errors (other 4xx, ctx errors) fail
fast.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: add PR links to CHANGELOG entries

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: point CHANGELOG entries at upstream PR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address PR + deep-review feedback

Cache concurrency:
- Bound the detached refresh with a 60s timeout so a wedged transport
  cannot wedge the cache slot indefinitely.
- Move close(op.done) inside the entry.mu critical section so a caller
  racing the completion edge cannot observe inFlight=nil before the
  result is published, eliminating the redundant-refresh window.
- Skip starting a new background refresh when the caller's ctx is
  already canceled and no op is in flight (returns ctx.Err() early
  without spawning).
- Drop the unused `previous` parameter from forceRefresh and its
  pointer-identity dedup; it was only ever called with nil in
  production. Stale-view dedup will be added in a follow-up that wires
  the routing-map pointer through every 410-retry call site.

invalidate semantics:
- Add an entry.generation counter; runRefresh checks the generation at
  completion and discards its result if invalidate() bumped it in the
  meantime. Prevents a refresh that pre-dates an invalidate from
  silently overwriting the invalidation.

Transient predicate:
- Tighten isTransientPKRangeFetchError: only retry actual transport-
  level errors (net.Error, io.ErrUnexpectedEOF, syscall.ECONNRESET,
  syscall.ETIMEDOUT) and 408. Drop 5xx and 429 from this layer — the
  pipeline retry policy already handles them with Retry-After honoring,
  so retrying here was double-retry. Drop the blanket
  "non-ResponseError ⇒ transient" branch so JSON/body-read errors fail
  fast.
- Add jitter to the linear backoff so a fleet doesn't synchronize
  retries across containers.

Observability:
- Log refresh start/finish with container link, elapsed, and error so
  detached HTTP spans can be correlated to the triggering caller.

Tests:
- Update invalidate-during-refresh test to assert the new
  result-discarded semantics.
- Replace 503 with 408 in mid-pagination tests (5xx no longer retried
  by this layer).
- Add tests for canceled-caller-no-background-fetch on both
  getRoutingMap and forceRefresh, and for second-wave-after-completion
  not triggering redundant network requests.
- Drop the stale-view-suppression test that tested the removed
  pointer-identity dedup.

CHANGELOG:
- Move entries from "Features Added" to "Bugs Fixed" (these are
  reliability fixes, not API additions).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address re-review feedback

- Remove duplicated/stale Concurrency-model doc block on
  partitionKeyRangeCache that still described the removed previous
  parameter.
- Drop the no-op callerCtx parameter from ensureInFlightLocked /
  runRefresh / newDetachedRefreshContext along with the optimistic
  tracing-span propagation comments.
- Fix the invalidate-during-refresh asymmetry: on generation mismatch
  op.rm is now nil and op.err is errPKRangeCacheInvalidatedDuringRefresh;
  awaiters re-enter and start a fresh post-invalidate refresh instead
  of routing against the stale view that motivated the invalidate.
- Wrap context.DeadlineExceeded from the detached refresh timeout with
  errPKRangeCacheRefreshTimeout so upstream errors.Is checks for the
  caller's own deadline don't misclassify it.
- Make pkRangeBackgroundRefreshTimeout a var so tests can override it.
- Reduce per-refresh log volume from two lines (start + finish) to one
  (finish), embedding elapsed and invalidated.
- Tests: update invalidate-during-refresh to assert the new sentinel
  semantics; add a new test that the cache's own refresh timeout is
  wrapped while preserving DeadlineExceeded as the cause.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: retry CI (transient go1.26.1 download flake)

* Broaden isTransientPKRangeFetchError to retry 429, 5xx, and 408

The per-page retry at the cache layer preserves accumulated change-feed
pages rather than restarting the drain. Retrying these statuses here is
intentional even if the pipeline already retried -- aligns with how .NET
and Java SDKs handle metadata request retries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address kushagraThapar follow-ups: retry budget, forceRefresh isolation, sentinel handling

- Bump changeFeedPageMaxAttempts 3 -> 6 and honor Retry-After on 429.
  Brings the per-page retry budget closer to .NET's
  MetadataRequestThrottleRetryPolicy / ResourceThrottleRetryPolicy stack
  (which retries 429s up to 9x with exponential backoff). Matters most on
  cold-start where there is no warm routing map to fall back on. (#1)

- forceRefresh now displaces an in-flight non-forced op (spawned by an
  earlier getRoutingMap) so its result never escapes to a caller that
  explicitly asked for fresh data. The displaced goroutine's install is
  rejected by the generation check. Concurrent forceRefresh callers
  still share a single forced fetch. (#3)

- errPKRangeCacheInvalidatedDuringRefresh is now handled internally by
  getRoutingMap / forceRefresh in a bounded retry loop (3 attempts).
  Callers never see the sentinel, so consumers do not have to know about
  this implementation detail when invalidate() starts being wired up to
  410/PartitionKeyRangeGone detection. (Azure#4)

- runRefresh only clears entry.inFlight if the slot still points to its
  own op, so a displaced goroutine's completion cannot clobber a
  successor op installed by forceRefresh.

Tests: invalidateDuringRefresh_discardsResult updated to assert the
caller observes the post-invalidate fresh result instead of the swallowed
sentinel. TestMain shrinks per-page retry delays so the bumped attempt
cap does not bloat test runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove detached refresh timeout; retry transient page failures indefinitely

Per design discussion: the refresh is background-only, so a wedged
transport cannot impact callers that already have a cached map. Cold
start callers can bail via their own ctx, and forceRefresh self-heals
by displacing the in-flight op. Trade-off: bounded goroutine leak
proportional to forceRefresh-during-wedge calls.

- runRefresh now uses context.Background() unconditionally (no deadline)
- Drop pkRangeBackgroundRefreshTimeout, errPKRangeCacheRefreshTimeout,
  and newDetachedRefreshContext entirely
- fetchOneChangeFeedPageWithRetry loops on transient errors with no
  attempt cap; returns only on success, non-transient error, or ctx
  cancellation
- Drop Retry-After honoring (azcore pipeline already handles it; retrying
  here would just compound the wait)
- Simpler backoff: capped linear + jitter

Tests: replace exhausted-retries test with infinite-retry-eventually-
succeeds test (5 transient 408s followed by success). Remove
refreshTimeout_isWrapped (no timeout to wrap anymore).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert forceRefresh displacement; concurrent callers share in-flight refresh

The displacement logic added complexity (forced flag, slot-ownership
guard, wasted goroutine on every displacement) to enforce a strict
'fetch started after my call' horizon that practical callers don't
need. The only production caller (refreshPKRangeCache on 410 retries)
is happy with eventual consistency: the in-flight /pkranges drain is
already pulling fresh change-feed pages, and the next 410 retriggers
another refresh if needed.

Callers that need strict post-call freshness must call invalidate()
first — the generation counter then guarantees the in-flight op's
result is discarded and a fresh post-invalidate fetch runs. The
internal sentinel retry handles the join+invalidate dance so callers
never see errPKRangeCacheInvalidatedDuringRefresh.

- refreshOp.forced field removed
- forceRefresh joins in-flight op (same as getRoutingMap)
- ensureInFlightLocked no longer takes a forced parameter
- runRefresh always clears entry.inFlight (no displacement to defend against)
- Drop the forceRefresh CHANGELOG entry; the join behavior is now the documented contract on forceRefresh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Simplify PK range cache retry: 429-only, delegate others to pipeline

Only 429 throttling is retried at the per-page layer (indefinitely with
capped linear backoff + jitter) so accumulated change-feed pages aren't
discarded when the service explicitly asks the client to slow down.
All other errors — 5xx, 408, transport — are surfaced immediately and
delegated to the azcore pipeline's retry policy, avoiding double-retry
and keeping responsibility for those classes in one place.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Condense PK range cache comments

Trim verbose multi-line comments added across this PR while preserving
the key 'why' notes (concurrency model, generation check, 429-only
retry rationale).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address deep-reviewer findings: panic recovery, invalidate cancel, sentinel wrap

Major changes:

1. invalidate() now also cancels the in-flight refresh via a context.CancelFunc
   stored on refreshOp. Combined with the generation bump, this lets
   invalidate() actually unwedge a stalled refresh (e.g. sustained 429),
   not just discard its eventual result.

2. runRefresh wraps its body in a deferred cleanup that recovers from panics
   and ALWAYS clears entry.inFlight + closes op.done. A panic in
   fetch/decode/merge can no longer leak the in-flight slot. A new
   errPKRangeCacheRefreshPanic surfaces the panic value to the awaiter.

3. The internal sentinel errPKRangeCacheInvalidatedDuringRefresh no longer
   leaks to callers: when the retry loop in getRoutingMap/forceRefresh
   exhausts (3 attempts), it returns a new public-ish error
   errPKRangeCacheInvalidatedRepeatedly instead of the unexported sentinel.

4. The 410/PartitionKeyRangeGone caller in refreshPKRangeCache now invalidates
   the pkrange cache before calling forceRefresh, so the join semantics of
   forceRefresh produce a fresh post-invalidate snapshot rather than
   whatever pre-trigger op may already be running.

5. refreshEntryDetached gets a 'safe to call ONLY from runRefresh' warning
   docstring since its safety depends on the generation check upstream.

6. fetchAllChangeFeedPages comment documents that page-preservation across
   429s is intentional cross-SDK divergence (.NET/Java/Python/Rust restart
   from page 1 on transient failures).

7. The retry-backoff multiplier is pre-capped to avoid integer overflow if a
   test or future contributor shrinks the base delay to nanoseconds.

8. gatePolicy.started buffer grown from 16 to 64 (named constant with a
   comment about silent drops past the bound).

9. TestMain comment about global mutation rewritten for accuracy.

10. Three new tests:
    - panicInRefresh_recoversAndAllowsRetry (Finding #2)
    - invalidateDuringThrottleStorm_unwedges (Finding #3)
    - forceRefreshAfterInvalidate_doesNotJoinPreInvalidateOp (Finding Azure#5)

11. forceRefresh docstring restructured with a prominent warning that callers
    needing strict post-call freshness MUST call invalidate() first.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove accidentally-committed PR diff dump

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Drop CHANGELOG entry for invalidate-cancel/panic-recovery internals

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* gofmt: re-indent appended test fixtures

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 added a commit that referenced this pull request May 22, 2026
… fallback to customer endpoint (Azure#25468) (Azure#26815)

* Cosmos: fix excess GetDatabaseAccount calls and default-endpoint fallback (Azure#25468)

Fixes issue Azure#25468 (excess GetDatabaseAccount HTTP calls with preferred
regions) plus the related default-endpoint fallback in data-plane routing.

Root causes addressed:
  * GEM Update only advanced lastUpdateTime on success, so any failed
    refresh caused every subsequent caller to re-issue the HTTP call
    immediately (no 5-min throttle on failures).
  * globalEndpointManagerPolicy.Do spawned a goroutine per request whenever
    ShouldRefresh() was true, with no in-flight coalescing.
  * attemptRetryOnEndpointFailure passed isWriteOperation as forceRefresh,
    so every retried write 403 force-refreshed GEM regardless of the
    refresh interval.
  * A failed initial sync.Once bootstrap latched done=true, pinning the
    client into the async-refresh herd path forever.
  * locationCache.readEndpoints/writeEndpoints could self-deadlock by
    acquiring RLock and then calling update() which takes Lock().
  * getPrefAvailableEndpointsLocked appended the customer-supplied default
    endpoint as a trailing fallback in every route list, so retry
    traversal eventually issued data-plane requests there.
  * resolveServiceEndpoint fell back to the default endpoint whenever
    enableCrossRegionRetries was false even with non-empty availWriteLocations.

Changes:
  * globalEndpointManager: add lastAttemptTime, per-flight singleflight,
    invalidationGen, lastUpdateErr, populated(), invalidate(); MarkEndpoint
    Unavailable* now invalidates once per newly-unavailable endpoint.
  * globalEndpointManagerPolicy: drop sync.Once; bootstrap is gated on
    populated() and coalesced via gem.Update's singleflight.
  * clientRetryPolicy: write retries pass forceRefresh=false.
  * locationCache: split Locked variants; readEndpoints/writeEndpoints
    drop the RLock before calling update; update() chooses regional
    write/read fallbacks; resolveServiceEndpoint always picks a regional
    endpoint when availWriteLocations is non-empty (cross-region-retries
    flag now only gates retry walking); getPrefAvailableEndpointsLocked
    no longer appends a trailing fallback.
  * Only remaining data-plane -> default-endpoint route is the degenerate
    'zero write regions on a write request' case.

Tests:
  * cosmos_dbaccount_refresh_test.go (new): 12 regression tests covering
    throttle-on-failure, single-flight coalescing, write-retry single
    refresh, in-flight invalidation, cached bootstrap error, deadlock-free
    readEndpoints, default-endpoint elimination across resource types
    and configurations, plus 500-concurrency soak tests on healthy and
    failing GEMs.
  * cosmos_location_cache_test.go: update 4 expectations to drop trailing
    default-endpoint entries from route lists.

All tests pass with -race.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address deep-review findings (panic-safety, atomic mark, test hardening)

Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the
deep-review pass:

1. Panic-safe gem.Update: wrap refreshOnce in a deferred cleanup so a
   panic in the HTTP pipeline (or any code refreshOnce transitively
   calls) cannot leak gem.inflight / leave flight.done unclosed --
   which would have permanently wedged every subsequent Update caller.
   The panic is re-raised after cleanup.

2. Atomic check-and-mark in locationCache.markEndpointUnavailable: the
   helper now returns wasAlreadyUnavailable from inside the same
   mapMutex critical section that performs the mark. This eliminates
   the check-then-act race between MarkEndpointUnavailableFor* and
   isEndpointUnavailable that previously let multiple concurrent
   markers each call invalidate(). Tightens TestFix3c bound from
   <=3 to a provable <=2.

3. TestFix4 hardening: assert require.Error on each follow-up request,
   so the test cannot pass if a regression made populated() return
   true and silently skip the bootstrap.

All tests pass with -race; gofmt and vet clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address PR Deep Reviewer findings (data-plane stall, waiter ctx, async panic)

Follow-up to PR Azure#26815 / issue Azure#25468 addressing findings from the PR
Deep Reviewer pass.

BLOCKING finding addressed:
  * invalidate() zeroed lastUpdateTime, which made populated() return
    false. If the post-invalidate refresh then failed, every subsequent
    request was blocked for refreshTimeInterval (5 min default) by
    populated()==false + cached error -- even though the locationCache
    still held a valid regional topology. This was a regression
    introduced by this PR.

    Fix: add everPopulated bool, set true on first successful refresh
    and never reset. populated() reads everPopulated instead of the
    timestamp. lastUpdateErr is now only surfaced to throttled callers
    the cached topology while a refresh attempt is throttled after a
    failure. New regression test TestFix7_InvalidateThenRefreshFailure
    DoesNotStallDataPlane locks in the behaviour.

Recommendations addressed:
  * gem.Update waiters now select between flight completion and
    ctx.Done(), so a caller-side timeout cannot be exceeded by an
    unrelated stuck refresh on a different goroutine. New regression
    test TestFix8_UpdateWaiterRespectsContextCancellation.
  * Async refresh goroutine in globalEndpointManagerPolicy.Do now
    recovers panics, so a panic re-raised by gem.Update's panic-safe
    defer (added in the previous commit) cannot bring down the host
    process via the detached goroutine.

Test:
  * Tightened TestFix3c bound from <=2 to ==1 -- the combination of the
    atomic check-and-mark and the in-flight singleflight makes the
    upper bound provably 1 HTTP call for the tested fixture.

All 14 fix tests pass with -race; gofmt and vet clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: Go best-practices polish from second deep-review pass

Idiomatic improvements identified by the PR Deep Reviewer's Go-craft
pass. No functional behaviour changes.

cosmos_global_endpoint_manager.go:
  * everPopulated is now atomic.Bool, so the policy's hot-path
    populated() check is lock-free instead of contending on gemMutex
    against Update's critical section.
  * Hoisted the panic-safe defer out of an IIFE in Update -- a plain
    defer at function scope has identical semantics with less ceremony.
  * Wrap GetAccountProperties / locationCache.update errors with %w
    so callers can errors.As / errors.Is the underlying response error
    when the cached error is surfaced via lastUpdateErr.
  * Added hasInflight() test helper so tests don't peek at gemMutex/
    inflight directly; keeps the test surface stable across future
    refactors of the internal sync primitives.
  * One-line comment documenting that lastUpdateErr is intentionally
    shared across force=true and force=false callers.

cosmos_global_endpoint_manager_policy.go:
  * Gate the async refresh goroutine spawn with an atomic.Bool CAS.
    Without this, every request that observed ShouldRefresh()==true
    during a burst spawned its own goroutine that then queued as a
    waiter inside gem.Update. Now N→1 per refresh-window.
  * Recovered panic is logged via the azcore log facility instead of
    being silently dropped, so production crashes remain triageable.
    Captures the stack trace too.

cosmos_dbaccount_refresh_test.go:
  * TestFix1b: replaced racy 30ms time.Sleep with require.Eventually
    polling the new hasInflight() helper.
  * TestFix8: replaced manual mutex poll loop with require.Eventually
    + hasInflight().
  * TestFix2: include goroutine index in the failure message so a
    sporadic single-goroutine failure is easier to debug.
  * Gated the 500-goroutine soak tests and the 2-second TestFix8
    behind testing.Short() so the dev-loop cost is opt-in.

All tests pass with -race; -short also passes (skips three slow tests
as designed). gofmt and vet clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: revert cross-region/forceRefresh behavior, address deep-review #1,#2,Azure#4,Azure#13, comment cleanup

Behavioral reverts (preserve long-standing semantics):
- cosmos_client_retry_policy.go: write-retry path uses gem.Update(ctx, true)
  to force-refresh per attempt (intentional; restores prior behavior).
- cosmos_location_cache.go: resolveServiceEndpoint with
  enableCrossRegionRetries=false returns lc.defaultEndpoint for
  single-master writes (avoids a breaking change).

Fixes addressing prior deep-review findings:
- cosmos_global_endpoint_manager_policy.go:
  * Log async refresh failures (was discarded via '_ ='); chronic
    post-bootstrap topology drift is now observable.
  * Capture refreshCtx via context.WithoutCancel BEFORE launching the
    goroutine so it does not depend on req's lifetime.
- cosmos_global_endpoint_manager.go: leader's HTTP call runs under
  context.WithoutCancel(ctx); waiters still respect their own ctx so
  caller-side cancellation no longer poisons coalesced waiters.
- cosmos_location_cache.go markEndpointUnavailable: fast-path skips the
  full updateLocked recompute when wasAlreadyUnavailable==true (route
  lists cannot change in that case).

Tests and docs:
- TestFix3 renamed/inverted to TestFix3_WriteRetryForceRefreshesGEM.
- TestDefaultEndpointElim_CrossRegionRetriesDisabled* renamed/inverted
  to assert the restored default-endpoint behavior.
- TestRegression25468_* renamed to TestRegression_*.
- Comments stripped of Azure#25468 references and tightened.
- CHANGELOG updated to link to PR 26815 and reflect the reverted
  cross-region-retries=false behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: deflake TestFix3c on Windows go1.26.1

Under some schedulers (observed on windows_go_1261 CI), every concurrent
gem.Update can complete observing the still-fresh lastUpdateTime before
any of the concurrent MarkEndpointUnavailableForWrite goroutines fires
invalidate(), leaving the post-invalidation refresh un-triggered and the
HTTP call count at 0 instead of the expected 1.

Add an explicit gem.Update after wg.Wait() to deterministically trigger
the post-invalidation refresh. The singleflight + invalidationGen bound
still guarantees the burst-plus-trailing-Update collapses to exactly
one HTTP round-trip, so the assertion (transport.count == 1) stays
tight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address PR Copilot review comments

- getPrefAvailableEndpointsLocked now takes prefLocations as a parameter
  rather than reading lc.locationInfo.prefLocations. updateLocked passes
  nextLoc.prefLocations so the route lists are computed from a single
  consistent in-progress snapshot, not the still-committed lc.locationInfo.
- countingTransport.Do now sets Response.Status (e.g. '200 OK') and
  ContentLength so simulated responses match net/http behavior and SDK
  helpers like NewResponseErrorWithErrorCode produce useful error text.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: consolidate CHANGELOG entry to one line

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Tomas Varon <tvaron@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant