Skip to content

[Cosmos] add container cache and pk range cache#26723

Merged
simorenoh merged 30 commits into
Azure:mainfrom
simorenoh:container-client-caches
May 13, 2026
Merged

[Cosmos] add container cache and pk range cache#26723
simorenoh merged 30 commits into
Azure:mainfrom
simorenoh:container-client-caches

Conversation

@simorenoh
Copy link
Copy Markdown
Member

@simorenoh simorenoh commented May 5, 2026

Resolves #26704

Adds client-level caching for partition key ranges and container properties, and fixes partition key routing for hierarchical partition keys (HPK). These changes eliminate redundant HTTP calls in the ReadManyItems path where the uncached getPartitionKeyRanges call was causing ~1.5s latency on large containers.

Partition Key Range Cache

  • Read-through cache keyed by container ResourceID (RID), shared across all ContainerClient instances
  • Purely event-driven invalidation (410/Gone triggers refresh, no TTL)
  • Per-entry sync.Mutex for single-flight pattern (concurrent cache misses result in a single network call)
  • Incremental refresh via change-feed ETag (A-IM: Incremental Feed + If-None-Match headers), looping until 304 Not Modified (capped at 10 iterations)
  • Falls back to full refresh if incremental merge produces incomplete covering

Container Properties Cache

  • Dual-index cache: entries accessible by both name-based link and ResourceID for O(1) lookups from either key
  • getProperties() (by link) and getPropertiesByRID() (by RID) — set() cross-populates both indices
  • container.Read() and container.Replace() populate the cache on success (always go to service, never read from cache)

Collection Routing Map

  • Immutable sorted PK range snapshot with tryCombine() for incremental split/merge handling
  • Gone-range tracking: parent range IDs stored in a set, filtered on merge
  • getOverlappingRanges(min, max) — binary search for all ranges overlapping an EPK range (used for HPK prefix fan-out)
  • isCompleteSetOfRanges() validation: checks contiguous coverage from "" to "FF"

Hierarchical Partition Key (HPK) Routing

  • Length-aware EPK comparison (CompareEPK): handles mixed-length EPK boundaries from HPK containers where the service returns 32-char partial and 64-char fully-specified boundaries
  • Prefix EPK range computation (computeEPKRange): for partial MultiHash keys (fewer components than paths), computes [prefix_epk, prefix_epk + "FF")
  • HPK prefix fan-out: groupItemsByPhysicalRange now uses getOverlappingRanges to find all physical partitions overlapping a prefix EPK range, distributing items across multiple partitions

Bug Fix: V2 EPK Top-Bit Masking

Fixed a pre-existing bug in the Go SDK where the top 2 bits of the first EPK byte were not masked (hash[0] &= 0x3F). This caused items whose V2 hash started with a byte >= 0x40 to fail routing in ReadMany because the EPK lexicographically exceeded the "FF" range sentinel. All other SDKs (Python, .NET, Java, Rust) apply this masking.

Behavioral Changes

  • getPartitionKeyRanges() now routes through the PK range cache (with nil-fallback for backward compat)
  • executeReadManyWithQueries() uses the container properties cache instead of calling container.Read() every time
  • 410/Gone responses in ReadManyItems trigger cache invalidation + retry (up to 3 attempts)

Testing

  • 50+ new unit tests covering routing map operations, cache hit/miss/invalidation, single-flight concurrency, tryCombine split merging, completeness validation, 410/Gone error detection, HPK prefix fan-out, length-aware EPK comparison, overlapping range queries, and dual-index container cache
  • All existing tests pass without modification

@github-actions github-actions Bot added the Cosmos label May 5, 2026
@Azure Azure deleted a comment from azure-pipelines Bot May 6, 2026
@simorenoh
Copy link
Copy Markdown
Member Author

/azp run go - azcosmos

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@simorenoh simorenoh marked this pull request as ready for review May 8, 2026 17:08
Copilot AI review requested due to automatic review settings May 8, 2026 17:08
@simorenoh simorenoh requested a review from a team as a code owner May 8, 2026 17:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves azcosmos ReadMany performance and correctness by introducing client-level caching for container metadata and partition key ranges, and by fixing hierarchical partition key (HPK) routing behavior via length-aware EPK comparisons and prefix fan-out.

Changes:

  • Added client-scoped caches for container properties (dual-index by link and RID) and partition key ranges (read-through by RID with incremental refresh).
  • Updated ReadMany routing to support HPK mixed-length EPK boundaries and MultiHash prefix fan-out, plus retries on PK-range-gone (410) responses.
  • Fixed V2 EPK routing by masking the top 2 bits of the first hash byte (aligning with other SDKs).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/data/azcosmos/partition_key.go Switches V2 EPK computation to routing-safe masking and adds EPK range computation for prefix keys.
sdk/data/azcosmos/partition_key_test.go Updates expected EPK values to reflect top-bit masking.
sdk/data/azcosmos/internal/epk/epk.go Adds routing-safe V2 hash helpers, top-bit masking, and length-aware CompareEPK.
sdk/data/azcosmos/internal/epk/epk_test.go Tests raw V2 hashes (internal) and adds CompareEPK coverage.
sdk/data/azcosmos/cosmos_pk_range_gone_test.go Adds tests for PK-range-gone (410 + substatus) detection helpers.
sdk/data/azcosmos/cosmos_partition_key_range_cache.go Implements client-level PK range cache with incremental refresh and single-flight per RID.
sdk/data/azcosmos/cosmos_partition_key_range_cache_test.go Adds initial tests for the PK range cache (currently mostly structural).
sdk/data/azcosmos/cosmos_http_constants.go Adds substatus constants and helper to detect PK-range-gone conditions.
sdk/data/azcosmos/cosmos_container.go Populates container cache on Read/Replace and routes PK range fetches through the cache.
sdk/data/azcosmos/cosmos_container_read_many.go Uses caches in ReadMany, adds HPK-aware routing, MultiHash prefix fan-out, and 410 retry/refresh handling.
sdk/data/azcosmos/cosmos_container_read_many_test.go Adds tests for EPK range computation and mixed-length boundary routing behavior.
sdk/data/azcosmos/cosmos_container_properties_cache.go Implements client-level container properties cache (dual-index by link and RID).
sdk/data/azcosmos/cosmos_container_properties_cache_test.go Adds tests for the container properties cache indexing/invalidation behavior.
sdk/data/azcosmos/cosmos_collection_routing_map.go Adds immutable routing map snapshot with incremental merge and overlap queries.
sdk/data/azcosmos/cosmos_collection_routing_map_test.go Adds tests for routing map sorting/merge/completeness/overlap behavior.
sdk/data/azcosmos/cosmos_client.go Initializes the new caches on client construction.
sdk/data/azcosmos/CHANGELOG.md Documents the new caches and the V2 routing fix.

Comment thread sdk/data/azcosmos/cosmos_collection_routing_map.go Outdated
Comment thread sdk/data/azcosmos/cosmos_container_read_many.go
Comment thread sdk/data/azcosmos/cosmos_container_properties_cache.go
Comment thread sdk/data/azcosmos/cosmos_partition_key_range_cache_test.go Outdated
Comment thread sdk/data/azcosmos/cosmos_partition_key_range_cache_test.go Outdated
Comment thread sdk/data/azcosmos/cosmos_container_properties_cache.go Outdated
simorenoh and others added 4 commits May 8, 2026 15:19
- Fix lock-order inversion in containerPropertiesCache: extract
  updateRIDIndex() helper, ensure c.mu is never acquired while
  holding entry.mu (fixes Azure#3 and Azure#6)
- Rename misleading test names to match actual behavior (Azure#4 and Azure#5)
- Use epk.CompareEPK in isCompleteSetOfRanges for length-aware
  HPK boundary comparison (Azure#1)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread sdk/data/azcosmos/internal/epk/epk.go
@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

Comment thread sdk/data/azcosmos/CHANGELOG.md
Comment thread sdk/data/azcosmos/cosmos_container.go Outdated
Comment thread sdk/data/azcosmos/cosmos_partition_key_range_cache.go Outdated
Comment thread sdk/data/azcosmos/cosmos_container_read_many.go
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PKRange cache is still not wired up to cosmos_container.go line 514 and following - that is from my understanding where the app in question is hammering master partition today. Please include it - by default GetFeedRanges should come from cache - and if force refresh is specified (opt-in), it should just result in draining updates until 304 - then use the cache - never getting all partitions again.

Comment thread sdk/data/azcosmos/cosmos_client.go Outdated
Comment thread sdk/data/azcosmos/cosmos_container.go Outdated
Comment thread sdk/data/azcosmos/cosmos_container.go
Comment thread sdk/data/azcosmos/cosmos_container_properties_cache.go
Comment thread sdk/data/azcosmos/partition_key_test.go
simorenoh and others added 6 commits May 12, 2026 18:03
…cache

Move PartitionKeyDefinition resolution inside the 410 retry loop so that
after a cache invalidation (container deleted and recreated with a
different partition key schema), the new pkDef is picked up — not just
the new partition key ranges.

Add nil guard in containerPropertiesCache.refreshEntry() to return an
error instead of caching nil properties, which would cause a nil pointer
panic in callers that dereference ContainerProperties.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for not enforcing singleton caches per Cosmos DB account - independent of the number of CosmosClients.

@kushagraThapar kushagraThapar dismissed FabianMeiswinkel’s stale review May 13, 2026 19:19

The change Fabian has requested is already implemented. We can merge this PR in.

@simorenoh simorenoh merged commit 4125cab into Azure:main May 13, 2026
21 checks passed
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

tvaron3 added a commit to tvaron3/azure-sdk-for-go that referenced this pull request May 13, 2026
…one retry, multi-range continuation queue (Azure#26767)

GetChangeFeed previously failed (silently or loudly) when a physical
partition split occurred under a customer-supplied FeedRange:

  * toHeaders did exact-equality matching against the cached PK ranges,
    so a parent FeedRange that no longer matched any current child fell
    through to a request issued WITHOUT the partition-key-range-id
    header (silent miss).
  * 410/Gone with the PartitionKeyRangeGone substatus surfaced directly
    to the caller; the underlying maxPKRangeGoneRetries / forceRefresh
    plumbing added in Azure#26723 was never wired into the change feed.
  * The composite continuation token only encoded a single EPK range,
    so resuming after a split lost events on the boundary.

This change rewires GetChangeFeed onto a queue-driven drain loop:

  * buildChangeFeedInitialQueue resolves the customer FeedRange (or
    inbound continuation token) by overlap-match against the routing
    map, expanding a parent FeedRange into N children inheriting the
    parent ETag.
  * getChangeFeedForQueue drains the queue FIFO. On each iteration it
    re-resolves the head against the cached snapshot, detects mid-drain
    splits and bumps the rotation budget so newly-inserted children
    get queried in the same call.
  * On 410/Gone with PK-range substatus, the cache is refreshed, the
    routing snapshot is re-fetched, and the head is retried; capped at
    maxPKRangeGoneRetries.
  * The composite continuation token now carries the full multi-range
    queue, with each entry's ETag advanced FIFO on every response.
  * Cross-container token reuse is rejected loudly: if a token's
    ResourceID does not match the current container's _rid, the call
    fails with a diagnostic error rather than silently misrouting
    against the wrong routing map.

Behavior preserved for callers without the cache infrastructure wired:
when no PK-range cache is present and the direct fetch returns no
ranges (test/mock case), a single passthrough queue entry is used so
the legacy issue-without-overlap-match flow still runs.

Tests added in cosmos_change_feed_split_test.go cover: 410-retry, retry
cap surfacing the final 410, cross-container ResourceID rejection,
no-overlap-after-refresh -> ErrFeedRangeUnresolved, split expansion
routing the first child, multi-range continuation round-trip, and the
mid-drain split-during-drain budget bump regression guard. All three
existing change-feed tests continue to pass.

Closes Azure#26767

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
simorenoh pushed a commit that referenced this pull request May 15, 2026
…one retry, multi-range continuation queue (#26767) (#26768)

* fix(azcosmos): GetChangeFeed split-resilient via overlap match, 410/Gone retry, multi-range continuation queue (#26767)

GetChangeFeed previously failed (silently or loudly) when a physical
partition split occurred under a customer-supplied FeedRange:

  * toHeaders did exact-equality matching against the cached PK ranges,
    so a parent FeedRange that no longer matched any current child fell
    through to a request issued WITHOUT the partition-key-range-id
    header (silent miss).
  * 410/Gone with the PartitionKeyRangeGone substatus surfaced directly
    to the caller; the underlying maxPKRangeGoneRetries / forceRefresh
    plumbing added in #26723 was never wired into the change feed.
  * The composite continuation token only encoded a single EPK range,
    so resuming after a split lost events on the boundary.

This change rewires GetChangeFeed onto a queue-driven drain loop:

  * buildChangeFeedInitialQueue resolves the customer FeedRange (or
    inbound continuation token) by overlap-match against the routing
    map, expanding a parent FeedRange into N children inheriting the
    parent ETag.
  * getChangeFeedForQueue drains the queue FIFO. On each iteration it
    re-resolves the head against the cached snapshot, detects mid-drain
    splits and bumps the rotation budget so newly-inserted children
    get queried in the same call.
  * On 410/Gone with PK-range substatus, the cache is refreshed, the
    routing snapshot is re-fetched, and the head is retried; capped at
    maxPKRangeGoneRetries.
  * The composite continuation token now carries the full multi-range
    queue, with each entry's ETag advanced FIFO on every response.
  * Cross-container token reuse is rejected loudly: if a token's
    ResourceID does not match the current container's _rid, the call
    fails with a diagnostic error rather than silently misrouting
    against the wrong routing map.

Behavior preserved for callers without the cache infrastructure wired:
when no PK-range cache is present and the direct fetch returns no
ranges (test/mock case), a single passthrough queue entry is used so
the legacy issue-without-overlap-match flow still runs.

Tests added in cosmos_change_feed_split_test.go cover: 410-retry, retry
cap surfacing the final 410, cross-container ResourceID rejection,
no-overlap-after-refresh -> ErrFeedRangeUnresolved, split expansion
routing the first child, multi-range continuation round-trip, and the
mid-drain split-during-drain budget bump regression guard. All three
existing change-feed tests continue to pass.

Closes #26767

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address reviewer round-1 findings on F1 change-feed split handling

- Capture drain-loop error in GetChangeFeed so deferred span observes failures (#1)
- Reset 410/Gone budget on both split-expansion and head-advance (#2)
- Move resp.Body.Close() above 304 short-circuit in newChangeFeedResponse (#7)
- Drop dead empty-feedMax branch in cosmos_feed_range.go (#8)
- Remove redundant token.head() reload in drain loop (#9)
- Reduce buildChangeFeedInitialQueue to 3 returns (#10)
- buildRequestHeaders surfaces PartitionKey serialization errors (#11)
- Cache-nil + no-routing degraded fallback: log warning, drop continuation
  token, and read fresh with passthrough range instead of erroring; preserves
  back-compat for tests built without caches wired up
- CHANGELOG link points to PR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: drop redundant head reload after dropHeadContinuation

Round-2 reviewer cleanup. dropHeadContinuation mutates Continuation[0]
in place; the existing head pointer reflects the change without reload.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: tighten GetChangeFeed doc + CHANGELOG entry

Condenses the GetChangeFeed numbered behavior list to a short paragraph
and trims the CHANGELOG entry. No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: relocate overlap helper, delete legacy toHeaders + findPartitionKeyRangeID

- Move overlappingPartitionKeyRanges into cosmos_feed_range.go next to its
  sibling findOverlappingPartitionKeyRangeIDs and simplify to a single
  pass (no map round-trip)
- Delete the legacy ChangeFeedOptions.toHeaders builder and its test file;
  the new GetChangeFeed path uses buildRequestHeaders exclusively
- Delete the now-unused findPartitionKeyRangeID exact-match helper (the
  original silent-mismatch source); overlap-resolution lives upstream now
- Drop the small toHeaders-tail test in TestContainerGetChangeFeedWithStartFrom
  that only verified the legacy header builder

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: split GetChangeFeed into its own file + restore options test coverage

- Move GetChangeFeed, buildChangeFeedInitialQueue, resolveFeedRangeToChildren,
  buildChildQueueEntries, getChangeFeedForQueue, and serializeCompositeContinuationToken
  out of cosmos_container.go into cosmos_container_change_feed.go (mirrors the
  cosmos_container_read_many.go split). Pure relocation, no behavior change.
- Add cosmos_change_feed_request_options_test.go covering buildRequestHeaders:
  defaults (only AIM header), all-fields population, empty-ETag continuation
  is omitted, and PartitionKey serialization errors are surfaced (the new
  error-returning contract the legacy toHeaders silently swallowed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: add emulator change-feed tests for F1 split-handling scenarios

Three new emulator tests targeting F1-specific paths not covered by the
existing TestEmulatorContainerChangeFeed:

- SubRangeOfPhysicalRange — original-bug repro: a customer FeedRange that
  is a strict sub-range of a physical pkrange (no exact-match) must
  resolve via overlap-match and emit a request.
- WideRangeMultiDrain — wide FeedRange overlapping every physical range;
  paginates via the composite continuation token until all inserted items
  are drained, exercising the multi-range queue rotation.
- CrossContainerTokenRejected — token from container-A handed to
  container-B must be rejected by the ResourceID guard.

All three skip when EMULATOR is unset, matching repo convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: fix lint failures from CI

- Drop ineffectual headFeedRange reassignment in degraded fallback (it was
  never read after the assignment)
- Delete unused findOverlappingPartitionKeyRangeIDs (overlappingPartitionKeyRanges
  superseded it; no remaining callers)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: keep GetChangeFeed entrypoint in cosmos_container.go

Helpers (buildChangeFeedInitialQueue, resolveFeedRangeToChildren,
buildChildQueueEntries, getChangeFeedForQueue, serializeCompositeContinuationToken)
remain in cosmos_container_change_feed.go; only the public method moves back
so the canonical container file continues to host the public API surface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* azcosmos: address PR review comments on F1 change-feed

Round-2 reviewer findings from kushagraThapar plus a CHANGELOG link revert:

- (#2) Revert Copilot autofix that changed the CHANGELOG PR link from
  26768 (this PR) to 26767 (the issue it closes).
- (#4) Clamp child ranges to the parent token's bounds during split-
  expansion (matches Java FeedRangeCompositeContinuationImpl.createChildRanges).
  Children that fall entirely outside the parent are dropped. New
  clampChildrenToParent helper + TestClampChildrenToParent.
- (#5) Populate token.ResourceID at queue-construction time (both Path A
  and Path B) so the cross-container guard remains meaningful even when
  the very first response is a 304 (no body parsed).
- (#6) Replace misleading 'legacy ETag-only continuation handled by
  buildRequestHeaders' comment with an accurate one stating that only
  composite tokens issued by GetChangeFeed are honored on resume; legacy
  raw-ETag continuation strings are not supported.
- (#7) Surface partial drain state alongside the 410-budget-exhausted
  error so callers can resume from the failed head instead of re-querying
  drained heads. New TestGetChangeFeed_410BudgetExhaustedMidDrain_SurfacesPartialState.
- (#8) Aggregate RequestCharge across drain iterations so the returned
  response reports total RU consumed (matches cosmos_container_read_many
  pattern).
- (#9) Remove the cache-nil degraded fallback (production always wires
  caches via acquireCaches; the fallback existed only for legacy tests
  built without caches). Rewire the 3 legacy TestContainerGetChangeFeed*
  tests to use createChangeFeedTestClient with pre-seeded caches.
  Removes now-unused dropHeadContinuation helper.
- (#10) Belt-and-suspenders: also accept len(response.Documents) > 0 in
  the early-return condition so a response missing _count but carrying
  docs isn't silently re-drained.
- (#11) Add explicit ctx.Done() check at the top of the drain loop so a
  cancelled context exits between sub-requests rather than waiting for
  transport-level cancellation. New
  TestGetChangeFeed_ContextCancelled_HonoredBetweenSubRequests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* azcosmos: address xinlian12 review comments + thread priority/throughput options post-merge

- (X1) Add hard cap on drain-loop rotation budget (4*(initialLen+1))
  matching Java's FeedRangeCompositeContinuationImpl defense. Prevents
  unbounded loops if overlap-resolution ever produces duplicate ranges.
- (X2) Add focused table-driven unit tests for the overlap helpers:
  TestNormalizeMaxBoundary, TestRangesOverlap_TableDriven (10 cases),
  TestOverlappingPartitionKeyRanges_TableDriven (11 cases covering exact-
  match, multi-overlap, sub-range, straddling, empty/inverted/equal
  feed ranges, boundary touches).
- (X3) PopulateCompositeContinuationToken (public API) is now a no-op
  when ContinuationToken is already populated, preventing the multi-range
  composite token written by the drain loop from being overwritten with
  a stale single-range token rebuilt from the per-head FeedRange.

Post-merge with upstream/main (#26750):
- Thread options.PriorityLevel and options.ThroughputBucket through the
  F1 drain loop's pipelineRequestOptions.headerOptionsOverride so the
  new priority/throughput-bucket support added in #26750 reaches the
  change-feed request.
- Rewire TestChangeFeedPriorityAndThroughputBucketHeaders to use
  createChangeFeedTestClient (caches pre-wired) since the cache-nil
  fallback was removed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow ReadManyItems calls due to recent query-based approach

7 participants