[Cosmos] add container cache and pk range cache#26723
Conversation
|
/azp run go - azcosmos |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR improves azcosmos ReadMany performance and correctness by introducing client-level caching for container metadata and partition key ranges, and by fixing hierarchical partition key (HPK) routing behavior via length-aware EPK comparisons and prefix fan-out.
Changes:
- Added client-scoped caches for container properties (dual-index by link and RID) and partition key ranges (read-through by RID with incremental refresh).
- Updated ReadMany routing to support HPK mixed-length EPK boundaries and MultiHash prefix fan-out, plus retries on PK-range-gone (410) responses.
- Fixed V2 EPK routing by masking the top 2 bits of the first hash byte (aligning with other SDKs).
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/data/azcosmos/partition_key.go | Switches V2 EPK computation to routing-safe masking and adds EPK range computation for prefix keys. |
| sdk/data/azcosmos/partition_key_test.go | Updates expected EPK values to reflect top-bit masking. |
| sdk/data/azcosmos/internal/epk/epk.go | Adds routing-safe V2 hash helpers, top-bit masking, and length-aware CompareEPK. |
| sdk/data/azcosmos/internal/epk/epk_test.go | Tests raw V2 hashes (internal) and adds CompareEPK coverage. |
| sdk/data/azcosmos/cosmos_pk_range_gone_test.go | Adds tests for PK-range-gone (410 + substatus) detection helpers. |
| sdk/data/azcosmos/cosmos_partition_key_range_cache.go | Implements client-level PK range cache with incremental refresh and single-flight per RID. |
| sdk/data/azcosmos/cosmos_partition_key_range_cache_test.go | Adds initial tests for the PK range cache (currently mostly structural). |
| sdk/data/azcosmos/cosmos_http_constants.go | Adds substatus constants and helper to detect PK-range-gone conditions. |
| sdk/data/azcosmos/cosmos_container.go | Populates container cache on Read/Replace and routes PK range fetches through the cache. |
| sdk/data/azcosmos/cosmos_container_read_many.go | Uses caches in ReadMany, adds HPK-aware routing, MultiHash prefix fan-out, and 410 retry/refresh handling. |
| sdk/data/azcosmos/cosmos_container_read_many_test.go | Adds tests for EPK range computation and mixed-length boundary routing behavior. |
| sdk/data/azcosmos/cosmos_container_properties_cache.go | Implements client-level container properties cache (dual-index by link and RID). |
| sdk/data/azcosmos/cosmos_container_properties_cache_test.go | Adds tests for the container properties cache indexing/invalidation behavior. |
| sdk/data/azcosmos/cosmos_collection_routing_map.go | Adds immutable routing map snapshot with incremental merge and overlap queries. |
| sdk/data/azcosmos/cosmos_collection_routing_map_test.go | Adds tests for routing map sorting/merge/completeness/overlap behavior. |
| sdk/data/azcosmos/cosmos_client.go | Initializes the new caches on client construction. |
| sdk/data/azcosmos/CHANGELOG.md | Documents the new caches and the V2 routing fix. |
…h/azure-sdk-for-go into container-client-caches
- Fix lock-order inversion in containerPropertiesCache: extract updateRIDIndex() helper, ensure c.mu is never acquired while holding entry.mu (fixes Azure#3 and Azure#6) - Rename misleading test names to match actual behavior (Azure#4 and Azure#5) - Use epk.CompareEPK in isCompleteSetOfRanges for length-aware HPK boundary comparison (Azure#1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@sdkReviewAgent |
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
The PKRange cache is still not wired up to cosmos_container.go line 514 and following - that is from my understanding where the app in question is hammering master partition today. Please include it - by default GetFeedRanges should come from cache - and if force refresh is specified (opt-in), it should just result in draining updates until 304 - then use the cache - never getting all partitions again.
…cache Move PartitionKeyDefinition resolution inside the 410 retry loop so that after a cache invalidation (container deleted and recreated with a different partition key schema), the new pkDef is picked up — not just the new partition key ranges. Add nil guard in containerPropertiesCache.refreshEntry() to return an error instead of caching nil properties, which would cause a nil pointer panic in callers that dereference ContainerProperties. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
LGTM except for not enforcing singleton caches per Cosmos DB account - independent of the number of CosmosClients.
The change Fabian has requested is already implemented. We can merge this PR in.
…one retry, multi-range continuation queue (Azure#26767) GetChangeFeed previously failed (silently or loudly) when a physical partition split occurred under a customer-supplied FeedRange: * toHeaders did exact-equality matching against the cached PK ranges, so a parent FeedRange that no longer matched any current child fell through to a request issued WITHOUT the partition-key-range-id header (silent miss). * 410/Gone with the PartitionKeyRangeGone substatus surfaced directly to the caller; the underlying maxPKRangeGoneRetries / forceRefresh plumbing added in Azure#26723 was never wired into the change feed. * The composite continuation token only encoded a single EPK range, so resuming after a split lost events on the boundary. This change rewires GetChangeFeed onto a queue-driven drain loop: * buildChangeFeedInitialQueue resolves the customer FeedRange (or inbound continuation token) by overlap-match against the routing map, expanding a parent FeedRange into N children inheriting the parent ETag. * getChangeFeedForQueue drains the queue FIFO. On each iteration it re-resolves the head against the cached snapshot, detects mid-drain splits and bumps the rotation budget so newly-inserted children get queried in the same call. * On 410/Gone with PK-range substatus, the cache is refreshed, the routing snapshot is re-fetched, and the head is retried; capped at maxPKRangeGoneRetries. * The composite continuation token now carries the full multi-range queue, with each entry's ETag advanced FIFO on every response. * Cross-container token reuse is rejected loudly: if a token's ResourceID does not match the current container's _rid, the call fails with a diagnostic error rather than silently misrouting against the wrong routing map. Behavior preserved for callers without the cache infrastructure wired: when no PK-range cache is present and the direct fetch returns no ranges (test/mock case), a single passthrough queue entry is used so the legacy issue-without-overlap-match flow still runs. Tests added in cosmos_change_feed_split_test.go cover: 410-retry, retry cap surfacing the final 410, cross-container ResourceID rejection, no-overlap-after-refresh -> ErrFeedRangeUnresolved, split expansion routing the first child, multi-range continuation round-trip, and the mid-drain split-during-drain budget bump regression guard. All three existing change-feed tests continue to pass. Closes Azure#26767 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…one retry, multi-range continuation queue (#26767) (#26768) * fix(azcosmos): GetChangeFeed split-resilient via overlap match, 410/Gone retry, multi-range continuation queue (#26767) GetChangeFeed previously failed (silently or loudly) when a physical partition split occurred under a customer-supplied FeedRange: * toHeaders did exact-equality matching against the cached PK ranges, so a parent FeedRange that no longer matched any current child fell through to a request issued WITHOUT the partition-key-range-id header (silent miss). * 410/Gone with the PartitionKeyRangeGone substatus surfaced directly to the caller; the underlying maxPKRangeGoneRetries / forceRefresh plumbing added in #26723 was never wired into the change feed. * The composite continuation token only encoded a single EPK range, so resuming after a split lost events on the boundary. This change rewires GetChangeFeed onto a queue-driven drain loop: * buildChangeFeedInitialQueue resolves the customer FeedRange (or inbound continuation token) by overlap-match against the routing map, expanding a parent FeedRange into N children inheriting the parent ETag. * getChangeFeedForQueue drains the queue FIFO. On each iteration it re-resolves the head against the cached snapshot, detects mid-drain splits and bumps the rotation budget so newly-inserted children get queried in the same call. * On 410/Gone with PK-range substatus, the cache is refreshed, the routing snapshot is re-fetched, and the head is retried; capped at maxPKRangeGoneRetries. * The composite continuation token now carries the full multi-range queue, with each entry's ETag advanced FIFO on every response. * Cross-container token reuse is rejected loudly: if a token's ResourceID does not match the current container's _rid, the call fails with a diagnostic error rather than silently misrouting against the wrong routing map. Behavior preserved for callers without the cache infrastructure wired: when no PK-range cache is present and the direct fetch returns no ranges (test/mock case), a single passthrough queue entry is used so the legacy issue-without-overlap-match flow still runs. Tests added in cosmos_change_feed_split_test.go cover: 410-retry, retry cap surfacing the final 410, cross-container ResourceID rejection, no-overlap-after-refresh -> ErrFeedRangeUnresolved, split expansion routing the first child, multi-range continuation round-trip, and the mid-drain split-during-drain budget bump regression guard. All three existing change-feed tests continue to pass. Closes #26767 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address reviewer round-1 findings on F1 change-feed split handling - Capture drain-loop error in GetChangeFeed so deferred span observes failures (#1) - Reset 410/Gone budget on both split-expansion and head-advance (#2) - Move resp.Body.Close() above 304 short-circuit in newChangeFeedResponse (#7) - Drop dead empty-feedMax branch in cosmos_feed_range.go (#8) - Remove redundant token.head() reload in drain loop (#9) - Reduce buildChangeFeedInitialQueue to 3 returns (#10) - buildRequestHeaders surfaces PartitionKey serialization errors (#11) - Cache-nil + no-routing degraded fallback: log warning, drop continuation token, and read fresh with passthrough range instead of erroring; preserves back-compat for tests built without caches wired up - CHANGELOG link points to PR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: drop redundant head reload after dropHeadContinuation Round-2 reviewer cleanup. dropHeadContinuation mutates Continuation[0] in place; the existing head pointer reflects the change without reload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: tighten GetChangeFeed doc + CHANGELOG entry Condenses the GetChangeFeed numbered behavior list to a short paragraph and trims the CHANGELOG entry. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: relocate overlap helper, delete legacy toHeaders + findPartitionKeyRangeID - Move overlappingPartitionKeyRanges into cosmos_feed_range.go next to its sibling findOverlappingPartitionKeyRangeIDs and simplify to a single pass (no map round-trip) - Delete the legacy ChangeFeedOptions.toHeaders builder and its test file; the new GetChangeFeed path uses buildRequestHeaders exclusively - Delete the now-unused findPartitionKeyRangeID exact-match helper (the original silent-mismatch source); overlap-resolution lives upstream now - Drop the small toHeaders-tail test in TestContainerGetChangeFeedWithStartFrom that only verified the legacy header builder Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: split GetChangeFeed into its own file + restore options test coverage - Move GetChangeFeed, buildChangeFeedInitialQueue, resolveFeedRangeToChildren, buildChildQueueEntries, getChangeFeedForQueue, and serializeCompositeContinuationToken out of cosmos_container.go into cosmos_container_change_feed.go (mirrors the cosmos_container_read_many.go split). Pure relocation, no behavior change. - Add cosmos_change_feed_request_options_test.go covering buildRequestHeaders: defaults (only AIM header), all-fields population, empty-ETag continuation is omitted, and PartitionKey serialization errors are surfaced (the new error-returning contract the legacy toHeaders silently swallowed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: add emulator change-feed tests for F1 split-handling scenarios Three new emulator tests targeting F1-specific paths not covered by the existing TestEmulatorContainerChangeFeed: - SubRangeOfPhysicalRange — original-bug repro: a customer FeedRange that is a strict sub-range of a physical pkrange (no exact-match) must resolve via overlap-match and emit a request. - WideRangeMultiDrain — wide FeedRange overlapping every physical range; paginates via the composite continuation token until all inserted items are drained, exercising the multi-range queue rotation. - CrossContainerTokenRejected — token from container-A handed to container-B must be rejected by the ResourceID guard. All three skip when EMULATOR is unset, matching repo convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: fix lint failures from CI - Drop ineffectual headFeedRange reassignment in degraded fallback (it was never read after the assignment) - Delete unused findOverlappingPartitionKeyRangeIDs (overlappingPartitionKeyRanges superseded it; no remaining callers) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: keep GetChangeFeed entrypoint in cosmos_container.go Helpers (buildChangeFeedInitialQueue, resolveFeedRangeToChildren, buildChildQueueEntries, getChangeFeedForQueue, serializeCompositeContinuationToken) remain in cosmos_container_change_feed.go; only the public method moves back so the canonical container file continues to host the public API surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * azcosmos: address PR review comments on F1 change-feed Round-2 reviewer findings from kushagraThapar plus a CHANGELOG link revert: - (#2) Revert Copilot autofix that changed the CHANGELOG PR link from 26768 (this PR) to 26767 (the issue it closes). - (#4) Clamp child ranges to the parent token's bounds during split- expansion (matches Java FeedRangeCompositeContinuationImpl.createChildRanges). Children that fall entirely outside the parent are dropped. New clampChildrenToParent helper + TestClampChildrenToParent. - (#5) Populate token.ResourceID at queue-construction time (both Path A and Path B) so the cross-container guard remains meaningful even when the very first response is a 304 (no body parsed). - (#6) Replace misleading 'legacy ETag-only continuation handled by buildRequestHeaders' comment with an accurate one stating that only composite tokens issued by GetChangeFeed are honored on resume; legacy raw-ETag continuation strings are not supported. - (#7) Surface partial drain state alongside the 410-budget-exhausted error so callers can resume from the failed head instead of re-querying drained heads. New TestGetChangeFeed_410BudgetExhaustedMidDrain_SurfacesPartialState. - (#8) Aggregate RequestCharge across drain iterations so the returned response reports total RU consumed (matches cosmos_container_read_many pattern). - (#9) Remove the cache-nil degraded fallback (production always wires caches via acquireCaches; the fallback existed only for legacy tests built without caches). Rewire the 3 legacy TestContainerGetChangeFeed* tests to use createChangeFeedTestClient with pre-seeded caches. Removes now-unused dropHeadContinuation helper. - (#10) Belt-and-suspenders: also accept len(response.Documents) > 0 in the early-return condition so a response missing _count but carrying docs isn't silently re-drained. - (#11) Add explicit ctx.Done() check at the top of the drain loop so a cancelled context exits between sub-requests rather than waiting for transport-level cancellation. New TestGetChangeFeed_ContextCancelled_HonoredBetweenSubRequests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * azcosmos: address xinlian12 review comments + thread priority/throughput options post-merge - (X1) Add hard cap on drain-loop rotation budget (4*(initialLen+1)) matching Java's FeedRangeCompositeContinuationImpl defense. Prevents unbounded loops if overlap-resolution ever produces duplicate ranges. - (X2) Add focused table-driven unit tests for the overlap helpers: TestNormalizeMaxBoundary, TestRangesOverlap_TableDriven (10 cases), TestOverlappingPartitionKeyRanges_TableDriven (11 cases covering exact- match, multi-overlap, sub-range, straddling, empty/inverted/equal feed ranges, boundary touches). - (X3) PopulateCompositeContinuationToken (public API) is now a no-op when ContinuationToken is already populated, preventing the multi-range composite token written by the drain loop from being overwritten with a stale single-range token rebuilt from the per-head FeedRange. Post-merge with upstream/main (#26750): - Thread options.PriorityLevel and options.ThroughputBucket through the F1 drain loop's pipelineRequestOptions.headerOptionsOverride so the new priority/throughput-bucket support added in #26750 reaches the change-feed request. - Rewire TestChangeFeedPriorityAndThroughputBucketHeaders to use createChangeFeedTestClient (caches pre-wired) since the cache-nil fallback was removed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Resolves #26704
Adds client-level caching for partition key ranges and container properties, and fixes partition key routing for hierarchical partition keys (HPK). These changes eliminate redundant HTTP calls in the ReadManyItems path where the uncached
getPartitionKeyRangescall was causing ~1.5s latency on large containers.Partition Key Range Cache
ContainerClientinstancessync.Mutexfor single-flight pattern (concurrent cache misses result in a single network call)A-IM: Incremental Feed+If-None-Matchheaders), looping until 304 Not Modified (capped at 10 iterations)Container Properties Cache
getProperties()(by link) andgetPropertiesByRID()(by RID) —set()cross-populates both indicescontainer.Read()andcontainer.Replace()populate the cache on success (always go to service, never read from cache)Collection Routing Map
tryCombine()for incremental split/merge handlinggetOverlappingRanges(min, max)— binary search for all ranges overlapping an EPK range (used for HPK prefix fan-out)isCompleteSetOfRanges()validation: checks contiguous coverage from""to"FF"Hierarchical Partition Key (HPK) Routing
CompareEPK): handles mixed-length EPK boundaries from HPK containers where the service returns 32-char partial and 64-char fully-specified boundariescomputeEPKRange): for partial MultiHash keys (fewer components than paths), computes[prefix_epk, prefix_epk + "FF")groupItemsByPhysicalRangenow usesgetOverlappingRangesto find all physical partitions overlapping a prefix EPK range, distributing items across multiple partitionsBug Fix: V2 EPK Top-Bit Masking
Fixed a pre-existing bug in the Go SDK where the top 2 bits of the first EPK byte were not masked (
hash[0] &= 0x3F). This caused items whose V2 hash started with a byte >= 0x40 to fail routing in ReadMany because the EPK lexicographically exceeded the"FF"range sentinel. All other SDKs (Python, .NET, Java, Rust) apply this masking.Behavioral Changes
getPartitionKeyRanges()now routes through the PK range cache (with nil-fallback for backward compat)executeReadManyWithQueries()uses the container properties cache instead of callingcontainer.Read()every timeTesting
tryCombinesplit merging, completeness validation, 410/Gone error detection, HPK prefix fan-out, length-aware EPK comparison, overlapping range queries, and dual-index container cache