Change Feed Processor: Fixes duplicate lease docs by using id as partition-key value#5799
Closed
NaluTripician wants to merge 1 commit into
Closed
Change Feed Processor: Fixes duplicate lease docs by using id as partition-key value#5799NaluTripician wants to merge 1 commit into
NaluTripician wants to merge 1 commit into
Conversation
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
679fa04 to
bb8bdc3
Compare
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
bb8bdc3 to
fdba79c
Compare
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Member
|
@sdkReviewAgent |
xinlian12
reviewed
Apr 22, 2026
xinlian12
reviewed
Apr 22, 2026
xinlian12
reviewed
Apr 22, 2026
Member
|
✅ Review complete (38:33) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…ition-key value When a Cosmos lease container is partitioned by /partitionKey, CreateLeaseIfNotExistAsync previously generated a fresh Guid for the partition-key value on every call. Because Cosmos's per-partition-key id-uniqueness check only catches duplicates with the same partition key, any retry or concurrent split-handler invocation rolled a new Guid and silently persisted a second document with identical id/LeaseToken but a different partitionKey. Once duplicates existed, EqualPartitionsBalancingStrategy.CategorizeLeases threw ArgumentException on every balance tick and the feed stalled indefinitely. This change sets the partition-key value to the lease document id (deterministic per LeaseToken) in both overloads of DocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync. Concurrent/retry creates now collide on the same (id, partitionKey) tuple and Cosmos returns a real 409 Conflict, preventing the duplicate at the source. Pre-existing lease documents with Guid-based partitionKey values remain fully readable — the stored value is round-tripped through lease.PartitionKey. Closes IcM 768856224. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
22e93fb to
8220055
Compare
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Member
|
Duplicate #5807 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Feed Processor: Fixes duplicate lease docs by using id as partition-key value
Closes IcM 768856224.
The bug
What the customer saw
The customer's Change Feed Processor (CFP) silently stopped processing changes. Restarting hosts did not help. Their lease container contained documents like:
MyProcessorhost1_abc_xyz..00d4e2…(Guid)nullMyProcessorhost1_abc_xyz..007a9f…(different Guid)nullEvery lease had a duplicate entry: same
id, sameLeaseToken, different GuidpartitionKey,Owner: null, timestamps frozen at the moment the processor died.Root cause
The CFP lease container was partitioned by
/partitionKey. On a partition split,PartitionSynchronizerCore.HandlePartitionGoneAsyncdelegates child-lease creation toDocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync, which generated the partition-key value withGuid.NewGuid().ToString():TryCreateItemAsyncrelies on Cosmos's per-partition-keyiduniqueness check to turn concurrent creates into a409 Conflict. That check only catches duplicates with the same partition key. Every retry of split handling — host restart mid-split, transient error retry, or two hosts racing on the same parent lease — rolled a new random Guid, bypassing the uniqueness check and silently persisting a second document with identicalid/LeaseTokenbut a differentpartitionKey.Once duplicates exist, every balance tick (~13 s) lands here in
EqualPartitionsBalancingStrategy.CategorizeLeases:The
ArgumentExceptionpropagates out ofCalculateLeasesToTake, the balancer catches it, logs, and returns zero leases — every tick, forever. No host ever claims a lease;Ownerstaysnull; the feed permanently stalls. There is no automatic recovery: the duplicate documents must be deleted manually.The fix
Set the partition-key value on each new lease document to the lease's own deterministic id (already computed as
this.GetDocumentId(leaseToken)), in both overloads ofDocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync:With this change, concurrent or retried creates of the same lease resolve to the same
(id, partitionKey)tuple, so Cosmos's per-partition-key id-uniqueness check fires normally and returns409 Conflict.TryCreateItemAsyncturns that into a benign "already exists" outcome — the duplicate is prevented at the source. This is the pattern already used byDocumentServiceLeaseStoreCosmosfor its marker and lock documents (which usemarkerDocId/lockIdas both id and pk).Backward compatibility
Fully compatible with existing lease containers. The read path (
TryGetLeaseAsync,ReleaseAsync,UpdateLeaseAsync,CheckpointAsync,DeleteAsync) pulls the partition-key value off the deserialized lease document viarequestOptionsFactory.GetPartitionKey(lease.Id, lease.PartitionKey), which for/partitionKey-partitioned containers returnsnew PartitionKey(partitionKey)— i.e. whatever was stored. Pre-existing lease documents that have Guid-basedpartitionKeyvalues continue to load, refresh, and be released normally. New lease documents created after this change are written withpartitionKey == id; once all parent leases have been split / replaced, the container naturally converges to the new scheme with no manual migration.Affected containers
Only
/partitionKey-partitioned lease containers were exposed to the bug. Lease containers partitioned by/idor on single-partition (fixed) collections never exercised the affected code path (AddPartitionKeyIfNeededis a no-op on those request-options factories — seePartitionedByIdCollectionRequestOptionsFactoryandSinglePartitionRequestOptionsFactory) and require no action.Testing
Unit tests
Extended the existing
ValidateRequestOptionsFactoryhelper used byDocumentServiceLeaseManagerCosmosTests.CreatesEPKBasedLeaseandCreatesPartitionKeyBasedLease(both[DataTestMethod]with rows for each of the threeRequestOptionsFactoryimplementations). For thePartitionedByPartitionKeyCollectionRequestOptionsFactorycase the helper now assertslease.PartitionKey == lease.Id, proving the new deterministic wiring for both the PK-range overload and the EPK overload.Full ChangeFeed unit-test slice: 261 / 261 pass.
Manual verification
dotnet build Microsoft.Azure.Cosmos.sln -c Debug— 0 warnings, 0 errors.Performance impact
None. The change replaces one
Guid.NewGuid().ToString()call with a variable reference (leaseDocId, already computed one line earlier in the same method). No extra IO.Customer remediation (for customers already in this state)
Delete the duplicate lease documents (same
id, differentpartitionKey) from the lease container, then restart the processors. A one-time query is sufficient:Group by
id; for each group with more than one document, keep one and delete the rest (using the matchingpartitionKeyas the partition-key value on the delete).Once this fix rolls out, new child leases created during a split/merge are immune to the duplicate-creation race — concurrent or retried creates for the same lease collide on
(id, partitionKey)and Cosmos enforces uniqueness.Type of change
Checklist
ValidateRequestOptionsFactoryto assert deterministic pk == id on/partitionKey-partitioned containers)Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com