Closed
Conversation
* Assemblyinfo clean-up (#104) * Signed build assembly info fix * Assembly info clean-up * Including test projects internal visible to (#107) * Stand By Feed and ChangeFeed pull model (#105) * Adding initial files * Using Etag for continuation * Removing unused * Refactoring to reduce variables * Refactoring to use CompositeToken * Adding feed test * Refactor through Options * Adding public methods and comments * Routing through the point transport handler * Moving to outer if * Adding split logic * Adding unit tests * Adding logic to detect invalid continuation tokens * Adding JSON validation * Routing based on PKRangeId * Renaming and adding more tests * Moving logic into the token * Forcing refresh on split * Addressing final coments * Addressing feedback * Added test to cover CT passing * Refactoring and adding pkrangedelegate * Argument checks * Moving contract to CosmosRequestMessage * Refactoring make EnsureInitialized async * Moving tests to a new file * Adding PKrange assert * Refactored back to parameters outside Options * UT split * Adding Start* checks * Adding new tests and renames * Addressing comments * Refactoring for cache tests * Adding comments to tests * Adding factory method * Addressing comments * Refactoring PKRange outside options * Addressing comments * Removing StartFromBeginning * Removing extra lines * Removing unnecessary ToList * Raising version to 22-preview (#113)
kirankumarkolli
added a commit
that referenced
this pull request
Apr 23, 2026
…titioned lease containers (#5807) **Research report (gist):** https://gist.github.com/kirankumarkolli/8bf1cbb9c3ee45dfa64511c4a0cb35ad ## Problem statement In V3 SDK 3.20.0+, when the lease container is partitioned on `/partitionKey` (used for Gremlin/Graph API accounts and any customer-chosen layout), `DocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync` sets the new lease's partition-key **value** to `Guid.NewGuid().ToString()`. Because Cosmos DB's `id` uniqueness is scoped per logical partition, two racing hosts that compute the same deterministic lease `id` land in different logical partitions and both succeed with `201 Created` instead of the `409 Conflict` the dedup logic relies on. **Root cause:** V3 PR #2491 (2021-05-27) and its V2 predecessor PR #158 silently violated the original V2 PR #105 (2018) invariant that *"lease collection PK must be a function of monitored collection PK Range. We already have that function as lease id."* ## Fix (minimal, env-gated with safe default) ### New env var | Variable | Default | Purpose | |---|---|---| | `AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED` | `true` | When `true`, lease docs in a `/partitionKey`-partitioned lease container use the deterministic lease `id` as the partition-key value (restores the 409-based dedup invariant). When `false`, falls back to legacy `Guid.NewGuid()` behavior. | ### Code changes **`ConfigurationManager.cs`** — New env-var constant and accessor `IsChangeFeedLeaseIdAsPartitionKeyEnabled()`, mirroring the established `AZURE_COSMOS_*_ENABLED` pattern. **`DocumentServiceLeaseManagerCosmos.cs`** — New `useDeterministicPartitionKey` field snapshotted in the constructor (ensures consistency for the manager's lifetime). New `GetLeasePartitionKeyValue(leaseId)` helper used at both `CreateLeaseIfNotExistAsync` call sites, replacing the `Guid.NewGuid().ToString()` calls. ### Why this is the right fix | Property | Status after fix | |---|---| | PK path | Unchanged (`/partitionKey`) — Gremlin still works | | Public API surface | Unchanged | | Default behavior | **New deterministic PK** (fix is on by default) | | Operational escape hatch | `AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED=false` reverts per-process | | Dedup invariant (PR #105) | **Restored** for all three lease-container layouts | | Back-compat read path | Unchanged — `TryGetLeaseAsync` uses stored PK value | ### Not required (explicitly) - No schema migration — old GUID-PK leases coexist with new `id`-PK leases; both read/update paths route through the stored `PartitionKey` field. - No change to `PartitionedByPartitionKeyCollectionRequestOptionsFactory` (already routes by stored PK value). ## Risks and mitigations | # | Risk | Likelihood | Mitigation | |---|---|---|---| | 1 | Existing customers carry a mix of GUID-PK (old) and `id`-PK (new) leases | Certain | Safe by design — `TryGetLeaseAsync` reads stored `lease.PartitionKey`; old leases remain readable/updatable | | 2 | Duplicate leases already written under old code persist | Certain | Out of scope — not self-clearing; manual cleanup needed | | 3 | Mixed-fleet window during rolling upgrade | Certain during rollout | Dedup fully restored only after all hosts upgraded. Not a new failure mode. | | 4 | Hot-partition risk | Very low | Each lease has a different PK value (per-leaseToken), so distribution is similar to `/id` layout | ### Upgrade semantics — mixed-fleet behavior During rolling upgrade (old + new SDK hosts coexist): - Read/update/delete/checkpoint of **existing** leases: works on both sides (routes through stored `lease.PartitionKey`). - Concurrent **create** of the same missing lease: still produces duplicates if one host is old SDK (GUID PK) and another is new (id PK). **Dedup is fully restored only after rollout completes.** ### Known limitation — pre-existing duplicates not auto-cleaned Duplicate leases written by older SDK versions persist until the corresponding PKRange undergoes a split/merge. Manual cleanup (`SELECT * FROM c WHERE c.id = @id` + `DELETE` extras) is needed for existing duplicates. ## Test coverage ### Unit tests — `DocumentServiceLeaseManagerCosmosTests.cs` | Test | What it proves | |---|---| | `CreateLeaseIfNotExistAsync_PartitionKeyBehavior` (4 DataRows) | Parameterized across PKRange/EPK × default/opt-out: captures the `PartitionKey` sent in the create request and asserts deterministic (PK = lease.Id) or legacy (PK = GUID ≠ lease.Id) behavior | | `CreateLeaseIfNotExistAsync_DuplicatePkId_Returns409` (2 DataRows) | Simulates Cosmos DB's (PK, id) uniqueness constraint via `ConcurrentDictionary` — first create succeeds, second with same (PK, id) gets 409 | | `AcquireCompletes_WithPreExistingGuidPartitionKey` | Back-compat: lease with old random-GUID PK can still be acquired (stored PK preserved) | | `RenewCompletes_WithPreExistingGuidPartitionKey` | Back-compat: lease with old random-GUID PK can still be renewed (stored PK preserved) | | Existing `CreatesEPKBasedLease` / `CreatesPartitionKeyBasedLease` | **Strengthened:** added `Assert.AreEqual(lease.Id, lease.PartitionKey)` for `/partitionKey` factory | Infrastructure: `[DoNotParallelize]` + `[TestInitialize]`/`[TestCleanup]` for env var safety. Shared helpers eliminate duplication across overload types. ### Emulator integration test — `GremlinSmokeTests.cs` | Test | What changed | |---|---| | `Schema_DefaultsToHavingPartitionKey` | Converted to `[DataTestMethod]` with two rows: **(a)** default env → asserts `partitionKey == lease.id` (dedup invariant), **(b)** env var `false` → asserts `partitionKey` is a GUID ≠ `lease.id` (legacy behavior) | Infrastructure: `[DoNotParallelize]` added (env var mutation is process-global). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
3.0.0.10 release