Skip to content

3.0.0.10 release#158

Closed
kirankumarkolli wants to merge 2 commits intomasterfrom
release
Closed

3.0.0.10 release#158
kirankumarkolli wants to merge 2 commits intomasterfrom
release

Conversation

@kirankumarkolli
Copy link
Copy Markdown
Member

3.0.0.10 release

* Assemblyinfo clean-up (#104)

* Signed build assembly info fix

* Assembly info clean-up

* Including test projects internal visible to (#107)

* Stand By Feed and ChangeFeed pull model (#105)

* Adding initial files

* Using Etag for continuation

* Removing unused

* Refactoring to reduce variables

* Refactoring to use CompositeToken

* Adding feed test

* Refactor through Options

* Adding public methods and comments

* Routing through the point transport handler

* Moving to outer if

* Adding split logic

* Adding unit tests

* Adding logic to detect invalid continuation tokens

* Adding JSON validation

* Routing based on PKRangeId

* Renaming and adding more tests

* Moving logic into the token

* Forcing refresh on split

* Addressing final coments

* Addressing feedback

* Added test to cover CT passing

* Refactoring and adding pkrangedelegate

* Argument checks

* Moving contract to CosmosRequestMessage

* Refactoring make EnsureInitialized async

* Moving tests to a new file

* Adding PKrange assert

* Refactored back to parameters outside Options

* UT split

* Adding Start* checks

* Adding new tests and renames

* Addressing comments

* Refactoring for cache tests

* Adding comments to tests

* Adding factory method

* Addressing comments

* Refactoring PKRange outside options

* Addressing comments

* Removing StartFromBeginning

* Removing extra lines

* Removing unnecessary ToList

* Raising version to 22-preview (#113)
kirankumarkolli added a commit that referenced this pull request Apr 23, 2026
…titioned lease containers (#5807)

**Research report (gist):**
https://gist.github.com/kirankumarkolli/8bf1cbb9c3ee45dfa64511c4a0cb35ad

## Problem statement

In V3 SDK 3.20.0+, when the lease container is partitioned on
`/partitionKey` (used for Gremlin/Graph API accounts and any
customer-chosen layout),
`DocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsync` sets the
new lease's partition-key **value** to `Guid.NewGuid().ToString()`.
Because Cosmos DB's `id` uniqueness is scoped per logical partition, two
racing hosts that compute the same deterministic lease `id` land in
different logical partitions and both succeed with `201 Created` instead
of the `409 Conflict` the dedup logic relies on.

**Root cause:** V3 PR #2491 (2021-05-27) and its V2 predecessor PR #158
silently violated the original V2 PR #105 (2018) invariant that *"lease
collection PK must be a function of monitored collection PK Range. We
already have that function as lease id."*

## Fix (minimal, env-gated with safe default)

### New env var

| Variable | Default | Purpose |
|---|---|---|
| `AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED` | `true`
| When `true`, lease docs in a `/partitionKey`-partitioned lease
container use the deterministic lease `id` as the partition-key value
(restores the 409-based dedup invariant). When `false`, falls back to
legacy `Guid.NewGuid()` behavior. |

### Code changes

**`ConfigurationManager.cs`** — New env-var constant and accessor
`IsChangeFeedLeaseIdAsPartitionKeyEnabled()`, mirroring the established
`AZURE_COSMOS_*_ENABLED` pattern.

**`DocumentServiceLeaseManagerCosmos.cs`** — New
`useDeterministicPartitionKey` field snapshotted in the constructor
(ensures consistency for the manager's lifetime). New
`GetLeasePartitionKeyValue(leaseId)` helper used at both
`CreateLeaseIfNotExistAsync` call sites, replacing the
`Guid.NewGuid().ToString()` calls.

### Why this is the right fix

| Property | Status after fix |
|---|---|
| PK path | Unchanged (`/partitionKey`) — Gremlin still works |
| Public API surface | Unchanged |
| Default behavior | **New deterministic PK** (fix is on by default) |
| Operational escape hatch |
`AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED=false`
reverts per-process |
| Dedup invariant (PR #105) | **Restored** for all three lease-container
layouts |
| Back-compat read path | Unchanged — `TryGetLeaseAsync` uses stored PK
value |

### Not required (explicitly)

- No schema migration — old GUID-PK leases coexist with new `id`-PK
leases; both read/update paths route through the stored `PartitionKey`
field.
- No change to
`PartitionedByPartitionKeyCollectionRequestOptionsFactory` (already
routes by stored PK value).

## Risks and mitigations

| # | Risk | Likelihood | Mitigation |
|---|---|---|---|
| 1 | Existing customers carry a mix of GUID-PK (old) and `id`-PK (new)
leases | Certain | Safe by design — `TryGetLeaseAsync` reads stored
`lease.PartitionKey`; old leases remain readable/updatable |
| 2 | Duplicate leases already written under old code persist | Certain
| Out of scope — not self-clearing; manual cleanup needed |
| 3 | Mixed-fleet window during rolling upgrade | Certain during rollout
| Dedup fully restored only after all hosts upgraded. Not a new failure
mode. |
| 4 | Hot-partition risk | Very low | Each lease has a different PK
value (per-leaseToken), so distribution is similar to `/id` layout |

### Upgrade semantics — mixed-fleet behavior

During rolling upgrade (old + new SDK hosts coexist):
- Read/update/delete/checkpoint of **existing** leases: works on both
sides (routes through stored `lease.PartitionKey`).
- Concurrent **create** of the same missing lease: still produces
duplicates if one host is old SDK (GUID PK) and another is new (id PK).
**Dedup is fully restored only after rollout completes.**

### Known limitation — pre-existing duplicates not auto-cleaned

Duplicate leases written by older SDK versions persist until the
corresponding PKRange undergoes a split/merge. Manual cleanup (`SELECT *
FROM c WHERE c.id = @id` + `DELETE` extras) is needed for existing
duplicates.

## Test coverage

### Unit tests — `DocumentServiceLeaseManagerCosmosTests.cs`

| Test | What it proves |
|---|---|
| `CreateLeaseIfNotExistAsync_PartitionKeyBehavior` (4 DataRows) |
Parameterized across PKRange/EPK × default/opt-out: captures the
`PartitionKey` sent in the create request and asserts deterministic (PK
= lease.Id) or legacy (PK = GUID ≠ lease.Id) behavior |
| `CreateLeaseIfNotExistAsync_DuplicatePkId_Returns409` (2 DataRows) |
Simulates Cosmos DB's (PK, id) uniqueness constraint via
`ConcurrentDictionary` — first create succeeds, second with same (PK,
id) gets 409 |
| `AcquireCompletes_WithPreExistingGuidPartitionKey` | Back-compat:
lease with old random-GUID PK can still be acquired (stored PK
preserved) |
| `RenewCompletes_WithPreExistingGuidPartitionKey` | Back-compat: lease
with old random-GUID PK can still be renewed (stored PK preserved) |
| Existing `CreatesEPKBasedLease` / `CreatesPartitionKeyBasedLease` |
**Strengthened:** added `Assert.AreEqual(lease.Id, lease.PartitionKey)`
for `/partitionKey` factory |

Infrastructure: `[DoNotParallelize]` +
`[TestInitialize]`/`[TestCleanup]` for env var safety. Shared helpers
eliminate duplication across overload types.

### Emulator integration test — `GremlinSmokeTests.cs`

| Test | What changed |
|---|---|
| `Schema_DefaultsToHavingPartitionKey` | Converted to
`[DataTestMethod]` with two rows: **(a)** default env → asserts
`partitionKey == lease.id` (dedup invariant), **(b)** env var `false` →
asserts `partitionKey` is a GUID ≠ `lease.id` (legacy behavior) |

Infrastructure: `[DoNotParallelize]` added (env var mutation is
process-global).

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant