ChangeFeedProcessor: Fixes lease de-duplication for /partitionKey-partitioned lease containers#5807
Conversation
…titioned lease containers Draft branch opened for plan review. No code changes yet; see PR description for the full fix plan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
|
@sdkReviewAgent |
|
@copilot follow issue agent fix instrucitons and implement it. |
…havior Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/2ecc1c73-658a-4358-ad20-9ecc87fb25fe Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…erloads Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/2ecc1c73-658a-4358-ad20-9ecc87fb25fe Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/2ecc1c73-658a-4358-ad20-9ecc87fb25fe Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/2ecc1c73-658a-4358-ad20-9ecc87fb25fe Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/2ecc1c73-658a-4358-ad20-9ecc87fb25fe Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Implemented the fix and tests per the issue-agent plan, including deterministic lease-id partition-key behavior with env-var opt-out and expanded coverage for both lease-creation overloads and both env-var branches. Latest addressing commit: |
Plan Review ChangeFeed Lease De-duplication FixOverall assessment: The root-cause analysis is excellent and the proposed code fix is correct. The bug at lines 131 and 156 of However, the test plan has significant design issues that should be addressed before implementation begins. Finding 1: Proposed emulator test will not trigger the duplicate lease bugThe proposed emulator test starts two Only one host ever calls Suggested alternative: Test the invariant directly at the lease-manager level construct two Finding 2: No unit test proves the full 409 dedup chain with the new PK valueThe proposed tests verify that the partition key value is deterministic and equals the lease ID. This is necessary but not sufficient. No proposed test mocks a second concurrent create returning 409 to verify the dedup chain end-to-end. The existing tests at lines 68154 only exercise the happy path (200 OK). Suggested addition: Add a unit test where the mock returns OK on the first Finding 3: Missing back-compat test for pre-existing GUID-PK leasesThe plan claims old leases with GUID partition keys "remain readable/updatable" because all downstream operations use the stored Suggested addition: A unit test that creates a lease with a known GUID PK value (simulating a pre-fix lease), then verifies Finding 4: Strengthen existing
|
|
✅ Review complete (33:23) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Minor readability/refactor nit on the two private string GetLeasePartitionKeyValue(string leaseId)
{
return this.useDeterministicPartitionKey
? leaseId
: Guid.NewGuid().ToString();
}Each call site then collapses to: this.requestOptionsFactory.AddPartitionKeyIfNeeded(
(string pk) => documentServiceLease.LeasePartitionKey = pk,
this.GetLeasePartitionKeyValue(documentServiceLease.LeaseId));Keeps the snapshotted-flag guarantee intact and gives a single source of truth for the "PK value = lease id when deterministic" invariant. Optional bigger refactor (probably out of scope for this PR): push the policy onto |
…t tests, Gremlin assertion - Extracts GetLeasePartitionKeyValue helper to deduplicate PK-value selection logic across both CreateLeaseIfNotExistAsync overloads (NaluTripician review) - Adds 409 Conflict dedup chain unit tests for both PKRange and EPK overloads (xinlian12 Finding 2) - Adds back-compat tests verifying pre-existing GUID-PK leases still work through Acquire and Renew paths (xinlian12 Finding 3) - Strengthens GremlinSmokeTests assertion to validate dedup invariant: lease.id == lease.partitionKey (xinlian12 Finding 4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…mlinSmokeTests - Updates 409 dedup tests to perform two calls: first succeeds (200), second returns 409 Conflict and null (proves end-to-end dedup chain) - Parameterizes GremlinSmokeTests.Schema_DefaultsToHavingPartitionKey with DataTestMethod to run both default (deterministic PK = lease id) and opt-out (legacy GUID PK) modes via env var Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…sses Both DocumentServiceLeaseManagerCosmosTests and GremlinSmokeTests mutate the AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED environment variable, which is process-wide state. Without DoNotParallelize, concurrent test classes could overwrite each other's env var values, causing flaky failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Replaces order-dependent callCount-based mock with ConcurrentDictionary tracking (PartitionKey, id) composite keys. The mock now returns 409 only when the same (PK, id) pair is seen again, faithfully simulating Cosmos DB's per-logical-partition id uniqueness constraint regardless of call order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
|
✅ Review complete (31:09) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
kushagraThapar
left a comment
There was a problem hiding this comment.
Overall looks good, requesting changes on missing changelog entry.
I am also curious, is there any way to fix the existing leases for existing customers? Maybe some guidance?
.NET changelog is included part of release not with PR. |
Agent-Logs-Url: https://github.com/Azure/azure-cosmos-dotnet-v3/sessions/33ea4a37-65e9-4440-9775-2a107d55919e Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Done — the two |
|
@copilot resolve the merge conflicts in this pull request |
Co-authored-by: kirankumarkolli <6880899+kirankumarkolli@users.noreply.github.com>
I resolved the merge conflict locally and finalized it in commit |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
## Release 3.59.0 ### Version Changes - ClientOfficialVersion: 3.58.0 → 3.59.0 - ClientPreviewVersion: 3.59.0 → 3.60.0 - ClientPreviewSuffixVersion: preview.0 → preview.0 ### Changelog (3.59.0 GA) #### Added - [5579](#5579) Change Feed Processor: Adds Lease container export support - [5709](#5709) Performance: Adds caching for URL-encoded AAD authorization signature - [5731](#5731) DNS dot-suffix: Adds TCP DNS dot-suffix for Direct mode to avoid Kubernetes ndots latency - [5755](#5755) Exceptionless: Adds enabling exception less 400 status code - [5756](#5756) Exceptionless: Adds enabling exception less 404/1002 status code - [5757](#5757) Exceptionless: Adds enabling exception less 403 - [5779](#5779) Direct: Adds Direct package version bump to 3.42.4 - [5786](#5786) Region Availability: Adds missing regions from Direct 3.42.4 - [5788](#5788) Socket Handler: Adds HTTP/2 PING keep-alive to detect broken connections in pool #### Fixed - [5553](#5553) NativeDLLs: Fixes Conditionally include win-x64 native DLLs based on RuntimeIdentifier - [5588](#5588) LINQ: Fixes memory leak from Expression.Compile() in all call sites - [5617](#5617) ChangeFeedProcessor: Fixes first-change skip during initial startup by anchoring StartTime - [5636](#5636) CosmosClientBuilder: Fixes self-referencing loop in GetSerializedConfiguration with STJ TypeInfoResolver - [5748](#5748) Routing: Fixes GetOverlappingRanges CPU overhead from repeated JSON deserialization - [5807](#5807) ChangeFeedProcessor: Fixes lease de-duplication for /partitionKey-partitioned lease containers ### Changelog (3.60.0-preview.0) #### Added - [5804](#5804) SemanticReranking: Adds Configurable Request Timeout #### Fixed - [5783](#5783) Container: Fixes SemanticRerankAsync TypeLoadException in derived classes ### API Contract Diff (GA) ```diff diff --git "a/Microsoft.Azure.Cosmos\\contracts\\API_3.58.0.txt" "b/Microsoft.Azure.Cosmos\\contracts\\API_3.59.0.txt" index 1b74a69..6fa9352 100644 --- "a/Microsoft.Azure.Cosmos\\contracts\\API_3.58.0.txt" +++ "b/Microsoft.Azure.Cosmos\\contracts\\API_3.59.0.txt" @@ -60,6 +60,7 @@ namespace Microsoft.Azure.Cosmos public ChangeFeedProcessor Build(); public ChangeFeedProcessorBuilder WithErrorNotification(Container.ChangeFeedMonitorErrorDelegate errorDelegate); public virtual ChangeFeedProcessorBuilder WithInMemoryLeaseContainer(); + public virtual ChangeFeedProcessorBuilder WithInMemoryLeaseContainer(MemoryStream leaseState); public ChangeFeedProcessorBuilder WithInstanceName(string instanceName); public ChangeFeedProcessorBuilder WithLeaseAcquireNotification(Container.ChangeFeedMonitorLeaseAcquireDelegate acquireDelegate); public ChangeFeedProcessorBuilder WithLeaseConfiguration(Nullable<TimeSpan> acquireInterval=default(Nullable<TimeSpan>), Nullable<TimeSpan> expirationInterval=default(Nullable<TimeSpan>), Nullable<TimeSpan> renewInterval=default(Nullable<TimeSpan>)); @@ -956,6 +957,7 @@ namespace Microsoft.Azure.Cosmos public const string NorwayWest = "Norway West"; public const string PolandCentral = "Poland Central"; public const string QatarCentral = "Qatar Central"; + public const string SaudiArabiaEast = "Saudi Arabia East"; public const string SingaporeCentral = "Singapore Central"; public const string SingaporeNorth = "Singapore North"; public const string SouthAfricaNorth = "South Africa North"; @@ -963,6 +965,7 @@ namespace Microsoft.Azure.Cosmos public const string SouthCentralUS = "South Central US"; public const string SouthCentralUS2 = "South Central US 2"; public const string SoutheastAsia = "Southeast Asia"; + public const string SoutheastAsia3 = "Southeast Asia 3"; public const string SoutheastUS = "Southeast US"; public const string SoutheastUS3 = "Southeast US 3"; public const string SoutheastUS5 = "Southeast US 5"; @@ -990,6 +993,7 @@ namespace Microsoft.Azure.Cosmos public const string USSecWest = "USSec West"; public const string USSecWestCentral = "USSec West Central"; public const string WestCentralUS = "West Central US"; + public const string WestCentralUSFRE = "West Central US FRE"; public const string WestEurope = "West Europe"; public const string WestIndia = "West India"; public const string WestUS = "West US"; ``` ### API Contract Diff (Preview) ```diff diff --git "a/Microsoft.Azure.Cosmos\\contracts\\API_3.59.0-preview.0.txt" "b/Microsoft.Azure.Cosmos\\contracts\\API_3.60.0-preview.0.txt" index 1ae52c0..58df10f 100644 --- "a/Microsoft.Azure.Cosmos\\contracts\\API_3.59.0-preview.0.txt" +++ "b/Microsoft.Azure.Cosmos\\contracts\\API_3.60.0-preview.0.txt" @@ -91,6 +91,7 @@ namespace Microsoft.Azure.Cosmos public ChangeFeedProcessor Build(); public ChangeFeedProcessorBuilder WithErrorNotification(Container.ChangeFeedMonitorErrorDelegate errorDelegate); public virtual ChangeFeedProcessorBuilder WithInMemoryLeaseContainer(); + public virtual ChangeFeedProcessorBuilder WithInMemoryLeaseContainer(MemoryStream leaseState); public ChangeFeedProcessorBuilder WithInstanceName(string instanceName); public ChangeFeedProcessorBuilder WithLeaseAcquireNotification(Container.ChangeFeedMonitorLeaseAcquireDelegate acquireDelegate); public ChangeFeedProcessorBuilder WithLeaseConfiguration(Nullable<TimeSpan> acquireInterval=default(Nullable<TimeSpan>), Nullable<TimeSpan> expirationInterval=default(Nullable<TimeSpan>), Nullable<TimeSpan> renewInterval=default(Nullable<TimeSpan>)); @@ -302,7 +303,7 @@ namespace Microsoft.Azure.Cosmos public abstract Task<ResponseMessage> ReplaceItemStreamAsync(Stream streamPayload, string id, PartitionKey partitionKey, ItemRequestOptions requestOptions=null, CancellationToken cancellationToken=default(CancellationToken)); public abstract Task<ThroughputResponse> ReplaceThroughputAsync(ThroughputProperties throughputProperties, RequestOptions requestOptions=null, CancellationToken cancellationToken=default(CancellationToken)); public abstract Task<ThroughputResponse> ReplaceThroughputAsync(int throughput, RequestOptions requestOptions=null, CancellationToken cancellationToken=default(CancellationToken)); - public abstract Task<SemanticRerankResult> SemanticRerankAsync(string rerankContext, IEnumerable<string> documents, IDictionary<string, object> options=null, CancellationToken cancellationToken=default(CancellationToken)); + public virtual Task<SemanticRerankResult> SemanticRerankAsync(string rerankContext, IEnumerable<string> documents, IDictionary<string, object> options=null, CancellationToken cancellationToken=default(CancellationToken)); public abstract Task<ItemResponse<T>> UpsertItemAsync<T>(T item, Nullable<PartitionKey> partitionKey=default(Nullable<PartitionKey>), ItemRequestOptions requestOptions=null, CancellationToken cancellationToken=default(CancellationToken)); public abstract Task<ResponseMessage> UpsertItemStreamAsync(Stream streamPayload, PartitionKey partitionKey, ItemRequestOptions requestOptions=null, CancellationToken cancellationToken=default(CancellationToken)); public delegate Task ChangeFeedHandlerWithManualCheckpoint<T>(ChangeFeedProcessorContext context, IReadOnlyCollection<T> changes, Func<Task> checkpointAsync, CancellationToken cancellationToken); @@ -407,6 +408,7 @@ namespace Microsoft.Azure.Cosmos public int GatewayModeMaxConnectionLimit { get; set; } public Func<HttpClient> HttpClientFactory { get; set; } public Nullable<TimeSpan> IdleTcpConnectionTimeout { get; set; } + public TimeSpan InferenceRequestTimeout { get; set; } public bool LimitToEndpoint { get; set; } public Nullable<int> MaxRequestsPerTcpConnection { get; set; } public Nullable<int> MaxRetryAttemptsOnRateLimitedRequests { get; set; } @@ -1092,6 +1094,7 @@ namespace Microsoft.Azure.Cosmos public const string NorwayWest = "Norway West"; public const string PolandCentral = "Poland Central"; public const string QatarCentral = "Qatar Central"; + public const string SaudiArabiaEast = "Saudi Arabia East"; public const string SingaporeCentral = "Singapore Central"; public const string SingaporeNorth = "Singapore North"; public const string SouthAfricaNorth = "South Africa North"; @@ -1099,6 +1102,7 @@ namespace Microsoft.Azure.Cosmos public const string SouthCentralUS = "South Central US"; public const string SouthCentralUS2 = "South Central US 2"; public const string SoutheastAsia = "Southeast Asia"; + public const string SoutheastAsia3 = "Southeast Asia 3"; public const string SoutheastUS = "Southeast US"; public const string SoutheastUS3 = "Southeast US 3"; public const string SoutheastUS5 = "Southeast US 5"; @@ -1126,6 +1130,7 @@ namespace Microsoft.Azure.Cosmos public const string USSecWest = "USSec West"; public const string USSecWestCentral = "USSec West Central"; public const string WestCentralUS = "West Central US"; + public const string WestCentralUSFRE = "West Central US FRE"; public const string WestEurope = "West Europe"; public const string WestIndia = "West India"; public const string WestUS = "West US"; @@ -1504,6 +1509,7 @@ namespace Microsoft.Azure.Cosmos.Fluent public CosmosClientBuilder WithEnableRemoteRegionPreferredForSessionRetry(bool enableRemoteRegionPreferredForSessionRetry); public CosmosClientBuilder WithFaultInjection(IFaultInjector faultInjector); public CosmosClientBuilder WithHttpClientFactory(Func<HttpClient> httpClientFactory); + public CosmosClientBuilder WithInferenceRequestTimeout(TimeSpan inferenceRequestTimeout); public CosmosClientBuilder WithLimitToEndpoint(bool limitToEndpoint); public CosmosClientBuilder WithPriorityLevel(PriorityLevel priorityLevel); public CosmosClientBuilder WithReadConsistencyStrategy(ReadConsistencyStrategy readConsistencyStrategy); ``` ### Checklist - [ ] Changelog entries reviewed by team - [ ] API contract diff reviewed by Kiran and Kirill - [ ] Preview APIs reviewed (email sent to azurecosmossdkdotnet@microsoft.com) - [ ] Kiran sign-off obtained Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Research report (gist): https://gist.github.com/kirankumarkolli/8bf1cbb9c3ee45dfa64511c4a0cb35ad
Problem statement
In V3 SDK 3.20.0+, when the lease container is partitioned on
/partitionKey(used for Gremlin/Graph API accounts and any customer-chosen layout),DocumentServiceLeaseManagerCosmos.CreateLeaseIfNotExistAsyncsets the new lease's partition-key value toGuid.NewGuid().ToString(). Because Cosmos DB'siduniqueness is scoped per logical partition, two racing hosts that compute the same deterministic leaseidland in different logical partitions and both succeed with201 Createdinstead of the409 Conflictthe dedup logic relies on.Root cause: V3 PR #2491 (2021-05-27) and its V2 predecessor PR #158 silently violated the original V2 PR #105 (2018) invariant that "lease collection PK must be a function of monitored collection PK Range. We already have that function as lease id."
Fix (minimal, env-gated with safe default)
New env var
AZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLEDtruetrue, lease docs in a/partitionKey-partitioned lease container use the deterministic leaseidas the partition-key value (restores the 409-based dedup invariant). Whenfalse, falls back to legacyGuid.NewGuid()behavior.Code changes
ConfigurationManager.cs— New env-var constant and accessorIsChangeFeedLeaseIdAsPartitionKeyEnabled(), mirroring the establishedAZURE_COSMOS_*_ENABLEDpattern.DocumentServiceLeaseManagerCosmos.cs— NewuseDeterministicPartitionKeyfield snapshotted in the constructor (ensures consistency for the manager's lifetime). NewGetLeasePartitionKeyValue(leaseId)helper used at bothCreateLeaseIfNotExistAsynccall sites, replacing theGuid.NewGuid().ToString()calls.Why this is the right fix
/partitionKey) — Gremlin still worksAZURE_COSMOS_CHANGE_FEED_LEASE_ID_AS_PARTITION_KEY_ENABLED=falsereverts per-processTryGetLeaseAsyncuses stored PK valueNot required (explicitly)
id-PK leases; both read/update paths route through the storedPartitionKeyfield.PartitionedByPartitionKeyCollectionRequestOptionsFactory(already routes by stored PK value).Risks and mitigations
id-PK (new) leasesTryGetLeaseAsyncreads storedlease.PartitionKey; old leases remain readable/updatable/idlayoutUpgrade semantics — mixed-fleet behavior
During rolling upgrade (old + new SDK hosts coexist):
lease.PartitionKey).Known limitation — pre-existing duplicates not auto-cleaned
Duplicate leases written by older SDK versions persist until the corresponding PKRange undergoes a split/merge. Manual cleanup (
SELECT * FROM c WHERE c.id = @id+DELETEextras) is needed for existing duplicates.Test coverage
Unit tests —
DocumentServiceLeaseManagerCosmosTests.csCreateLeaseIfNotExistAsync_PartitionKeyBehavior(4 DataRows)PartitionKeysent in the create request and asserts deterministic (PK = lease.Id) or legacy (PK = GUID ≠ lease.Id) behaviorCreateLeaseIfNotExistAsync_DuplicatePkId_Returns409(2 DataRows)ConcurrentDictionary— first create succeeds, second with same (PK, id) gets 409AcquireCompletes_WithPreExistingGuidPartitionKeyRenewCompletes_WithPreExistingGuidPartitionKeyCreatesEPKBasedLease/CreatesPartitionKeyBasedLeaseAssert.AreEqual(lease.Id, lease.PartitionKey)for/partitionKeyfactoryInfrastructure:
[DoNotParallelize]+[TestInitialize]/[TestCleanup]for env var safety. Shared helpers eliminate duplication across overload types.Emulator integration test —
GremlinSmokeTests.csSchema_DefaultsToHavingPartitionKey[DataTestMethod]with two rows: (a) default env → assertspartitionKey == lease.id(dedup invariant), (b) env varfalse→ assertspartitionKeyis a GUID ≠lease.id(legacy behavior)Infrastructure:
[DoNotParallelize]added (env var mutation is process-global).