CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation#5613
Merged
Conversation
…gy NullRef fix Adds 8 regression tests covering: - Hedge CTS token cancels in-flight requests (not app CT) - Sender receives hedge CTS token, not application token - App cancellation prevents spawning new hedge requests (do/while fix) - Request not accessed after disposal on cancellation - Stream-based request path (ReadItemStreamAsync) - Primary request fast return skips hedging - All-transient error handling - Concurrent hedging stress test (50 parallel requests)
ArgumentNullException race condition in hedging cancellation
kundadebdatta
approved these changes
Feb 18, 2026
Contributor
|
@kundadebdatta I've opened a new pull request, #5625, to work on those changes. Once the pull request is ready, I'll request review from you. |
kirankumarkolli
approved these changes
Feb 20, 2026
NaluTripician
added a commit
that referenced
this pull request
Feb 20, 2026
… race condition in hedging cancellation (#5613) ## CrossRegionHedgingAvailabilityStrategy: Fixes `ArgumentNullException` race condition in hedging cancellation ### Bug Multiple production customers reported unobserved `ArgumentNullException: Value cannot be null. (Parameter 'request')` crashes originating from `CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync`. The exception was surfaced as a `TaskScheduler_UnobservedTaskException`, crashing the process. The affected code paths were: - `ContainerCore.ReadItemAsync` → `ReadItemStreamAsync` → `ProcessItemStreamAsync` → `RequestInvokerHandler.SendAsync` → `CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync` ### Root Cause A **race condition** caused by passing the wrong `CancellationToken` to the sender delegate: 1. `ExecuteAvailabilityStrategyAsync` creates a `hedgeRequestsCancellationTokenSource` (linked to the app-provided CT) to coordinate hedge request lifecycle. 2. `CloneAndSendAsync` clones the request inside a `using` block and calls `RequestSenderAndResultCheckAsync`. 3. **Bug:** `RequestSenderAndResultCheckAsync` called `sender.Invoke(request, cancellationToken)` with the **application-provided `CancellationToken`** — not `hedgeRequestsCancellationTokenSource.Token`. 4. When hedge Region B returned a final result (e.g., 200 OK), `hedgeRequestsCancellationTokenSource.Cancel()` was called. 5. **But the in-flight sender for Region A still held the app CT** (e.g., `CancellationToken.None`), which was **never cancelled**. 6. The `CloneAndSendAsync` `using` block exited, **disposing the cloned request**. 7. The Region A sender continued executing with a reference to the now-disposed request → **`ArgumentNullException: Value cannot be null. (Parameter 'request')`**. A secondary issue: when the application CT was cancelled (e2e timeout), the hedge timer (linked to app CT) would fire, and the old code would blindly continue the loop attempting to clone and send new requests on a cancelled path. ### Fix Two changes in `CrossRegionHedgingAvailabilityStrategy.cs`: **1. Pass `hedgeRequestsCancellationTokenSource.Token` to `sender.Invoke()` instead of the app CT** This ensures that when **any** hedge gets a final result and calls `hedgeRequestsCancellationTokenSource.Cancel()`, **all** in-flight senders immediately see their CT cancelled and stop before the cloned request is disposed. The `CancellationTokenSource` and `CancellationToken` parameters were also consolidated into a single `hedgeRequestsCancellationTokenSource` parameter passed through `CloneAndSendAsync` → `RequestSenderAndResultCheckAsync`. **2. Add `do/while` loop to handle spurious timer completions on app CT cancellation** When the app CT is cancelled (e2e timeout), the hedge timer fires via the linked CTS. The old code would `continue` the loop and try to clone a new request. The `do/while` loop now detects `applicationProvidedCancellationToken.IsCancellationRequested` and falls through to consolidate existing request outcomes instead of spawning new hedges. ### Tests Added (8 new unit tests) | Test | Validates | |---|---| | `HedgeCancellationCancelsInFlightRequests_NoNullRef` | Slow primary request's CT is cancelled when a hedge returns a final result — core regression test | | `SenderReceivesHedgeCancellationToken_NotAppToken` | Captures the actual CT passed to each sender and asserts all are from the hedge CTS, not the app CT | | `AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests` | E2e timeout (app CT cancelled) does not spawn new hedge requests — validates the do/while loop fix | | `MultiRegionHedging_RequestNotAccessedAfterDisposal` | Verifies the cloned request is still accessible when cancellation fires — exact scenario from the crash reports | | `HedgeCancellation_StreamRequest_NoNullRef` | Tests the stream-based code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces | | `PrimaryRequestFinalResult_NoAdditionalHedgesSent` | Fast primary response skips hedging entirely | | `AllHedgesTransientError_ReturnsLastResponse` | All regions return transient errors — strategy returns last response without NullRef | | `ConcurrentHedgingRequests_NoNullRef` | Stress test: 50 concurrent hedging requests with random delays — no NullRef under concurrency | ### Type of change - [x] Bug fix (non-breaking change which fixes an issue) --------- Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com> (cherry picked from commit 6f56e40)
This was referenced Apr 27, 2026
Open
Merged
Merged
Bump Microsoft.Azure.Cosmos from 3.58.0 to 3.59.0
azureossd/general-database-connectivity-samples#40
Merged
This was referenced May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CrossRegionHedgingAvailabilityStrategy: Fixes
ArgumentNullExceptionrace condition in hedging cancellationBug
Multiple production customers reported unobserved
ArgumentNullException: Value cannot be null. (Parameter 'request')crashes originating fromCrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync. The exception was surfaced as aTaskScheduler_UnobservedTaskException, crashing the process. The affected code paths were:ContainerCore.ReadItemAsync→ReadItemStreamAsync→ProcessItemStreamAsync→RequestInvokerHandler.SendAsync→CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsyncRoot Cause
A race condition caused by passing the wrong
CancellationTokento the sender delegate:ExecuteAvailabilityStrategyAsynccreates ahedgeRequestsCancellationTokenSource(linked to the app-provided CT) to coordinate hedge request lifecycle.CloneAndSendAsyncclones the request inside ausingblock and callsRequestSenderAndResultCheckAsync.RequestSenderAndResultCheckAsynccalledsender.Invoke(request, cancellationToken)with the application-providedCancellationToken— nothedgeRequestsCancellationTokenSource.Token.hedgeRequestsCancellationTokenSource.Cancel()was called.CancellationToken.None), which was never cancelled.CloneAndSendAsyncusingblock exited, disposing the cloned request.ArgumentNullException: Value cannot be null. (Parameter 'request').A secondary issue: when the application CT was cancelled (e2e timeout), the hedge timer (linked to app CT) would fire, and the old code would blindly continue the loop attempting to clone and send new requests on a cancelled path.
Fix
Two changes in
CrossRegionHedgingAvailabilityStrategy.cs:1. Pass
hedgeRequestsCancellationTokenSource.Tokentosender.Invoke()instead of the app CTThis ensures that when any hedge gets a final result and calls
hedgeRequestsCancellationTokenSource.Cancel(), all in-flight senders immediately see their CT cancelled and stop before the cloned request is disposed. TheCancellationTokenSourceandCancellationTokenparameters were also consolidated into a singlehedgeRequestsCancellationTokenSourceparameter passed throughCloneAndSendAsync→RequestSenderAndResultCheckAsync.2. Add
do/whileloop to handle spurious timer completions on app CT cancellationWhen the app CT is cancelled (e2e timeout), the hedge timer fires via the linked CTS. The old code would
continuethe loop and try to clone a new request. Thedo/whileloop now detectsapplicationProvidedCancellationToken.IsCancellationRequestedand falls through to consolidate existing request outcomes instead of spawning new hedges.Tests Added (8 new unit tests)
HedgeCancellationCancelsInFlightRequests_NoNullRefSenderReceivesHedgeCancellationToken_NotAppTokenAppCancellationDuringHedging_DoesNotSpawnNewHedgeRequestsMultiRegionHedging_RequestNotAccessedAfterDisposalHedgeCancellation_StreamRequest_NoNullRefPrimaryRequestFinalResult_NoAdditionalHedgesSentAllHedgesTransientError_ReturnsLastResponseConcurrentHedgingRequests_NoNullRefClosing issues
To automatically close an issue: closes #5623