Skip to content

CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation#5613

Merged
kundadebdatta merged 8 commits into
masterfrom
users/fabianm/NullRefInvestigation
Feb 20, 2026
Merged

CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation#5613
kundadebdatta merged 8 commits into
masterfrom
users/fabianm/NullRefInvestigation

Conversation

@FabianMeiswinkel
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel commented Feb 11, 2026

CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation

Bug

Multiple production customers reported unobserved ArgumentNullException: Value cannot be null. (Parameter 'request') crashes originating from CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync. The exception was surfaced as a TaskScheduler_UnobservedTaskException, crashing the process. The affected code paths were:

  • ContainerCore.ReadItemAsyncReadItemStreamAsyncProcessItemStreamAsyncRequestInvokerHandler.SendAsyncCrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync

Root Cause

A race condition caused by passing the wrong CancellationToken to the sender delegate:

  1. ExecuteAvailabilityStrategyAsync creates a hedgeRequestsCancellationTokenSource (linked to the app-provided CT) to coordinate hedge request lifecycle.
  2. CloneAndSendAsync clones the request inside a using block and calls RequestSenderAndResultCheckAsync.
  3. Bug: RequestSenderAndResultCheckAsync called sender.Invoke(request, cancellationToken) with the application-provided CancellationToken — not hedgeRequestsCancellationTokenSource.Token.
  4. When hedge Region B returned a final result (e.g., 200 OK), hedgeRequestsCancellationTokenSource.Cancel() was called.
  5. But the in-flight sender for Region A still held the app CT (e.g., CancellationToken.None), which was never cancelled.
  6. The CloneAndSendAsync using block exited, disposing the cloned request.
  7. The Region A sender continued executing with a reference to the now-disposed request → ArgumentNullException: Value cannot be null. (Parameter 'request').

A secondary issue: when the application CT was cancelled (e2e timeout), the hedge timer (linked to app CT) would fire, and the old code would blindly continue the loop attempting to clone and send new requests on a cancelled path.

Fix

Two changes in CrossRegionHedgingAvailabilityStrategy.cs:

1. Pass hedgeRequestsCancellationTokenSource.Token to sender.Invoke() instead of the app CT

This ensures that when any hedge gets a final result and calls hedgeRequestsCancellationTokenSource.Cancel(), all in-flight senders immediately see their CT cancelled and stop before the cloned request is disposed. The CancellationTokenSource and CancellationToken parameters were also consolidated into a single hedgeRequestsCancellationTokenSource parameter passed through CloneAndSendAsyncRequestSenderAndResultCheckAsync.

2. Add do/while loop to handle spurious timer completions on app CT cancellation

When the app CT is cancelled (e2e timeout), the hedge timer fires via the linked CTS. The old code would continue the loop and try to clone a new request. The do/while loop now detects applicationProvidedCancellationToken.IsCancellationRequested and falls through to consolidate existing request outcomes instead of spawning new hedges.

Tests Added (8 new unit tests)

Test Validates
HedgeCancellationCancelsInFlightRequests_NoNullRef Slow primary request's CT is cancelled when a hedge returns a final result — core regression test
SenderReceivesHedgeCancellationToken_NotAppToken Captures the actual CT passed to each sender and asserts all are from the hedge CTS, not the app CT
AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests E2e timeout (app CT cancelled) does not spawn new hedge requests — validates the do/while loop fix
MultiRegionHedging_RequestNotAccessedAfterDisposal Verifies the cloned request is still accessible when cancellation fires — exact scenario from the crash reports
HedgeCancellation_StreamRequest_NoNullRef Tests the stream-based code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces
PrimaryRequestFinalResult_NoAdditionalHedgesSent Fast primary response skips hedging entirely
AllHedgesTransientError_ReturnsLastResponse All regions return transient errors — strategy returns last response without NullRef
ConcurrentHedgingRequests_NoNullRef Stress test: 50 concurrent hedging requests with random delays — no NullRef under concurrency

Closing issues

To automatically close an issue: closes #5623

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@FabianMeiswinkel FabianMeiswinkel marked this pull request as draft February 11, 2026 11:27
…gy NullRef fix

Adds 8 regression tests covering:
- Hedge CTS token cancels in-flight requests (not app CT)
- Sender receives hedge CTS token, not application token
- App cancellation prevents spawning new hedge requests (do/while fix)
- Request not accessed after disposal on cancellation
- Stream-based request path (ReadItemStreamAsync)
- Primary request fast return skips hedging
- All-transient error handling
- Concurrent hedging stress test (50 parallel requests)
@NaluTripician NaluTripician changed the title [DRAFT-ONLY][ONLY-FOR-DISCUSSION]NullRef fix in CrossRegionHedgingAvailabilityStrategy Hedging: Fixes NullRef in CrossRegionHedgingAvailabilityStrategy Feb 14, 2026
@NaluTripician NaluTripician changed the title Hedging: Fixes NullRef in CrossRegionHedgingAvailabilityStrategy CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation Feb 14, 2026
@NaluTripician NaluTripician marked this pull request as ready for review February 14, 2026 00:44
Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 19, 2026

@kundadebdatta I've opened a new pull request, #5625, to work on those changes. Once the pull request is ready, I'll request review from you.

@kundadebdatta kundadebdatta merged commit 6f56e40 into master Feb 20, 2026
32 checks passed
@kundadebdatta kundadebdatta deleted the users/fabianm/NullRefInvestigation branch February 20, 2026 21:44
NaluTripician added a commit that referenced this pull request Feb 20, 2026
… race condition in hedging cancellation (#5613)

## CrossRegionHedgingAvailabilityStrategy: Fixes `ArgumentNullException`
race condition in hedging cancellation

### Bug

Multiple production customers reported unobserved
`ArgumentNullException: Value cannot be null. (Parameter 'request')`
crashes originating from
`CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync`.
The exception was surfaced as a `TaskScheduler_UnobservedTaskException`,
crashing the process. The affected code paths were:

- `ContainerCore.ReadItemAsync` → `ReadItemStreamAsync` →
`ProcessItemStreamAsync` → `RequestInvokerHandler.SendAsync` →
`CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync`

### Root Cause

A **race condition** caused by passing the wrong `CancellationToken` to
the sender delegate:

1. `ExecuteAvailabilityStrategyAsync` creates a
`hedgeRequestsCancellationTokenSource` (linked to the app-provided CT)
to coordinate hedge request lifecycle.
2. `CloneAndSendAsync` clones the request inside a `using` block and
calls `RequestSenderAndResultCheckAsync`.
3. **Bug:** `RequestSenderAndResultCheckAsync` called
`sender.Invoke(request, cancellationToken)` with the
**application-provided `CancellationToken`** — not
`hedgeRequestsCancellationTokenSource.Token`.
4. When hedge Region B returned a final result (e.g., 200 OK),
`hedgeRequestsCancellationTokenSource.Cancel()` was called.
5. **But the in-flight sender for Region A still held the app CT**
(e.g., `CancellationToken.None`), which was **never cancelled**.
6. The `CloneAndSendAsync` `using` block exited, **disposing the cloned
request**.
7. The Region A sender continued executing with a reference to the
now-disposed request → **`ArgumentNullException: Value cannot be null.
(Parameter 'request')`**.

A secondary issue: when the application CT was cancelled (e2e timeout),
the hedge timer (linked to app CT) would fire, and the old code would
blindly continue the loop attempting to clone and send new requests on a
cancelled path.

### Fix

Two changes in `CrossRegionHedgingAvailabilityStrategy.cs`:

**1. Pass `hedgeRequestsCancellationTokenSource.Token` to
`sender.Invoke()` instead of the app CT**

This ensures that when **any** hedge gets a final result and calls
`hedgeRequestsCancellationTokenSource.Cancel()`, **all** in-flight
senders immediately see their CT cancelled and stop before the cloned
request is disposed. The `CancellationTokenSource` and
`CancellationToken` parameters were also consolidated into a single
`hedgeRequestsCancellationTokenSource` parameter passed through
`CloneAndSendAsync` → `RequestSenderAndResultCheckAsync`.

**2. Add `do/while` loop to handle spurious timer completions on app CT
cancellation**

When the app CT is cancelled (e2e timeout), the hedge timer fires via
the linked CTS. The old code would `continue` the loop and try to clone
a new request. The `do/while` loop now detects
`applicationProvidedCancellationToken.IsCancellationRequested` and falls
through to consolidate existing request outcomes instead of spawning new
hedges.

### Tests Added (8 new unit tests)

| Test | Validates |
|---|---|
| `HedgeCancellationCancelsInFlightRequests_NoNullRef` | Slow primary
request's CT is cancelled when a hedge returns a final result — core
regression test |
| `SenderReceivesHedgeCancellationToken_NotAppToken` | Captures the
actual CT passed to each sender and asserts all are from the hedge CTS,
not the app CT |
| `AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests` | E2e
timeout (app CT cancelled) does not spawn new hedge requests — validates
the do/while loop fix |
| `MultiRegionHedging_RequestNotAccessedAfterDisposal` | Verifies the
cloned request is still accessible when cancellation fires — exact
scenario from the crash reports |
| `HedgeCancellation_StreamRequest_NoNullRef` | Tests the stream-based
code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces
|
| `PrimaryRequestFinalResult_NoAdditionalHedgesSent` | Fast primary
response skips hedging entirely |
| `AllHedgesTransientError_ReturnsLastResponse` | All regions return
transient errors — strategy returns last response without NullRef |
| `ConcurrentHedgingRequests_NoNullRef` | Stress test: 50 concurrent
hedging requests with random delays — no NullRef under concurrency |

### Type of change

- [x] Bug fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com>
(cherry picked from commit 6f56e40)
This was referenced Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CrossRegionHedgingAvailabilityStrategy: Fix ArgumentNullException with a Race Condition in Hedging Cancellation

6 participants