Skip to content

CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2#5870

Merged
kundadebdatta merged 4 commits into
mainfrom
users/kundadebdatta/fix_stackoverflow_error_in_availability_strategy
May 16, 2026
Merged

CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2#5870
kundadebdatta merged 4 commits into
mainfrom
users/kundadebdatta/fix_stackoverflow_error_in_availability_strategy

Conversation

@kundadebdatta
Copy link
Copy Markdown
Member

Problem statement

Customer using the latest .NET SDK on .NET Framework 4.7.2 hit a System.StackOverflowException
during a regional outage. The dump shows the failing thread's top frames in
CrossRegionHedgingAvailabilityStrategy:

RequestSenderAndResultCheckAsync     (deepest hedging frame; calls into the request pipeline)
  → ExceptionDispatchInfo.Throw
  → TaskAwaiter.ThrowForNonSuccess
  → TaskAwaiter.HandleNonSuccessAndDebuggerNotification
CloneAndSendAsync                    (middle hedging frame)
  → ExceptionDispatchInfo.Throw
  → TaskAwaiter.ThrowForNonSuccess
ExecuteAvailabilityStrategyAsync     (topmost hedging frame)
  → ExceptionDispatchInfo.Throw
  → TaskAwaiter.ThrowForNonSuccess

Every CosmosOperationCanceledException thrown deep in the request pipeline (e.g. when
hedgeRequestsCancellationTokenSource cancels in-flight hedged calls) propagates synchronously
back through the entire pipeline of awaiting async methods. On .NET Framework 4.7.2 each async
method's exception path consumes ~10KB of stack (ExceptionDispatchInfo.Throw
TaskAwaiter.ThrowForNonSuccessHandleNonSuccessAndDebuggerNotification). With the request
pipeline (handlers, retry policies, store clients, etc.) layered under hedging, the cumulative
stack consumption can blow the 1MB managed thread stack.

This is not reproducible on .NET Core / .NET 5+ — those runtimes reset the synchronous
exception stack across await boundaries automatically. Only .NET Framework 4.x consumers
are affected.

Approach

Apply the documented workaround: insert await Task.Yield(); in the catch block of
middle-layer async methods
along the propagation chain. Task.Yield() schedules the
continuation on the threadpool, so the rethrown exception unwinds onto a fresh stack
instead of accumulating frames.

Guidance:

  • Never add it at the deepest frame (it just delays without resetting useful stack).
  • Never add it at the absolute topmost frame (caller has nowhere left to unwind into).
  • Do add it on a middle layer — once is enough per long chain to reset the stack.

Scope

P0 — Fix the hedging stack overflow (this PR)

File: Microsoft.Azure.Cosmos/src/Routing/AvailabilityStrategy/CrossRegionHedgingAvailabilityStrategy.cs

  1. CloneAndSendAsync (middle layer) — currently has no try/catch. Wrap the awaited call
    in try { ... } catch { await Task.Yield(); throw; }. This is the recommended insertion
    point per the runtime guidance — middle-layer, between the deepest hedging frame
    (RequestSenderAndResultCheckAsync) and the topmost (ExecuteAvailabilityStrategyAsync).
  2. RequestSenderAndResultCheckAsyncthrow ex; bug fix — line 372 uses throw ex;
    which resets the captured stack trace. Change to throw; (a tightly-coupled bug discovered
    while reading the same catch block).
  3. Unit test — extend AvailabilityStrategyUnitTests with a regression test that wires up
    a deep, fake handler chain whose deepest stage throws OperationCanceledException, drives
    it through hedging, and asserts the call returns/throws normally without a StackOverflowException.
    We can't directly assert "no SO" (SO terminates the process), but we can assert that the
    continuation after the catch runs on a different thread (proof the yield happened) and that
    the original exception type and message are preserved.

P1 — Audit other deep async chains in the SDK (follow-up)

Same pattern is latent anywhere a long async/await chain rethrows from a catch on
.NET Framework. Candidates worth auditing in a follow-up PR:

File Location Notes
Handler/AbstractRetryHandler.cs SendAsync catches (DocumentClientException, CosmosException, AggregateException, OperationCanceledException) and ExecuteHttpRequestAsync retry loop Sits in the request pipeline; long await chain below it
ChangeFeed/ChangeFeedIteratorCore.cs Two catch (OperationCanceledException) → throw new CosmosOperationCanceledException blocks (lines 243, 287) Async iterator — deep pipeline below
ReadFeed/ReadFeedIteratorCore.cs Same pattern (line 257)
Query/v3Query/QueryIterator.cs Same pattern (line 225)
Resource/ClientContextCore.cs OperationHelperWithRootTraceAsync TryTransformException rethrow (line 615) Top-of-pipeline; arguably "topmost", but worth verifying

Approach for P1:

  • Triage each by stack depth below the catch (deepest pipelines = highest risk).
  • For each confirmed risk, add await Task.Yield() in one middle-layer catch on the
    propagation path (typically the iterator/handler that sits 1 layer above the request
    pipeline, not the very top).
  • Do not add await Task.Yield() everywhere — it has a small perf cost (extra threadpool
    hop) on every exception. One per long chain is sufficient.

Out of scope

  • Refactoring hedging concurrency model.
  • Changing the Task.WhenAny accumulation pattern (the leaked-task issue is separate).
  • Adding Task.Yield on the success path.

Notes

  • Task.Yield() cost: one threadpool dispatch per exception. Negligible vs. the alternative
    (process crash) and only paid on the exception path, which is already slow.
  • Behavior on .NET Core / .NET 5+: identical observable behavior; one extra threadpool hop
    on exceptions. Acceptable.
  • The fix is netstandard2.0 friendly (target framework of Microsoft.Azure.Cosmos).

Validation plan

  1. Build Microsoft.Azure.Cosmos.csproj.
  2. Run AvailabilityStrategyUnitTests and any other existing hedging tests.
  3. Ensure the new regression test passes.
  4. Confirm contracts (API_*.txt) are unchanged — no public API surface changes.

Status

  • Branch: users/kundadebdatta/fix_stackoverflow_error_in_availability_strategy
  • Commit: d04487cef — "Fix StackOverflowException in CrossRegionHedgingAvailabilityStrategy on .NET Framework"
  • Build: clean, 0 warnings.
  • Tests: 103/103 related tests pass (AvailabilityStrategy + ClientRetryPolicy + RetryHandler + CrossRegion + PartitionKeyRangeFailover).
  • Push: not pushed yet — awaiting user confirmation per stored preference.

Next steps

  • Get user confirmation, then push the branch and open a PR.
  • Schedule the P1 audit (per the table above) as a follow-up PR.

…on .NET Framework

A customer running the latest .NET SDK on .NET Framework 4.7.2 hit a
System.StackOverflowException during a regional outage. The failing thread's
top frames were in CrossRegionHedgingAvailabilityStrategy
(RequestSenderAndResultCheckAsync -> CloneAndSendAsync ->
ExecuteAvailabilityStrategyAsync), each followed by ExceptionDispatchInfo.Throw
and TaskAwaiter.ThrowForNonSuccess machinery.

Root cause: on .NET Framework, every awaiting async method consumes ~10KB of
stack on the synchronous exception unwind path. When CosmosOperationCanceledException
is thrown deep in the request pipeline (e.g. after the hedge CTS is signalled),
the synchronous exception propagation through the pipeline + hedging frames can
blow the 1MB managed stack. .NET Core / .NET 5+ are unaffected because they
optimize this path.

P0 fix:
- CloneAndSendAsync: wrap the awaited call in try/catch with await Task.Yield();
  throw; in the catch. The yield reschedules the rethrow on a fresh threadpool
  stack, breaking the synchronous propagation chain (one yield per long chain
  is sufficient; placed at the middle layer per runtime guidance).

Tightly coupled bugs in the same propagation path (also fixed):
- RequestSenderAndResultCheckAsync: throw ex; -> throw; (preserve stack trace).
- ExecuteAvailabilityStrategyAsync: throw lastException; ->
  ExceptionDispatchInfo.Capture(lastException).Throw(); (preserve stack trace).

Test:
- New AvailabilityStrategyUnitTests.SenderException_PropagatesViaYield_PreservesStackTrace
  asserts that (1) the inner OCE's stack trace still contains the deep throwing
  frame after propagation through hedging (regression for throw-ex -> throw)
  and (2) at least one continuation is posted to the active SynchronizationContext
  during exception propagation, observable proof that Task.Yield ran in the catch.
- 103/103 related tests pass (AvailabilityStrategy + ClientRetryPolicy +
  RetryHandler + CrossRegion + PartitionKeyRangeFailover).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

📝 Changelog reminder (non-blocking)

This PR touches shipped source but does not appear to update the
corresponding changelog.md.

Touched (and missing entry for):

  • Microsoft.Azure.Cosmos/src/** ⇒ expected an entry in changelog.md (### Unreleased)

How to decide

Use the rubric in .github/copilot-instructions.md ("Changelog
classifier") or in CONTRIBUTING.md ("Changelog entry"). Quick
version:

  • Customer-observable change (behavior, perf, memory, API) ⇒ add an entry, even if the PR is [Internal].
  • Test-only / CI-only / doc-only / pure-internal-refactor with zero customer-observable effect ⇒ no entry, add the no-changelog-needed label to silence this check.
  • Unsure? Default to adding a one-line entry under #### Other Changes. Reviewer will adjust.

This check is non-blocking — merge is not gated on it. The
reviewer is responsible for the final classification.

@kundadebdatta kundadebdatta self-assigned this May 15, 2026
@kundadebdatta kundadebdatta added the bug Something isn't working label May 15, 2026
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ananth7592
ananth7592 previously approved these changes May 15, 2026
Copy link
Copy Markdown
Member

@ananth7592 ananth7592 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one item

Task.Yield() runs on all runtimes including .NET 6+ where SO doesn't exist; comment says "no-op" but it's a threadpool dispatch. The threadpool dispatch adds pressure but since it is much miniscule in impact compared to bottleneck of network timeouts happening during outage /hedge scenario

@kundadebdatta kundadebdatta enabled auto-merge (squash) May 15, 2026 23:47
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review pass. The core fix (single Task.Yield() in CloneAndSendAsync's catch) is correctly placed at the middle layer of the hedging seam, the untyped catch scope is intentional, and the compiler-stored ExceptionDispatchInfo across the await preserves the original stack on the subsequent throw;. Three findings below — two are notable, one is a minor disclosure/coverage gap.

  1. Major — the new regression test's claim of covering the throw ex;throw; change in RequestSenderAndResultCheckAsync is empirically false (verified by reverting just that line and watching the test still pass). The pre-cancelled-CTS path routes through the OCE-filtered catch, not the generic catch where the fix lives, and CosmosOperationCanceledException.StackTrace delegates to the inner OCE's stack trace so the preservation is incidental, not caused by throw;. See inline comment.
  2. Major (merge-time)changelog.md targets #### Bugs Fixed under ### Unreleased, but main (commit 8b03a5a, PR #5864) replaced that with #### Fixed under ### Unreleased Preview. Rebase needed or the entry lands in a heading that no longer exists. See inline comment.
  3. Minor — the throw lastException;ExceptionDispatchInfo.Capture(...).Throw() change in ExecuteAvailabilityStrategyAsync is correct but isn't called out in the PR description and isn't exercised by the new test. See inline comment.

Everything else checked out: scope is exactly the 3 advertised files, PR title matches the prlint regex, postCountDelta > 0 is empirically robust (not flaky), and no disposal/cancellation races introduced.

Comment thread changelog.md
@kundadebdatta kundadebdatta dismissed stale reviews from FabianMeiswinkel and ananth7592 via 1852041 May 16, 2026 04:44
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kundadebdatta kundadebdatta merged commit d5aefe6 into main May 16, 2026
33 checks passed
@kundadebdatta kundadebdatta deleted the users/kundadebdatta/fix_stackoverflow_error_in_availability_strategy branch May 16, 2026 09:32
NaluTripician added a commit to MichalMarsalek/azure-cosmos-dotnet-v3 that referenced this pull request May 18, 2026
Resolves changelog.md Unreleased Bugs Fixed section: keeps both the Azure#5298 LINQ MemberInit fix entry and the Azure#5870 CrossRegionHedgingAvailabilityStrategy entry from main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician added a commit that referenced this pull request May 18, 2026
Conflicts resolved:
- Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs:
  Adopted main's defensive structure from #5825 (explicit AVAD throws for
  StartFromBeginning/StartTime plus the shouldAnchorStartTime boolean). The
  Mode != AllVersionsAndDeletes guard ΓÇö the core fix from #5852 ΓÇö is
  preserved. Extended the comment to cite both #5825 and #5846 so future
  readers see the LSN-based rationale and the original cold-start
  regression context.
- changelog.md: Kept both Unreleased Bugs Fixed entries (#5852 cold-start
  regression and #5870 hedging StackOverflow).

Validated by building Microsoft.Azure.Cosmos.Tests and running the
ChangeFeedProcessorCoreTests filter; all 20 tests pass, including both the
PR's regression tests (AC1/AC7/AC8) and main's defensive throw tests added
in #5825.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants