CrossRegionHedgingAvailabilityStrategy: Fixes StackOverflow in CrossRegionHedgingAvailabilityStrategy Observed in .NET Framework 4.7.2#5870
Conversation
…on .NET Framework A customer running the latest .NET SDK on .NET Framework 4.7.2 hit a System.StackOverflowException during a regional outage. The failing thread's top frames were in CrossRegionHedgingAvailabilityStrategy (RequestSenderAndResultCheckAsync -> CloneAndSendAsync -> ExecuteAvailabilityStrategyAsync), each followed by ExceptionDispatchInfo.Throw and TaskAwaiter.ThrowForNonSuccess machinery. Root cause: on .NET Framework, every awaiting async method consumes ~10KB of stack on the synchronous exception unwind path. When CosmosOperationCanceledException is thrown deep in the request pipeline (e.g. after the hedge CTS is signalled), the synchronous exception propagation through the pipeline + hedging frames can blow the 1MB managed stack. .NET Core / .NET 5+ are unaffected because they optimize this path. P0 fix: - CloneAndSendAsync: wrap the awaited call in try/catch with await Task.Yield(); throw; in the catch. The yield reschedules the rethrow on a fresh threadpool stack, breaking the synchronous propagation chain (one yield per long chain is sufficient; placed at the middle layer per runtime guidance). Tightly coupled bugs in the same propagation path (also fixed): - RequestSenderAndResultCheckAsync: throw ex; -> throw; (preserve stack trace). - ExecuteAvailabilityStrategyAsync: throw lastException; -> ExceptionDispatchInfo.Capture(lastException).Throw(); (preserve stack trace). Test: - New AvailabilityStrategyUnitTests.SenderException_PropagatesViaYield_PreservesStackTrace asserts that (1) the inner OCE's stack trace still contains the deep throwing frame after propagation through hedging (regression for throw-ex -> throw) and (2) at least one continuation is posted to the active SynchronizationContext during exception propagation, observable proof that Task.Yield ran in the catch. - 103/103 related tests pass (AvailabilityStrategy + ClientRetryPolicy + RetryHandler + CrossRegion + PartitionKeyRangeFailover). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 Changelog reminder (non-blocking)This PR touches shipped source but does not appear to update the Touched (and missing entry for):
How to decideUse the rubric in
This check is non-blocking — merge is not gated on it. The |
ananth7592
left a comment
There was a problem hiding this comment.
LGTM except for one item
Task.Yield() runs on all runtimes including .NET 6+ where SO doesn't exist; comment says "no-op" but it's a threadpool dispatch. The threadpool dispatch adds pressure but since it is much miniscule in impact compared to bottleneck of network timeouts happening during outage /hedge scenario
…in_availability_strategy
NaluTripician
left a comment
There was a problem hiding this comment.
Deep review pass. The core fix (single Task.Yield() in CloneAndSendAsync's catch) is correctly placed at the middle layer of the hedging seam, the untyped catch scope is intentional, and the compiler-stored ExceptionDispatchInfo across the await preserves the original stack on the subsequent throw;. Three findings below — two are notable, one is a minor disclosure/coverage gap.
- Major — the new regression test's claim of covering the
throw ex;→throw;change inRequestSenderAndResultCheckAsyncis empirically false (verified by reverting just that line and watching the test still pass). The pre-cancelled-CTS path routes through the OCE-filtered catch, not the generic catch where the fix lives, andCosmosOperationCanceledException.StackTracedelegates to the inner OCE's stack trace so the preservation is incidental, not caused bythrow;. See inline comment. - Major (merge-time) —
changelog.mdtargets#### Bugs Fixedunder### Unreleased, butmain(commit 8b03a5a, PR #5864) replaced that with#### Fixedunder### Unreleased Preview. Rebase needed or the entry lands in a heading that no longer exists. See inline comment. - Minor — the
throw lastException;→ExceptionDispatchInfo.Capture(...).Throw()change inExecuteAvailabilityStrategyAsyncis correct but isn't called out in the PR description and isn't exercised by the new test. See inline comment.
Everything else checked out: scope is exactly the 3 advertised files, PR title matches the prlint regex, postCountDelta > 0 is empirically robust (not flaky), and no disposal/cancellation races introduced.
1852041
Resolves changelog.md Unreleased Bugs Fixed section: keeps both the Azure#5298 LINQ MemberInit fix entry and the Azure#5870 CrossRegionHedgingAvailabilityStrategy entry from main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Conflicts resolved: - Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/ChangeFeedProcessorCore.cs: Adopted main's defensive structure from #5825 (explicit AVAD throws for StartFromBeginning/StartTime plus the shouldAnchorStartTime boolean). The Mode != AllVersionsAndDeletes guard ΓÇö the core fix from #5852 ΓÇö is preserved. Extended the comment to cite both #5825 and #5846 so future readers see the LSN-based rationale and the original cold-start regression context. - changelog.md: Kept both Unreleased Bugs Fixed entries (#5852 cold-start regression and #5870 hedging StackOverflow). Validated by building Microsoft.Azure.Cosmos.Tests and running the ChangeFeedProcessorCoreTests filter; all 20 tests pass, including both the PR's regression tests (AC1/AC7/AC8) and main's defensive throw tests added in #5825. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem statement
Customer using the latest .NET SDK on .NET Framework 4.7.2 hit a
System.StackOverflowExceptionduring a regional outage. The dump shows the failing thread's top frames in
CrossRegionHedgingAvailabilityStrategy:Every
CosmosOperationCanceledExceptionthrown deep in the request pipeline (e.g. whenhedgeRequestsCancellationTokenSourcecancels in-flight hedged calls) propagates synchronouslyback through the entire pipeline of awaiting async methods. On .NET Framework 4.7.2 each async
method's exception path consumes ~10KB of stack (
ExceptionDispatchInfo.Throw→TaskAwaiter.ThrowForNonSuccess→HandleNonSuccessAndDebuggerNotification). With the requestpipeline (handlers, retry policies, store clients, etc.) layered under hedging, the cumulative
stack consumption can blow the 1MB managed thread stack.
This is not reproducible on .NET Core / .NET 5+ — those runtimes reset the synchronous
exception stack across
awaitboundaries automatically. Only .NET Framework 4.x consumersare affected.
Approach
Apply the documented workaround: insert
await Task.Yield();in the catch block ofmiddle-layer async methods along the propagation chain.
Task.Yield()schedules thecontinuation on the threadpool, so the rethrown exception unwinds onto a fresh stack
instead of accumulating frames.
Guidance:
Scope
P0 — Fix the hedging stack overflow (this PR)
File:
Microsoft.Azure.Cosmos/src/Routing/AvailabilityStrategy/CrossRegionHedgingAvailabilityStrategy.csCloneAndSendAsync(middle layer) — currently has no try/catch. Wrap the awaited callin
try { ... } catch { await Task.Yield(); throw; }. This is the recommended insertionpoint per the runtime guidance — middle-layer, between the deepest hedging frame
(
RequestSenderAndResultCheckAsync) and the topmost (ExecuteAvailabilityStrategyAsync).RequestSenderAndResultCheckAsync—throw ex;bug fix — line 372 usesthrow ex;which resets the captured stack trace. Change to
throw;(a tightly-coupled bug discoveredwhile reading the same catch block).
AvailabilityStrategyUnitTestswith a regression test that wires upa deep, fake handler chain whose deepest stage throws
OperationCanceledException, drivesit through hedging, and asserts the call returns/throws normally without a
StackOverflowException.We can't directly assert "no SO" (SO terminates the process), but we can assert that the
continuation after the catch runs on a different thread (proof the yield happened) and that
the original exception type and message are preserved.
P1 — Audit other deep async chains in the SDK (follow-up)
Same pattern is latent anywhere a long async/await chain rethrows from a catch on
.NET Framework. Candidates worth auditing in a follow-up PR:
Handler/AbstractRetryHandler.csSendAsynccatches (DocumentClientException,CosmosException,AggregateException,OperationCanceledException) andExecuteHttpRequestAsyncretry loopChangeFeed/ChangeFeedIteratorCore.cscatch (OperationCanceledException) → throw new CosmosOperationCanceledExceptionblocks (lines 243, 287)ReadFeed/ReadFeedIteratorCore.csQuery/v3Query/QueryIterator.csResource/ClientContextCore.csOperationHelperWithRootTraceAsyncTryTransformExceptionrethrow (line 615)Approach for P1:
await Task.Yield()in one middle-layer catch on thepropagation path (typically the iterator/handler that sits 1 layer above the request
pipeline, not the very top).
await Task.Yield()everywhere — it has a small perf cost (extra threadpoolhop) on every exception. One per long chain is sufficient.
Out of scope
Task.WhenAnyaccumulation pattern (the leaked-task issue is separate).Task.Yieldon the success path.Notes
Task.Yield()cost: one threadpool dispatch per exception. Negligible vs. the alternative(process crash) and only paid on the exception path, which is already slow.
on exceptions. Acceptable.
Microsoft.Azure.Cosmos).Validation plan
Microsoft.Azure.Cosmos.csproj.AvailabilityStrategyUnitTestsand any other existing hedging tests.API_*.txt) are unchanged — no public API surface changes.Status
users/kundadebdatta/fix_stackoverflow_error_in_availability_strategyd04487cef— "Fix StackOverflowException in CrossRegionHedgingAvailabilityStrategy on .NET Framework"Next steps