[Internal] Direct package: Fixes thread pool starvation from blocking calls in RNTBD Dispatcher#5722
Conversation
… calls in RNTBD Dispatcher Converts the idle timer callback path from synchronous blocking (.Wait()) to async (await) to prevent thread pool starvation when many RNTBD connections go idle simultaneously. Changes: - Dispatcher: Add WaitTaskAsync, convert OnIdleTimer to async OnIdleTimerAsync, update ScheduleIdleTimer with .Unwrap(), add IAsyncDisposable + DisposeAsync - IChannel: Add CloseAsync() to interface - Channel: Add IAsyncDisposable + DisposeAsync + CloseAsync - LoadBalancingChannel: Add IAsyncDisposable + DisposeAsync + CloseAsync - LoadBalancingPartition: Add DisposeAsync with concurrent channel disposal - LbChannelState: Add DisposeAsync calling CloseAsync - ChannelDictionary: Add IAsyncDisposable + DisposeAsync with Task.WhenAll All existing sync methods (Dispose, Close, WaitTask) kept unchanged for backward compatibility. Fixes: #4393 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review SummaryOverall Assessment: The core fix is sound and well-structured. Converting The Context examined:
Key observations:
Findings: 6 Recommendations, 5 Suggestions, 2 Observations (13 total). See inline comments for details. |
- Make DisposeAsync idempotent (return if disposed) across all classes - Move chaosInterceptor call after disposal guard in Channel.DisposeAsync - Wrap dispatcher.DisposeAsync in try/finally to protect stateLock.Dispose - Implement IAsyncDisposable on LbChannelState and LoadBalancingPartition with ValueTask return type for consistency - Add exception handling around Task.WhenAll in ChannelDictionary, LoadBalancingChannel, and LoadBalancingPartition DisposeAsync - Add GC.SuppressFinalize(this) to all DisposeAsync implementations - Pre-size List<Task> with channels.Count in ChannelDictionary - Add trace logging for swallowed SynchronizationLockException - Add cross-reference comments between Dispose/DisposeAsync pairs - Add TODO for upstream IChannelDictionary wiring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review SummaryOverall Assessment: The core fix is well-designed and correct. Converting Existing Comments: 11 AI-generated comments from a prior review run were found. The second commit addressed most of them (idempotent disposal guards, 10 new findings posted as inline comments below (4 Recommendations, 4 Suggestions, 2 Observations). |
…ng improvements - Use Interlocked.CompareExchange for atomic disposed flag in Dispatcher, Channel, ChannelDictionary, LoadBalancingChannel, LoadBalancingPartition to prevent double-execution when Dispose() and DisposeAsync() race - Make sync Dispose() idempotent (return instead of throw) to match async - Add GC.SuppressFinalize to sync Dispose() paths - Add try/finally in Channel.Dispose() for stateLock cleanup safety - Iterate AggregateException.InnerExceptions in Task.WhenAll catches to log all failures, not just the first - Optimize CloseAsync() to use DisposeAsync().AsTask() instead of async state machine in Channel and LoadBalancingChannel - Add List<Task> capacity hint in LoadBalancingChannel.DisposeAsync() - Add issue reference to TODO(#4393) in LoadBalancingChannel Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review SummaryOverall Assessment: The core fix ( New findings (after deduplication): 6 new issues found, 1 blocking.
What the PR gets right:
AI-generated review summary |
- Fix catch (AggregateException) dead code: await unwraps AggregateException so catch never fired. Save Task.WhenAll result to variable, catch Exception broadly, access task.Exception for full error list. Applies to ChannelDictionary, LoadBalancingChannel, LoadBalancingPartition. - Add Interlocked.CompareExchange atomic disposal guard to LoadBalancingPartition.Dispose() matching DisposeAsync() - Add GC.SuppressFinalize to LoadBalancingPartition Dispose/DisposeAsync - Add GC.SuppressFinalize to LbChannelState Dispose/DisposeAsync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
…ession tests - Add detailed comment on ScheduleIdleTimer explaining why .Unwrap() is essential (use-after-dispose risk if removed) - Improve WaitTaskAsync logging: include exception type name alongside message for better diagnostics (CDX1003-compliant) - Add DispatcherThreadStarvationTests with 7 test cases: - Dispose idempotency - DisposeAsync idempotency - Concurrent Dispose/DisposeAsync race safety - DisposeAsync non-blocking behavior - Mass concurrent disposal stress test (100 dispatchers) - LoadBalancingChannel DisposeAsync idempotency - ChannelDictionary DisposeAsync idempotency Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three benchmarks validating thread pool behavior: - Concurrent DisposeAsync throughput (10-200 dispatchers) - Sync vs Async dispose latency comparison - Thread pool stability during mass disposal (200 dispatchers) Results: 200 async disposals in <1ms, 0 thread spike, ~5µs/item. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
Deep Analysis, Regression Tests & Benchmark ResultsCode Fixes Applied (commit
|
| Test | Validates |
|---|---|
Dispose_IsIdempotent |
Double-dispose is no-op per .NET guidelines |
DisposeAsync_IsIdempotent |
Double async dispose is no-op |
ConcurrentDisposeAndDisposeAsync_OnlyOneExecutes |
Interlocked.CompareExchange guard: connection disposed exactly once when sync+async race |
DisposeAsync_DoesNotBlock_WhenNoReceiveTask |
Async disposal completes promptly (< 5s) |
ManyDisposals_DoNotStarveThreadPool |
100 concurrent DisposeAsync calls don't starve thread pool |
Channel_DisposeAsync_IsIdempotent |
LoadBalancingChannel async disposal chain is idempotent |
ChannelDictionary_DisposeAsync_IsIdempotent |
Full dictionary disposal with 3 channels via Task.WhenAll |
Benchmark Results (3/3 passing)
Concurrent DisposeAsync Throughput:
| Count | Time (ms) | Avg (ms) | TP Threads | TP Responsive |
|---|---|---|---|---|
| 10 | 5 | 0.50 | 3 | ✅ |
| 50 | <1 | 0.00 | 5 | ✅ |
| 100 | <1 | 0.00 | 5 | ✅ |
| 200 | <1 | 0.00 | 5 | ✅ |
Sync vs Async Dispose Latency (100 dispatchers × 3 iterations):
| Method | Avg/item (µs) |
|---|---|
Sync Dispose() |
~6.5 |
Async DisposeAsync() |
~5.5 |
Thread Pool Stability (200 dispatchers):
- Disposal time: 1ms
- Thread spike: 0 (should be << 200)
- Peak thread count: 5 (unchanged from baseline)
- Pending work items: 0
Analysis Summary
The core fix (ContinueWith(OnIdleTimerAsync).Unwrap()) is architecturally sound and correctly eliminates the Path 1 thread pool starvation from idle timer callbacks. The benchmarks confirm zero thread pool overhead from the async conversion — async disposal is actually slightly faster than sync due to Task.WhenAll parallelism at the ChannelDictionary/LoadBalancingChannel level.
Key validation points:
- ✅ Lock ordering preserved (
connectionLock→callLock), allawaitoutside locks - ✅
.Unwrap()correctly tracks full async lifecycle forStopIdleTimer()cancellation - ✅ Atomic
Interlocked.CompareExchangeprevents double-execution across sync/async paths - ✅ No allocation regression — async state machine (~200 bytes) replaces blocked thread (~1MB stack)
- ✅ Backward compatible — all sync paths preserved unchanged
Full analysis report available as dispatcher-thread-starvation-analysis.md.
Analysis, Validation & Benchmark ReportRoot Cause RecapIssue #4393 — on Linux, when many RNTBD connections go idle simultaneously, The FixThe core change converts
When Code Changes Already Addressed (via PR review feedback)The following issues were identified in the initial analysis report and have been resolved across 4 review iterations:
Known Remaining Item
Regression Test Results (7/7 ✅)
Benchmark Results (3/3 ✅)Concurrent DisposeAsync Throughput
200 concurrent async disposals complete in under 1ms with zero thread pool thread spike. Sync vs Async Dispose Latency (100 dispatchers × 3 iterations)
No performance regression. Async disposal is comparable or slightly faster than sync, likely due to Thread Pool Stability Under Mass Disposal (200 dispatchers)
The async disposal path creates zero additional thread pool pressure. Pre-fix, the sync path with Verdict✅ The fix correctly eliminates the thread pool starvation from idle timer callbacks.
|
Simulates the exact blocking pattern from OnIdleTimer: - sync mode: ContinueWith callback calls t.Wait() — STARVES thread pool - async mode: ContinueWith callback awaits t — thread pool stays responsive Results with 200 connections: sync: Thread pool STARVED, probe latency 10,193ms async: Thread pool responsive, probe latency 0ms Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
Reproduction Results — Before/After ProofA standalone repro ( Sync Mode (base
|
| Metric | Sync (t.Wait()) |
Async (await t) |
|---|---|---|
| Thread pool probe | ❌ STARVED (10,193ms) | ✅ Responsive (0ms) |
| Callbacks started | 0/200 | 200/200 |
| Thread spike | Pool exhausted | +1 |
| Total time | 12,237ms | 2,060ms |
The sync path couldn't even start 1 of the 200 callbacks — the thread pool was completely saturated by Task.Run work items that immediately blocked on t.Wait(), preventing the QueueUserWorkItem probe from executing for over 10 seconds. The async path started all 200 callbacks and the probe executed in 0ms.
| { | ||
| const int ConnectionCount = 200; | ||
|
|
||
| static async Task Main(string[] args) |
There was a problem hiding this comment.
Is this a conceptual possibility repro?
Ideal is to repro with SDK code.
Adds 3 end-to-end tests that exercise REAL SDK Dispatcher and TimerPool instances to validate the OnIdleTimerAsync fix: - EndToEnd_IdleTimerCallbacks_WithPendingReceiveTasks_ThreadPoolRemainsResponsive: Creates 50 Dispatchers with injected pending receive tasks, triggers StartIdleTimer via the real TimerPool, verifies 0ms probe latency. - EndToEnd_MassAsyncDisposal_ThreadPoolRemainsResponsive: 100 concurrent DisposeAsync calls with pending receive tasks, verifies thread pool stays responsive during mass disposal. - EndToEnd_IdleTimerRacesWithDisposal_NoDeadlock: 20 iterations racing idle timer callback against DisposeAsync to verify no deadlock or use-after-dispose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
… starvation fix Adds the full ThreadPoolStarvationFix-PR5722 directory containing: - VALIDATION-REPORT.md: Comprehensive analysis with root cause, code review, reproduction results, benchmarks, and risk assessment - repros/02-sdk-code-repro: SDK-code-faithful reproduction (before/after) - repros/03-disposal-benchmark: Sync vs async dispose throughput/memory benchmarks - repros/04-integration-stress-test: 8 correctness stress tests - repros/DispatcherThreadStarvationTests.cs: End-to-end SDK tests backup This PR is a POC — the actual fix will be merged via the msdata repo. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
|
PR scope too large, fix inconsistent and not easily testable. Will create new PR with better testing framework |
Problem
Fixes #4393
On Linux, thread pool threads are blocked by synchronous
.Wait()calls in the RNTBDDispatcherclass. When many connections go idle simultaneously, theOnIdleTimercallback runs on thread pool threads viaContinueWithand each one callsWaitTask()→t.Wait(), blocking the thread. This causes thread pool starvation and makes the service unresponsive.Two Blocking Paths
Path 1: Idle Timer Callbacks (PRIMARY)
Path 2: Mass Channel Disposal (SECONDARY)
Changes
Dispatcher.csWaitTaskAsync: New async counterpart toWaitTaskusingawaitinstead of.Wait()OnIdleTimerAsync: Converted from syncOnIdleTimer— the critical fix that eliminates thread pool starvation from idle timer callbacksScheduleIdleTimer: Updated to use.ContinueWith(OnIdleTimerAsync).Unwrap()for proper async continuation trackingIAsyncDisposable+DisposeAsync: Non-blocking disposal pathIChannel.csCloseAsync()method to the interfaceChannel.csIAsyncDisposable+DisposeAsync: Usesawait initTaskanddispatcher.DisposeAsync()CloseAsync: Delegates toDisposeAsyncLoadBalancingChannel.csIAsyncDisposable+DisposeAsync: Concurrent partition disposal viaTask.WhenAllCloseAsync: Delegates toDisposeAsyncLoadBalancingPartition.csDisposeAsync: Concurrent channel state disposal viaTask.WhenAllLbChannelState.csDisposeAsync: Callschannel.CloseAsync()instead ofchannel.Close()ChannelDictionary.csIAsyncDisposable+DisposeAsync: Concurrent channel closure viaTask.WhenAllDesign Decisions
Dispose(),Close(),WaitTask()) kept unchangedlockblocks remain synchronous;awaitis always outside lock scope.Unwrap()is essential: EnsuresidleTimerTaskproperly represents the full async operation lifecycle forStopIdleTimer()cancellation trackingDisposeAsyncat each level usesTask.WhenAllfor parallel channel/partition cleanupTesting