Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait#6054
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates background refresh delegates to use async + ConfigureAwait(false) when fetching tokens in silent/background flows, reducing synchronization-context capture in library code.
Changes:
- Converted background fetch delegates from non-
asynclambdas toasynclambdas. - Added
await ...ConfigureAwait(false)on token refresh / token acquisition tasks in background fetch paths.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/client/Microsoft.Identity.Client/Internal/Requests/Silent/CacheSilentStrategy.cs | Uses await ...ConfigureAwait(false) in background refresh delegate. |
| src/client/Microsoft.Identity.Client/Internal/Requests/OnBehalfOfRequest.cs | Uses await ...ConfigureAwait(false) in background refresh delegate. |
| src/client/Microsoft.Identity.Client/Internal/Requests/ManagedIdentityAuthRequest.cs | Uses await ...ConfigureAwait(false) in background token acquisition delegate. |
| src/client/Microsoft.Identity.Client/Internal/Requests/ClientCredentialRequest.cs | Uses await ...ConfigureAwait(false) in background token acquisition delegate. |
908a3dd to
f2e817e
Compare
|
Hi @pmaytak, @AzureAD/id4s-msal-team — would appreciate your review on this fix when you get a chance. This addresses a subtle async disposal issue in the The change is minimal — adding Thank you! |
… and causes SemaphoreSlim convoy The lambda passed to ProcessFetchInBackground uses 'using var' to create a linked CancellationTokenSource, but returns the Task without awaiting it. This causes the linked CTS to be disposed before the async operation completes, breaking the link to the parent cancellation token. As a result, WaitAsync on the static SemaphoreSlim(1,1) becomes permanently unkillable. When the token endpoint (IMDS) is temporarily unreachable, every proactive refresh background task becomes a permanent semaphore waiter, forming an unbounded convoy. Foreground threads that later need a token are blocked behind the convoy for hours. Fix: make the lambda async and await the inner call, so 'using var' disposal waits for the async operation to complete and parent cancellation propagates correctly through the linked token. Fixes all 4 affected locations introduced by PR AzureAD#4471: - ClientCredentialRequest.cs - ManagedIdentityAuthRequest.cs - OnBehalfOfRequest.cs - CacheSilentStrategy.cs
f2e817e to
858895c
Compare
|
cc @bgavrilMS @gladjohn @pmaytak — would appreciate your review as you are familiar with this area of the codebase. |
gladjohn
left a comment
There was a problem hiding this comment.
LGTM!!! Thanks @jayesh-a-shah
|
Verified the build passes on a separate PR. Contributor PRs don't seem to trigger ADO builds. |
Fixes #6053
The Bug
PR #4471 changed the
ProcessFetchInBackgroundlambda from:to:
This lambda is not async.
using vardisposestokenSourcewhen the lambda body returns atreturn, which is beforeGetAccessTokenAsynccompletes. After disposal, the linked token is disconnected from its parent — cancelling the parent CTS no longer cancels the linked token.Inside
GetAccessTokenAsync, the code calls:where
s_semaphoreSlimis astatic SemaphoreSlim(1,1)— a process-wide single-concurrency lock. Because the linked token is disconnected,WaitAsynccan only complete when the semaphore is released — it can never be cancelled.How This Causes Thread Starvation
Azure.Core's
BearerTokenAuthenticationPolicyhas a background token refresh mechanism. When a cached token approaches expiry, it spawns a background refresh with a 30-second timeout:This 30s CTS is passed through Azure.Identity → MSAL →
ExecuteAsync(30sToken). When MSAL'sNeedsRefresh()is true, it returns the cached token to the caller and spawns aTask.RunviaProcessFetchInBackgroundto refresh the token in the background. The lambda captures the 30s token.Without the bug: The background task calls
WaitAsync(linkedToken). If it can't acquire the semaphore within 30s, the parent CTS fires, cancellation propagates through the linked token,WaitAsyncthrowsOperationCanceledException, and the task exits the semaphore queue. The queue stays small.With the bug: The linked token is disconnected from the 30s parent CTS. The background task calls
WaitAsync(disconnectedToken). The 30s timeout fires but has no effect — the task remains in the semaphore queue indefinitely, waiting for the semaphore to be released by whoever is ahead of it.When the token endpoint (IMDS) is temporarily unreachable, the semaphore holder takes ~100s (retry loop) before failing and releasing. During the ~20-minute
NeedsRefreshwindow, each incomingGetTokencall spawns a new background task that joins the queue permanently. ~100+ tasks accumulate.These tasks drain one-at-a-time: each acquires the semaphore, calls IMDS, fails after ~100s of retries, releases. Foreground threads that need a token enter the same queue behind all accumulated tasks. With ~139 tasks ahead, each taking ~100s, a foreground thread waits ~232 minutes.
The Fix
Make the lambda
asyncandawaitthe inner call:In an
asyncmethod,using vardisposal happens after theawaitcompletes (the compiler transforms it into a state machine). The linked token stays connected to the parent CTS for the entire duration of the async operation. The 30s timeout from Azure.Core now correctly propagates — background tasks exit the semaphore queue after 30s instead of accumulating permanently.Production Impact Observed
Service: Azure Site Recovery — Replication Configuration Manager (RCM)
Framework: net462, MSAL 4.83.1, Azure.Identity 1.19.0, Azure.Core 1.51.1
Stamp: ecy-pod01 (eastus2euap), Date: 2026-06-02, 20:48–00:40 UTC
RCM uses managed identity for Azure Storage access. When IMDS became unreachable on one node, ~139 proactive refresh tasks accumulated in the semaphore queue. 4 foreground threads (task execution engine scheduler and workflow threads) were blocked for 189–232 minutes. All completed within 1 second after IMDS recovered. The blocked scheduler thread prevented all task scheduling on the node for the entire duration.
Changes
Fixed all 4 affected call sites (all introduced by PR #4471):
ClientCredentialRequest.csManagedIdentityAuthRequest.csOnBehalfOfRequest.csCacheSilentStrategy.csStandalone Repro
See issue #6053 for a console app that reproduces the bug on both net462 and net8.0.