Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait by jayesh-a-shah · Pull Request #6054 · AzureAD/microsoft-authentication-library-for-dotnet

jayesh-a-shah · 2026-06-07T02:52:01Z

The Bug

PR #4471 changed the ProcessFetchInBackground lambda from:

() => GetAccessTokenAsync(cancellationToken, logger)

to:

() =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return GetAccessTokenAsync(tokenSource.Token, logger);
}

This lambda is not async. using var disposes tokenSource when the lambda body returns at return, which is before GetAccessTokenAsync completes. After disposal, the linked token is disconnected from its parent — cancelling the parent CTS no longer cancels the linked token.

Inside GetAccessTokenAsync, the code calls:

await s_semaphoreSlim.WaitAsync(cancellationToken).ConfigureAwait(false);

where s_semaphoreSlim is a static SemaphoreSlim(1,1) — a process-wide single-concurrency lock. Because the linked token is disconnected, WaitAsync can only complete when the semaphore is released — it can never be cancelled.

How This Causes Thread Starvation

Azure.Core's BearerTokenAuthenticationPolicy has a background token refresh mechanism. When a cached token approaches expiry, it spawns a background refresh with a 30-second timeout:

// BearerTokenAuthenticationPolicy.cs, line 37
private static readonly TimeSpan _tokenRefreshRetryDelay = TimeSpan.FromSeconds(30);

// line 371 — background refresh path
var cts = new CancellationTokenSource(_tokenRefreshRetryDelay);

This 30s CTS is passed through Azure.Identity → MSAL → ExecuteAsync(30sToken). When MSAL's NeedsRefresh() is true, it returns the cached token to the caller and spawns a Task.Run via ProcessFetchInBackground to refresh the token in the background. The lambda captures the 30s token.

Without the bug: The background task calls WaitAsync(linkedToken). If it can't acquire the semaphore within 30s, the parent CTS fires, cancellation propagates through the linked token, WaitAsync throws OperationCanceledException, and the task exits the semaphore queue. The queue stays small.

With the bug: The linked token is disconnected from the 30s parent CTS. The background task calls WaitAsync(disconnectedToken). The 30s timeout fires but has no effect — the task remains in the semaphore queue indefinitely, waiting for the semaphore to be released by whoever is ahead of it.

When the token endpoint (IMDS) is temporarily unreachable, the semaphore holder takes ~100s (retry loop) before failing and releasing. During the ~20-minute NeedsRefresh window, each incoming GetToken call spawns a new background task that joins the queue permanently. ~100+ tasks accumulate.

These tasks drain one-at-a-time: each acquires the semaphore, calls IMDS, fails after ~100s of retries, releases. Foreground threads that need a token enter the same queue behind all accumulated tasks. With ~139 tasks ahead, each taking ~100s, a foreground thread waits ~232 minutes.

The Fix

Make the lambda async and await the inner call:

async () =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return await GetAccessTokenAsync(tokenSource.Token, logger).ConfigureAwait(false);
}

In an async method, using var disposal happens after the await completes (the compiler transforms it into a state machine). The linked token stays connected to the parent CTS for the entire duration of the async operation. The 30s timeout from Azure.Core now correctly propagates — background tasks exit the semaphore queue after 30s instead of accumulating permanently.

Production Impact Observed

Service: Azure Site Recovery — Replication Configuration Manager (RCM)
Framework: net462, MSAL 4.83.1, Azure.Identity 1.19.0, Azure.Core 1.51.1
Stamp: ecy-pod01 (eastus2euap), Date: 2026-06-02, 20:48–00:40 UTC

RCM uses managed identity for Azure Storage access. When IMDS became unreachable on one node, ~139 proactive refresh tasks accumulated in the semaphore queue. 4 foreground threads (task execution engine scheduler and workflow threads) were blocked for 189–232 minutes. All completed within 1 second after IMDS recovered. The blocked scheduler thread prevented all task scheduling on the node for the entire duration.

Changes

Fixed all 4 affected call sites (all introduced by PR #4471):

ClientCredentialRequest.cs
ManagedIdentityAuthRequest.cs
OnBehalfOfRequest.cs
CacheSilentStrategy.cs

Standalone Repro

See issue #6053 for a console app that reproduces the bug on both net462 and net8.0.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates background refresh delegates to use async + ConfigureAwait(false) when fetching tokens in silent/background flows, reducing synchronization-context capture in library code.

Changes:

Converted background fetch delegates from non-async lambdas to async lambdas.
Added await ...ConfigureAwait(false) on token refresh / token acquisition tasks in background fetch paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/client/Microsoft.Identity.Client/Internal/Requests/Silent/CacheSilentStrategy.cs	Uses `await ...ConfigureAwait(false)` in background refresh delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/OnBehalfOfRequest.cs	Uses `await ...ConfigureAwait(false)` in background refresh delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/ManagedIdentityAuthRequest.cs	Uses `await ...ConfigureAwait(false)` in background token acquisition delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/ClientCredentialRequest.cs	Uses `await ...ConfigureAwait(false)` in background token acquisition delegate.

jayesh-a-shah · 2026-06-07T03:13:18Z

Hi @pmaytak, @AzureAD/id4s-msal-team — would appreciate your review on this fix when you get a chance.

This addresses a subtle async disposal issue in the ProcessFetchInBackground lambda call sites. The using var linked CTS gets disposed before the async operation completes, which can break cancellation propagation to SemaphoreSlim.WaitAsync under certain conditions. Details and a standalone repro are in issue #6053.

The change is minimal — adding async/await to 4 lambdas so the CTS stays alive until the async operation completes.

Thank you!

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

… and causes SemaphoreSlim convoy The lambda passed to ProcessFetchInBackground uses 'using var' to create a linked CancellationTokenSource, but returns the Task without awaiting it. This causes the linked CTS to be disposed before the async operation completes, breaking the link to the parent cancellation token. As a result, WaitAsync on the static SemaphoreSlim(1,1) becomes permanently unkillable. When the token endpoint (IMDS) is temporarily unreachable, every proactive refresh background task becomes a permanent semaphore waiter, forming an unbounded convoy. Foreground threads that later need a token are blocked behind the convoy for hours. Fix: make the lambda async and await the inner call, so 'using var' disposal waits for the async operation to complete and parent cancellation propagates correctly through the linked token. Fixes all 4 affected locations introduced by PR AzureAD#4471: - ClientCredentialRequest.cs - ManagedIdentityAuthRequest.cs - OnBehalfOfRequest.cs - CacheSilentStrategy.cs

jayesh-a-shah · 2026-06-07T03:38:12Z

cc @bgavrilMS @gladjohn @pmaytak — would appreciate your review as you are familiar with this area of the codebase.

gladjohn

LGTM!!! Thanks @jayesh-a-shah

bgavrilMS · 2026-06-10T20:56:56Z

Verified the build passes on a separate PR. Contributor PRs don't seem to trigger ADO builds.

Copilot AI review requested due to automatic review settings June 7, 2026 02:52

jayesh-a-shah requested a review from a team as a code owner June 7, 2026 02:52

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread src/client/Microsoft.Identity.Client/Internal/Requests/Silent/CacheSilentStrategy.cs

jayesh-a-shah mentioned this pull request Jun 7, 2026

Background token refresh ignores cancellation, causing threads to block for hours #6053

Closed

jayesh-a-shah force-pushed the fix/cts-disposal-convoy branch from 908a3dd to f2e817e Compare June 7, 2026 03:01

This comment was marked as outdated.

Sign in to view

jayesh-a-shah requested a review from Copilot June 7, 2026 03:31

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread src/client/Microsoft.Identity.Client/Internal/Requests/Silent/CacheSilentStrategy.cs Outdated

jayesh-a-shah force-pushed the fix/cts-disposal-convoy branch from f2e817e to 858895c Compare June 7, 2026 03:36

jayesh-a-shah changed the title ~~Fix CTS disposal in ProcessFetchInBackground that breaks cancellation and causes SemaphoreSlim convoy~~ Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait Jun 7, 2026

gladjohn approved these changes Jun 9, 2026

View reviewed changes

bgavrilMS approved these changes Jun 10, 2026

View reviewed changes

bgavrilMS enabled auto-merge (squash) June 10, 2026 17:54

bgavrilMS disabled auto-merge June 10, 2026 18:02

bgavrilMS mentioned this pull request Jun 10, 2026

Fix CTS disposal in ProcessFetchInBackground that breaks cancellation and causes SemaphoreSlim convoy #6056

Closed

bgavrilMS merged commit 37280a8 into AzureAD:main Jun 10, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait#6054

Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait#6054
bgavrilMS merged 1 commit into
AzureAD:mainfrom
jayesh-a-shah:fix/cts-disposal-convoy

jayesh-a-shah commented Jun 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

jayesh-a-shah commented Jun 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

jayesh-a-shah commented Jun 7, 2026

Uh oh!

gladjohn left a comment

Uh oh!

bgavrilMS commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jayesh-a-shah commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Bug

How This Causes Thread Starvation

The Fix

Production Impact Observed

Changes

Standalone Repro

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

jayesh-a-shah commented Jun 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

jayesh-a-shah commented Jun 7, 2026

Uh oh!

gladjohn left a comment

Choose a reason for hiding this comment

Uh oh!

bgavrilMS commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jayesh-a-shah commented Jun 7, 2026 •

edited

Loading