Skip to content

Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait#6054

Merged
bgavrilMS merged 1 commit into
AzureAD:mainfrom
jayesh-a-shah:fix/cts-disposal-convoy
Jun 10, 2026
Merged

Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait#6054
bgavrilMS merged 1 commit into
AzureAD:mainfrom
jayesh-a-shah:fix/cts-disposal-convoy

Conversation

@jayesh-a-shah

@jayesh-a-shah jayesh-a-shah commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Fixes #6053

The Bug

PR #4471 changed the ProcessFetchInBackground lambda from:

() => GetAccessTokenAsync(cancellationToken, logger)

to:

() =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return GetAccessTokenAsync(tokenSource.Token, logger);
}

This lambda is not async. using var disposes tokenSource when the lambda body returns at return, which is before GetAccessTokenAsync completes. After disposal, the linked token is disconnected from its parent — cancelling the parent CTS no longer cancels the linked token.

Inside GetAccessTokenAsync, the code calls:

await s_semaphoreSlim.WaitAsync(cancellationToken).ConfigureAwait(false);

where s_semaphoreSlim is a static SemaphoreSlim(1,1) — a process-wide single-concurrency lock. Because the linked token is disconnected, WaitAsync can only complete when the semaphore is released — it can never be cancelled.

How This Causes Thread Starvation

Azure.Core's BearerTokenAuthenticationPolicy has a background token refresh mechanism. When a cached token approaches expiry, it spawns a background refresh with a 30-second timeout:

// BearerTokenAuthenticationPolicy.cs, line 37
private static readonly TimeSpan _tokenRefreshRetryDelay = TimeSpan.FromSeconds(30);

// line 371 — background refresh path
var cts = new CancellationTokenSource(_tokenRefreshRetryDelay);

This 30s CTS is passed through Azure.Identity → MSAL → ExecuteAsync(30sToken). When MSAL's NeedsRefresh() is true, it returns the cached token to the caller and spawns a Task.Run via ProcessFetchInBackground to refresh the token in the background. The lambda captures the 30s token.

Without the bug: The background task calls WaitAsync(linkedToken). If it can't acquire the semaphore within 30s, the parent CTS fires, cancellation propagates through the linked token, WaitAsync throws OperationCanceledException, and the task exits the semaphore queue. The queue stays small.

With the bug: The linked token is disconnected from the 30s parent CTS. The background task calls WaitAsync(disconnectedToken). The 30s timeout fires but has no effect — the task remains in the semaphore queue indefinitely, waiting for the semaphore to be released by whoever is ahead of it.

When the token endpoint (IMDS) is temporarily unreachable, the semaphore holder takes ~100s (retry loop) before failing and releasing. During the ~20-minute NeedsRefresh window, each incoming GetToken call spawns a new background task that joins the queue permanently. ~100+ tasks accumulate.

These tasks drain one-at-a-time: each acquires the semaphore, calls IMDS, fails after ~100s of retries, releases. Foreground threads that need a token enter the same queue behind all accumulated tasks. With ~139 tasks ahead, each taking ~100s, a foreground thread waits ~232 minutes.

The Fix

Make the lambda async and await the inner call:

async () =>
{
    using var tokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    return await GetAccessTokenAsync(tokenSource.Token, logger).ConfigureAwait(false);
}

In an async method, using var disposal happens after the await completes (the compiler transforms it into a state machine). The linked token stays connected to the parent CTS for the entire duration of the async operation. The 30s timeout from Azure.Core now correctly propagates — background tasks exit the semaphore queue after 30s instead of accumulating permanently.

Production Impact Observed

Service: Azure Site Recovery — Replication Configuration Manager (RCM)
Framework: net462, MSAL 4.83.1, Azure.Identity 1.19.0, Azure.Core 1.51.1
Stamp: ecy-pod01 (eastus2euap), Date: 2026-06-02, 20:48–00:40 UTC

RCM uses managed identity for Azure Storage access. When IMDS became unreachable on one node, ~139 proactive refresh tasks accumulated in the semaphore queue. 4 foreground threads (task execution engine scheduler and workflow threads) were blocked for 189–232 minutes. All completed within 1 second after IMDS recovered. The blocked scheduler thread prevented all task scheduling on the node for the entire duration.

Changes

Fixed all 4 affected call sites (all introduced by PR #4471):

  • ClientCredentialRequest.cs
  • ManagedIdentityAuthRequest.cs
  • OnBehalfOfRequest.cs
  • CacheSilentStrategy.cs

Standalone Repro

See issue #6053 for a console app that reproduces the bug on both net462 and net8.0.

Copilot AI review requested due to automatic review settings June 7, 2026 02:52
@jayesh-a-shah jayesh-a-shah requested a review from a team as a code owner June 7, 2026 02:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates background refresh delegates to use async + ConfigureAwait(false) when fetching tokens in silent/background flows, reducing synchronization-context capture in library code.

Changes:

  • Converted background fetch delegates from non-async lambdas to async lambdas.
  • Added await ...ConfigureAwait(false) on token refresh / token acquisition tasks in background fetch paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/client/Microsoft.Identity.Client/Internal/Requests/Silent/CacheSilentStrategy.cs Uses await ...ConfigureAwait(false) in background refresh delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/OnBehalfOfRequest.cs Uses await ...ConfigureAwait(false) in background refresh delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/ManagedIdentityAuthRequest.cs Uses await ...ConfigureAwait(false) in background token acquisition delegate.
src/client/Microsoft.Identity.Client/Internal/Requests/ClientCredentialRequest.cs Uses await ...ConfigureAwait(false) in background token acquisition delegate.

jayesh-a-shah

This comment was marked as outdated.

@jayesh-a-shah

Copy link
Copy Markdown
Contributor Author

Hi @pmaytak, @AzureAD/id4s-msal-team — would appreciate your review on this fix when you get a chance.

This addresses a subtle async disposal issue in the ProcessFetchInBackground lambda call sites. The using var linked CTS gets disposed before the async operation completes, which can break cancellation propagation to SemaphoreSlim.WaitAsync under certain conditions. Details and a standalone repro are in issue #6053.

The change is minimal — adding async/await to 4 lambdas so the CTS stays alive until the async operation completes.

Thank you!

@jayesh-a-shah jayesh-a-shah requested a review from Copilot June 7, 2026 03:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

… and causes SemaphoreSlim convoy

The lambda passed to ProcessFetchInBackground uses 'using var' to create a
linked CancellationTokenSource, but returns the Task without awaiting it.
This causes the linked CTS to be disposed before the async operation completes,
breaking the link to the parent cancellation token. As a result, WaitAsync on
the static SemaphoreSlim(1,1) becomes permanently unkillable.

When the token endpoint (IMDS) is temporarily unreachable, every proactive
refresh background task becomes a permanent semaphore waiter, forming an
unbounded convoy. Foreground threads that later need a token are blocked
behind the convoy for hours.

Fix: make the lambda async and await the inner call, so 'using var' disposal
waits for the async operation to complete and parent cancellation propagates
correctly through the linked token.

Fixes all 4 affected locations introduced by PR AzureAD#4471:
- ClientCredentialRequest.cs
- ManagedIdentityAuthRequest.cs
- OnBehalfOfRequest.cs
- CacheSilentStrategy.cs
@jayesh-a-shah jayesh-a-shah force-pushed the fix/cts-disposal-convoy branch from f2e817e to 858895c Compare June 7, 2026 03:36
@jayesh-a-shah

Copy link
Copy Markdown
Contributor Author

cc @bgavrilMS @gladjohn @pmaytak — would appreciate your review as you are familiar with this area of the codebase.

@jayesh-a-shah jayesh-a-shah changed the title Fix CTS disposal in ProcessFetchInBackground that breaks cancellation and causes SemaphoreSlim convoy Fix proactive token refresh bypassing cancellation, leading to unbounded semaphore wait Jun 7, 2026

@gladjohn gladjohn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!! Thanks @jayesh-a-shah

@bgavrilMS

Copy link
Copy Markdown
Member

Verified the build passes on a separate PR. Contributor PRs don't seem to trigger ADO builds.

This was referenced Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background token refresh ignores cancellation, causing threads to block for hours

4 participants