IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting cached cert and re-minting by gladjohn · Pull Request #5761 · AzureAD/microsoft-authentication-library-for-dotnet

gladjohn · 2026-02-14T18:24:12Z

PR Title

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting
cached cert and re-minting

Summary

This PR fixes an intermittent but persistent failure in the IMDSv2 mTLS
PoP flow on Windows where MSAL reuses a cached/persisted mTLS binding
certificate whose private key becomes unusable (e.g., corrupted,
inaccessible, or otherwise rejected by SCHANNEL).

When this occurs, the TLS handshake fails and the token request
consistently errors until the bad certificate is manually removed from
the persistent store.

The fix introduces self-healing behavior: when we detect a
SCHANNEL-style failure during the token request, we remove the cached
mTLS binding (both in-memory and persisted store) and retry once. This
forces minting a fresh certificate/key binding and restores normal
operation.

Problem / Issue Description

In the IMDSv2 mTLS PoP flow, MSAL caches the mTLS binding (certificate +
endpoint + canonical client_id) in:

A process-local in-memory cache\
A best-effort Windows persistent store (CurrentUser\My) via
FriendlyName tagging

In rare cases, the persisted certificate entry can appear valid (e.g.,
HasPrivateKey == true, unexpired) but still fail during TLS handshake
due to SCHANNEL rejecting the client credential.

This typically manifests as a transport-layer failure during the token
request:

System.Net.Sockets.SocketException (10054):
An existing connection was forcibly closed by the remote host

Wrapped as:

MsalServiceException
ErrorCode: managed_identity_unreachable_network

Because the cached binding is reused on subsequent requests, the failure
becomes sticky: every token acquisition attempt reuses the same bad
certificate and fails repeatedly until the certificate is manually
removed from the store.

Root Cause

The persistent cache selection logic previously treated a certificate as
usable if it:

Matched the alias via FriendlyName
Had sufficient remaining lifetime
Reported HasPrivateKey == true

However, on Windows this is not sufficient.

HasPrivateKey does not guarantee the private key is actually
usable by SCHANNEL. The private key can be:

Missing
Inaccessible (ACL issues)
Provider/KSP corrupted
Container mismatch
Hardware-backed state invalid after reboot

When SCHANNEL attempts the client-auth signature, the connection is
reset and the HTTP request fails.

Static validation cannot reliably predict this failure.

Fix

1) Detect SCHANNEL-style mTLS handshake failures

In ImdsV2ManagedIdentitySource.AuthenticateAsync, catch
MsalServiceException where:

ErrorCode == managed_identity_unreachable_network
The exception chain indicates SCHANNEL-style failure
(e.g., SocketException 10054, "forcibly closed by the remote
host")

2) Evict the bad binding from caches

Remove the binding from:

The in-memory cache\
The Windows persistent store (delete all certs for that alias)

3) Retry once

Immediately retry base.AuthenticateAsync(...).

Since the cached entry was removed, the flow re-mints a fresh
certificate and proceeds successfully.

Retry is strictly limited to one attempt.

Key Code Changes

`ImdsV2ManagedIdentitySource`

Added SCHANNEL failure detection.
Added one-time retry logic with cache eviction.

`MtlsBindingCache`

Added RemoveBadCert(cacheKey, logger) to remove from:
- In-memory cache
- Persistent store

`WindowsPersistentCertificateCache`

Added DeleteAllForAlias(alias, logger)
Deletes all tagged certs for the alias (not just expired ones),
enabling a full reset of a corrupted binding set.

Behavioral Impact

✅ No change for normal success paths.

✅ When SCHANNEL rejects a cached certificate, the system recovers
automatically without manual cleanup.

✅ Cache eviction is scoped to the affected alias (including attestation
tag), so unrelated bindings are not impacted.

⚠️ Retry is limited to one attempt to avoid masking genuine network
failures or causing repeated mint loops.

Example Failure This Fix Addresses

Typical failure:

MsalServiceException: managed_identity_unreachable_network

Inner exceptions:

HttpRequestException
IOException: ... forcibly closed by the remote host
SocketException (10054)

After this PR:

The failure is detected as a SCHANNEL-style client cert rejection.
The cached/persisted certificate is purged.
The request is retried.
A fresh mTLS binding is minted.
Token acquisition succeeds.

Testing

unit tests have been added

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/ImdsV2ManagedIdentitySource.cs

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/IPersistentCertificateCache.cs

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs

….com/AzureAD/microsoft-authentication-library-for-dotnet into gladjohn/fix-mtls-handshake-failures

mtls

531683f

gladjohn requested a review from a team as a code owner February 14, 2026 18:24

bgavrilMS reviewed Feb 19, 2026

View reviewed changes

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/ImdsV2ManagedIdentitySource.cs Show resolved Hide resolved

bgavrilMS reviewed Feb 19, 2026

View reviewed changes

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs Show resolved Hide resolved

bgavrilMS reviewed Feb 19, 2026

View reviewed changes

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/IPersistentCertificateCache.cs Show resolved Hide resolved

bgavrilMS reviewed Feb 19, 2026

View reviewed changes

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs Outdated Show resolved Hide resolved

bgavrilMS approved these changes Feb 19, 2026

View reviewed changes

trwalke reviewed Feb 23, 2026

View reviewed changes

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs Show resolved Hide resolved

trwalke approved these changes Feb 23, 2026

View reviewed changes

gladjohn added 5 commits February 24, 2026 11:31

Merge branch 'main' into gladjohn/fix-mtls-handshake-failures

270abfc

pr comments

e809575

Merge branch 'gladjohn/fix-mtls-handshake-failures' of https://github…

8107126

….com/AzureAD/microsoft-authentication-library-for-dotnet into gladjohn/fix-mtls-handshake-failures

Merge branch 'main' into gladjohn/fix-mtls-handshake-failures

6a2e4cd

Update template-run-mi-e2e-imdsv2.yaml

b0e3905

trwalke approved these changes Feb 24, 2026

View reviewed changes

gladjohn merged commit e95d7c6 into main Feb 24, 2026
12 checks passed

gladjohn deleted the gladjohn/fix-mtls-handshake-failures branch February 24, 2026 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting cached cert and re-minting#5761

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting cached cert and re-minting#5761
gladjohn merged 6 commits intomainfrom
gladjohn/fix-mtls-handshake-failures

gladjohn commented Feb 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gladjohn commented Feb 14, 2026

PR Title

Summary

Problem / Issue Description

Root Cause

Fix

1) Detect SCHANNEL-style mTLS handshake failures

2) Evict the bad binding from caches

3) Retry once

Key Code Changes

ImdsV2ManagedIdentitySource

MtlsBindingCache

WindowsPersistentCertificateCache

Behavioral Impact

Example Failure This Fix Addresses

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`ImdsV2ManagedIdentitySource`

`MtlsBindingCache`

`WindowsPersistentCertificateCache`