Skip to content

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting cached cert and re-minting#5761

Merged
gladjohn merged 6 commits intomainfrom
gladjohn/fix-mtls-handshake-failures
Feb 24, 2026
Merged

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting cached cert and re-minting#5761
gladjohn merged 6 commits intomainfrom
gladjohn/fix-mtls-handshake-failures

Conversation

@gladjohn
Copy link
Copy Markdown
Contributor

Fixes #5755

PR Title

IMDSv2 mTLS: Auto-recover from SCHANNEL handshake failures by evicting
cached cert and re-minting


Summary

This PR fixes an intermittent but persistent failure in the IMDSv2 mTLS
PoP flow on Windows where MSAL reuses a cached/persisted mTLS binding
certificate whose private key becomes unusable (e.g., corrupted,
inaccessible, or otherwise rejected by SCHANNEL).

When this occurs, the TLS handshake fails and the token request
consistently errors until the bad certificate is manually removed from
the persistent store.

The fix introduces self-healing behavior: when we detect a
SCHANNEL-style failure during the token request, we remove the cached
mTLS binding (both in-memory and persisted store) and retry once. This
forces minting a fresh certificate/key binding and restores normal
operation.


Problem / Issue Description

In the IMDSv2 mTLS PoP flow, MSAL caches the mTLS binding (certificate +
endpoint + canonical client_id) in:

  • A process-local in-memory cache\
  • A best-effort Windows persistent store (CurrentUser\My) via
    FriendlyName tagging

In rare cases, the persisted certificate entry can appear valid (e.g.,
HasPrivateKey == true, unexpired) but still fail during TLS handshake
due to SCHANNEL rejecting the client credential.

This typically manifests as a transport-layer failure during the token
request:

System.Net.Sockets.SocketException (10054):
An existing connection was forcibly closed by the remote host

Wrapped as:

MsalServiceException
ErrorCode: managed_identity_unreachable_network

Because the cached binding is reused on subsequent requests, the failure
becomes sticky: every token acquisition attempt reuses the same bad
certificate and fails repeatedly until the certificate is manually
removed from the store.


Root Cause

The persistent cache selection logic previously treated a certificate as
usable if it:

  • Matched the alias via FriendlyName
  • Had sufficient remaining lifetime
  • Reported HasPrivateKey == true

However, on Windows this is not sufficient.

HasPrivateKey does not guarantee the private key is actually
usable by SCHANNEL. The private key can be:

  • Missing
  • Inaccessible (ACL issues)
  • Provider/KSP corrupted
  • Container mismatch
  • Hardware-backed state invalid after reboot

When SCHANNEL attempts the client-auth signature, the connection is
reset and the HTTP request fails.

Static validation cannot reliably predict this failure.


Fix

1) Detect SCHANNEL-style mTLS handshake failures

In ImdsV2ManagedIdentitySource.AuthenticateAsync, catch
MsalServiceException where:

  • ErrorCode == managed_identity_unreachable_network
  • The exception chain indicates SCHANNEL-style failure
    (e.g., SocketException 10054, "forcibly closed by the remote
    host")

2) Evict the bad binding from caches

Remove the binding from:

  • The in-memory cache\
  • The Windows persistent store (delete all certs for that alias)

3) Retry once

Immediately retry base.AuthenticateAsync(...).

Since the cached entry was removed, the flow re-mints a fresh
certificate and proceeds successfully.

Retry is strictly limited to one attempt.


Key Code Changes

ImdsV2ManagedIdentitySource

  • Added SCHANNEL failure detection.
  • Added one-time retry logic with cache eviction.

MtlsBindingCache

  • Added RemoveBadCert(cacheKey, logger) to remove from:
    • In-memory cache
    • Persistent store

WindowsPersistentCertificateCache

  • Added DeleteAllForAlias(alias, logger)
    Deletes all tagged certs for the alias (not just expired ones),
    enabling a full reset of a corrupted binding set.

Behavioral Impact

✅ No change for normal success paths.

✅ When SCHANNEL rejects a cached certificate, the system recovers
automatically without manual cleanup.

✅ Cache eviction is scoped to the affected alias (including attestation
tag), so unrelated bindings are not impacted.

⚠️ Retry is limited to one attempt to avoid masking genuine network
failures or causing repeated mint loops.


Example Failure This Fix Addresses

Typical failure:

MsalServiceException: managed_identity_unreachable_network

Inner exceptions:

HttpRequestException
IOException: ... forcibly closed by the remote host
SocketException (10054)

After this PR:

  1. The failure is detected as a SCHANNEL-style client cert rejection.
  2. The cached/persisted certificate is purged.
  3. The request is retried.
  4. A fresh mTLS binding is minted.
  5. Token acquisition succeeds.

Testing

unit tests have been added

@gladjohn gladjohn requested a review from a team as a code owner February 14, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Managed Identity V2: Inaccessible KeyGuard-backed certificate causes mTLS failures with no automatic recovery

3 participants