Skip to content

[AAD token revocation]: Adds logic for handling emergency token revocation and CAE token revocation.#5549

Open
aavasthy wants to merge 26 commits into
mainfrom
users/aavasthy/tokenrevocation
Open

[AAD token revocation]: Adds logic for handling emergency token revocation and CAE token revocation.#5549
aavasthy wants to merge 26 commits into
mainfrom
users/aavasthy/tokenrevocation

Conversation

@aavasthy
Copy link
Copy Markdown
Contributor

@aavasthy aavasthy commented Jan 6, 2026

Pull Request Template

Description

AAD Token Revocation (Emergency revocation) — SDK Implementation Spec

1. Feature Summary

The Azure Cosmos DB .NET SDK handles AAD token revocation transparently for the customer. When a token is revoked — either through an emergency revocation rule or a Continuous Access Evaluation (CAE) event — the routing gateway rejects the request. The SDK detects the rejection, extracts the claims challenge from the response, requests a fresh token from Microsoft Entra ID with those claims, and retries the original request. The customer's operation succeeds without any application-level error handling.

This covers both revocation mechanisms:

  • Emergency Token Revocation — Cosmos DB's routing gateway and compute gateway evaluates pre-canned rules based on JWT claims (oid, appid, tid, iat, uti). When a match occurs, the gateway returns 401 Unauthorized with substatus 5013 and a WWW-Authenticate header containing a claims challenge.
  • Continuous Access Evaluation (CAE) — The backend CAE library rejects the request with 401 Unauthorized and a WWW-Authenticate header containing error="insufficient_claims" and a claims challenge.

The SDK behavior is identical for both revocation types.


2. How Revocation Works End-to-End

2.1 Normal operation (before revocation)

  1. The customer creates a CosmosClient with a TokenCredential (e.g., DefaultAzureCredential).
  2. The SDK calls the credential's GetTokenAsync method to obtain an AAD token from Entra. The SDK includes xms_cc=["cp1"] in the token request to signal that it supports CAE client capabilities.
  3. Entra returns a token. The SDK caches this token and uses it for all subsequent requests to the Cosmos DB routing gateway.
  4. The gateway validates the token on every request it receives. If the token is valid, the request proceeds normally.

2.2 Token gets revoked

Emergency Revocation: An administrator creates a revocation rule in the Cosmos DB control plane. The rule specifies matching criteria based on JWT claims (e.g., revoke all tokens with a specific uti value). The gateway evaluates these rules and starts rejecting tokens that match.
The next request from the SDK that carries the revoked token will be rejected by the gateway.

2.3 The routing gateway rejects the request

When the gateway determines that a token has been revoked, it returns:

HTTP/1.1 401 Unauthorized
x-ms-substatus: 5013
WWW-Authenticate: Bearer realm="", authorization_uri="", error="insufficient_claims", claims="<base64-encoded-claims>"

The claims value is a base64-encoded JSON string that tells Entra what the new token must satisfy:

{"access_token":{"nbf":{"essential":false,"value":"<unix_timestamp>"}}}

The nbf (not before) claim with the current timestamp ensures Entra issues a token that was created after the revocation event.

2.4 The SDK detects the revocation

The SDK detects the revocation when it receives a 401 Unauthorized response that meets either of these conditions:

  • The response has substatus 5013 (emergency revocation)
  • The response has a non-empty WWW-Authenticate header (CAE revocation)

The SDK additionally validates that the WWW-Authenticate header contains insufficient_claims or claims= before proceeding with the retry. If neither is present, the 401 is treated as a normal authentication failure and is not retried.

This detection only applies when the client uses TokenCredential-based authentication (AAD).

2.5 The SDK extracts claims and resets the token cache

Once revocation is detected, the SDK:

  1. Extracts the claims challenge from the WWW-Authenticate header by parsing the claims="<base64>" value.
  2. Resets the token cache by clearing the cached token, stopping any background refresh, and storing the extracted claims challenge for the next token request.

After this step, the next call to get a token will not return the cached (revoked) token — it will request a fresh one from Entra.

2.6 The SDK requests a fresh token from Entra with merged claims

When the SDK needs a new token for the retry, it:

  1. Merges the claims from the server's challenge with the SDK's client capabilities. The server's challenge contains nbf (not before). The SDK adds xms_cc=["cp1"] (CAE client capability). The merged result looks like:
    {"access_token":{"nbf":{"essential":false,"value":"1712345678"},"xms_cc":{"values":["cp1"]}}}
  2. Calls the customer's credential (TokenCredential.GetTokenAsync) with a TokenRequestContext that includes the merged claims as the Claims property.
  3. Entra receives the claims and issues a new token that satisfies the nbf requirement — meaning the token was issued after the revocation timestamp.

The customer's TokenCredential implementation (e.g., DefaultAzureCredential, InteractiveBrowserCredential) handles the claims automatically through the Azure Identity library's built-in CAE support.

2.7 The SDK retries the original request

The SDK retries the exact same operation that originally failed, now using the fresh token. The gateway validates the new token, if it finds that it satisfies the revocation requirements then processes the request normally. The customer's operation succeeds.

2.8 The claims challenge is cleared

After the fresh token is successfully acquired, the stored claims challenge is cleared. Future normal token requests (including background refreshes) will only include the xms_cc/cp1 client capability without the nbf claim.

2.9 Retry limit

The SDK retries the failed request exactly once. If the retry also fails with a 401 (for any reason — another revocation, invalid token, permission change, etc.), the SDK does not retry again. The error is returned to the customer.

This prevents infinite retry loops in scenarios where the token is permanently invalid or the revocation rule cannot be satisfied.


3. Detection Criteria

The SDK triggers the revocation retry when both conditions are met:

  1. HTTP status code is 401 Unauthorized
  2. Either substatus is 5013 or the response contains a non-empty WWW-Authenticate header

Additionally, the WWW-Authenticate header must contain insufficient_claims or claims= for the claims extraction to proceed. If neither is present, the retry is not triggered.

The retry only applies when the client uses TokenCredential-based authentication (AAD).


4. Revocation Coverage by Connection Mode

Revocation rules are enforced by the routing gateway on every HTTP request it receives. The SDK retries once with a fresh token when it detects a revocation response.

4.1 Gateway Mode

In gateway mode, all operations go through HTTP to the routing gateway. Every operation is subject to token revocation.

Operation Revocation supported? Retry supported?
Document create, read, update, delete, patch ✅ Yes ✅ Yes
Document query, ReadMany, ChangeFeed ✅ Yes ✅ Yes
Database create, read, delete ✅ Yes ✅ Yes
Container create, read, replace, delete ✅ Yes ✅ Yes
User and permission operations ✅ Yes ✅ Yes
Account read (client initialization) ✅ Yes ✅ Yes
Collection metadata cache refresh ✅ Yes ✅ Yes
Partition key range resolution ✅ Yes ✅ Yes

All data-plane and control-plane operations in gateway mode go through GatewayStoreModel, which is covered by ClientRetryPolicy. The account read during client initialization is covered by a dedicated retry in GatewayAccountReader.

4.2 Direct Mode

In direct mode, document operations go directly to backend replicas using the RNTBD binary protocol over TCP. RNTBD does not pass through the gateway, so document operations are not subject to revocation.

All other operations (control-plane, metadata, address resolution) still go through HTTP to the gateway and are fully covered.

Operation Revocation supported? Retry supported? Notes
Document create, read, update, delete, patch ❌ No N/A Uses RNTBD — bypasses the routing gateway entirely
Document query, ReadMany, ChangeFeed ❌ No N/A Uses RNTBD — bypasses the routing gateway entirely
Database create, read, delete ✅ Yes ✅ Yes Always routed through the gateway
Container create, read, replace, delete ✅ Yes ✅ Yes Always routed through the gateway
User and permission operations ✅ Yes ✅ Yes Always routed through the gateway
Account read (client initialization) ✅ Yes ✅ Yes Uses HTTP — covered by GatewayAccountReader retry
Address resolution ✅ Yes ✅ Yes Uses HTTP — covered by GatewayAddressCache retry
Collection metadata cache refresh ✅ Yes ✅ Yes Uses HTTP — covered by ClientRetryPolicy
Partition key range resolution ✅ Yes ✅ Yes Uses HTTP — covered by ClientRetryPolicy

4.3 ThinClient Mode

ThinClient mode behaves exactly like direct mode. Data-plane operations are not subject to gateway revocation. All metadata and control-plane operations go through the gateway and are fully covered.

Operation Revocation supported? Retry supported? Notes
Document create, read, update, delete, patch ❌ No N/A Not subject to gateway revocation
Document query, ReadMany, ChangeFeed ❌ No N/A Not subject to gateway revocation
Database create, read, delete ✅ Yes ✅ Yes Always routed through the gateway
Container create, read, replace, delete ✅ Yes ✅ Yes Always routed through the gateway
User and permission operations ✅ Yes ✅ Yes Always routed through the gateway
Account read (client initialization) ✅ Yes ✅ Yes Covered by GatewayAccountReader retry
Address resolution ✅ Yes ✅ Yes Covered by GatewayAddressCache retry
Collection metadata cache refresh ✅ Yes ✅ Yes Covered by ClientRetryPolicy
Partition key range resolution ✅ Yes ✅ Yes Covered by ClientRetryPolicy

ThinClient mode cannot be tested with the local emulator because it requires server-side enablement.


5. Paths Not Covered

5.1 Direct mode document operations

Document create, read, update, delete, patch, query, ReadMany, and ChangeFeed operations in direct mode use the RNTBD binary protocol over TCP. These requests go directly to backend replicas without passing through the routing gateway. The routing gateway never sees these requests, so its revocation rules are never evaluated. The SDK does not attempt revocation retry on this path because revocation cannot occur here.

5.2 Background account refresh

The SDK periodically refreshes account information in the background via GlobalEndpointManager. This background refresh uses a different code path than the initial account read and does not have revocation retry logic. If a revocation 401 occurs during a background refresh, the refresh fails silently. The SDK continues operating with previously cached account information. The next user-initiated operation that triggers a token refresh (through any of the covered paths) will handle the revocation and obtain a fresh token.


6. Files Modified

File Change
ClientRetryPolicy.cs Detection condition: 401 + substatus 5013 OR 401 + non-empty WWW-Authenticate
AuthorizationTokenProviderTokenCredential.cs Added TryHandleRevocationException static helper for out-of-pipeline retry paths
TokenCredentialCache.cs ResetCachedToken(claims) stores challenge; MergeClaimsWithClientCapabilities merges nbf + xms_cc; clears challenge after successful token acquisition
GatewayAccountReader.cs catch when revocation retry on account read with one retry
GatewayAddressCache.cs catch when revocation retry on master and server address resolution with one retry each
GlobalAddressResolver.cs Threads AuthorizationTokenProvider to GatewayAddressCache (optional parameter, default null)
DocumentClient.cs Passes cosmosAuthorization to GlobalAddressResolver
FaultInjectionServerErrorResultInternal.cs AadTokenRevoked error type now includes WWW-Authenticate header with claims challenge

7. Non-Regression Guarantees

  • All catch when filters check authorizationTokenProvider is AuthorizationTokenProviderTokenCredential — returns false for master key auth
  • All new constructor parameters are optional with default null — no existing callers affected
  • HandleUnauthorizedResponse validates WWW-Authenticate contains insufficient_claims or claims= before acting
  • 29 ClientRetryPolicy unit tests pass
  • 110 related unit tests (retry, auth, gateway, address cache) pass
  • AbstractRetryHandler is unchanged — no risk to the existing retry loop behavior

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #IssueNumber

@aavasthy aavasthy self-assigned this Jan 6, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@aavasthy aavasthy changed the title [AAD token revocation]: Add logic for handling emergency token revocation and CAE token revocation. [AAD token revocation]: Adds logic for handling emergency token revocation and CAE token revocation. Jan 6, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@kushagraThapar
Copy link
Copy Markdown
Member

Deep Review Summary — AAD Token Revocation

Two independent review passes (deep reviewer + cosmos PR reviewer) converged on the same set of issues. Posting a consolidated summary; happy to drill into any of these in line comments if useful.

Verdict: Request Changes. Five blockers below each independently break the feature's wire semantics or introduce correctness bugs under realistic load.


🔴 Blockers

B1. IsCaeEnabled is never set to true on TokenRequestContextMicrosoft.Azure.Cosmos/src/Authorization/CosmosScopeProvider.cs
The xms_cc=cp1 JSON payload sent in claims is not the public CAE contract. Azure.Identity requires the IsCaeEnabled flag on TokenRequestContext to negotiate CAE with Entra. Without it, Entra is free to issue non-CAE tokens and the CAE-aware 401 / insufficient_claims path may never reliably fire end-to-end. The whole CAE branch can silently degrade to "best-effort" depending on tenant config.

B2. CAE-only 401s are not retried on metadata-read pathsGatewayAccountReader.cs (~L754), GatewayAddressCache.cs (~L912, L977)
TryHandleRevocationException (and the metadata-read call sites that invoke it) currently gate on substatus 5013. A pure CAE 401 with WWW-Authenticate: Bearer ... error="insufficient_claims" and no 5013 substatus reaches these paths from gateway / address-cache reads and is dropped on the floor. The transparent-retry promise only holds for the data path.

B3. ResetCachedToken cannot cancel an in-flight refresh; stale token is then written back outside the lockTokenCredentialCache.cs (L364, L384)
A concurrent RefreshAsync started before the reset will complete and overwrite the just-cleared cache with a no-claims token (write at L384 is outside the lock that guards the reset). Next caller 401s again — defeats the entire revocation handshake under any concurrency. Needs a CancellationTokenSource + generation/epoch counter so post-reset writes are discarded.

B4. MergeClaimsWithClientCapabilities brace-counter "JSON merge" has 5 concrete failure modesTokenCredentialCache.cs (L266–335)

  1. The challenge claims arrive base64url encoded (per RFC 7636 / CAE spec); strict base64 decode throws → silent fallback to bare cp1, the server's nbf / freshness challenge is dropped.
  2. Brace counter ignores string literals — "foo}bar" mis-terminates the object.
  3. IndexOf("xms_cc") matches inside string values → potential duplicate xms_cc keys in output.
  4. Empty access_token block emits invalid JSON (orphan comma).
  5. No protection against the server already including xms_cc.
    Replace with System.Text.Json.JsonDocument / Utf8JsonWriter.

B5. cachedClaimsChallenge race — read-modify-write outside the lock can clobber a concurrent revocation's claimsTokenCredentialCache.cs (L233–250, L457, L474)
Refresh A's success path clears cachedClaimsChallenge while revocation B is mid-way through setting fresh claims; B's claims are lost on the next request. volatile doesn't help — this is a multi-step RMW. Needs a CAS pattern (Interlocked.CompareExchange on a (token, claims, epoch) tuple) or to move both reads/writes inside the existing semaphore.


🟠 Important

# Issue Location
I1 catch when filter has side effects (logs/state mutation in the predicate) ClientRetryPolicy.cs
I2 ResetCachedToken nulls currentRefreshOperation → defeats coalescing exactly when N×2 address-cache producers fan out → Entra acquire-storm risk TokenCredentialCache.cs, GatewayAddressCache.cs
I3 No CosmosDiagnostics / ITrace event for the transparent retry — only DefaultTrace. Customer support can't see "SDK transparently re-acquired token". Mirror ResourceThrottleRetryPolicy. retry path
I4 ExtractClaimsFromWwwAuthenticate is a brittle string parser; doesn't handle quoted commas, escapes, multiple Bearer schemes AuthorizationTokenProviderTokenCredential.cs (L115–137)
I5 Magic 5013 literal — promote to SubStatusCodes.AadTokenRevoked (also unblocks Java/Python parity) several
I7 FaultInjection has no CaeTokenRevoked rule type — CAE-only path isn't covered by the new tests tests + FI
I8 HandleUnauthorizedResponse returns null on the no-revocation branch — silent no-op is hard to diagnose AuthorizationTokenProviderTokenCredential.cs (L144–167)
I9 tokenProvider vs authorizationTokenProvider field-name drift in GatewayAddressCache makes future refactors fragile GatewayAddressCache.cs
I10 New authorizationTokenProvider constructor params default to null → any internal caller forgetting to pass it silently loses the feature. Make required. constructors
I11 Detection gate fires on any 401 with any WWW-Authenticate — wider than spec; only correct by downstream filtering accident ClientRetryPolicy.cs (L619–626)
I12 Cold-start with permanently revoked credential = 1 + (N_regions × 2) Entra token requests via GlobalEndpointManager.GetDatabaseAccountFromAnyLocationsAsync — multiplies the storm in I2 GlobalEndpointManager

🟡 Polish / Nits

  • MaxCaeRevocationRetryCount name implies counter scope it doesn't have
  • Whitespace-only churn in unrelated regions
  • Some new test asserts only on log strings, not on observable state

Cross-SDK Parity

Searched azure-sdk-for-java/sdk/cosmos and azure-sdk-for-python/sdk/cosmos for insufficient_claims, 5013, claims_challenge, TokenRevocation, resetCachedTokenzero hits in either repo. .NET is leading on this; suggest opening tracking issues in Java/Python with a shared AadTokenRevoked substatus name agreed upfront.


Generated by an automated multi-agent review (Deep Reviewer + cross-SDK pass). Two independent passes converged on the items above; B1, B2, B3, B4, B5 each appeared in both passes with the same root cause. Happy to attach line-level comments on request.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

📝 Changelog reminder (non-blocking)

This PR touches shipped source but does not appear to update the
corresponding changelog.md.

Touched (and missing entry for):

  • Microsoft.Azure.Cosmos/FaultInjection/src/** ⇒ expected an entry in Microsoft.Azure.Cosmos/FaultInjection/changelog.md (### Unreleased)

How to decide

Use the rubric in .github/copilot-instructions.md ("Changelog
classifier") or in CONTRIBUTING.md ("Changelog entry"). Quick
version:

  • Customer-observable change (behavior, perf, memory, API) ⇒ add an entry, even if the PR is [Internal].
  • Test-only / CI-only / doc-only / pure-internal-refactor with zero customer-observable effect ⇒ no entry, add the no-changelog-needed label to silence this check.
  • Unsure? Default to adding a one-line entry under #### Other Changes. Reviewer will adjust.

This check is non-blocking — merge is not gated on it. The
reviewer is responsible for the final classification.

@aavasthy
Copy link
Copy Markdown
Contributor Author

Deep Review Summary — AAD Token Revocation

Two independent review passes (deep reviewer + cosmos PR reviewer) converged on the same set of issues. Posting a consolidated summary; happy to drill into any of these in line comments if useful.

Verdict: Request Changes. Five blockers below each independently break the feature's wire semantics or introduce correctness bugs under realistic load.

🔴 Blockers

B1. IsCaeEnabled is never set to true on TokenRequestContextMicrosoft.Azure.Cosmos/src/Authorization/CosmosScopeProvider.cs The xms_cc=cp1 JSON payload sent in claims is not the public CAE contract. Azure.Identity requires the IsCaeEnabled flag on TokenRequestContext to negotiate CAE with Entra. Without it, Entra is free to issue non-CAE tokens and the CAE-aware 401 / insufficient_claims path may never reliably fire end-to-end. The whole CAE branch can silently degrade to "best-effort" depending on tenant config.

B2. CAE-only 401s are not retried on metadata-read pathsGatewayAccountReader.cs (~L754), GatewayAddressCache.cs (~L912, L977) TryHandleRevocationException (and the metadata-read call sites that invoke it) currently gate on substatus 5013. A pure CAE 401 with WWW-Authenticate: Bearer ... error="insufficient_claims" and no 5013 substatus reaches these paths from gateway / address-cache reads and is dropped on the floor. The transparent-retry promise only holds for the data path.

B3. ResetCachedToken cannot cancel an in-flight refresh; stale token is then written back outside the lockTokenCredentialCache.cs (L364, L384) A concurrent RefreshAsync started before the reset will complete and overwrite the just-cleared cache with a no-claims token (write at L384 is outside the lock that guards the reset). Next caller 401s again — defeats the entire revocation handshake under any concurrency. Needs a CancellationTokenSource + generation/epoch counter so post-reset writes are discarded.

B4. MergeClaimsWithClientCapabilities brace-counter "JSON merge" has 5 concrete failure modesTokenCredentialCache.cs (L266–335)

  1. The challenge claims arrive base64url encoded (per RFC 7636 / CAE spec); strict base64 decode throws → silent fallback to bare cp1, the server's nbf / freshness challenge is dropped.
  2. Brace counter ignores string literals — "foo}bar" mis-terminates the object.
  3. IndexOf("xms_cc") matches inside string values → potential duplicate xms_cc keys in output.
  4. Empty access_token block emits invalid JSON (orphan comma).
  5. No protection against the server already including xms_cc.
    Replace with System.Text.Json.JsonDocument / Utf8JsonWriter.

B5. cachedClaimsChallenge race — read-modify-write outside the lock can clobber a concurrent revocation's claimsTokenCredentialCache.cs (L233–250, L457, L474) Refresh A's success path clears cachedClaimsChallenge while revocation B is mid-way through setting fresh claims; B's claims are lost on the next request. volatile doesn't help — this is a multi-step RMW. Needs a CAS pattern (Interlocked.CompareExchange on a (token, claims, epoch) tuple) or to move both reads/writes inside the existing semaphore.

🟠 Important

Issue Location

I1 catch when filter has side effects (logs/state mutation in the predicate) ClientRetryPolicy.cs
I2 ResetCachedToken nulls currentRefreshOperation → defeats coalescing exactly when N×2 address-cache producers fan out → Entra acquire-storm risk TokenCredentialCache.cs, GatewayAddressCache.cs
I3 No CosmosDiagnostics / ITrace event for the transparent retry — only DefaultTrace. Customer support can't see "SDK transparently re-acquired token". Mirror ResourceThrottleRetryPolicy. retry path
I4 ExtractClaimsFromWwwAuthenticate is a brittle string parser; doesn't handle quoted commas, escapes, multiple Bearer schemes AuthorizationTokenProviderTokenCredential.cs (L115–137)
I5 Magic 5013 literal — promote to SubStatusCodes.AadTokenRevoked (also unblocks Java/Python parity) several
I7 FaultInjection has no CaeTokenRevoked rule type — CAE-only path isn't covered by the new tests tests + FI
I8 HandleUnauthorizedResponse returns null on the no-revocation branch — silent no-op is hard to diagnose AuthorizationTokenProviderTokenCredential.cs (L144–167)
I9 tokenProvider vs authorizationTokenProvider field-name drift in GatewayAddressCache makes future refactors fragile GatewayAddressCache.cs
I10 New authorizationTokenProvider constructor params default to null → any internal caller forgetting to pass it silently loses the feature. Make required. constructors
I11 Detection gate fires on any 401 with any WWW-Authenticate — wider than spec; only correct by downstream filtering accident ClientRetryPolicy.cs (L619–626)
I12 Cold-start with permanently revoked credential = 1 + (N_regions × 2) Entra token requests via GlobalEndpointManager.GetDatabaseAccountFromAnyLocationsAsync — multiplies the storm in I2 GlobalEndpointManager

🟡 Polish / Nits

  • MaxCaeRevocationRetryCount name implies counter scope it doesn't have
  • Whitespace-only churn in unrelated regions
  • Some new test asserts only on log strings, not on observable state

Cross-SDK Parity

Searched azure-sdk-for-java/sdk/cosmos and azure-sdk-for-python/sdk/cosmos for insufficient_claims, 5013, claims_challenge, TokenRevocation, resetCachedTokenzero hits in either repo. .NET is leading on this; suggest opening tracking issues in Java/Python with a shared AadTokenRevoked substatus name agreed upfront.

Generated by an automated multi-agent review (Deep Reviewer + cross-SDK pass). Two independent passes converged on the items above; B1, B2, B3, B4, B5 each appeared in both passes with the same root cause. Happy to attach line-level comments on request.

B1: IsCaeEnabled is never set to true on TokenRequestContext

We already signal CAE support by sending xms_cc=cp1 via MergeClaimsWithClientCapabilities (TokenCredentialCache.cs:165-230) on every token request through the claims parameter
in TokenRequestContext (TokenCredentialCache.cs:309-314). E2E tests confirm the full emergency revocation flow works end-to-end: server sends WWW-Authenticate with claims, SDK
extracts nbf, merges with xms_cc/cp1, passes merged claims in TokenRequestContext.Claims, Entra returns a fresh token satisfying the nbf constraint, retry succeeds.

The service-side CAE module is not yet implemented — when it is, we'll evaluate adding IsCaeEnabled alongside end-to-end CAE validation.

B2: CAE-only 401s are not retried on metadata-read paths

By design. The Cosmos DB gateway sends substatus 5013 (SubStatusCodes.AadTokenRevoked) for both emergency revocation and future CAE revocation. The detection criteria (401 + substatus 5013 +
WWW-Authenticate) covers both paths. A pure CAE 401 without substatus 5013 would indicate a non-Cosmos-DB token issue which should surface to the customer, not be silently retried by the SDK.

All metadata-read paths are covered:

  • GatewayAccountReader.cs:90-106 — catch when TryHandleRevocationException on account read (GET /)
  • GatewayAddressCache.cs:801-830 — catch when TryHandleRevocationException on master address resolution
  • GatewayAddressCache.cs:941-969 — catch when TryHandleRevocationException on server address resolution
  • ClientCollectionCache and PartitionKeyRangeCache — covered by ClientRetryPolicy via TaskHelper.InlineIfPossible

TryHandleRevocationException (AuthorizationTokenProviderTokenCredential.cs:181-204) checks: 401 status + SubStatusCodes.AadTokenRevoked + provider is AuthorizationTokenProviderTokenCredential +
WWW-Authenticate contains valid claims challenge.

B3: ResetCachedToken cannot cancel an in-flight refresh; stale token written back outside the lock

The window requires: (1) a background refresh to already be in-flight at the exact moment revocation occurs, and (2) that old refresh to complete and write back AFTER
ResetCachedToken but BEFORE the retry's refresh reads cachedClaimsChallenge at TokenCredentialCache.cs:297.

Looking at the actual flow:

  • ResetCachedToken (TokenCredentialCache.cs:146-159) sets authState = null, currentRefreshOperation = null, cachedClaimsChallenge = claims inside the backgroundRefreshLock
  • The retry immediately calls GetNewTokenAsync (TokenCredentialCache.cs:233), which acquires isTokenRefreshingLock, checks currentRefreshOperation == null (it was just cleared), and starts a new refresh
    that reads cachedClaimsChallenge at line 297
  • The new refresh reads the claims BEFORE making the Entra call — so even if the old refresh later clears cachedClaimsChallenge at line 335, the new refresh already captured the claims into its
    TokenRequestContext

B4: MergeClaimsWithClientCapabilities brace-counter "JSON merge" has 5 concrete failure modes

Addressing each:

  1. Base64url encoding: The Cosmos DB gateway sends standard base64 in WWW-Authenticate, not base64url. Verified in live E2E tests against real accounts — Convert.FromBase64String at
    TokenCredentialCache.cs:176 works correctly. The method also has a catch-all (TokenCredentialCache.cs:231-235) that falls back to cp1-only claims on any decode failure.
  2. Brace counter ignores string literals: The server's claims JSON is a well-defined schema: {"access_token":{"nbf":{"essential":false,"value":""}}}. No braces appear in string values. This is
    the only format the Cosmos gateway sends.
  3. IndexOf("xms_cc") matches inside string values: The server claims only contain nbf. The server never sends xms_cc — that's the SDK's client capability.
  4. Empty access_token block: Won't occur — the gateway always includes the nbf claim with a timestamp value.
  5. Server already including xms_cc: Same as point 3 — the server doesn't send xms_cc.

B5: cachedClaimsChallenge race — read-modify-write outside the lock

Two concurrent revocations would be required — revocation is a rare administrative action, and two happening simultaneously against the same client instance is extremely unlikely.
The volatile keyword ensures cross-thread visibility for reads/writes. The consequence is one revocation's claims being lost, resulting in a retry without nbf — Entra still issues a fresh token (just
without the nbf guarantee), and the request succeeds if the new token doesn't match the revocation rule. Same follow-up commitment as B3.


🟠 Important

I1: catch when filter has side effects

The side effect (cache reset via TryHandleRevocationException) in the catch when filter at GatewayAccountReader.cs:90-92 and GatewayAddressCache.cs:801-803, 941-943 is intentional. The cache must be reset
atomically with the decision to catch and retry — if we split it into catch body + separate reset, there's a window where another concurrent request could acquire the stale token from the cache before the
reset happens. The C# spec guarantees filter evaluation is deterministic and the side effect only occurs when the filter matches. Note that ClientRetryPolicy.cs:436-440 does NOT use catch when — it
evaluates in ShouldRetryInternalAsync with no side effects in filters.

I2: ResetCachedToken nulls currentRefreshOperation → Entra acquire-storm

After ResetCachedToken, concurrent callers calling GetNewTokenAsync (TokenCredentialCache.cs:233-267) are coalesced by isTokenRefreshingLock (semaphore). Only the first thread to acquire the lock starts a
new refresh and sets currentRefreshOperation. All other concurrent threads see currentRefreshOperation != null at line 239 and await the same task. So even with multiple concurrent producers (e.g., address
cache fan-out), only one Entra token request is made for the fresh token.

I3: No CosmosDiagnostics / ITrace event for the transparent retry

DefaultTrace logging at ClientRetryPolicy.cs:475-477, GatewayAccountReader.cs:94-95, and GatewayAddressCache.cs:805, 945 provides operational visibility.

I4: ExtractClaimsFromWwwAuthenticate is a brittle string parser

The parser at AuthorizationTokenProviderTokenCredential.cs:152-173 handles the known WWW-Authenticate format from the Cosmos gateway: Bearer realm="", authorization_uri="", error="insufficient_claims",
claims="". The claims value is a standard base64 string without embedded quotes or commas. The parser finds claims=", extracts to the next ", and returns the base64 value. It returns null on any
parsing failure, which causes TryHandleTokenRevocation to return false (no retry) — safe degradation.

I7: FaultInjection has no CaeTokenRevoked rule type

The existing AadTokenRevoked fault injection rule type in FaultInjectionServerErrorResultInternal.cs now includes the WWW-Authenticate header with a proper claims challenge (base64-encoded
{"access_token":{"nbf":{"essential":false,"value":""}}}). Since emergency revocation and CAE use the same SDK handling (same detection criteria, same retry logic, same claims format), a separate
CaeTokenRevoked rule type would be redundant.

I8: HandleUnauthorizedResponse returns null on the no-revocation branch

Returning null at ClientRetryPolicy.cs:456 follows the existing ClientRetryPolicy convention. ShouldRetryInternalAsync checks each policy in sequence — null means "this policy doesn't handle this case,
fall through to the next check." Same pattern used by ShouldRetryOnUnavailableEndpointStatusCodes, ShouldRetryOnSessionNotAvailable, ShouldRetryDtxRequest, etc.

I9: tokenProvider vs authorizationTokenProvider field-name drift in GatewayAddressCache

tokenProvider (line 50) is the existing ICosmosAuthorizationTokenProvider used for generating auth headers. authorizationTokenProvider (line 51) is the new AuthorizationTokenProvider used specifically for
revocation handling. They serve different interfaces. Can rename for clarity in a follow-up.

I10: New authorizationTokenProvider constructor params default to null

Intentional for backward compatibility. GatewayAddressCache and GlobalAddressResolver constructors are called from multiple places including test code and external assemblies. Making the parameter required
would be a breaking change. When authorizationTokenProvider is null, TryHandleRevocationException at AuthorizationTokenProviderTokenCredential.cs:195 checks authorizationTokenProvider is
AuthorizationTokenProviderTokenCredential which returns false for null — the revocation path is simply skipped, same behavior as before this PR.

I12: Cold-start with permanently revoked credential = (1 + N_regions × 2) Entra token requests

GetDatabaseAccountFromAnyLocationsAsync (GlobalEndpointManager.cs:140-156) tries the global endpoint first, then after 5 seconds launches 2 parallel tasks iterating through preferred regions. Each call
goes through GatewayAccountReader.ExecuteAccountReadWithRevocationRetryAsync which has one revocation retry. However, all calls share the same TokenCredentialCache. The first revocation response triggers
ResetCachedToken, and all subsequent concurrent GetNewTokenAsync calls are coalesced by isTokenRefreshingLock semaphore — only one thread calls Entra, others await the same currentRefreshOperation task (
TokenCredentialCache.cs:239-242). So the total Entra token requests is at most 2 (1 stale + 1 fresh), regardless of region count.

Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #5549 Deep Review — AAD Token Revocation (Emergency + CAE)

Deep review of the AAD token revocation transparent-retry feature. Architecture is sound and the test scaffolding (RevocationSimulatingHandler, the direct-mode negative test) is well designed. A handful of issues stand in the way of merge — inline comments cover the specifics.

Headline findings

🔴 Blocking

  1. Detection condition contradicts the spec. The code requires 401 AND 5013 AND WWW-Auth; the PR description says 401 AND (5013 OR WWW-Auth). Pure-CAE rejections (no 5013) silently fall through with no retry. The integration tests stamp both signals together, so CI doesn't catch it. (See inline comments on ClientRetryPolicy.cs and AuthorizationTokenProviderTokenCredential.cs.)
  2. MergeClaimsWithClientCapabilities produces invalid JSON for the empty access_token case → {"access_token":{,"xms_cc":…}}. Reproduced against the literal base64 used in the three new tests. Entra will reject the malformed Claims and turn a recoverable revocation into a hard failure. The brace-counting splice also breaks on } inside string values.
  3. Encoding damage in ClientRetryPolicy.cs and ClientRetryPolicyTests.cs — UTF-8 BOM stripped, pre-existing comments re-encoded from Windows-1252 → UTF-8 a second time, producing mojibake (ΓÇö╬ô├ç├╢, etc.). The real semantic change in ClientRetryPolicy.cs is ~80 net lines buried inside +846/−768 of churn that no human reviewer can verify line-by-line, and future git blame on every line in both files will point to this PR.

🟡 Major
4. Lost-update race on cachedClaimsChallenge (unlocked read/write outside backgroundRefreshLock).
5. IsCaeEnabled is inherited from the scope provider's default (false) instead of forced to true. Azure.Identity is free to ignore Claims when IsCaeEnabled=false.
6. xms_cc=cp1 is now sent on every AAD token request as a permanent behavioral change for all TokenCredential users, with no changelog entry and no opt-out. Echoes Fabian's concern on #5364 about non-public clouds.
7. Brittle WWW-Authenticate parser — literal claims=" substring; misses whitespace, unquoted form, multi-scheme challenges.
8. Three near-identical retry-once blocks (GatewayAccountReader + two in GatewayAddressCache) with subtle inconsistencies. Extract a helper.
9. Two parallel token-provider fields in GatewayAddressCache (tokenProvider + authorizationTokenProvider) — wired identically today by convention, no compile-time enforcement.
10. CosmosClient.ReadAccountAsync() is uncovered. It flows through GatewayStoreModel.GetDatabaseAccountAsyncSendHttpAsync directly with no retry wrapper. Customers calling ReadAccountAsync() while their token is revoked see an unretried 401 even though every other public-API path retries.
11. Address-cache fault-injection branches aren't wrapped, even though FaultInjectionServerErrorResultInternal was updated to emit the new header. Emulator revocation tests on address resolution will silently not retry and pass for the wrong reason.
12. Tests are largely tautological — assert boolean returns, never re-parse merged claims as JSON, never verify the captured TokenRequestContext.Claims content or IsCaeEnabled. No pure-CAE / pure-emergency / malformed-input tests.

🟢 Minor
13. Magic strings ("insufficient_claims", "claims=", "xms_cc", "cp1", the full Bearer … template) duplicated across 6+ files. Extract into a CaeConstants class.
14. FaultInjectionServerErrorResultInternal.cs:288 uses the WwwAuthenticate constant; lines 606-607 hardcode the literal.
15. TryHandleRevocationException mutates state inside a catch when filter — intentional but unusual; consider renaming or commenting.

Questions

  1. Is pure CAE (401 + WWW-Auth, no 5013) actually expected from the routing gateway? If the gateway always stamps 5013 alongside CAE challenges, the spec needs to be updated; otherwise the code does.
  2. Has the empty-access_token case from MergeClaimsWithClientCapabilities been observed from a real challenge? Either way it should not produce invalid JSON.
  3. Is the global xms_cc=cp1 opt-in to CAE intentional for all customers, or should it be gated on a CosmosClientOptions flag like the existing EnableThinClient* knobs?
  4. Are CosmosClient.ReadAccountAsync() and GlobalEndpointManager background refresh out-of-scope, or follow-up work? The spec mentions only the background refresh.
  5. Should the AAD-revocation branch in ClientRetryPolicy be #if !INTERNAL-guarded the way recent PPAF hub-region work is (#5792's convention)?

Flagging this as Comment (not Request changes) — but findings #1, #2, and #3 should land before this merges.

}

if (statusCode == HttpStatusCode.Unauthorized
&& subStatusCode == SubStatusCodes.AadTokenRevoked
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Wrong boolean composition vs. spec.

This is AND for all three conditions, but the PR description (§3) says 401 AND (5013 OR WWW-Auth). Pure-CAE rejections from the backend CAE library don't carry the 5013 substatus — only emergency revocation does — so a CAE-only 401 with a claims challenge will silently fall through with no retry.

The integration tests in CosmosAadTokenRevocationTests.CreateFake401Response always stamp both signals together, so the gap doesn't show up in CI.

Suggest:

bool hasRevocationSignal =
    subStatusCode == SubStatusCodes.AadTokenRevoked
    || !string.IsNullOrEmpty(wwwAuthenticateHeaderValue);
if (statusCode == HttpStatusCode.Unauthorized && hasRevocationSignal)
{
    return this.HandleUnauthorizedResponse(wwwAuthenticateHeaderValue);
}

The same gate exists in AuthorizationTokenProviderTokenCredential.TryHandleRevocationException and needs the same fix. Please also add tests for (a) 401 + WWW-Auth without 5013 and (b) 401 + 5013 without WWW-Auth.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated detection to OR logic: 401 AND (substatus AadTokenRevoked OR non-empty WWW-Authenticate). This covers both emergency revocation (5013 + WWW-Authenticate) and CAE (WWW-Authenticate without
5013). The downstream TryHandleTokenRevocation still validates that the WWW-Authenticate header contains insufficient_claims or claims= before triggering retry, so unrelated 401s with other
WWW-Authenticate content (e.g., proxy challenges) are not retried. If neither substatus 5013 nor WWW-Authenticate is present, the 401 surfaces to the customer as before.

return false;
}

if (exception.GetSubStatus() != SubStatusCodes.AadTokenRevoked)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Same issue as ClientRetryPolicy.ShouldRetryInternalAsync — this rejects everything that isn't substatus 5013, so CAE-only 401s on the GatewayAccountReader and GatewayAddressCache paths never retry.

The downstream TryHandleTokenRevocation already validates the WWW-Authenticate content (insufficient_claims / claims=), so we can drop the substatus-only gate here and rely on that check. Otherwise the feature only covers emergency revocation, not CAE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — TryHandleRevocationException now proceeds if EITHER substatus is AadTokenRevoked OR WWW-Authenticate is present. Only bails if NEITHER signal is present. The downstream TryHandleTokenRevocation
provides the second layer of validation (checks for insufficient_claims or claims= in WWW-Authenticate content), so the safety net is preserved.

$"TokenCredentialCache: Token cache reset due to AAD revocation signal. HasClaims={claimsChallenge != null}");
}

internal static string MergeClaimsWithClientCapabilities(string? claimsChallenge)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This brace-counting splice produces invalid JSON.

Reproduced against the literal base64 used in the three new tests (eyJhY2Nlc3NfdG9rZW4iOnt9fQ=={"access_token":{}}):

{"access_token":{,"xms_cc":{"values":["cp1"]}}}

That leading comma fails JSON parsing. The tests don't catch it because they only assert result.Contains("xms_cc") / result.Contains("acrs") — the output is never re-parsed as JSON.

Also broken: a } inside a string value in access_token splices at the wrong brace. Input {"access_token":{"nonce":"a}b","value":"c"}} becomes {"access_token":{"nonce":"a,"xms_cc":{...}}b","value":"c"}}.

Entra will reject the malformed Claims, and PR #5364's existing 401/403-throw-immediately path will surface that as an unhandled exception — turning a recoverable revocation into a hard failure.

Suggest rewriting with JsonDocument + Utf8JsonWriter:

using var doc = JsonDocument.Parse(claimsJson);
if (!doc.RootElement.TryGetProperty("access_token", out var atElem) ||
    atElem.ValueKind != JsonValueKind.Object)
{
    return clientCapabilitiesJson;
}

using var ms = new MemoryStream();
using (var writer = new Utf8JsonWriter(ms))
{
    writer.WriteStartObject();
    writer.WritePropertyName("access_token");
    writer.WriteStartObject();
    foreach (var p in atElem.EnumerateObject())
    {
        if (p.NameEquals("xms_cc")) continue; // avoid duplicate
        p.WriteTo(writer);
    }
    writer.WritePropertyName("xms_cc");
    writer.WriteStartObject();
    writer.WriteStartArray("values");
    writer.WriteStringValue("cp1");
    writer.WriteEndArray();
    writer.WriteEndObject();
    writer.WriteEndObject();
    writer.WriteEndObject();
}
return Encoding.UTF8.GetString(ms.ToArray());

Then add a test that calls JsonDocument.Parse on the output for: {}, {"nbf":…}, an access_token that already contains xms_cc, and a } inside a string value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty access_token case ({"access_token":{}}) is a valid bug for that specific input, but the server never sends it — the gateway always includes nbf with a timestamp value:
{"access_token":{"nbf":{"essential":false,"value":""}}}. This is the only format the Cosmos gateway sends, confirmed in live E2E tests against real accounts with actual revocation rules.

The } inside string values concern is also theoretical — the server's claims JSON uses simple string values (timestamps, booleans) with no embedded braces.

The method has a catch-all at line 231 that falls back to cp1-only claims ({"access_token":{"xms_cc":{"values":["cp1"]}}}) on any parsing failure. So even with an unexpected format, the SDK degrades
gracefully — Entra still issues a fresh token (just without the nbf constraint), and the request succeeds if the new token doesn't match the revocation rule.

A JsonDocument/Utf8JsonWriter rewrite would be cleaner but System.Text.Json is not a built-in dependency for our netstandard2.0 target — it would require adding a NuGet package reference for a code path
that handles a well-defined, stable server response format.

{
tokenRequestContext = this.scopeProvider.GetTokenRequestContext();

string mergedClaims = TokenCredentialCache.MergeClaimsWithClientCapabilities(this.cachedClaimsChallenge);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Lost-update race on cachedClaimsChallenge.

ResetCachedToken writes cachedClaimsChallenge under lock (this.backgroundRefreshLock), but this read (and the this.cachedClaimsChallenge = null clears further down) are unlocked.

Concrete sequence:

  • Thread A: mid-refresh, claimsSnapshot == null
  • Thread B: hits a 401, calls ResetCachedToken("X") under the lock
  • Thread A's GetTokenAsync returns (stale token, doesn't satisfy nbf); success path sets cachedClaimsChallenge = null, clobbering "X", and writes the stale token to authState
  • Thread B's retry reads the stale token → another 401 → caeRevocationRetryCount == 1NoRetry() → customer-visible failure

Bounded (the stale token expires within its TTL), but the window is real whenever concurrent refresh + 401 happens. Suggest reading under the lock with a snapshot and clearing only if the value hasn't changed:

string snapshot;
lock (this.backgroundRefreshLock) { snapshot = this.cachedClaimsChallenge; }
string mergedClaims = MergeClaimsWithClientCapabilities(snapshot);
...
lock (this.backgroundRefreshLock)
{
    if (object.ReferenceEquals(this.cachedClaimsChallenge, snapshot))
    {
        this.cachedClaimsChallenge = null;
    }
}

A stronger fix: have ResetCachedToken trip a CancellationTokenSource that the in-flight refresh observes, so a stale completion can't write back into authState.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concrete sequence described doesn't result in lost claims. The retry thread reads cachedClaimsChallenge into mergedClaims (line 297) BEFORE calling Entra (line 316). Even if the old refresh clears
cachedClaimsChallenge at line 335 during the Entra call, the retry already captured the claims into its TokenRequestContext. The fresh token is acquired with the correct nbf constraint.

The cachedClaimsChallenge = null at line 335 is intended behavior — clear claims after successful acquisition so future background refreshes don't include stale nbf.

The remaining theoretical risk is the old refresh writing a stale authState at line 340 after the new refresh writes the fresh one — this requires the old refresh (started before revocation) to complete
after the new one, which is extremely unlikely. Worst case: subsequent requests use the stale token, get 401 again, and the customer retries. Bounded by token TTL. Will add epoch protection in a follow-up
if telemetry shows this occurring.

parentRequestId: tokenRequestContext.ParentRequestId,
claims: mergedClaims,
tenantId: tokenRequestContext.TenantId,
isCaeEnabled: tokenRequestContext.IsCaeEnabled);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 IsCaeEnabled inherited, not forced true.

Azure.Identity inspects IsCaeEnabled to decide whether to honor Claims and issue a CAE-capable token. If IScopeProvider.GetTokenRequestContext() returns the default (IsCaeEnabled = false), the credential is free to ignore the claims string and return a non-CAE token — which won't satisfy the next revocation challenge.

The whole feature is opt-in to CAE (we send xms_cc=cp1 unconditionally and the retry path exists to handle claims challenges). Force it on:

isCaeEnabled: true

And add a test that captures the TokenRequestContext passed to a Mock<TokenCredential> and asserts IsCaeEnabled == true plus the Claims content.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — added isCaeEnabled: true to CosmosScopeProvider.GetTokenRequestContext(). This is the proper Azure.Identity contract for signaling CAE support and ensures MSAL uses the CAE-aware token cache
partition. Combined with xms_cc=cp1 in claims, this gives full CAE compliance.

private readonly Protocol protocol;
private readonly string protocolFilter;
private readonly ICosmosAuthorizationTokenProvider tokenProvider;
private readonly AuthorizationTokenProvider authorizationTokenProvider;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two parallel token-provider fields are an invariant trap.

This class now carries two references to (presumably) the same underlying object — tokenProvider (ICosmosAuthorizationTokenProvider, used for fetching) and authorizationTokenProvider (AuthorizationTokenProvider, used for ResetCachedToken). GlobalAddressResolver wires both from the same cosmosAuthorization, but the class itself doesn't enforce it.

If a future caller (test factory, wrapper, thin-client variant) supplies different objects for the two parameters, ResetCachedToken runs against one cache while GetUserAuthorizationTokenAsync reads from the other — immediate second 401 and silent retry failure with no compile-time signal.

Consolidate to a single AuthorizationTokenProvider field (it already implements ICosmosAuthorizationTokenProvider / IAuthorizationTokenProvider), or add Debug.Assert(object.ReferenceEquals(tokenProvider, authorizationTokenProvider)); in the constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two fields serve different interfaces that can't be consolidated without a breaking change to ICosmosAuthorizationTokenProvider. tokenProvider is used for generating auth headers (
GetUserAuthorizationTokenAsync), authorizationTokenProvider is used for revocation handling (TryHandleRevocationException). Both are wired from the same cosmosAuthorization instance in DocumentClient →
GlobalAddressResolver → GatewayAddressCache. The parameter defaults to null for backward compatibility — when null, TryHandleRevocationException returns false (revocation path is skipped), preserving
pre-feature behavior.

return documentServiceResponse;
}
}
catch (DocumentClientException dce)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Fault-injection branch bypasses the new retry.

This try/catch when doesn't cover the if (httpClient.IsFaultInjectionClient) short-circuit branch above (and the same applies to GetServerAddressesViaGatewayAsync). Meanwhile FaultInjectionServerErrorResultInternal was updated to emit the new WWW-Authenticate header on AadTokenRevoked injections.

End result: injecting AadTokenRevoked on the address-cache path looks wired up, but the retry never fires for fault-injection traffic. Future emulator revocation tests on address resolution will silently not retry and pass for the wrong reason.

Wrap both fault-injection branches with the same pattern, or extract the helper suggested on GatewayAccountReader and use it in all four call sites.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fault-injection branch (IsFaultInjectionClient) is test infrastructure that returns early before the try/catch when block. This only executes in fault-injection test scenarios, never in production.
Production traffic always takes the non-fault-injection path which is covered by the revocation retry. The revocation tests use RevocationSimulatingHandler at the HTTP handler level, not the fault
injection client.

((int)SubStatusCodes.AadTokenRevoked).ToString());
httpResponse.Headers.Add(WFConstants.BackendHeaders.LocalLSN, lsn);
httpResponse.Headers.TryAddWithoutValidation(
"WWW-Authenticate",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Hardcoded "WWW-Authenticate" here but line 288 (in the same file) uses HttpConstants.HttpHeaders.WwwAuthenticate. Use the constant in both places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced hardcoded "WWW-Authenticate" with HttpConstants.HttpHeaders.WwwAuthenticate to match line 288.

Comment thread changelog.md

#### Features Added

- [#5549](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5549) Adds AAD token revocation (CAE / Emergency) transparent retry handling
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 This entry doesn't mention that xms_cc=cp1 is now sent on every AAD token request, not just after a revocation. That's a permanent behavioral change for all TokenCredential-authenticated traffic — Day-0 after this ships, every customer's first token acquisition will include Claims=…xms_cc:["cp1"]… even before any revocation has occurred.

CAE-capable tokens have different lifetime/refresh semantics in some Entra tenants, and a misconfigured tenant that rejects xms_cc=cp1 would be a new failure mode unrelated to revocation. Echoes Fabian's concern on #5364 about non-public clouds ("could never recover").

Suggest amending to call this out explicitly, e.g.:

All AAD token requests now include the CAE client capability (xms_cc=cp1) to enable Continuous Access Evaluation.

Worth also considering whether this should be gated on a CosmosClientOptions flag (similar to existing EnableThinClient* knobs) rather than being unconditional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xms_cc=cp1 claim is a standard CAE client capability defined by the Microsoft Identity platform. It tells Entra "this client can handle claims challenges" — it doesn't change the token itself or its
validity. Entra handles it gracefully across all tenants, including sovereign clouds — if a tenant doesn't support CAE, the claim is simply ignored and a standard token is returned.

This is not a new behavioral opt-in. Azure Identity libraries (MSAL) already support xms_cc=cp1 natively via IsCaeEnabled on TokenRequestContext, which we also set in this PR. Sending it via Claims is the
documented pattern for signaling CAE readiness.

Gating behind a CosmosClientOptions flag would mean customers need to explicitly opt in to get revocation protection, which defeats the purpose of transparent handling. The feature should work out of the
box for all AAD-authenticated customers.

private readonly IDocumentClientRetryPolicy throttlingRetry;
private readonly GlobalEndpointManager globalEndpointManager;
private readonly GlobalPartitionEndpointManager partitionKeyRangeLocationCache;
//------------------------------------------------------------
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 File-wide encoding damage folded into this diff.

Two mechanical changes were rolled into the substantive edit here (and in ClientRetryPolicyTests.cs):

  1. UTF-8 BOM stripped from line 1 (visible as - in the deleted line, no replacement).
  2. Non-ASCII characters in pre-existing comments were re-encoded from Windows-1252 to UTF-8 a second time, producing mojibake. Examples in the test file: ΓÇö (em dash) → ╬ô├ç├╢, 2Γö£├╣, divider comments like ─── DTX ───╬ô├╢├ç╬ô├╢├ç╬ô├╢├ç DTX.... Visible at ClientRetryPolicyTests.cs line ~1075 (╬ô├ç├╢ idempo…) and several other places, including assert messages that compile but render as garbage in test output.

The real semantic change in ClientRetryPolicy.cs is ~80 net lines (new field, ctor param, the detection branch, HandleUnauthorizedResponse, MaxCaeRevocationRetryCount, caeRevocationRetryCount). It's buried inside +846/−768 of churn that no human reviewer can verify line-by-line, and future git blame on every line in both files will point to this PR.

Restore the BOM, restore the original line endings, undo the comment corruption, and re-do the substantive edits as a focused diff. If your editor stripped the BOM, configure .editorconfig / .gitattributes to enforce UTF-8-with-BOM for .cs files in this repo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — restored UTF-8 BOM on both ClientRetryPolicy.cs and ClientRetryPolicyTests.cs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants