[AAD token revocation]: Adds logic for handling emergency token revocation and CAE token revocation.#5549
[AAD token revocation]: Adds logic for handling emergency token revocation and CAE token revocation.#5549aavasthy wants to merge 26 commits into
Conversation
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Deep Review Summary — AAD Token RevocationTwo independent review passes (deep reviewer + cosmos PR reviewer) converged on the same set of issues. Posting a consolidated summary; happy to drill into any of these in line comments if useful. Verdict: Request Changes. Five blockers below each independently break the feature's wire semantics or introduce correctness bugs under realistic load. 🔴 BlockersB1. B2. CAE-only 401s are not retried on metadata-read paths — B3. B4.
B5. 🟠 Important
🟡 Polish / Nits
Cross-SDK ParitySearched Generated by an automated multi-agent review (Deep Reviewer + cross-SDK pass). Two independent passes converged on the items above; B1, B2, B3, B4, B5 each appeared in both passes with the same root cause. Happy to attach line-level comments on request. |
📝 Changelog reminder (non-blocking)This PR touches shipped source but does not appear to update the Touched (and missing entry for):
How to decideUse the rubric in
This check is non-blocking — merge is not gated on it. The |
B1: IsCaeEnabled is never set to true on TokenRequestContext We already signal CAE support by sending xms_cc=cp1 via MergeClaimsWithClientCapabilities (TokenCredentialCache.cs:165-230) on every token request through the claims parameter The service-side CAE module is not yet implemented — when it is, we'll evaluate adding IsCaeEnabled alongside end-to-end CAE validation. B2: CAE-only 401s are not retried on metadata-read paths By design. The Cosmos DB gateway sends substatus 5013 (SubStatusCodes.AadTokenRevoked) for both emergency revocation and future CAE revocation. The detection criteria (401 + substatus 5013 + All metadata-read paths are covered:
TryHandleRevocationException (AuthorizationTokenProviderTokenCredential.cs:181-204) checks: 401 status + SubStatusCodes.AadTokenRevoked + provider is AuthorizationTokenProviderTokenCredential + B3: ResetCachedToken cannot cancel an in-flight refresh; stale token written back outside the lock The window requires: (1) a background refresh to already be in-flight at the exact moment revocation occurs, and (2) that old refresh to complete and write back AFTER Looking at the actual flow:
B4: MergeClaimsWithClientCapabilities brace-counter "JSON merge" has 5 concrete failure modes Addressing each:
B5: cachedClaimsChallenge race — read-modify-write outside the lock Two concurrent revocations would be required — revocation is a rare administrative action, and two happening simultaneously against the same client instance is extremely unlikely. 🟠 Important I1: catch when filter has side effects The side effect (cache reset via TryHandleRevocationException) in the catch when filter at GatewayAccountReader.cs:90-92 and GatewayAddressCache.cs:801-803, 941-943 is intentional. The cache must be reset I2: ResetCachedToken nulls currentRefreshOperation → Entra acquire-storm After ResetCachedToken, concurrent callers calling GetNewTokenAsync (TokenCredentialCache.cs:233-267) are coalesced by isTokenRefreshingLock (semaphore). Only the first thread to acquire the lock starts a I3: No CosmosDiagnostics / ITrace event for the transparent retry DefaultTrace logging at ClientRetryPolicy.cs:475-477, GatewayAccountReader.cs:94-95, and GatewayAddressCache.cs:805, 945 provides operational visibility. I4: ExtractClaimsFromWwwAuthenticate is a brittle string parser The parser at AuthorizationTokenProviderTokenCredential.cs:152-173 handles the known WWW-Authenticate format from the Cosmos gateway: Bearer realm="", authorization_uri="", error="insufficient_claims", I7: FaultInjection has no CaeTokenRevoked rule type The existing AadTokenRevoked fault injection rule type in FaultInjectionServerErrorResultInternal.cs now includes the WWW-Authenticate header with a proper claims challenge (base64-encoded I8: HandleUnauthorizedResponse returns null on the no-revocation branch Returning null at ClientRetryPolicy.cs:456 follows the existing ClientRetryPolicy convention. ShouldRetryInternalAsync checks each policy in sequence — null means "this policy doesn't handle this case, I9: tokenProvider vs authorizationTokenProvider field-name drift in GatewayAddressCache tokenProvider (line 50) is the existing ICosmosAuthorizationTokenProvider used for generating auth headers. authorizationTokenProvider (line 51) is the new AuthorizationTokenProvider used specifically for I10: New authorizationTokenProvider constructor params default to null Intentional for backward compatibility. GatewayAddressCache and GlobalAddressResolver constructors are called from multiple places including test code and external assemblies. Making the parameter required I12: Cold-start with permanently revoked credential = (1 + N_regions × 2) Entra token requests GetDatabaseAccountFromAnyLocationsAsync (GlobalEndpointManager.cs:140-156) tries the global endpoint first, then after 5 seconds launches 2 parallel tasks iterating through preferred regions. Each call |
NaluTripician
left a comment
There was a problem hiding this comment.
PR #5549 Deep Review — AAD Token Revocation (Emergency + CAE)
Deep review of the AAD token revocation transparent-retry feature. Architecture is sound and the test scaffolding (RevocationSimulatingHandler, the direct-mode negative test) is well designed. A handful of issues stand in the way of merge — inline comments cover the specifics.
Headline findings
🔴 Blocking
- Detection condition contradicts the spec. The code requires
401 AND 5013 AND WWW-Auth; the PR description says401 AND (5013 OR WWW-Auth). Pure-CAE rejections (no5013) silently fall through with no retry. The integration tests stamp both signals together, so CI doesn't catch it. (See inline comments onClientRetryPolicy.csandAuthorizationTokenProviderTokenCredential.cs.) MergeClaimsWithClientCapabilitiesproduces invalid JSON for the emptyaccess_tokencase →{"access_token":{,"xms_cc":…}}. Reproduced against the literal base64 used in the three new tests. Entra will reject the malformedClaimsand turn a recoverable revocation into a hard failure. The brace-counting splice also breaks on}inside string values.- Encoding damage in
ClientRetryPolicy.csandClientRetryPolicyTests.cs— UTF-8 BOM stripped, pre-existing comments re-encoded from Windows-1252 → UTF-8 a second time, producing mojibake (ΓÇö→╬ô├ç├╢, etc.). The real semantic change inClientRetryPolicy.csis ~80 net lines buried inside +846/−768 of churn that no human reviewer can verify line-by-line, and futuregit blameon every line in both files will point to this PR.
🟡 Major
4. Lost-update race on cachedClaimsChallenge (unlocked read/write outside backgroundRefreshLock).
5. IsCaeEnabled is inherited from the scope provider's default (false) instead of forced to true. Azure.Identity is free to ignore Claims when IsCaeEnabled=false.
6. xms_cc=cp1 is now sent on every AAD token request as a permanent behavioral change for all TokenCredential users, with no changelog entry and no opt-out. Echoes Fabian's concern on #5364 about non-public clouds.
7. Brittle WWW-Authenticate parser — literal claims=" substring; misses whitespace, unquoted form, multi-scheme challenges.
8. Three near-identical retry-once blocks (GatewayAccountReader + two in GatewayAddressCache) with subtle inconsistencies. Extract a helper.
9. Two parallel token-provider fields in GatewayAddressCache (tokenProvider + authorizationTokenProvider) — wired identically today by convention, no compile-time enforcement.
10. CosmosClient.ReadAccountAsync() is uncovered. It flows through GatewayStoreModel.GetDatabaseAccountAsync → SendHttpAsync directly with no retry wrapper. Customers calling ReadAccountAsync() while their token is revoked see an unretried 401 even though every other public-API path retries.
11. Address-cache fault-injection branches aren't wrapped, even though FaultInjectionServerErrorResultInternal was updated to emit the new header. Emulator revocation tests on address resolution will silently not retry and pass for the wrong reason.
12. Tests are largely tautological — assert boolean returns, never re-parse merged claims as JSON, never verify the captured TokenRequestContext.Claims content or IsCaeEnabled. No pure-CAE / pure-emergency / malformed-input tests.
🟢 Minor
13. Magic strings ("insufficient_claims", "claims=", "xms_cc", "cp1", the full Bearer … template) duplicated across 6+ files. Extract into a CaeConstants class.
14. FaultInjectionServerErrorResultInternal.cs:288 uses the WwwAuthenticate constant; lines 606-607 hardcode the literal.
15. TryHandleRevocationException mutates state inside a catch when filter — intentional but unusual; consider renaming or commenting.
Questions
- Is pure CAE (401 + WWW-Auth, no
5013) actually expected from the routing gateway? If the gateway always stamps5013alongside CAE challenges, the spec needs to be updated; otherwise the code does. - Has the empty-
access_tokencase fromMergeClaimsWithClientCapabilitiesbeen observed from a real challenge? Either way it should not produce invalid JSON. - Is the global
xms_cc=cp1opt-in to CAE intentional for all customers, or should it be gated on aCosmosClientOptionsflag like the existingEnableThinClient*knobs? - Are
CosmosClient.ReadAccountAsync()andGlobalEndpointManagerbackground refresh out-of-scope, or follow-up work? The spec mentions only the background refresh. - Should the AAD-revocation branch in
ClientRetryPolicybe#if !INTERNAL-guarded the way recent PPAF hub-region work is (#5792's convention)?
Flagging this as Comment (not Request changes) — but findings #1, #2, and #3 should land before this merges.
| } | ||
|
|
||
| if (statusCode == HttpStatusCode.Unauthorized | ||
| && subStatusCode == SubStatusCodes.AadTokenRevoked |
There was a problem hiding this comment.
🔴 Wrong boolean composition vs. spec.
This is AND for all three conditions, but the PR description (§3) says 401 AND (5013 OR WWW-Auth). Pure-CAE rejections from the backend CAE library don't carry the 5013 substatus — only emergency revocation does — so a CAE-only 401 with a claims challenge will silently fall through with no retry.
The integration tests in CosmosAadTokenRevocationTests.CreateFake401Response always stamp both signals together, so the gap doesn't show up in CI.
Suggest:
bool hasRevocationSignal =
subStatusCode == SubStatusCodes.AadTokenRevoked
|| !string.IsNullOrEmpty(wwwAuthenticateHeaderValue);
if (statusCode == HttpStatusCode.Unauthorized && hasRevocationSignal)
{
return this.HandleUnauthorizedResponse(wwwAuthenticateHeaderValue);
}The same gate exists in AuthorizationTokenProviderTokenCredential.TryHandleRevocationException and needs the same fix. Please also add tests for (a) 401 + WWW-Auth without 5013 and (b) 401 + 5013 without WWW-Auth.
There was a problem hiding this comment.
updated detection to OR logic: 401 AND (substatus AadTokenRevoked OR non-empty WWW-Authenticate). This covers both emergency revocation (5013 + WWW-Authenticate) and CAE (WWW-Authenticate without
5013). The downstream TryHandleTokenRevocation still validates that the WWW-Authenticate header contains insufficient_claims or claims= before triggering retry, so unrelated 401s with other
WWW-Authenticate content (e.g., proxy challenges) are not retried. If neither substatus 5013 nor WWW-Authenticate is present, the 401 surfaces to the customer as before.
| return false; | ||
| } | ||
|
|
||
| if (exception.GetSubStatus() != SubStatusCodes.AadTokenRevoked) |
There was a problem hiding this comment.
🔴 Same issue as ClientRetryPolicy.ShouldRetryInternalAsync — this rejects everything that isn't substatus 5013, so CAE-only 401s on the GatewayAccountReader and GatewayAddressCache paths never retry.
The downstream TryHandleTokenRevocation already validates the WWW-Authenticate content (insufficient_claims / claims=), so we can drop the substatus-only gate here and rely on that check. Otherwise the feature only covers emergency revocation, not CAE.
There was a problem hiding this comment.
Fixed — TryHandleRevocationException now proceeds if EITHER substatus is AadTokenRevoked OR WWW-Authenticate is present. Only bails if NEITHER signal is present. The downstream TryHandleTokenRevocation
provides the second layer of validation (checks for insufficient_claims or claims= in WWW-Authenticate content), so the safety net is preserved.
| $"TokenCredentialCache: Token cache reset due to AAD revocation signal. HasClaims={claimsChallenge != null}"); | ||
| } | ||
|
|
||
| internal static string MergeClaimsWithClientCapabilities(string? claimsChallenge) |
There was a problem hiding this comment.
🔴 This brace-counting splice produces invalid JSON.
Reproduced against the literal base64 used in the three new tests (eyJhY2Nlc3NfdG9rZW4iOnt9fQ== → {"access_token":{}}):
{"access_token":{,"xms_cc":{"values":["cp1"]}}}
That leading comma fails JSON parsing. The tests don't catch it because they only assert result.Contains("xms_cc") / result.Contains("acrs") — the output is never re-parsed as JSON.
Also broken: a } inside a string value in access_token splices at the wrong brace. Input {"access_token":{"nonce":"a}b","value":"c"}} becomes {"access_token":{"nonce":"a,"xms_cc":{...}}b","value":"c"}}.
Entra will reject the malformed Claims, and PR #5364's existing 401/403-throw-immediately path will surface that as an unhandled exception — turning a recoverable revocation into a hard failure.
Suggest rewriting with JsonDocument + Utf8JsonWriter:
using var doc = JsonDocument.Parse(claimsJson);
if (!doc.RootElement.TryGetProperty("access_token", out var atElem) ||
atElem.ValueKind != JsonValueKind.Object)
{
return clientCapabilitiesJson;
}
using var ms = new MemoryStream();
using (var writer = new Utf8JsonWriter(ms))
{
writer.WriteStartObject();
writer.WritePropertyName("access_token");
writer.WriteStartObject();
foreach (var p in atElem.EnumerateObject())
{
if (p.NameEquals("xms_cc")) continue; // avoid duplicate
p.WriteTo(writer);
}
writer.WritePropertyName("xms_cc");
writer.WriteStartObject();
writer.WriteStartArray("values");
writer.WriteStringValue("cp1");
writer.WriteEndArray();
writer.WriteEndObject();
writer.WriteEndObject();
writer.WriteEndObject();
}
return Encoding.UTF8.GetString(ms.ToArray());Then add a test that calls JsonDocument.Parse on the output for: {}, {"nbf":…}, an access_token that already contains xms_cc, and a } inside a string value.
There was a problem hiding this comment.
The empty access_token case ({"access_token":{}}) is a valid bug for that specific input, but the server never sends it — the gateway always includes nbf with a timestamp value:
{"access_token":{"nbf":{"essential":false,"value":""}}}. This is the only format the Cosmos gateway sends, confirmed in live E2E tests against real accounts with actual revocation rules.
The } inside string values concern is also theoretical — the server's claims JSON uses simple string values (timestamps, booleans) with no embedded braces.
The method has a catch-all at line 231 that falls back to cp1-only claims ({"access_token":{"xms_cc":{"values":["cp1"]}}}) on any parsing failure. So even with an unexpected format, the SDK degrades
gracefully — Entra still issues a fresh token (just without the nbf constraint), and the request succeeds if the new token doesn't match the revocation rule.
A JsonDocument/Utf8JsonWriter rewrite would be cleaner but System.Text.Json is not a built-in dependency for our netstandard2.0 target — it would require adding a NuGet package reference for a code path
that handles a well-defined, stable server response format.
| { | ||
| tokenRequestContext = this.scopeProvider.GetTokenRequestContext(); | ||
|
|
||
| string mergedClaims = TokenCredentialCache.MergeClaimsWithClientCapabilities(this.cachedClaimsChallenge); |
There was a problem hiding this comment.
🟡 Lost-update race on cachedClaimsChallenge.
ResetCachedToken writes cachedClaimsChallenge under lock (this.backgroundRefreshLock), but this read (and the this.cachedClaimsChallenge = null clears further down) are unlocked.
Concrete sequence:
- Thread A: mid-refresh,
claimsSnapshot == null - Thread B: hits a 401, calls
ResetCachedToken("X")under the lock - Thread A's
GetTokenAsyncreturns (stale token, doesn't satisfynbf); success path setscachedClaimsChallenge = null, clobbering "X", and writes the stale token toauthState - Thread B's retry reads the stale token → another 401 →
caeRevocationRetryCount == 1→NoRetry()→ customer-visible failure
Bounded (the stale token expires within its TTL), but the window is real whenever concurrent refresh + 401 happens. Suggest reading under the lock with a snapshot and clearing only if the value hasn't changed:
string snapshot;
lock (this.backgroundRefreshLock) { snapshot = this.cachedClaimsChallenge; }
string mergedClaims = MergeClaimsWithClientCapabilities(snapshot);
...
lock (this.backgroundRefreshLock)
{
if (object.ReferenceEquals(this.cachedClaimsChallenge, snapshot))
{
this.cachedClaimsChallenge = null;
}
}A stronger fix: have ResetCachedToken trip a CancellationTokenSource that the in-flight refresh observes, so a stale completion can't write back into authState.
There was a problem hiding this comment.
The concrete sequence described doesn't result in lost claims. The retry thread reads cachedClaimsChallenge into mergedClaims (line 297) BEFORE calling Entra (line 316). Even if the old refresh clears
cachedClaimsChallenge at line 335 during the Entra call, the retry already captured the claims into its TokenRequestContext. The fresh token is acquired with the correct nbf constraint.
The cachedClaimsChallenge = null at line 335 is intended behavior — clear claims after successful acquisition so future background refreshes don't include stale nbf.
The remaining theoretical risk is the old refresh writing a stale authState at line 340 after the new refresh writes the fresh one — this requires the old refresh (started before revocation) to complete
after the new one, which is extremely unlikely. Worst case: subsequent requests use the stale token, get 401 again, and the customer retries. Bounded by token TTL. Will add epoch protection in a follow-up
if telemetry shows this occurring.
| parentRequestId: tokenRequestContext.ParentRequestId, | ||
| claims: mergedClaims, | ||
| tenantId: tokenRequestContext.TenantId, | ||
| isCaeEnabled: tokenRequestContext.IsCaeEnabled); |
There was a problem hiding this comment.
🟡 IsCaeEnabled inherited, not forced true.
Azure.Identity inspects IsCaeEnabled to decide whether to honor Claims and issue a CAE-capable token. If IScopeProvider.GetTokenRequestContext() returns the default (IsCaeEnabled = false), the credential is free to ignore the claims string and return a non-CAE token — which won't satisfy the next revocation challenge.
The whole feature is opt-in to CAE (we send xms_cc=cp1 unconditionally and the retry path exists to handle claims challenges). Force it on:
isCaeEnabled: trueAnd add a test that captures the TokenRequestContext passed to a Mock<TokenCredential> and asserts IsCaeEnabled == true plus the Claims content.
There was a problem hiding this comment.
Agreed — added isCaeEnabled: true to CosmosScopeProvider.GetTokenRequestContext(). This is the proper Azure.Identity contract for signaling CAE support and ensures MSAL uses the CAE-aware token cache
partition. Combined with xms_cc=cp1 in claims, this gives full CAE compliance.
| private readonly Protocol protocol; | ||
| private readonly string protocolFilter; | ||
| private readonly ICosmosAuthorizationTokenProvider tokenProvider; | ||
| private readonly AuthorizationTokenProvider authorizationTokenProvider; |
There was a problem hiding this comment.
🟡 Two parallel token-provider fields are an invariant trap.
This class now carries two references to (presumably) the same underlying object — tokenProvider (ICosmosAuthorizationTokenProvider, used for fetching) and authorizationTokenProvider (AuthorizationTokenProvider, used for ResetCachedToken). GlobalAddressResolver wires both from the same cosmosAuthorization, but the class itself doesn't enforce it.
If a future caller (test factory, wrapper, thin-client variant) supplies different objects for the two parameters, ResetCachedToken runs against one cache while GetUserAuthorizationTokenAsync reads from the other — immediate second 401 and silent retry failure with no compile-time signal.
Consolidate to a single AuthorizationTokenProvider field (it already implements ICosmosAuthorizationTokenProvider / IAuthorizationTokenProvider), or add Debug.Assert(object.ReferenceEquals(tokenProvider, authorizationTokenProvider)); in the constructor.
There was a problem hiding this comment.
The two fields serve different interfaces that can't be consolidated without a breaking change to ICosmosAuthorizationTokenProvider. tokenProvider is used for generating auth headers (
GetUserAuthorizationTokenAsync), authorizationTokenProvider is used for revocation handling (TryHandleRevocationException). Both are wired from the same cosmosAuthorization instance in DocumentClient →
GlobalAddressResolver → GatewayAddressCache. The parameter defaults to null for backward compatibility — when null, TryHandleRevocationException returns false (revocation path is skipped), preserving
pre-feature behavior.
| return documentServiceResponse; | ||
| } | ||
| } | ||
| catch (DocumentClientException dce) |
There was a problem hiding this comment.
🟡 Fault-injection branch bypasses the new retry.
This try/catch when doesn't cover the if (httpClient.IsFaultInjectionClient) short-circuit branch above (and the same applies to GetServerAddressesViaGatewayAsync). Meanwhile FaultInjectionServerErrorResultInternal was updated to emit the new WWW-Authenticate header on AadTokenRevoked injections.
End result: injecting AadTokenRevoked on the address-cache path looks wired up, but the retry never fires for fault-injection traffic. Future emulator revocation tests on address resolution will silently not retry and pass for the wrong reason.
Wrap both fault-injection branches with the same pattern, or extract the helper suggested on GatewayAccountReader and use it in all four call sites.
There was a problem hiding this comment.
The fault-injection branch (IsFaultInjectionClient) is test infrastructure that returns early before the try/catch when block. This only executes in fault-injection test scenarios, never in production.
Production traffic always takes the non-fault-injection path which is covered by the revocation retry. The revocation tests use RevocationSimulatingHandler at the HTTP handler level, not the fault
injection client.
| ((int)SubStatusCodes.AadTokenRevoked).ToString()); | ||
| httpResponse.Headers.Add(WFConstants.BackendHeaders.LocalLSN, lsn); | ||
| httpResponse.Headers.TryAddWithoutValidation( | ||
| "WWW-Authenticate", |
There was a problem hiding this comment.
🟢 Hardcoded "WWW-Authenticate" here but line 288 (in the same file) uses HttpConstants.HttpHeaders.WwwAuthenticate. Use the constant in both places.
There was a problem hiding this comment.
Fixed — replaced hardcoded "WWW-Authenticate" with HttpConstants.HttpHeaders.WwwAuthenticate to match line 288.
|
|
||
| #### Features Added | ||
|
|
||
| - [#5549](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5549) Adds AAD token revocation (CAE / Emergency) transparent retry handling |
There was a problem hiding this comment.
🟡 This entry doesn't mention that xms_cc=cp1 is now sent on every AAD token request, not just after a revocation. That's a permanent behavioral change for all TokenCredential-authenticated traffic — Day-0 after this ships, every customer's first token acquisition will include Claims=…xms_cc:["cp1"]… even before any revocation has occurred.
CAE-capable tokens have different lifetime/refresh semantics in some Entra tenants, and a misconfigured tenant that rejects xms_cc=cp1 would be a new failure mode unrelated to revocation. Echoes Fabian's concern on #5364 about non-public clouds ("could never recover").
Suggest amending to call this out explicitly, e.g.:
All AAD token requests now include the CAE client capability (
xms_cc=cp1) to enable Continuous Access Evaluation.
Worth also considering whether this should be gated on a CosmosClientOptions flag (similar to existing EnableThinClient* knobs) rather than being unconditional.
There was a problem hiding this comment.
The xms_cc=cp1 claim is a standard CAE client capability defined by the Microsoft Identity platform. It tells Entra "this client can handle claims challenges" — it doesn't change the token itself or its
validity. Entra handles it gracefully across all tenants, including sovereign clouds — if a tenant doesn't support CAE, the claim is simply ignored and a standard token is returned.
This is not a new behavioral opt-in. Azure Identity libraries (MSAL) already support xms_cc=cp1 natively via IsCaeEnabled on TokenRequestContext, which we also set in this PR. Sending it via Claims is the
documented pattern for signaling CAE readiness.
Gating behind a CosmosClientOptions flag would mean customers need to explicitly opt in to get revocation protection, which defeats the purpose of transparent handling. The feature should work out of the
box for all AAD-authenticated customers.
| private readonly IDocumentClientRetryPolicy throttlingRetry; | ||
| private readonly GlobalEndpointManager globalEndpointManager; | ||
| private readonly GlobalPartitionEndpointManager partitionKeyRangeLocationCache; | ||
| //------------------------------------------------------------ |
There was a problem hiding this comment.
🔴 File-wide encoding damage folded into this diff.
Two mechanical changes were rolled into the substantive edit here (and in ClientRetryPolicyTests.cs):
- UTF-8 BOM stripped from line 1 (visible as
-in the deleted line, no replacement). - Non-ASCII characters in pre-existing comments were re-encoded from Windows-1252 to UTF-8 a second time, producing mojibake. Examples in the test file:
ΓÇö(em dash) →╬ô├ç├╢,2×→2Γö£├╣, divider comments like─── DTX ───→╬ô├╢├ç╬ô├╢├ç╬ô├╢├ç DTX.... Visible atClientRetryPolicyTests.csline ~1075 (╬ô├ç├╢ idempo…) and several other places, including assert messages that compile but render as garbage in test output.
The real semantic change in ClientRetryPolicy.cs is ~80 net lines (new field, ctor param, the detection branch, HandleUnauthorizedResponse, MaxCaeRevocationRetryCount, caeRevocationRetryCount). It's buried inside +846/−768 of churn that no human reviewer can verify line-by-line, and future git blame on every line in both files will point to this PR.
Restore the BOM, restore the original line endings, undo the comment corruption, and re-do the substantive edits as a focused diff. If your editor stripped the BOM, configure .editorconfig / .gitattributes to enforce UTF-8-with-BOM for .cs files in this repo.
There was a problem hiding this comment.
Fixed — restored UTF-8 BOM on both ClientRetryPolicy.cs and ClientRetryPolicyTests.cs.
Pull Request Template
Description
AAD Token Revocation (Emergency revocation) — SDK Implementation Spec
1. Feature Summary
The Azure Cosmos DB .NET SDK handles AAD token revocation transparently for the customer. When a token is revoked — either through an emergency revocation rule or a Continuous Access Evaluation (CAE) event — the routing gateway rejects the request. The SDK detects the rejection, extracts the claims challenge from the response, requests a fresh token from Microsoft Entra ID with those claims, and retries the original request. The customer's operation succeeds without any application-level error handling.
This covers both revocation mechanisms:
401 Unauthorizedwith substatus5013and aWWW-Authenticateheader containing a claims challenge.401 Unauthorizedand aWWW-Authenticateheader containingerror="insufficient_claims"and a claims challenge.The SDK behavior is identical for both revocation types.
2. How Revocation Works End-to-End
2.1 Normal operation (before revocation)
CosmosClientwith aTokenCredential(e.g.,DefaultAzureCredential).GetTokenAsyncmethod to obtain an AAD token from Entra. The SDK includesxms_cc=["cp1"]in the token request to signal that it supports CAE client capabilities.2.2 Token gets revoked
Emergency Revocation: An administrator creates a revocation rule in the Cosmos DB control plane. The rule specifies matching criteria based on JWT claims (e.g., revoke all tokens with a specific
utivalue). The gateway evaluates these rules and starts rejecting tokens that match.The next request from the SDK that carries the revoked token will be rejected by the gateway.
2.3 The routing gateway rejects the request
When the gateway determines that a token has been revoked, it returns:
The
claimsvalue is a base64-encoded JSON string that tells Entra what the new token must satisfy:{"access_token":{"nbf":{"essential":false,"value":"<unix_timestamp>"}}}The
nbf(not before) claim with the current timestamp ensures Entra issues a token that was created after the revocation event.2.4 The SDK detects the revocation
The SDK detects the revocation when it receives a
401 Unauthorizedresponse that meets either of these conditions:5013(emergency revocation)WWW-Authenticateheader (CAE revocation)The SDK additionally validates that the
WWW-Authenticateheader containsinsufficient_claimsorclaims=before proceeding with the retry. If neither is present, the 401 is treated as a normal authentication failure and is not retried.This detection only applies when the client uses
TokenCredential-based authentication (AAD).2.5 The SDK extracts claims and resets the token cache
Once revocation is detected, the SDK:
WWW-Authenticateheader by parsing theclaims="<base64>"value.After this step, the next call to get a token will not return the cached (revoked) token — it will request a fresh one from Entra.
2.6 The SDK requests a fresh token from Entra with merged claims
When the SDK needs a new token for the retry, it:
nbf(not before). The SDK addsxms_cc=["cp1"](CAE client capability). The merged result looks like:{"access_token":{"nbf":{"essential":false,"value":"1712345678"},"xms_cc":{"values":["cp1"]}}}TokenCredential.GetTokenAsync) with aTokenRequestContextthat includes the merged claims as theClaimsproperty.nbfrequirement — meaning the token was issued after the revocation timestamp.The customer's
TokenCredentialimplementation (e.g.,DefaultAzureCredential,InteractiveBrowserCredential) handles the claims automatically through the Azure Identity library's built-in CAE support.2.7 The SDK retries the original request
The SDK retries the exact same operation that originally failed, now using the fresh token. The gateway validates the new token, if it finds that it satisfies the revocation requirements then processes the request normally. The customer's operation succeeds.
2.8 The claims challenge is cleared
After the fresh token is successfully acquired, the stored claims challenge is cleared. Future normal token requests (including background refreshes) will only include the
xms_cc/cp1client capability without thenbfclaim.2.9 Retry limit
The SDK retries the failed request exactly once. If the retry also fails with a 401 (for any reason — another revocation, invalid token, permission change, etc.), the SDK does not retry again. The error is returned to the customer.
This prevents infinite retry loops in scenarios where the token is permanently invalid or the revocation rule cannot be satisfied.
3. Detection Criteria
The SDK triggers the revocation retry when both conditions are met:
401 Unauthorized5013or the response contains a non-emptyWWW-AuthenticateheaderAdditionally, the
WWW-Authenticateheader must containinsufficient_claimsorclaims=for the claims extraction to proceed. If neither is present, the retry is not triggered.The retry only applies when the client uses
TokenCredential-based authentication (AAD).4. Revocation Coverage by Connection Mode
Revocation rules are enforced by the routing gateway on every HTTP request it receives. The SDK retries once with a fresh token when it detects a revocation response.
4.1 Gateway Mode
In gateway mode, all operations go through HTTP to the routing gateway. Every operation is subject to token revocation.
All data-plane and control-plane operations in gateway mode go through
GatewayStoreModel, which is covered byClientRetryPolicy. The account read during client initialization is covered by a dedicated retry inGatewayAccountReader.4.2 Direct Mode
In direct mode, document operations go directly to backend replicas using the RNTBD binary protocol over TCP. RNTBD does not pass through the gateway, so document operations are not subject to revocation.
All other operations (control-plane, metadata, address resolution) still go through HTTP to the gateway and are fully covered.
4.3 ThinClient Mode
ThinClient mode behaves exactly like direct mode. Data-plane operations are not subject to gateway revocation. All metadata and control-plane operations go through the gateway and are fully covered.
ThinClient mode cannot be tested with the local emulator because it requires server-side enablement.
5. Paths Not Covered
5.1 Direct mode document operations
Document create, read, update, delete, patch, query, ReadMany, and ChangeFeed operations in direct mode use the RNTBD binary protocol over TCP. These requests go directly to backend replicas without passing through the routing gateway. The routing gateway never sees these requests, so its revocation rules are never evaluated. The SDK does not attempt revocation retry on this path because revocation cannot occur here.
5.2 Background account refresh
The SDK periodically refreshes account information in the background via
GlobalEndpointManager. This background refresh uses a different code path than the initial account read and does not have revocation retry logic. If a revocation 401 occurs during a background refresh, the refresh fails silently. The SDK continues operating with previously cached account information. The next user-initiated operation that triggers a token refresh (through any of the covered paths) will handle the revocation and obtain a fresh token.6. Files Modified
ClientRetryPolicy.csWWW-AuthenticateAuthorizationTokenProviderTokenCredential.csTryHandleRevocationExceptionstatic helper for out-of-pipeline retry pathsTokenCredentialCache.csResetCachedToken(claims)stores challenge;MergeClaimsWithClientCapabilitiesmerges nbf + xms_cc; clears challenge after successful token acquisitionGatewayAccountReader.cscatch whenrevocation retry on account read with one retryGatewayAddressCache.cscatch whenrevocation retry on master and server address resolution with one retry eachGlobalAddressResolver.csAuthorizationTokenProvidertoGatewayAddressCache(optional parameter, default null)DocumentClient.cscosmosAuthorizationtoGlobalAddressResolverFaultInjectionServerErrorResultInternal.csAadTokenRevokederror type now includesWWW-Authenticateheader with claims challenge7. Non-Regression Guarantees
catch whenfilters checkauthorizationTokenProvider is AuthorizationTokenProviderTokenCredential— returns false for master key authnull— no existing callers affectedHandleUnauthorizedResponsevalidatesWWW-Authenticatecontainsinsufficient_claimsorclaims=before actingClientRetryPolicyunit tests passAbstractRetryHandleris unchanged — no risk to the existing retry loop behaviorPlease delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber