Diagnostics: Adds Hedging Detection API (HedgingStarted, GetRequestedRegions and GetRespondedRegions)#5868
Diagnostics: Adds Hedging Detection API (HedgingStarted, GetRequestedRegions and GetRespondedRegions)#5868NaluTripician wants to merge 14 commits into
HedgingStarted, GetRequestedRegions and GetRespondedRegions)#5868Conversation
…mosDiagnostics accessors
Introduces the Hedging Detection API public surface on CosmosDiagnostics:
- New public readonly struct Microsoft.Azure.Cosmos.RequestedRegion
- New public enum Microsoft.Azure.Cosmos.RequestedRegionReason (non-exhaustive)
- Three new virtual methods on CosmosDiagnostics with safe defaults:
bool HedgingStarted()
IReadOnlyList<RequestedRegion> GetRequestedRegions()
IReadOnlyList<string> GetRespondedRegions()
Internal storage is a new HedgingDetectionState class attached to TraceSummary
(shared per-operation). CosmosTraceDiagnostics overrides delegate to that state.
No new TraceDatum is added, so the Diagnostics.ToString() JSON shape is unchanged.
Issue #5867. No callers yet; dispatch-site wiring lands in follow-up commits.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wires the HedgingDetectionState appends end-to-end: - ClientRetryPolicy.OnBeforeSendRequest: after route resolution, when a retry is in flight, tag Properties[DispatchReasonPropertyKey] with OperationRetry or RegionFailover depending on retryContext. - CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsync: for hedge arms (requestNumber > 0) set Properties[DispatchReasonPropertyKey] = Hedging. Skipped for requestNumber == 0 so the primary remains tagged Initial. - TransportHandler.ProcessMessageAsync: after ToDocumentServiceRequest() and before storeProxy.ProcessMessageAsync, call new helper AppendDispatchedRegion which reads the reason from Properties (defaulting to Initial), resolves the region via GlobalEndpointManager.GetLocation, appends a RequestedRegion entry to HedgingDetectionState, and removes the property so subsequent retries default unless reset. - ClientSideRequestStatisticsTraceDatum: at the three RecordResponse / RecordHttpResponse / RecordHttpException sites, also append responded regions to HedgingDetectionState alongside the existing AddRegionContacted call. Build is clean for netstandard2.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RequestedRegionTests: equality (case-insensitive), GetHashCode, ToString, null-name guard, operator ==/!=, mixed-type Equals. - HedgingDetectionStateTests: defaults, null/empty guards, HedgingStarted flip semantics, ordering/duplicates preservation, snapshot independence, thread-safety (8 writers x 500 iterations). - CosmosDiagnosticsBackwardCompatTests: customer subclass that never overrides the new virtuals — confirms safe default of false / empty list (AC9). - DotNetSDKAPI.net6.json: regenerated to reflect the new public surface (3 virtual methods, RequestedRegion struct, RequestedRegionReason enum). - changelog.md: adds Unreleased Preview entry for issue #5867. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ew 3-API design Replaces the legacy IsHedged/GetHedgedRegions proposal (preserved on feature/hedging-detection-api-legacy-pr5741 at commits 162dab8 and 388cedb) with the approved 3-API design covering HedgingStarted, GetRequestedRegions, GetRespondedRegions plus the RequestedRegion struct and RequestedRegionReason enum. Tracks issue #5867. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Diagnostics sample — what callers will seeAttaching a sample of how the new surface looks at call-sites and what the values look like for a few realistic dispatch shapes. The Caller usageItemResponse<Book> response = await container.ReadItemAsync<Book>(id, pk);
CosmosDiagnostics diagnostics = response.Diagnostics;
bool hedged = diagnostics.HedgingStarted();
IReadOnlyList<RequestedRegion> requested = diagnostics.GetRequestedRegions();
IReadOnlyList<string> responded = diagnostics.GetRespondedRegions();
logger.LogInformation(
"Hedging started: {Hedged}; requested=[{Req}]; responded=[{Resp}]",
hedged,
string.Join(", ", requested), // uses RequestedRegion.ToString() → "East US:Initial"
string.Join(", ", responded));
// Non-exhaustive enum — always include a default arm
foreach (RequestedRegion r in requested)
{
switch (r.Reason)
{
case RequestedRegionReason.Initial: /* first dispatch */ break;
case RequestedRegionReason.OperationRetry: /* same-region retry */ break;
case RequestedRegionReason.RegionFailover: /* cross-region failover */ break;
case RequestedRegionReason.Hedging: /* hedge arm dispatched */ break;
case RequestedRegionReason.CircuitBreakerProbe: /* PPCB probe (reserved in v1) */ break;
case RequestedRegionReason.TransportRetry: /* reserved in v1 */ break;
default: /* future-proof */ break;
}
}Scenarios1. Single-region happy path (no hedging configured)2. Same-region retry (e.g. 410 Gone → SDK retries primary)3. Region fail-over (preferred-locations list exhausts primary, fails over)4. Hedging configured, primary wins under threshold (no hedge arm dispatched)
5. Hedging configured, hedge arm wins (primary slow / failed)6. Hedging configured, two hedge arms, last responds (worst case)Equality / formatting on
|
| /// subclasses of <see cref="CosmosDiagnostics"/>). | ||
| /// </para> | ||
| /// </remarks> | ||
| public virtual bool HedgingStarted() |
There was a problem hiding this comment.
🟡 Recommendation (F2) No end-to-end tests cover the wiring this PR adds.
The 17 added tests cover HedgingDetectionState and RequestedRegion in isolation plus AC9 backward-compat defaults — but nothing verifies the integration that is the entire ask of the PR:
ClientRetryPolicy.OnBeforeSendRequestactually writesOperationRetryvsRegionFailoverfor the two branches ofRetryRequestOnPreferredLocations.CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsyncwritesHedgingonly onrequestNumber > 0.TransportHandler.AppendDispatchedRegionresolves a region, appends an entry, and removes the property.- An end-to-end hedge fan-out via the existing mock pipeline produces the expected
HedgingStarted()/GetRequestedRegions()/GetRespondedRegions()sequence.
AC1–AC5 are essentially asserted by inspection only. CrossRegionHedgingAvailabilityStrategyTests is the natural home for the integration coverage, and the gap is what would have caught F3 (hedge-arm retry overwrite) before merge.
| Uri endpoint = serviceRequest.RequestContext?.LocationEndpointToRoute; | ||
| if (endpoint != null && globalEndpointManager != null) | ||
| { | ||
| regionName = globalEndpointManager.GetLocation(endpoint); |
There was a problem hiding this comment.
🟡 Recommendation (F4) Thin-client / PPAF / per-partition routed dispatches produce no entry in GetRequestedRegions().
GlobalEndpointManager.GetLocation(uri) walks AvailableWriteEndpointByLocation and AvailableReadEndpointByLocation and returns null for any URI not in those maps. That includes:
- Thin-client endpoints (
ResolveThinClientEndpointinClientRetryPolicy.cs:274-277returns endpoints fromThinClientReadEndpoints/ThinClientWriteEndpoints, whichGetLocationdoes not consult). - PPAF / PPCB-rerouted URIs.
- Any per-partition routed endpoint.
For those operations AppendDispatchedRegion bails silently — but ClientSideRequestStatisticsTraceDatum.RecordResponse/RecordHttpResponse still call AppendResponded because the responded region is derived from a different source. Net result: GetRespondedRegions() has entries but GetRequestedRegions() is empty for thin-client and PPCB scenarios — a surprising asymmetry that the XML docs do not disclose.
Suggested fix (lowest cost): add a remark to the XML doc on CosmosDiagnostics.GetRequestedRegions and HedgingStarted calling out that thin-client / PPAF dispatches are not currently included, and cross-reference AC6/AC7. Better: fall back to serviceRequest.RequestContext.RegionName when GetLocation returns null. Best: make GlobalEndpointManager aware of thin-client endpoints.
Addresses F1 review feedback on PR #5868. The previous enum layout had Initial = 0, which meant default(RequestedRegion) yielded { RegionName = null, Reason = Initial } — indistinguishable from a real first dispatch (any new RequestedRegion[N], uninitialized field, or deserialized value where Reason was absent produced a phantom Initial entry that callers could not attribute to themselves). After GA this would be an API-shape problem that could not be fixed without a breaking change. Reserves Unknown = 0 as the explicit default sentinel and renumbers Initial = 1, OperationRetry = 2, TransportRetry = 3, Hedging = 4, RegionFailover = 5, CircuitBreakerProbe = 6. The SDK never emits Unknown from a real dispatch — its presence is the signal that a RequestedRegion came from struct-default construction rather than an SDK dispatch. Updates RequestedRegionReason XML docs to document the sentinel, the contract baseline (DotNetSDKAPI.net6.json) to include the new Unknown field, the OpenSpec requirement + scenario, the tasks.md entry, and the changelog. Adds two new tests in RequestedRegionTests: DefaultStruct_ReasonIsUnknownSentinel_NotInitial pins the runtime invariant and Enum_UnknownIsZeroAndInitialIsOne pins the byte layout so any future renumbering is caught.
Addresses F3 review feedback on PR #5868. Before this fix, when a hedge arm itself triggered a retry (e.g. 410 Gone, 449), ClientRetryPolicy.OnBeforeSendRequest unconditionally overwrote the Hedging dispatch-reason value with OperationRetry or RegionFailover. TransportHandler had already consumed the prior Hedging tag for the first arm dispatch, so the retry of the hedge arm was recorded as a same-region retry, silently losing the hedge origin from the GetRequestedRegions() sequence. Adds a preservation guard: when Properties[DispatchReasonPropertyKey] already holds RequestedRegionReason.Hedging, the policy leaves it in place rather than overwriting. Non-Hedging tags continue to be overwritten by the new retry reason so the policy can still correctly distinguish OperationRetry from RegionFailover across normal (non-hedge) retries. Adds two unit tests in ClientRetryPolicyTests: OnBeforeSendRequest_HedgeArmRetry_PreservesHedgingReason pins the preservation invariant (drives a 410/LeaseNotFound on a pre-tagged hedge arm and asserts Hedging survives the second OnBeforeSendRequest), and OnBeforeSendRequest_NonHedgeRetry_OverwritesPreviousReason pins the inverse so the guard is strictly scoped to the Hedging value and does not block normal retry-reason transitions.
…ceeds Addresses F5 review feedback on PR #5868. Previously TransportHandler.AppendDispatchedRegion consumed (Remove'd) the DispatchReasonPropertyKey from Properties immediately after reading it, then attempted to resolve the region name from the routing endpoint and bailed early if resolution failed (the F4 thin-client / PPCB / per-partition routed scenarios). The dispatch-reason signal had now been consumed but never recorded — if anything downstream re-entered the dispatch path on the same DocumentServiceRequest, the original reason was gone and the next entry defaulted to Initial even though the upstream caller intended a different reason. Splits the Remove away from the read: capture whether the property was present, then attempt region resolution and bail (leaving Properties intact) on failure, and only Remove the key after a successful state.AppendRequested call. This preserves the signal across re-dispatch attempts that hit the F4 resolution gap, and is a strict superset of the previous semantics for the happy path.
…flag Addresses F7 review feedback on PR #5868. The public CosmosDiagnostics.HedgingStarted() virtual is documented as O(1) and safe to call on the diagnostics hot path, but the previous implementation acquired Monitor.Enter on regionLock for every read of a monotonic-true flag (set once inside AppendRequested and never reset). That added avoidable contention with the writer-side AppendRequested / AppendResponded callers on the same lock. Marks the hedgingStarted backing field volatile and reads it directly without acquiring regionLock. The write continues to happen under regionLock so it is ordered with the requestedRegions list mutation that triggered it; volatile gives readers an acquire-fence so the flip cannot be reordered before the list Add that established it. Memory-model behavior is preserved (no torn reads on bool; monotonic-true once flipped) without the Monitor.Enter / Exit pair on every CosmosDiagnostics.HedgingStarted() call. Adds two unit tests in HedgingDetectionStateTests: HedgingStartedBackingField_IsDeclaredVolatile uses reflection on FieldInfo.GetRequiredCustomModifiers() to assert IsVolatile is set so an accidental drop of the modifier during a future refactor is caught at the test layer rather than at production runtime, and HedgingStarted_LockFreeRead_ObservesWriterFlip functionally exercises three concurrent readers spinning on the getter while a writer flips the flag, asserting all readers observe true within a 5-second deadline.
HedgingStarted, GetRequestedRegions and GetRespondedRegions)
Addresses follow-up F7 review feedback on PR #5868. The volatile keyword on hedgingStarted is REQUIRED on the reader side (acquire fence for the lock-free HedgingStarted getter) but is REDUNDANT on the writer side (the regionLock release already publishes the store). Without an explicit comment, a future contributor could 'optimize' the writer-side lock away and break the diagnostics invariant that HedgingStarted == true implies at least one Hedging entry is visible in GetRequestedRegionsSnapshot(). - Rewrites the field-level comment to spell out reader-side vs. writer-side semantics and the consequence of moving the write outside regionLock. - Adds a short inline comment at the write site pointing back to the field comment. No behavioral change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…retries Addresses follow-up F3 review feedback on PR #5868. The previous F3 fix in ClientRetryPolicy.OnBeforeSendRequest added a preservation guard that skips overwriting Properties[DispatchReasonPropertyKey] when the existing value is already Hedging. However, the guard was effectively dead code in production because TransportHandler.AppendDispatchedRegion removed the key (under the F5 deferred-Remove fix) BEFORE the retry-driven re-entry of OnBeforeSendRequest could observe it. Since RequestMessage.Properties and the cached DocumentServiceRequest.Properties are the same dictionary reference, the Remove drained the source too. Fix: AppendDispatchedRegion now skips the Remove when the dispatch reason is Hedging. The key stays in Properties so the next physical retry's OnBeforeSendRequest can observe it, which lets the existing CRP preservation guard correctly classify the retry as Hedging instead of overwriting it with RegionFailover/OperationRetry. Tests: - AppendDispatchedRegion_HedgingReason_LeavesPropertyForRetry: unit test that the property remains after AppendDispatchedRegion when reason is Hedging. - AppendDispatchedRegion_NonHedgingReason_RemovesPropertyAfterConsume: inverse unit test confirming the key is still removed for non-Hedging reasons. - HedgeArmRetry_ProductionOrder_RecordsBothPhysicalAttemptsAsHedging: drives the production sequence (OnBefore -> AppendDispatched -> 410 -> ShouldRetry -> OnBefore -> AppendDispatched) and asserts both physical attempts are recorded as Hedging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Diagnostics: Adds hedging detection API (
HedgingStarted/GetRequestedRegions/GetRespondedRegions)Closes #5867 (partial — covers .NET; Java and Python tracked separately).
Summary
Adds a focused, hot-path-friendly public surface on
CosmosDiagnosticsso callers can observe per-operation hedging behavior without parsing theToString()JSON:virtual bool HedgingStarted()—trueiff the SDK dispatched ≥ 1 hedge arm.virtual IReadOnlyList<RequestedRegion> GetRequestedRegions()— every region the SDK dispatched to, in observed order, tagged with aRequestedRegionReason.virtual IReadOnlyList<string> GetRespondedRegions()— every region that produced a response, in arrival order, duplicates allowed.RequestedRegion(readonly struct,IEquatable<>, case-insensitive name equality) andRequestedRegionReasonenum (Initial,OperationRetry,RegionFailover,Hedging,CircuitBreakerProbe,TransportRetry).All members are virtual with safe defaults (
false/ empty list); custom subclasses ofCosmosDiagnosticsare unaffected.Why this design
TraceSummary.HedgingDetectionState, not as aTraceDatum. This means theDiagnostics.ToString()JSON shape is unchanged, while the API remains O(1). Mirrors the existingRegionsContactedpattern.RequestMessage.Properties[__CosmosDB_HedgingDetection_NextDispatchReason]—Propertiesis already deep-cloned per hedge arm byRequestMessage.Clone()and is shared withDocumentServiceRequest.Properties, so the upstream sites (ClientRetryPolicy,CrossRegionHedgingAvailabilityStrategy) can signal the downstream dispatch site (TransportHandler) without adding a new internal contract.GlobalEndpointManager.GetLocation(locationEndpointToRoute)—DocumentServiceRequest.RequestContext.RegionNameis not populated until after the append site.requestNumber > 0) are only tagged insideCloneAndSendAsync, which is invoked only after the previous threshold delay elapses without primary cancellation.See
openspec/changes/hedging-detection-api/design.mdfor the full design write-up.Wiring
CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsyncrequestNumber > 0setProperties[KEY] = Hedging.ClientRetryPolicy.OnBeforeSendRequestRouteToLocation, whenretryContext != nullsetProperties[KEY]toRegionFailoverorOperationRetry.TransportHandler.ProcessMessageAsyncToDocumentServiceRequest, append aRequestedRegion, then remove the property.ClientSideRequestStatisticsTraceDatumGetRespondedRegions.AC coverage
HedgingDetectionState.HedgingStartedflag, flipped on firstHedgingappend.requestNumber > 0; threshold delay gates dispatch.Initialrequested, region recorded as responded.ClientRetryPolicytagsOperationRetryfor same-region retries.ClientRetryPolicytagsRegionFailoverwhenRetryRequestOnPreferredLocations.CircuitBreakerProbereserved in enum but not populated — PPCB rerouting decision happens insideMicrosoft.Azure.Cosmos.Direct'sstoreProxy.ProcessMessageAsync, after the pre-dispatch append site. Tracked as follow-up requiring a Direct-package change.TransportRetryreserved but per-channel transport retries inside Direct are not surfaced at this layer in v1.GetRequestedRegions(Hedging) andGetRespondedRegions.CosmosDiagnosticsreturnfalse/ empty —CosmosDiagnosticsBackwardCompatTests.IReadOnlyList<T>returns immutable snapshot arrays.ToString()JSON shape unchanged.RequestedRegionis a readonly struct with case-insensitive name equality,==/!=, customGetHashCode,ToString.GetRespondedRegionsallowed and preserved.Files changed
Public surface (
src/Diagnostics)CosmosDiagnostics.cs— 3 new virtual methods with SE-013<remarks>.CosmosTraceDiagnostics.cs— overrides delegate toHedgingDetectionState.RequestedRegion.cs(new) — readonly struct.RequestedRegionReason.cs(new) — enum.Internal state (
src/Tracing)HedgingDetectionState.cs(new) — lock-protected per-trace state.TraceSummary.cs— eagerHedgingDetectionStateproperty.Wiring
Routing/AvailabilityStrategy/CrossRegionHedgingAvailabilityStrategy.csClientRetryPolicy.csHandler/TransportHandler.csTracing/TraceData/ClientSideRequestStatisticsTraceDatum.csTests
Diagnostics/RequestedRegionTests.cs(6 tests)Diagnostics/HedgingDetectionStateTests.cs(8 tests incl. concurrency)Diagnostics/CosmosDiagnosticsBackwardCompatTests.cs(3 tests, AC9)Contract / changelog / OpenSpec
tests/Microsoft.Azure.Cosmos.Tests/Contracts/DotNetSDKAPI.net6.jsonchangelog.mdopenspec/changes/hedging-detection-api/{proposal.md,design.md,tasks.md,specs/hedging-detection/spec.md}Validation
dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Debug— clean ✅dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Release— clean ✅Open questions / follow-ups
CircuitBreakerProbe— full PPCB tagging needs a small change inMicrosoft.Azure.Cosmos.Direct'sDocumentServiceRequestContextto flag probe requests. Suggested follow-up: surface anIsPpcbProbeflag and append aCircuitBreakerProbe-tagged entry at the rerouting site.TransportRetry— same Direct-package consideration for per-channel transport retries.Microsoft.Azure.Cosmos.EmulatorTestsinfra or add a single live-account test under an environment-gated test category.internal-spec.md.cc @kirankumarkolli — reviewer of record per the internal spec.