Skip to content

Diagnostics: Adds Hedging Detection API (HedgingStarted, GetRequestedRegions and GetRespondedRegions)#5868

Open
NaluTripician wants to merge 14 commits into
mainfrom
feature/hedging-detection-api
Open

Diagnostics: Adds Hedging Detection API (HedgingStarted, GetRequestedRegions and GetRespondedRegions)#5868
NaluTripician wants to merge 14 commits into
mainfrom
feature/hedging-detection-api

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

Diagnostics: Adds hedging detection API (HedgingStarted / GetRequestedRegions / GetRespondedRegions)

Closes #5867 (partial — covers .NET; Java and Python tracked separately).

Summary

Adds a focused, hot-path-friendly public surface on CosmosDiagnostics so callers can observe per-operation hedging behavior without parsing the ToString() JSON:

  • virtual bool HedgingStarted()true iff the SDK dispatched ≥ 1 hedge arm.
  • virtual IReadOnlyList<RequestedRegion> GetRequestedRegions() — every region the SDK dispatched to, in observed order, tagged with a RequestedRegionReason.
  • virtual IReadOnlyList<string> GetRespondedRegions() — every region that produced a response, in arrival order, duplicates allowed.
  • New types: RequestedRegion (readonly struct, IEquatable<>, case-insensitive name equality) and RequestedRegionReason enum (Initial, OperationRetry, RegionFailover, Hedging, CircuitBreakerProbe, TransportRetry).

All members are virtual with safe defaults (false / empty list); custom subclasses of CosmosDiagnostics are unaffected.

Why this design

  • State lives on TraceSummary.HedgingDetectionState, not as a TraceDatum. This means the Diagnostics.ToString() JSON shape is unchanged, while the API remains O(1). Mirrors the existing RegionsContacted pattern.
  • Cross-handler signaling via RequestMessage.Properties[__CosmosDB_HedgingDetection_NextDispatchReason]Properties is already deep-cloned per hedge arm by RequestMessage.Clone() and is shared with DocumentServiceRequest.Properties, so the upstream sites (ClientRetryPolicy, CrossRegionHedgingAvailabilityStrategy) can signal the downstream dispatch site (TransportHandler) without adding a new internal contract.
  • Region resolution at dispatch via GlobalEndpointManager.GetLocation(locationEndpointToRoute)DocumentServiceRequest.RequestContext.RegionName is not populated until after the append site.
  • No phantom Hedging entries — hedge arms (requestNumber > 0) are only tagged inside CloneAndSendAsync, which is invoked only after the previous threshold delay elapses without primary cancellation.

See openspec/changes/hedging-detection-api/design.md for the full design write-up.

Wiring

Site Change
CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsync For requestNumber > 0 set Properties[KEY] = Hedging.
ClientRetryPolicy.OnBeforeSendRequest After RouteToLocation, when retryContext != null set Properties[KEY] to RegionFailover or OperationRetry.
TransportHandler.ProcessMessageAsync After ToDocumentServiceRequest, append a RequestedRegion, then remove the property.
ClientSideRequestStatisticsTraceDatum At all 3 response-record sites, also append the region to GetRespondedRegions.

AC coverage

AC Status Mechanism
AC1 HedgingDetectionState.HedgingStarted flag, flipped on first Hedging append.
AC2 Hedge arm tag set only inside requestNumber > 0; threshold delay gates dispatch.
AC3 Single-region happy path: Initial requested, region recorded as responded.
AC4 ClientRetryPolicy tags OperationRetry for same-region retries.
AC5 ClientRetryPolicy tags RegionFailover when RetryRequestOnPreferredLocations.
AC6 ⏸ Deferred CircuitBreakerProbe reserved in enum but not populated — PPCB rerouting decision happens inside Microsoft.Azure.Cosmos.Direct's storeProxy.ProcessMessageAsync, after the pre-dispatch append site. Tracked as follow-up requiring a Direct-package change.
AC7 ⏸ Deferred TransportRetry reserved but per-channel transport retries inside Direct are not surfaced at this layer in v1.
AC8 Hedge winner appears both in GetRequestedRegions (Hedging) and GetRespondedRegions.
AC9 Default virtuals on CosmosDiagnostics return false / empty — CosmosDiagnosticsBackwardCompatTests.
AC10 IReadOnlyList<T> returns immutable snapshot arrays.
AC11 State off the trace tree → ToString() JSON shape unchanged.
AC12 RequestedRegion is a readonly struct with case-insensitive name equality, ==/!=, custom GetHashCode, ToString.
AC13 No phantom Hedging entries — see design §5.
AC14 Duplicates in GetRespondedRegions allowed and preserved.
AC15 Lock-protected snapshots; concurrency test (8 writers × 500 iterations).
AC16 ⏸ Tracked Live multi-region smoke test deferred to a follow-up emulator/live-account test; requires an environment outside the per-commit CI matrix.

Files changed

Public surface (src/Diagnostics)

  • CosmosDiagnostics.cs — 3 new virtual methods with SE-013 <remarks>.
  • CosmosTraceDiagnostics.cs — overrides delegate to HedgingDetectionState.
  • RequestedRegion.cs (new) — readonly struct.
  • RequestedRegionReason.cs (new) — enum.

Internal state (src/Tracing)

  • HedgingDetectionState.cs (new) — lock-protected per-trace state.
  • TraceSummary.cs — eager HedgingDetectionState property.

Wiring

  • Routing/AvailabilityStrategy/CrossRegionHedgingAvailabilityStrategy.cs
  • ClientRetryPolicy.cs
  • Handler/TransportHandler.cs
  • Tracing/TraceData/ClientSideRequestStatisticsTraceDatum.cs

Tests

  • Diagnostics/RequestedRegionTests.cs (6 tests)
  • Diagnostics/HedgingDetectionStateTests.cs (8 tests incl. concurrency)
  • Diagnostics/CosmosDiagnosticsBackwardCompatTests.cs (3 tests, AC9)

Contract / changelog / OpenSpec

  • tests/Microsoft.Azure.Cosmos.Tests/Contracts/DotNetSDKAPI.net6.json
  • changelog.md
  • openspec/changes/hedging-detection-api/{proposal.md,design.md,tasks.md,specs/hedging-detection/spec.md}

Validation

  • dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Debug — clean ✅
  • dotnet build Microsoft.Azure.Cosmos\src\Microsoft.Azure.Cosmos.csproj -c Release — clean ✅
  • New unit tests — 17 / 17 ✅
  • GA contract enforcement test — 107 / 107 ✅

Open questions / follow-ups

  • AC6 / CircuitBreakerProbe — full PPCB tagging needs a small change in Microsoft.Azure.Cosmos.Direct's DocumentServiceRequestContext to flag probe requests. Suggested follow-up: surface an IsPpcbProbe flag and append a CircuitBreakerProbe-tagged entry at the rerouting site.
  • AC7 / TransportRetry — same Direct-package consideration for per-channel transport retries.
  • AC16 — live multi-region smoke test; can either reuse the existing Microsoft.Azure.Cosmos.EmulatorTests infra or add a single live-account test under an environment-gated test category.
  • Java / Python issue threads should be opened separately referencing issue [Feature] Hedging Detection API — public accessors on CosmosDiagnostics #5867 and the cross-SDK invariants in internal-spec.md.

cc @kirankumarkolli — reviewer of record per the internal spec.

Copilot AI and others added 5 commits May 14, 2026 13:39
…mosDiagnostics accessors

Introduces the Hedging Detection API public surface on CosmosDiagnostics:
- New public readonly struct Microsoft.Azure.Cosmos.RequestedRegion
- New public enum Microsoft.Azure.Cosmos.RequestedRegionReason (non-exhaustive)
- Three new virtual methods on CosmosDiagnostics with safe defaults:
    bool HedgingStarted()
    IReadOnlyList<RequestedRegion> GetRequestedRegions()
    IReadOnlyList<string> GetRespondedRegions()

Internal storage is a new HedgingDetectionState class attached to TraceSummary
(shared per-operation). CosmosTraceDiagnostics overrides delegate to that state.
No new TraceDatum is added, so the Diagnostics.ToString() JSON shape is unchanged.

Issue #5867. No callers yet; dispatch-site wiring lands in follow-up commits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wires the HedgingDetectionState appends end-to-end:

- ClientRetryPolicy.OnBeforeSendRequest: after route resolution, when a
  retry is in flight, tag Properties[DispatchReasonPropertyKey] with
  OperationRetry or RegionFailover depending on retryContext.
- CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsync: for hedge arms
  (requestNumber > 0) set Properties[DispatchReasonPropertyKey] = Hedging.
  Skipped for requestNumber == 0 so the primary remains tagged Initial.
- TransportHandler.ProcessMessageAsync: after ToDocumentServiceRequest()
  and before storeProxy.ProcessMessageAsync, call new helper
  AppendDispatchedRegion which reads the reason from Properties (defaulting
  to Initial), resolves the region via GlobalEndpointManager.GetLocation,
  appends a RequestedRegion entry to HedgingDetectionState, and removes
  the property so subsequent retries default unless reset.
- ClientSideRequestStatisticsTraceDatum: at the three RecordResponse /
  RecordHttpResponse / RecordHttpException sites, also append responded
  regions to HedgingDetectionState alongside the existing AddRegionContacted
  call.

Build is clean for netstandard2.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- RequestedRegionTests: equality (case-insensitive), GetHashCode, ToString,
  null-name guard, operator ==/!=, mixed-type Equals.
- HedgingDetectionStateTests: defaults, null/empty guards, HedgingStarted
  flip semantics, ordering/duplicates preservation, snapshot independence,
  thread-safety (8 writers x 500 iterations).
- CosmosDiagnosticsBackwardCompatTests: customer subclass that never
  overrides the new virtuals — confirms safe default of false / empty list
  (AC9).
- DotNetSDKAPI.net6.json: regenerated to reflect the new public surface
  (3 virtual methods, RequestedRegion struct, RequestedRegionReason enum).
- changelog.md: adds Unreleased Preview entry for issue #5867.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ew 3-API design

Replaces the legacy IsHedged/GetHedgedRegions proposal (preserved on
feature/hedging-detection-api-legacy-pr5741 at commits 162dab8 and
388cedb) with the approved 3-API design covering HedgingStarted,
GetRequestedRegions, GetRespondedRegions plus the RequestedRegion struct
and RequestedRegionReason enum. Tracks issue #5867.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician marked this pull request as ready for review May 18, 2026 18:42
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Diagnostics sample — what callers will see

Attaching a sample of how the new surface looks at call-sites and what the values look like for a few realistic dispatch shapes. The Diagnostics.ToString() JSON shape is unchanged (state lives off the trace tree on TraceSummary.HedgingDetectionState), so the only observable difference is the three new accessors on CosmosDiagnostics.

Caller usage

ItemResponse<Book> response = await container.ReadItemAsync<Book>(id, pk);
CosmosDiagnostics diagnostics = response.Diagnostics;

bool hedged = diagnostics.HedgingStarted();
IReadOnlyList<RequestedRegion> requested = diagnostics.GetRequestedRegions();
IReadOnlyList<string>          responded = diagnostics.GetRespondedRegions();

logger.LogInformation(
    "Hedging started: {Hedged}; requested=[{Req}]; responded=[{Resp}]",
    hedged,
    string.Join(", ", requested),                 // uses RequestedRegion.ToString() → "East US:Initial"
    string.Join(", ", responded));

// Non-exhaustive enum — always include a default arm
foreach (RequestedRegion r in requested)
{
    switch (r.Reason)
    {
        case RequestedRegionReason.Initial:             /* first dispatch */              break;
        case RequestedRegionReason.OperationRetry:      /* same-region retry */           break;
        case RequestedRegionReason.RegionFailover:      /* cross-region failover */       break;
        case RequestedRegionReason.Hedging:             /* hedge arm dispatched */        break;
        case RequestedRegionReason.CircuitBreakerProbe: /* PPCB probe (reserved in v1) */ break;
        case RequestedRegionReason.TransportRetry:      /* reserved in v1 */              break;
        default:                                        /* future-proof */                break;
    }
}

Scenarios

1. Single-region happy path (no hedging configured)
HedgingStarted()      = false
GetRequestedRegions() = [ East US:Initial ]
GetRespondedRegions() = [ East US ]
2. Same-region retry (e.g. 410 Gone → SDK retries primary)
HedgingStarted()      = false
GetRequestedRegions() = [ East US:Initial, East US:OperationRetry ]
GetRespondedRegions() = [ East US, East US ]
3. Region fail-over (preferred-locations list exhausts primary, fails over)
HedgingStarted()      = false
GetRequestedRegions() = [ East US:Initial, West US:RegionFailover ]
GetRespondedRegions() = [ East US, West US ]
4. Hedging configured, primary wins under threshold (no hedge arm dispatched)
HedgingStarted()      = false           // ← AC1: registered ≠ dispatched
GetRequestedRegions() = [ East US:Initial ]
GetRespondedRegions() = [ East US ]

Note: HedgingStarted() == false here does not mean hedging was disabled — the primary just won under the configured threshold so no hedge arm ever fan-out. To check configuration, inspect CosmosClientOptions.AvailabilityStrategy / RequestOptions.AvailabilityStrategy directly.

5. Hedging configured, hedge arm wins (primary slow / failed)
HedgingStarted()      = true
GetRequestedRegions() = [ East US:Initial, West US:Hedging ]
GetRespondedRegions() = [ West US, East US ]   // hedge winner first; late primary response preserved (AC14)
6. Hedging configured, two hedge arms, last responds (worst case)
HedgingStarted()      = true
GetRequestedRegions() = [ East US:Initial, West US:Hedging, North Europe:Hedging ]
GetRespondedRegions() = [ North Europe, West US, East US ]

Equality / formatting on RequestedRegion

RequestedRegion a = new RequestedRegion("East US", RequestedRegionReason.Hedging);
RequestedRegion b = new RequestedRegion("east us", RequestedRegionReason.Hedging);

a == b           // true (case-insensitive on name, exact on reason)
a.ToString()     // "East US:Hedging"

Backward-compat note (AC9)

Any customer-authored subclass of CosmosDiagnostics that does not override the new virtuals returns:

HedgingStarted()      = false
GetRequestedRegions() = []   // Array.Empty<RequestedRegion>()
GetRespondedRegions() = []   // Array.Empty<string>()

This is also what the SDK returns for diagnostics objects produced before this version's plumbing populated them.

Diagnostics.ToString() JSON shape

Unchanged. State is on TraceSummary.HedgingDetectionState, not in the trace tree, so existing dashboards / log parsers that consume the JSON are unaffected.

Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/RequestedRegionReason.cs Outdated
/// subclasses of <see cref="CosmosDiagnostics"/>).
/// </para>
/// </remarks>
public virtual bool HedgingStarted()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation (F2) No end-to-end tests cover the wiring this PR adds.

The 17 added tests cover HedgingDetectionState and RequestedRegion in isolation plus AC9 backward-compat defaults — but nothing verifies the integration that is the entire ask of the PR:

  • ClientRetryPolicy.OnBeforeSendRequest actually writes OperationRetry vs RegionFailover for the two branches of RetryRequestOnPreferredLocations.
  • CrossRegionHedgingAvailabilityStrategy.CloneAndSendAsync writes Hedging only on requestNumber > 0.
  • TransportHandler.AppendDispatchedRegion resolves a region, appends an entry, and removes the property.
  • An end-to-end hedge fan-out via the existing mock pipeline produces the expected HedgingStarted()/GetRequestedRegions()/GetRespondedRegions() sequence.

AC1–AC5 are essentially asserted by inspection only. CrossRegionHedgingAvailabilityStrategyTests is the natural home for the integration coverage, and the gap is what would have caught F3 (hedge-arm retry overwrite) before merge.

Comment thread Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs
Uri endpoint = serviceRequest.RequestContext?.LocationEndpointToRoute;
if (endpoint != null && globalEndpointManager != null)
{
regionName = globalEndpointManager.GetLocation(endpoint);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation (F4) Thin-client / PPAF / per-partition routed dispatches produce no entry in GetRequestedRegions().

GlobalEndpointManager.GetLocation(uri) walks AvailableWriteEndpointByLocation and AvailableReadEndpointByLocation and returns null for any URI not in those maps. That includes:

  • Thin-client endpoints (ResolveThinClientEndpoint in ClientRetryPolicy.cs:274-277 returns endpoints from ThinClientReadEndpoints/ThinClientWriteEndpoints, which GetLocation does not consult).
  • PPAF / PPCB-rerouted URIs.
  • Any per-partition routed endpoint.

For those operations AppendDispatchedRegion bails silently — but ClientSideRequestStatisticsTraceDatum.RecordResponse/RecordHttpResponse still call AppendResponded because the responded region is derived from a different source. Net result: GetRespondedRegions() has entries but GetRequestedRegions() is empty for thin-client and PPCB scenarios — a surprising asymmetry that the XML docs do not disclose.

Suggested fix (lowest cost): add a remark to the XML doc on CosmosDiagnostics.GetRequestedRegions and HedgingStarted calling out that thin-client / PPAF dispatches are not currently included, and cross-reference AC6/AC7. Better: fall back to serviceRequest.RequestContext.RegionName when GetLocation returns null. Best: make GlobalEndpointManager aware of thin-client endpoints.

Comment thread Microsoft.Azure.Cosmos/src/Handler/TransportHandler.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/HedgingDetectionState.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/RequestedRegion.cs
Comment thread Microsoft.Azure.Cosmos/src/Diagnostics/RequestedRegionReason.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Tracing/HedgingDetectionState.cs
Addresses F1 review feedback on PR #5868. The previous enum layout had Initial = 0, which meant default(RequestedRegion) yielded { RegionName = null, Reason = Initial } — indistinguishable from a real first dispatch (any new RequestedRegion[N], uninitialized field, or deserialized value where Reason was absent produced a phantom Initial entry that callers could not attribute to themselves). After GA this would be an API-shape problem that could not be fixed without a breaking change.

Reserves Unknown = 0 as the explicit default sentinel and renumbers Initial = 1, OperationRetry = 2, TransportRetry = 3, Hedging = 4, RegionFailover = 5, CircuitBreakerProbe = 6. The SDK never emits Unknown from a real dispatch — its presence is the signal that a RequestedRegion came from struct-default construction rather than an SDK dispatch.

Updates RequestedRegionReason XML docs to document the sentinel, the contract baseline (DotNetSDKAPI.net6.json) to include the new Unknown field, the OpenSpec requirement + scenario, the tasks.md entry, and the changelog. Adds two new tests in RequestedRegionTests: DefaultStruct_ReasonIsUnknownSentinel_NotInitial pins the runtime invariant and Enum_UnknownIsZeroAndInitialIsOne pins the byte layout so any future renumbering is caught.
Addresses F3 review feedback on PR #5868. Before this fix, when a hedge arm itself triggered a retry (e.g. 410 Gone, 449), ClientRetryPolicy.OnBeforeSendRequest unconditionally overwrote the Hedging dispatch-reason value with OperationRetry or RegionFailover. TransportHandler had already consumed the prior Hedging tag for the first arm dispatch, so the retry of the hedge arm was recorded as a same-region retry, silently losing the hedge origin from the GetRequestedRegions() sequence.

Adds a preservation guard: when Properties[DispatchReasonPropertyKey] already holds RequestedRegionReason.Hedging, the policy leaves it in place rather than overwriting. Non-Hedging tags continue to be overwritten by the new retry reason so the policy can still correctly distinguish OperationRetry from RegionFailover across normal (non-hedge) retries.

Adds two unit tests in ClientRetryPolicyTests: OnBeforeSendRequest_HedgeArmRetry_PreservesHedgingReason pins the preservation invariant (drives a 410/LeaseNotFound on a pre-tagged hedge arm and asserts Hedging survives the second OnBeforeSendRequest), and OnBeforeSendRequest_NonHedgeRetry_OverwritesPreviousReason pins the inverse so the guard is strictly scoped to the Hedging value and does not block normal retry-reason transitions.
…ceeds

Addresses F5 review feedback on PR #5868. Previously TransportHandler.AppendDispatchedRegion consumed (Remove'd) the DispatchReasonPropertyKey from Properties immediately after reading it, then attempted to resolve the region name from the routing endpoint and bailed early if resolution failed (the F4 thin-client / PPCB / per-partition routed scenarios). The dispatch-reason signal had now been consumed but never recorded — if anything downstream re-entered the dispatch path on the same DocumentServiceRequest, the original reason was gone and the next entry defaulted to Initial even though the upstream caller intended a different reason.

Splits the Remove away from the read: capture whether the property was present, then attempt region resolution and bail (leaving Properties intact) on failure, and only Remove the key after a successful state.AppendRequested call. This preserves the signal across re-dispatch attempts that hit the F4 resolution gap, and is a strict superset of the previous semantics for the happy path.
…flag

Addresses F7 review feedback on PR #5868. The public CosmosDiagnostics.HedgingStarted() virtual is documented as O(1) and safe to call on the diagnostics hot path, but the previous implementation acquired Monitor.Enter on regionLock for every read of a monotonic-true flag (set once inside AppendRequested and never reset). That added avoidable contention with the writer-side AppendRequested / AppendResponded callers on the same lock.

Marks the hedgingStarted backing field volatile and reads it directly without acquiring regionLock. The write continues to happen under regionLock so it is ordered with the requestedRegions list mutation that triggered it; volatile gives readers an acquire-fence so the flip cannot be reordered before the list Add that established it. Memory-model behavior is preserved (no torn reads on bool; monotonic-true once flipped) without the Monitor.Enter / Exit pair on every CosmosDiagnostics.HedgingStarted() call.

Adds two unit tests in HedgingDetectionStateTests: HedgingStartedBackingField_IsDeclaredVolatile uses reflection on FieldInfo.GetRequiredCustomModifiers() to assert IsVolatile is set so an accidental drop of the modifier during a future refactor is caught at the test layer rather than at production runtime, and HedgingStarted_LockFreeRead_ObservesWriterFlip functionally exercises three concurrent readers spinning on the getter while a writer flips the flag, asserting all readers observe true within a 5-second deadline.
@kundadebdatta kundadebdatta changed the title Diagnostics: Adds hedging detection API (HedgingStarted / GetRequestedRegions / GetRespondedRegions) Diagnostics: Adds Hedging Detection API (HedgingStarted, GetRequestedRegions and GetRespondedRegions) May 29, 2026
@kundadebdatta kundadebdatta added Diagnostics Issues around diagnostics and troubleshooting Hedging Any issue/feature request related to request hedging labels May 29, 2026
kundadebdatta and others added 3 commits May 29, 2026 17:15
Addresses follow-up F7 review feedback on PR #5868. The volatile keyword
on hedgingStarted is REQUIRED on the reader side (acquire fence for the
lock-free HedgingStarted getter) but is REDUNDANT on the writer side
(the regionLock release already publishes the store). Without an
explicit comment, a future contributor could 'optimize' the writer-side
lock away and break the diagnostics invariant that HedgingStarted ==
true implies at least one Hedging entry is visible in
GetRequestedRegionsSnapshot().

- Rewrites the field-level comment to spell out reader-side vs.
  writer-side semantics and the consequence of moving the write
  outside regionLock.
- Adds a short inline comment at the write site pointing back to the
  field comment.

No behavioral change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…retries

Addresses follow-up F3 review feedback on PR #5868.

The previous F3 fix in ClientRetryPolicy.OnBeforeSendRequest added a
preservation guard that skips overwriting Properties[DispatchReasonPropertyKey]
when the existing value is already Hedging. However, the guard was effectively
dead code in production because TransportHandler.AppendDispatchedRegion removed
the key (under the F5 deferred-Remove fix) BEFORE the retry-driven re-entry of
OnBeforeSendRequest could observe it. Since RequestMessage.Properties and the
cached DocumentServiceRequest.Properties are the same dictionary reference,
the Remove drained the source too.

Fix: AppendDispatchedRegion now skips the Remove when the dispatch reason is
Hedging. The key stays in Properties so the next physical retry's
OnBeforeSendRequest can observe it, which lets the existing CRP preservation
guard correctly classify the retry as Hedging instead of overwriting it with
RegionFailover/OperationRetry.

Tests:
- AppendDispatchedRegion_HedgingReason_LeavesPropertyForRetry: unit test that
  the property remains after AppendDispatchedRegion when reason is Hedging.
- AppendDispatchedRegion_NonHedgingReason_RemovesPropertyAfterConsume: inverse
  unit test confirming the key is still removed for non-Hedging reasons.
- HedgeArmRetry_ProductionOrder_RecordsBothPhysicalAttemptsAsHedging: drives
  the production sequence (OnBefore -> AppendDispatched -> 410 -> ShouldRetry
  -> OnBefore -> AppendDispatched) and asserts both physical attempts are
  recorded as Hedging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Diagnostics Issues around diagnostics and troubleshooting Hedging Any issue/feature request related to request hedging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Hedging Detection API — public accessors on CosmosDiagnostics

3 participants