Skip to content

Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829

Open
NaluTripician wants to merge 11 commits into
mainfrom
users/ntripician/ppaf-gateway-hedging-override
Open

Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829
NaluTripician wants to merge 11 commits into
mainfrom
users/ntripician/ppaf-gateway-hedging-override

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician commented May 5, 2026

Description

Adds a Gateway-controlled disableCrossRegionalHedging account property that lets on-call engineers dynamically disable PPAF cross-region hedging without rolling back PPAF entirely.

When the flag is true, the SDK clears the AvailabilityStrategy on the next account-properties refresh; toggling it back to false restores the customer-configured strategy or rebuilds the SDK default.

Type of change

  • Bug fix / mitigation lever (non-breaking; adds an operational escape hatch)
  • New feature
  • Breaking change

Spec

Changes

  • AccountProperties: new internal DisableCrossRegionalHedging (nullable bool) bound to JSON key disableCrossRegionalHedging.
  • GlobalEndpointManager: OnEnablePartitionLevelFailoverConfigChanged is now Action<bool, bool>; tracks lastKnownDisableCrossRegionalHedging so refresh fires on either flag change.
  • DocumentClient:
    • Captures the initial gateway flag during GetInitializationTaskAsync and stashes any customer-configured AvailabilityStrategy when the flag is true at startup.
    • InitializePartitionLevelFailoverWithDefaultHedging short-circuits when the flag is true.
    • UpdatePartitionLevelFailoverConfigWithAccountRefresh reconciles both PPAF state and the disable flag via ApplyHedgingStrategyForCurrentState.
    • Customer strategies are stashed/restored across toggles; SDK-default strategies are not stashed (rebuilt deterministically from PPAF state).
  • 10 new unit tests in GatewayHedgingOverrideTests.
  • Updates 2 existing serializer contract tests that pin the AccountProperties JSON shape.

Behavior

State Result
flag true, customer strategy present Strategy stashed, AvailabilityStrategy=null
flag true, SDK default present AvailabilityStrategy=null (no stash; deterministic)
Toggle false, strategy stashed Restored from stash
Toggle false, no stash, PPAF on SDK default rebuilt
DisablePartitionLevelFailoverClientLevelOverride=true Gateway flag ignored entirely

Validation

This change is a runtime mitigation lever for a production feature (PPAF cross-region hedging), so correctness across the toggle lifecycle and CI green across the full SDK matrix were both treated as gating signals.

Unit-test coverage (10 new tests in GatewayHedgingOverrideTests)

Each row in the Behavior table above has a dedicated test, plus dedicated coverage for the JSON contract and the init-time path:

JSON contract / forward-compat (AccountProperties)

  • ..._DisableCrossRegionalHedging_True_Deserializes -- flag true round-trips.
  • ..._False_Deserializes -- flag false round-trips.
  • ..._Absent_DefaultsToNull -- missing key deserializes as null (i.e. "no opinion"), preserving existing behavior for older accounts.
  • ..._Unknown_FieldDoesNotBreakDeserialization -- unknown sibling fields don't break deserialization (forward-compat guard for future Gateway additions).

Init-time path (InitializePartitionLevelFailoverWithDefaultHedging)

  • ..._FlagTrue_SkipsApplyingDefaultStrategy -- when the flag is true at startup, the SDK default hedging strategy is not installed.
  • ..._FlagFalse_AppliesDefaultStrategy -- when the flag is false, the SDK default is installed as before (no regression for the normal path).

Refresh-time reconciliation (UpdatePartitionLevelFailoverConfigWithAccountRefresh)

  • UpdateConfig_FlagToggleOn_StashesCustomerStrategyAndClearsAvailabilityStrategy -- flipping the flag to true clears ConnectionPolicy.AvailabilityStrategy and stashes the customer's original reference (verified with AreSame, not just structural equality).
  • UpdateConfig_FlagToggleOff_RestoresStashedCustomerStrategy -- flipping back to false restores the exact same customer strategy instance and clears the stash, validating the full enable -> disable -> re-enable cycle from task 6.6.
  • UpdateConfig_FlagTrue_DoesNotStashSDKDefaultStrategy -- when only the SDK default is present, the flag clears it but does not stash it; toggling back rebuilds a fresh SDK default. This guards the "stash only customer-configured strategies" invariant called out in the design.
  • UpdateConfig_ClientLevelOverrideDisabled_FlagIsIgnored -- when DisablePartitionLevelFailoverClientLevelOverride=true, the Gateway flag is ignored end-to-end and the customer strategy is left untouched (client-level override remains absolute).

The tests use compile-time-safe internal test-only properties (DisableCrossRegionalHedgingForTests, CustomerConfiguredAvailabilityStrategyForTests) instead of reflection, so any future rename of the underlying fields breaks the build rather than silently passing.

Contract tests

The two existing tests that pin the AccountProperties JSON shape (CosmosJsonSerializerUnitTests, CosmosSerializerCoreTests) were updated and re-run, confirming the new optional property is the only change to the wire contract.

Review-feedback hardening (commit 816e334)

A second pass addressed correctness concerns that aren't easy to assert from a unit test but materially raise confidence:

  • Thread safety: introduced hedgingStrategyLock to serialize the (flag, stash, ConnectionPolicy.AvailabilityStrategy) mutation sequence, so concurrent refresh callbacks cannot interleave stash/clear/restore steps.
  • Init-time race: moved the initial flag capture and customer-strategy stash from GetInitializationTaskAsync into InitializeGatewayConfigurationReaderAsync, before the GlobalEndpointManager background-refresh loop starts and before the change-event handler is subscribed -- eliminating the theoretical window where a refresh-driven event could be overwritten by a later init-time assignment.
  • Early-return guard: UpdatePartitionLevelFailoverConfigWithAccountRefresh no-ops when neither PPAF enablement nor the hedging flag changed, protecting future direct callers from spurious AvailabilityStrategy mutation.
  • Shadowing guard: renamed locals (isEnabled -> latestIsEnabled, disableCrossRegionalHedging -> latestDisableCrossRegionalHedging) so a future edit that drops a this. prefix can't silently flip semantics.
  • Loud-fail diagnostic: ApplyHedgingStrategyForCurrentState emits a TraceWarning if it would silently drop a non-default AvailabilityStrategy while a previously-stashed customer strategy is still held, so this edge case is observable in support logs rather than invisible.

CI

All 31 required checks are green on the final commit, including:

  • Microsoft.Azure.Cosmos.Tests (full unit-test run) and Microsoft.Azure.Cosmos.Tests Coverage.
  • Full EmulatorTests matrix: Query/ChangeFeed/ReadFeed/Batch/Client Telemetry, MultiMaster, MultiRegion, ThinClient, Others, and the Flaky lane.
  • Both preview lanes (Preview Tests Release, Preview Flag Release, Preview Encryption EmulatorTests Release).
  • Encryption and Encryption.Custom emulator tests (project-ref and package-ref).
  • Static Analysis, CodeQL, and all Analyze (...) jobs.
  • PerformanceTests Release and CosmosBenchmark (no perf regression in the hot path; the new logic only runs on account-refresh and at init, not per-request).

Live long-running soak (~9.3 hours, 837 K ops, 22 toggle cycles)

In addition to the unit/contract tests and CI matrix, we ran the dedicated PpafHedgingValidator harness end-to-end against a real PPAF-enabled account in our internal test environment. While the harness ran the workload + fault injection, a separate operator script toggled disableCrossRegionalHedging on the account every 15 minutes by alternating UpdateGlobalDatabaseAccount.ps1 -PropertyValue true and RemoveGlobalDatabaseAccountConfigurationOverrides.ps1 -- exactly the on-call mitigation gesture this PR is designed to support.

The 12 h budgeted run was cut short by an OS auto-update at 2026-05-13 03:44 UTC after 09:18:16 of wall time / 837,303 ops / 22 full toggle cycles captured. The artifacts (samples.csv, events.jsonl, requests.jsonl, report.md, timeline.png) cover the full 9.3 h window.

Headline result -- hedging efficacy under live chaos (post-warmup samples, fault windows only)

Gateway flag SDK ConnectionPolicy.AvailabilityStrategy (in-process) Samples Mean p99 Total hedges
disableCrossRegionalHedging = false (hedging enabled) SdkDefault (CrossRegionHedging) 79 706 ms 3,889
disableCrossRegionalHedging = true (hedging disabled) null 93 1,454 ms 65

When the gateway flag flipped to false, the SDK rebuilt the hedging strategy and hedges fired against the faulted North Central US primary -- cutting p99 by ~750 ms (~51%). When the gateway flag flipped to true, the strategy cleared and the ~60x drop in hedge count demonstrates the disable path actually disabled hedging. The 65 residual hedges in the null bucket are in-flight requests from the prior SdkDefault window completing with their pre-flip hedges -- by construction the SDK cannot retroactively cancel those.

Toggle-cadence alignment

After SDK warmup (~20:16 UTC onward), strategy_state -- read directly from in-process ConnectionPolicy.AvailabilityStrategy via reflection -- flipped between Null and SdkDefault on a perfect 15-minute cadence, exactly matching the external Start-Sleep -Seconds 900 toggle script. Post-warmup cycle ratio: 52% Null / 48% SdkDefault (effectively 50/50). The timeline.png flag panel shows the gateway flag and the SDK-observed flag overlay one another across all 36 transitions.

Other signals

  • 0 errors across 837,303 operations.
  • 4,590 hedged requests total; 3,898 of them hedged away from the faulted primary region (i.e. the SDK followed the fault, which is the whole point of PPAF hedging).
  • report.md tabulates 33 of 36 detected transitions as PASS. The 3 strict-comparator misses (No DocumentQuery Linq to SQL evaluation #4 / skip/take? #8 / Preserve some of the way creating a query worked in v2 #26 -- 48 / 47 / 65 hedges) are all false -> true boundary noise from the in-flight tail described above; the strict numerical comparator does not yet treat a sub-poll-interval in-flight tail as a pass, but the underlying SDK behavior is correct.

Validator-side caveats (harness bugs, not product code)

The GatewayPropertyProbe and SdkStateInspector reflective probes regressed mid-run, so per-sample gw_flag / sdk_flag / sdk_lastKnown columns were stuck at 0 for the whole run. Transition timestamps in report.md are therefore synthesized from strategy_state (Null -> flag=true, SdkDefault -> flag=false) by the harness's --rebuild-from-csv path, which is accurate to the 30 s poll boundary. strategy_state itself is a direct in-process read of ConnectionPolicy.AvailabilityStrategy and is the authoritative signal -- and as the table above shows, it tracks the external toggle cadence perfectly and yields a decisive separation in both p99 latency and hedge count between the two flag states.

Adds a Gateway-controlled `disableCrossRegionalHedging` account property that
lets on-call engineers dynamically disable PPAF cross-region hedging without
rolling back PPAF entirely. When the flag is true, the SDK clears the
AvailabilityStrategy on the next account-properties refresh; toggling it back
to false restores the customer-configured strategy or the SDK default.

Changes
- AccountProperties: new internal `DisableCrossRegionalHedging` (nullable bool)
  bound to the JSON key `disableCrossRegionalHedging`.
- GlobalEndpointManager: extends `OnEnablePartitionLevelFailoverConfigChanged`
  to `Action<bool, bool>` and tracks `lastKnownDisableCrossRegionalHedging`
  so the event fires when either flag changes.
- DocumentClient:
  - Captures the initial gateway flag during `GetInitializationTaskAsync` and
    stashes any customer-configured `AvailabilityStrategy` if the flag is
    true at startup.
  - `InitializePartitionLevelFailoverWithDefaultHedging` short-circuits when
    the flag is true.
  - `UpdatePartitionLevelFailoverConfigWithAccountRefresh` now reconciles
    both PPAF state and the disable flag via `ApplyHedgingStrategyForCurrentState`,
    which restores customer strategies when the flag flips back off and skips
    stashing SDK-default strategies (regenerated deterministically).
- Tests: 10 new unit tests covering deserialization, init-time skip, customer-
  strategy stash/restore, SDK-default rebuild, and client-level override.
- Updates 2 existing serializer contract tests that pin the `AccountProperties`
  JSON shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
Successfully started running 1 pipeline(s).

Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs
Comment thread .gitignore
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
…back

Addresses review comments on PR #5829:
- #3 Renames shadowing parameters in UpdatePartitionLevelFailoverConfigWithAccountRefresh
  (isEnabled -> latestIsEnabled, disableCrossRegionalHedging -> latestDisableCrossRegionalHedging)
  so a future edit dropping a 'this.' prefix cannot silently change semantics.
- #6 Adds early-return guard when neither PPAF enablement nor the hedging flag changed,
  protecting future direct callers from spurious AvailabilityStrategy mutation.
- #5 Introduces hedgingStrategyLock and serializes the (flag, stash, ConnectionPolicy.AvailabilityStrategy)
  mutation sequence in UpdatePartitionLevelFailoverConfigWithAccountRefresh and ApplyHedgingStrategyForCurrentState,
  so concurrent invocations cannot interleave stash/clear/restore steps.
- #11 Emits a TraceWarning when ApplyHedgingStrategyForCurrentState would silently drop a non-default
  AvailabilityStrategy because a previously-stashed customer strategy is still held.
- #8 Replaces reflection-based field access in GatewayHedgingOverrideTests with internal test-only
  properties (DisableCrossRegionalHedgingForTests, CustomerConfiguredAvailabilityStrategyForTests),
  so renames are caught at compile time.
- #2 / #12 Moves the init-time flag capture and customer-strategy stash from GetInitializationTaskAsync
  into InitializeGatewayConfigurationReaderAsync, before the GEM background-refresh loop is started
  and before the change-event handler is subscribed. This eliminates the theoretical race where a
  refresh-driven event could be overwritten by a subsequent init-time field assignment, and removes
  the inconsistent null-check style.
- #10 Removes unrelated .coding-harness/ entry from .gitignore.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread Microsoft.Azure.Cosmos/src/Resource/Settings/AccountProperties.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
@kundadebdatta
Copy link
Copy Markdown
Member

Please add changelog entry as well.

Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on author.

…iew feedback

Addresses the 2026-05-14 review comments from @kundadebdatta on PR #5829:

- AccountProperties.DisableCrossRegionalHedging: adds a TODO noting that the
  JSON property name should move to Constants.Properties once the underlying
  constant ships in the Microsoft.Azure.Cosmos.Direct package. Also drops the
  stale `only honored for accounts with PPAF enabled'' phrase from the doc
  remarks, matching the design clarification that the flag applies to both
  PPAF and non-PPAF accounts.

- DocumentClient.hedgingStrategyLock: expands the field-level comment to call
  out that the lock is NOT load-bearing under current callers (GEM serializes
  refreshes; init runs before subscription) and exists specifically for the
  internal test accessors and as future-proofing for new direct callers.

- UpdatePartitionLevelFailoverConfigWithAccountRefresh: expands the comments
  on the no-change early-return guard and on the unconditional
  ApplyHedgingStrategyForCurrentState() call to make their rationale explicit
  (guard exists for direct callers / unit tests; reconcile is intentionally
  shared between the ppafEnablementChanged and hedgingFlagChanged paths
  because PPAF toggles also need to install or drop the SDK default).

- UpdatePartitionLevelFailoverConfigWithAccountRefresh: adds a matching
  `Successfully reconciled hedging strategy dynamically'' trace for
  hedging-flag-only transitions, and moves the PPAF `Successfully updated
  PPAF configuration dynamically'' trace inside the ppafEnablementChanged
  branch so each transition logs exactly one terminal trace.

- DisableCrossRegionalHedgingForTests / CustomerConfiguredAvailabilityStrategyForTests:
  expands the comment to explain why these accessors exist instead of
  mocking AccountProperties (no current mocking seam past
  GatewayAccountReader.InitializeReaderAsync, which performs real I/O) and
  notes that they should be removed once such a seam is introduced.

- changelog.md: adds an [Unreleased] section with the PR #5829 entry under
  Added.

No behavioral changes -- comments, traces, and the changelog entry only.
GatewayHedgingOverrideTests (10 tests) continue to pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Addressed in d9041ca. Added an [Unreleased] section at the top of changelog.md (matching the Keep a Changelog format the file's header references) with the PR #5829 entry under Added:

### <a name="unreleased"/> [Unreleased]

#### Added

- [5829](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5829) Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control. Introduces a Gateway-controlled `disableCrossRegionalHedging` account property that lets operators dynamically disable PPAF cross-region hedging (and any customer-configured `AvailabilityStrategy`) without rolling back PPAF entirely. The SDK reconciles `ConnectionPolicy.AvailabilityStrategy` against the Gateway-supplied flag on each account-properties refresh; toggling the flag back to `false` restores the customer-configured strategy or rebuilds the SDK default.

The release author can rename Unreleased to the actual version (e.g., 3.60.0-preview.1 or 3.60.0) at release time. Let me know if you'd prefer the entry be slotted directly into a specific existing version section instead.

…-gateway-hedging-override

# Conflicts:
#	changelog.md
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NaluTripician — thanks for the careful hardening work in commit 816e334fc and the second round in d9041cae, the lock discipline + init-time relocation closed off the most concerning windows, and the changelog entry at HEAD is great. A few collaborative follow-ups from a fresh-eyes pass on commit 95938a5c2. None of these duplicate Debdatta's open N1/N2/N3 (replies on those are tracked separately).


🟡 M1 — Per-request RequestOptions.AvailabilityStrategy may bypass the new operator override

Could you please help me understand the scope of this PR vs the unchecked tasks 4.1 (RequestInvokerHandler enforcement) and 6.3 (per-request test) in openspec/changes/ppaf-dynamic-hedging-control/tasks.md?

Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs:132 resolves the per-request strategy as:

AvailabilityStrategy strategy = request.RequestOptions?.AvailabilityStrategy
    ?? this.client.DocumentClient.ConnectionPolicy.AvailabilityStrategy;

So a customer that sets AvailabilityStrategy on RequestOptions per-call would keep speculating cross-region after the gateway flag is true — which seems to contradict the operator-mitigation intent.

Two options to consider, either is fine — mostly want the scope to be unambiguous before merge:

(a) Implement task 4.1 — add an internal DocumentClient.IsHedgingDisabledByGateway accessor (read under hedgingStrategyLock) and have RequestInvokerHandler.AvailabilityStrategy(...) return null when set; or
(b) Explicitly downscope — strike the per-request scenarios from spec.md, mark task 4.1 / 6.3 as out of scope in tasks.md, and note in the PR description that per-request enforcement is a follow-up.

Happy to defer to (b) if that's the agreed scope.


🟡 M4 — Java parity gap (informational; happy to be told there's already a tracking issue)

I am curious whether there is a Java tracking issue for this — could you please help me understand whether Java is intentionally lagging on the gateway-driven knob?

I checked Azure/azure-sdk-for-java main and found:

$ grep -rn --include="*.java" "disableCrossRegionalHedging" \
    sdk/cosmos/azure-cosmos/src sdk/cosmos/azure-cosmos-tests/src
(no matches)

$ grep -n "IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF" \
    sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
333:    private static final String DEFAULT_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "true";
334:    private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";
335:    private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF_VARIABLE = "COSMOS_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";

Java currently has the related-but-distinct client-side JVM lever COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF, but no server-driven override. Without one, an operator flipping the gateway flag will get inconsistent behavior across .NET vs Java clients on the same account — which is exactly what this knob seems designed to avoid. Java already has the GEM perPartitionAutomaticFailoverConfigModifier callback plumbing (GlobalEndpointManager.java:60), so mirroring this would be additive rather than new infrastructure. Python has no SDK-default PPAF hedging at all (default availability_strategy=False), so the gap there is largely cosmetic and probably worth a single note in spec.md calling Python out of scope.


Thanks again — happy to chat through any of these if helpful.

Comment thread Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Round 2 review feedback (kushagraThapar, kundadebdatta) on PR #5829:

* RequestInvokerHandler.AvailabilityStrategy now short-circuits to null when
  the gateway flag is true, giving the operator override absolute precedence
  over per-request and client-level AvailabilityStrategy (K-1, K-5).
* DocumentClient gains an internal IsHedgingDisabledByGateway accessor that
  reads the cached flag under hedgingStrategyLock.
* GlobalEndpointManager.RefreshDatabaseAccountInternalAsync now guards the
  hedging change-detection with HasValue (mirrors the existing PPAF guard).
  An absent property is treated as "no signal" rather than implicit false,
  so a transient property-drop response cannot silently re-enable hedging
  during the exact window the operator most wants it disabled. The cached
  lastKnownDisableCrossRegionalHedging is preserved across null reads (K-2).
* Specs/design/proposal updated: the flag now applies to both PPAF and
  non-PPAF accounts; the only documented bypass is a customer-configured
  client-level AvailabilityStrategy (K-3). Added Cross-SDK Parity section
  pointing at the Java tracking issue.
* Tests:
  - GatewayHedgingOverrideTests: RequestInvokerHandler precedence
    (flag-true returns null even with per-request strategy; flag-false
    honors the per-request strategy verbatim).
  - GlobalEndpointManagerTest: property-drop-after-true does NOT fire the
    change event; explicit-false-after-true DOES fire it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician
Copy link
Copy Markdown
Contributor Author

Thanks @kushagraThapar — addressed both M1 and M4 in 6433c5afbc67c0ee31772737c760c604a3443ec2.

Re: M1 (Per-request RequestOptions.AvailabilityStrategy bypasses the operator override)

Going with option (a). The shipped spec.md (Requirement: Gateway flag disables all hedging when trueScenario: PPAF account with request-level hedging override and flag set to true) already requires this to be blocked, and tasks 4.1 / 6.3 were exactly the unchecked items. Done in three small pieces:

  1. DocumentClient — added internal bool IsHedgingDisabledByGateway that reads this.disableCrossRegionalHedging under hedgingStrategyLock (read-only, atomic).
  2. RequestInvokerHandler.AvailabilityStrategy(RequestMessage) — short-circuits to null when IsHedgingDisabledByGateway is true, before resolving any per-request or client-level strategy. Operator override now has absolute precedence regardless of where the strategy was configured.
  3. GatewayHedgingOverrideTests — added RequestInvokerHandler_FlagTrue_PerRequestStrategy_ReturnsNull covering task 6.3 (and the symmetric RequestInvokerHandler_FlagFalse_HonorsPerRequestStrategy for regression protection).

Tasks 4.1 and 6.3 ticked off in tasks.md.

Re: M4 (Java parity)

Status:

  • Java: confirmed the grep — disableCrossRegionalHedging is not yet in azure-sdk-for-java. Filed Cosmos: Add gateway-driven disableCrossRegionalHedging handling (mirror .NET PR #5829) azure-sdk-for-java#49192 (mirror the .NET PR; reuse the existing perPartitionAutomaticFailoverConfigModifier callback plumbing in GlobalEndpointManager.java). Acceptance criteria explicitly include the HasValue guard (M2 in Java too) and the per-request precedence guarantee (M1 in Java too).
  • Python: agreed it's cosmetic — no SDK-default PPAF hedging, so the per-request override path that motivates this knob doesn't exist there.

Added a ## Cross-SDK Parity section to proposal.md recording this status: .NET implements the gateway-driven knob; Java has a tracking issue and is expected to mirror; Python is explicitly out of scope. Keeps the in-tree design record honest about what's shipping where and avoids the inconsistent-behavior-across-clients trap you flagged.

NaluTripician and others added 2 commits May 18, 2026 11:42
…labilityStrategy

The cref RequestInvokerHandler.AvailabilityStrategy in DocumentClient.cs could
not be resolved because (1) RequestInvokerHandler lives in the
Microsoft.Azure.Cosmos.Handlers namespace and is not imported by DocumentClient,
and (2) the simple member name 'AvailabilityStrategy' is ambiguous with the
public AvailabilityStrategy class. This produced error CS1574 and broke every
downstream build job (21 failing CI checks).

Qualify the cref with the Handlers namespace and the (RequestMessage) parameter
signature so the doc-comment resolver picks the method overload unambiguously.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs
Comment thread Microsoft.Azure.Cosmos/src/DocumentClient.cs Outdated
NaluTripician and others added 2 commits May 19, 2026 11:15
…ree reads

Addresses two unresolved review threads from @kundadebdatta on PR #5829:

1. IsHedgingDisabledByGateway was taking a monitor lock on every request
   through RequestInvokerHandler — contradicting its own "atomic kill-switch"
   contract and adding uncontended-but-non-zero cost to the SDK hot path
   (CI benchmarks would not surface this since they run steady-state).

2. InitializePartitionLevelFailoverWithDefaultHedging at line 6963 read
   disableCrossRegionalHedging without any synchronization, even though
   it executes after the GEM background-refresh loop is live — a narrow
   but real window where a refresh-driven write could be missed for lack
   of an acquire barrier.

Both findings collapse to the same fix:

- Declare disableCrossRegionalHedging as `private volatile bool` so reads
  have acquire semantics and writes have release semantics. The
  hedgingStrategyLock continues to serialize the multi-field mutation
  sequence (flag + stashed customer strategy + ConnectionPolicy.AvailabilityStrategy)
  on the refresh / init-stash paths.
- Drop the lock from IsHedgingDisabledByGateway; collapse it to an
  expression-body property returning the volatile field directly.
- Drop the lock from the DisableCrossRegionalHedgingForTests getter for
  consistency; keep it on the setter so test writes happens-before any
  subsequent reconcile-path mutation observes them.
- Rewrite the field-level and hedgingStrategyLock comments to spell out
  the volatile contract explicitly, so future readers do not reintroduce
  a lock around the hot read.

Adds InitializePartitionLevelFailoverWithDefaultHedging_AfterRefreshDrivenFlagFlip_SkipsApplyingDefaultStrategy
to GatewayHedgingOverrideTests, which exercises the exact race scenario
the reviewer called out: a refresh-driven UpdatePartitionLevelFailoverConfigWithAccountRefresh(true, true)
followed by the init-path call to InitializePartitionLevelFailoverWithDefaultHedging,
asserting the SDK default is never installed. Complements the existing
synthetic-setter coverage in InitializePartitionLevelFailoverWithDefaultHedging_FlagTrue_SkipsApplyingDefaultStrategy.

Validation: Microsoft.Azure.Cosmos.Tests Release run — 2750 passed, 0
new failures. All 13 GatewayHedgingOverrideTests pass (12 prior + 1 new).
The single failure on this leg (CosmosClientTests.InvalidKey_ExceptionFullStacktrace)
is a pre-existing environment-dependent failure unrelated to this change,
verified to reproduce on the pre-edit PR head.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-gateway-hedging-override

# Conflicts:
#	changelog.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants