Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829
Routing: Adds Gateway Account Property Flag for Dynamic Hedging Control#5829NaluTripician wants to merge 11 commits into
Conversation
Adds a Gateway-controlled `disableCrossRegionalHedging` account property that
lets on-call engineers dynamically disable PPAF cross-region hedging without
rolling back PPAF entirely. When the flag is true, the SDK clears the
AvailabilityStrategy on the next account-properties refresh; toggling it back
to false restores the customer-configured strategy or the SDK default.
Changes
- AccountProperties: new internal `DisableCrossRegionalHedging` (nullable bool)
bound to the JSON key `disableCrossRegionalHedging`.
- GlobalEndpointManager: extends `OnEnablePartitionLevelFailoverConfigChanged`
to `Action<bool, bool>` and tracks `lastKnownDisableCrossRegionalHedging`
so the event fires when either flag changes.
- DocumentClient:
- Captures the initial gateway flag during `GetInitializationTaskAsync` and
stashes any customer-configured `AvailabilityStrategy` if the flag is
true at startup.
- `InitializePartitionLevelFailoverWithDefaultHedging` short-circuits when
the flag is true.
- `UpdatePartitionLevelFailoverConfigWithAccountRefresh` now reconciles
both PPAF state and the disable flag via `ApplyHedgingStrategyForCurrentState`,
which restores customer strategies when the flag flips back off and skips
stashing SDK-default strategies (regenerated deterministically).
- Tests: 10 new unit tests covering deserialization, init-time skip, customer-
strategy stash/restore, SDK-default rebuild, and client-level override.
- Updates 2 existing serializer contract tests that pin the `AccountProperties`
JSON shape.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
…back Addresses review comments on PR #5829: - #3 Renames shadowing parameters in UpdatePartitionLevelFailoverConfigWithAccountRefresh (isEnabled -> latestIsEnabled, disableCrossRegionalHedging -> latestDisableCrossRegionalHedging) so a future edit dropping a 'this.' prefix cannot silently change semantics. - #6 Adds early-return guard when neither PPAF enablement nor the hedging flag changed, protecting future direct callers from spurious AvailabilityStrategy mutation. - #5 Introduces hedgingStrategyLock and serializes the (flag, stash, ConnectionPolicy.AvailabilityStrategy) mutation sequence in UpdatePartitionLevelFailoverConfigWithAccountRefresh and ApplyHedgingStrategyForCurrentState, so concurrent invocations cannot interleave stash/clear/restore steps. - #11 Emits a TraceWarning when ApplyHedgingStrategyForCurrentState would silently drop a non-default AvailabilityStrategy because a previously-stashed customer strategy is still held. - #8 Replaces reflection-based field access in GatewayHedgingOverrideTests with internal test-only properties (DisableCrossRegionalHedgingForTests, CustomerConfiguredAvailabilityStrategyForTests), so renames are caught at compile time. - #2 / #12 Moves the init-time flag capture and customer-strategy stash from GetInitializationTaskAsync into InitializeGatewayConfigurationReaderAsync, before the GEM background-refresh loop is started and before the change-event handler is subscribed. This eliminates the theoretical race where a refresh-driven event could be overwritten by a subsequent init-time field assignment, and removes the inconsistent null-check style. - #10 Removes unrelated .coding-harness/ entry from .gitignore. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Please add changelog entry as well. |
…iew feedback Addresses the 2026-05-14 review comments from @kundadebdatta on PR #5829: - AccountProperties.DisableCrossRegionalHedging: adds a TODO noting that the JSON property name should move to Constants.Properties once the underlying constant ships in the Microsoft.Azure.Cosmos.Direct package. Also drops the stale `only honored for accounts with PPAF enabled'' phrase from the doc remarks, matching the design clarification that the flag applies to both PPAF and non-PPAF accounts. - DocumentClient.hedgingStrategyLock: expands the field-level comment to call out that the lock is NOT load-bearing under current callers (GEM serializes refreshes; init runs before subscription) and exists specifically for the internal test accessors and as future-proofing for new direct callers. - UpdatePartitionLevelFailoverConfigWithAccountRefresh: expands the comments on the no-change early-return guard and on the unconditional ApplyHedgingStrategyForCurrentState() call to make their rationale explicit (guard exists for direct callers / unit tests; reconcile is intentionally shared between the ppafEnablementChanged and hedgingFlagChanged paths because PPAF toggles also need to install or drop the SDK default). - UpdatePartitionLevelFailoverConfigWithAccountRefresh: adds a matching `Successfully reconciled hedging strategy dynamically'' trace for hedging-flag-only transitions, and moves the PPAF `Successfully updated PPAF configuration dynamically'' trace inside the ppafEnablementChanged branch so each transition logs exactly one terminal trace. - DisableCrossRegionalHedgingForTests / CustomerConfiguredAvailabilityStrategyForTests: expands the comment to explain why these accessors exist instead of mocking AccountProperties (no current mocking seam past GatewayAccountReader.InitializeReaderAsync, which performs real I/O) and notes that they should be removed once such a seam is introduced. - changelog.md: adds an [Unreleased] section with the PR #5829 entry under Added. No behavioral changes -- comments, traces, and the changelog entry only. GatewayHedgingOverrideTests (10 tests) continue to pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed in d9041ca. Added an The release author can rename |
…-gateway-hedging-override # Conflicts: # changelog.md
kushagraThapar
left a comment
There was a problem hiding this comment.
Hi @NaluTripician — thanks for the careful hardening work in commit 816e334fc and the second round in d9041cae, the lock discipline + init-time relocation closed off the most concerning windows, and the changelog entry at HEAD is great. A few collaborative follow-ups from a fresh-eyes pass on commit 95938a5c2. None of these duplicate Debdatta's open N1/N2/N3 (replies on those are tracked separately).
🟡 M1 — Per-request RequestOptions.AvailabilityStrategy may bypass the new operator override
Could you please help me understand the scope of this PR vs the unchecked tasks 4.1 (RequestInvokerHandler enforcement) and 6.3 (per-request test) in openspec/changes/ppaf-dynamic-hedging-control/tasks.md?
Microsoft.Azure.Cosmos/src/Handler/RequestInvokerHandler.cs:132 resolves the per-request strategy as:
AvailabilityStrategy strategy = request.RequestOptions?.AvailabilityStrategy
?? this.client.DocumentClient.ConnectionPolicy.AvailabilityStrategy;So a customer that sets AvailabilityStrategy on RequestOptions per-call would keep speculating cross-region after the gateway flag is true — which seems to contradict the operator-mitigation intent.
Two options to consider, either is fine — mostly want the scope to be unambiguous before merge:
(a) Implement task 4.1 — add an internal DocumentClient.IsHedgingDisabledByGateway accessor (read under hedgingStrategyLock) and have RequestInvokerHandler.AvailabilityStrategy(...) return null when set; or
(b) Explicitly downscope — strike the per-request scenarios from spec.md, mark task 4.1 / 6.3 as out of scope in tasks.md, and note in the PR description that per-request enforcement is a follow-up.
Happy to defer to (b) if that's the agreed scope.
🟡 M4 — Java parity gap (informational; happy to be told there's already a tracking issue)
I am curious whether there is a Java tracking issue for this — could you please help me understand whether Java is intentionally lagging on the gateway-driven knob?
I checked Azure/azure-sdk-for-java main and found:
$ grep -rn --include="*.java" "disableCrossRegionalHedging" \
sdk/cosmos/azure-cosmos/src sdk/cosmos/azure-cosmos-tests/src
(no matches)
$ grep -n "IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF" \
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
333: private static final String DEFAULT_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "true";
334: private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF = "COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";
335: private static final String IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF_VARIABLE = "COSMOS_IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF";
Java currently has the related-but-distinct client-side JVM lever COSMOS.IS_READ_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF, but no server-driven override. Without one, an operator flipping the gateway flag will get inconsistent behavior across .NET vs Java clients on the same account — which is exactly what this knob seems designed to avoid. Java already has the GEM perPartitionAutomaticFailoverConfigModifier callback plumbing (GlobalEndpointManager.java:60), so mirroring this would be additive rather than new infrastructure. Python has no SDK-default PPAF hedging at all (default availability_strategy=False), so the gap there is largely cosmetic and probably worth a single note in spec.md calling Python out of scope.
Thanks again — happy to chat through any of these if helpful.
Round 2 review feedback (kushagraThapar, kundadebdatta) on PR #5829: * RequestInvokerHandler.AvailabilityStrategy now short-circuits to null when the gateway flag is true, giving the operator override absolute precedence over per-request and client-level AvailabilityStrategy (K-1, K-5). * DocumentClient gains an internal IsHedgingDisabledByGateway accessor that reads the cached flag under hedgingStrategyLock. * GlobalEndpointManager.RefreshDatabaseAccountInternalAsync now guards the hedging change-detection with HasValue (mirrors the existing PPAF guard). An absent property is treated as "no signal" rather than implicit false, so a transient property-drop response cannot silently re-enable hedging during the exact window the operator most wants it disabled. The cached lastKnownDisableCrossRegionalHedging is preserved across null reads (K-2). * Specs/design/proposal updated: the flag now applies to both PPAF and non-PPAF accounts; the only documented bypass is a customer-configured client-level AvailabilityStrategy (K-3). Added Cross-SDK Parity section pointing at the Java tracking issue. * Tests: - GatewayHedgingOverrideTests: RequestInvokerHandler precedence (flag-true returns null even with per-request strategy; flag-false honors the per-request strategy verbatim). - GlobalEndpointManagerTest: property-drop-after-true does NOT fire the change event; explicit-false-after-true DOES fire it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @kushagraThapar — addressed both M1 and M4 in Re: M1 (Per-request
|
…labilityStrategy The cref RequestInvokerHandler.AvailabilityStrategy in DocumentClient.cs could not be resolved because (1) RequestInvokerHandler lives in the Microsoft.Azure.Cosmos.Handlers namespace and is not imported by DocumentClient, and (2) the simple member name 'AvailabilityStrategy' is ambiguous with the public AvailabilityStrategy class. This produced error CS1574 and broke every downstream build job (21 failing CI checks). Qualify the cref with the Handlers namespace and the (RequestMessage) parameter signature so the doc-comment resolver picks the method overload unambiguously. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ree reads Addresses two unresolved review threads from @kundadebdatta on PR #5829: 1. IsHedgingDisabledByGateway was taking a monitor lock on every request through RequestInvokerHandler — contradicting its own "atomic kill-switch" contract and adding uncontended-but-non-zero cost to the SDK hot path (CI benchmarks would not surface this since they run steady-state). 2. InitializePartitionLevelFailoverWithDefaultHedging at line 6963 read disableCrossRegionalHedging without any synchronization, even though it executes after the GEM background-refresh loop is live — a narrow but real window where a refresh-driven write could be missed for lack of an acquire barrier. Both findings collapse to the same fix: - Declare disableCrossRegionalHedging as `private volatile bool` so reads have acquire semantics and writes have release semantics. The hedgingStrategyLock continues to serialize the multi-field mutation sequence (flag + stashed customer strategy + ConnectionPolicy.AvailabilityStrategy) on the refresh / init-stash paths. - Drop the lock from IsHedgingDisabledByGateway; collapse it to an expression-body property returning the volatile field directly. - Drop the lock from the DisableCrossRegionalHedgingForTests getter for consistency; keep it on the setter so test writes happens-before any subsequent reconcile-path mutation observes them. - Rewrite the field-level and hedgingStrategyLock comments to spell out the volatile contract explicitly, so future readers do not reintroduce a lock around the hot read. Adds InitializePartitionLevelFailoverWithDefaultHedging_AfterRefreshDrivenFlagFlip_SkipsApplyingDefaultStrategy to GatewayHedgingOverrideTests, which exercises the exact race scenario the reviewer called out: a refresh-driven UpdatePartitionLevelFailoverConfigWithAccountRefresh(true, true) followed by the init-path call to InitializePartitionLevelFailoverWithDefaultHedging, asserting the SDK default is never installed. Complements the existing synthetic-setter coverage in InitializePartitionLevelFailoverWithDefaultHedging_FlagTrue_SkipsApplyingDefaultStrategy. Validation: Microsoft.Azure.Cosmos.Tests Release run — 2750 passed, 0 new failures. All 13 GatewayHedgingOverrideTests pass (12 prior + 1 new). The single failure on this leg (CosmosClientTests.InvalidKey_ExceptionFullStacktrace) is a pre-existing environment-dependent failure unrelated to this change, verified to reproduce on the pre-edit PR head. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-gateway-hedging-override # Conflicts: # changelog.md
Description
Adds a Gateway-controlled
disableCrossRegionalHedgingaccount property that lets on-call engineers dynamically disable PPAF cross-region hedging without rolling back PPAF entirely.When the flag is
true, the SDK clears theAvailabilityStrategyon the next account-properties refresh; toggling it back tofalserestores the customer-configured strategy or rebuilds the SDK default.Type of change
Spec
openspec/changes/ppaf-dynamic-hedging-control/(proposal/design/spec/tasks)Changes
AccountProperties: new internalDisableCrossRegionalHedging(nullable bool) bound to JSON keydisableCrossRegionalHedging.GlobalEndpointManager:OnEnablePartitionLevelFailoverConfigChangedis nowAction<bool, bool>; trackslastKnownDisableCrossRegionalHedgingso refresh fires on either flag change.DocumentClient:GetInitializationTaskAsyncand stashes any customer-configuredAvailabilityStrategywhen the flag is true at startup.InitializePartitionLevelFailoverWithDefaultHedgingshort-circuits when the flag is true.UpdatePartitionLevelFailoverConfigWithAccountRefreshreconciles both PPAF state and the disable flag viaApplyHedgingStrategyForCurrentState.GatewayHedgingOverrideTests.AccountPropertiesJSON shape.Behavior
AvailabilityStrategy=nullAvailabilityStrategy=null(no stash; deterministic)DisablePartitionLevelFailoverClientLevelOverride=trueValidation
This change is a runtime mitigation lever for a production feature (PPAF cross-region hedging), so correctness across the toggle lifecycle and CI green across the full SDK matrix were both treated as gating signals.
Unit-test coverage (10 new tests in
GatewayHedgingOverrideTests)Each row in the Behavior table above has a dedicated test, plus dedicated coverage for the JSON contract and the init-time path:
JSON contract / forward-compat (
AccountProperties)..._DisableCrossRegionalHedging_True_Deserializes-- flagtrueround-trips...._False_Deserializes-- flagfalseround-trips...._Absent_DefaultsToNull-- missing key deserializes asnull(i.e. "no opinion"), preserving existing behavior for older accounts...._Unknown_FieldDoesNotBreakDeserialization-- unknown sibling fields don't break deserialization (forward-compat guard for future Gateway additions).Init-time path (
InitializePartitionLevelFailoverWithDefaultHedging)..._FlagTrue_SkipsApplyingDefaultStrategy-- when the flag istrueat startup, the SDK default hedging strategy is not installed...._FlagFalse_AppliesDefaultStrategy-- when the flag isfalse, the SDK default is installed as before (no regression for the normal path).Refresh-time reconciliation (
UpdatePartitionLevelFailoverConfigWithAccountRefresh)UpdateConfig_FlagToggleOn_StashesCustomerStrategyAndClearsAvailabilityStrategy-- flipping the flag totrueclearsConnectionPolicy.AvailabilityStrategyand stashes the customer's original reference (verified withAreSame, not just structural equality).UpdateConfig_FlagToggleOff_RestoresStashedCustomerStrategy-- flipping back tofalserestores the exact same customer strategy instance and clears the stash, validating the full enable -> disable -> re-enable cycle from task 6.6.UpdateConfig_FlagTrue_DoesNotStashSDKDefaultStrategy-- when only the SDK default is present, the flag clears it but does not stash it; toggling back rebuilds a fresh SDK default. This guards the "stash only customer-configured strategies" invariant called out in the design.UpdateConfig_ClientLevelOverrideDisabled_FlagIsIgnored-- whenDisablePartitionLevelFailoverClientLevelOverride=true, the Gateway flag is ignored end-to-end and the customer strategy is left untouched (client-level override remains absolute).The tests use compile-time-safe
internaltest-only properties (DisableCrossRegionalHedgingForTests,CustomerConfiguredAvailabilityStrategyForTests) instead of reflection, so any future rename of the underlying fields breaks the build rather than silently passing.Contract tests
The two existing tests that pin the
AccountPropertiesJSON shape (CosmosJsonSerializerUnitTests,CosmosSerializerCoreTests) were updated and re-run, confirming the new optional property is the only change to the wire contract.Review-feedback hardening (commit
816e334)A second pass addressed correctness concerns that aren't easy to assert from a unit test but materially raise confidence:
hedgingStrategyLockto serialize the(flag, stash, ConnectionPolicy.AvailabilityStrategy)mutation sequence, so concurrent refresh callbacks cannot interleave stash/clear/restore steps.GetInitializationTaskAsyncintoInitializeGatewayConfigurationReaderAsync, before theGlobalEndpointManagerbackground-refresh loop starts and before the change-event handler is subscribed -- eliminating the theoretical window where a refresh-driven event could be overwritten by a later init-time assignment.UpdatePartitionLevelFailoverConfigWithAccountRefreshno-ops when neither PPAF enablement nor the hedging flag changed, protecting future direct callers from spuriousAvailabilityStrategymutation.isEnabled->latestIsEnabled,disableCrossRegionalHedging->latestDisableCrossRegionalHedging) so a future edit that drops athis.prefix can't silently flip semantics.ApplyHedgingStrategyForCurrentStateemits aTraceWarningif it would silently drop a non-defaultAvailabilityStrategywhile a previously-stashed customer strategy is still held, so this edge case is observable in support logs rather than invisible.CI
All 31 required checks are green on the final commit, including:
Microsoft.Azure.Cosmos.Tests(full unit-test run) andMicrosoft.Azure.Cosmos.Tests Coverage.EmulatorTestsmatrix:Query/ChangeFeed/ReadFeed/Batch/Client Telemetry,MultiMaster,MultiRegion,ThinClient,Others, and theFlakylane.Preview Tests Release,Preview Flag Release,Preview Encryption EmulatorTests Release).EncryptionandEncryption.Customemulator tests (project-ref and package-ref).Static Analysis,CodeQL, and allAnalyze (...)jobs.PerformanceTests ReleaseandCosmosBenchmark(no perf regression in the hot path; the new logic only runs on account-refresh and at init, not per-request).Live long-running soak (~9.3 hours, 837 K ops, 22 toggle cycles)
In addition to the unit/contract tests and CI matrix, we ran the dedicated
PpafHedgingValidatorharness end-to-end against a real PPAF-enabled account in our internal test environment. While the harness ran the workload + fault injection, a separate operator script toggleddisableCrossRegionalHedgingon the account every 15 minutes by alternatingUpdateGlobalDatabaseAccount.ps1 -PropertyValue trueandRemoveGlobalDatabaseAccountConfigurationOverrides.ps1-- exactly the on-call mitigation gesture this PR is designed to support.The 12 h budgeted run was cut short by an OS auto-update at
2026-05-13 03:44 UTCafter 09:18:16 of wall time / 837,303 ops / 22 full toggle cycles captured. The artifacts (samples.csv,events.jsonl,requests.jsonl,report.md,timeline.png) cover the full 9.3 h window.Headline result -- hedging efficacy under live chaos (post-warmup samples, fault windows only)
ConnectionPolicy.AvailabilityStrategy(in-process)disableCrossRegionalHedging = false(hedging enabled)SdkDefault(CrossRegionHedging)disableCrossRegionalHedging = true(hedging disabled)nullWhen the gateway flag flipped to
false, the SDK rebuilt the hedging strategy and hedges fired against the faultedNorth Central USprimary -- cutting p99 by ~750 ms (~51%). When the gateway flag flipped totrue, the strategy cleared and the ~60x drop in hedge count demonstrates the disable path actually disabled hedging. The 65 residual hedges in thenullbucket are in-flight requests from the priorSdkDefaultwindow completing with their pre-flip hedges -- by construction the SDK cannot retroactively cancel those.Toggle-cadence alignment
After SDK warmup (~
20:16 UTConward),strategy_state-- read directly from in-processConnectionPolicy.AvailabilityStrategyvia reflection -- flipped betweenNullandSdkDefaulton a perfect 15-minute cadence, exactly matching the externalStart-Sleep -Seconds 900toggle script. Post-warmup cycle ratio: 52%Null/ 48%SdkDefault(effectively 50/50). Thetimeline.pngflag panel shows the gateway flag and the SDK-observed flag overlay one another across all 36 transitions.Other signals
report.mdtabulates 33 of 36 detected transitions as PASS. The 3 strict-comparator misses (No DocumentQuery Linq to SQL evaluation #4 / skip/take? #8 / Preserve some of the way creating a query worked in v2 #26 -- 48 / 47 / 65 hedges) are allfalse -> trueboundary noise from the in-flight tail described above; the strict numerical comparator does not yet treat a sub-poll-interval in-flight tail as a pass, but the underlying SDK behavior is correct.Validator-side caveats (harness bugs, not product code)
The
GatewayPropertyProbeandSdkStateInspectorreflective probes regressed mid-run, so per-samplegw_flag/sdk_flag/sdk_lastKnowncolumns were stuck at0for the whole run. Transition timestamps inreport.mdare therefore synthesized fromstrategy_state(Null-> flag=true,SdkDefault-> flag=false) by the harness's--rebuild-from-csvpath, which is accurate to the 30 s poll boundary.strategy_stateitself is a direct in-process read ofConnectionPolicy.AvailabilityStrategyand is the authoritative signal -- and as the table above shows, it tracks the external toggle cadence perfectly and yields a decisive separation in both p99 latency and hedge count between the two flag states.