[Cosmos] Per-Partition Automatic Failover #41588

simorenoh · 2025-06-15T23:08:14Z

Per-Partition Automatic Failover

This PR adds the ability for the SDK to utilize per-partition automatic failover as a resiliency mechanism for write requests in single-write multi-region Cosmos accounts. Big picture, PPAF allows the SDK to reach out to other regions when the main write region becomes unavailable for a given partition. This means that while the partition is having issues in the main write region, the SDK will be able to route these write requests to one of the available read regions for the account for write requests targeting that partition. Once the endpoint is available again, the service will let the SDK know that failing back is now possible (403.3) and the SDK will attempt to return to the main write region in a round-robin fashion through the available regions for the account. Neither preferred regions nor excluded regions play a role in PPAF, since the decision of where the current write region for a partition is located depends entirely on the service. It is also worth noting that enabling PPAF will also enable PPCB.

Unlike PPCB, PPAF is a service feature - that is, it won't work right out the box with just the SDK, but also requires the database account to be configured with the feature. More information on the feature and how to enable it for an account can be found here: https://learn.microsoft.com/azure/cosmos-db/how-to-configure-per-partition-automatic-failover

TLDR: in order use these enhancements, a user will need to have:

Have configured an account with PPAF as per the linked document above.
A single-write multi-region Cosmos account.
More than one region available for their account.

Design and new classes

_GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover

Takes care of the regional routing and partition unavailability tracking when PPAF is enabled by using its own cache of PartitionLevelFailoverInfo objects.
Instance attributes:

partition_range_to_failover_info: mapping of a partition key range to their respective PartitionLevelFailoverInfo object.
ppaf_thresholds_tracker: mapping of a partition key range to the number of consecutive errors of a given type

PartitionLevelFailoverInfo

Holds the relevant partition level failover information for a partition. This information is used in order to route requests to the next available region based on the known information about its availability.
Instance attributes:

unavailable_regional_endpoints: Set holding the regional endpoints that have been marked as unavailable for this partition.
current_regional_endpoint: the current regional endpoint to be used by PPAF requests - stored in order to ensure that requests are properly routed to this region if PPAF is enabled.
_lock: To ensure updating logic is thread-safe for a given partition.

Request flow with PPAF

flowchart TD
    A[User sends request] --> B{Is PPAF valid? do we have regions available, is this a write request, is this a multi-write account}
    
    B -- No --> C[Use per-partition Circuit Breaker GlobalEndpointManager]
    
    B -- Yes --> D{Is PK range info from this request cached?}
    
    D -- No --> E[Create new cache entry]
    E --> F[Use default GlobalEndpointManager to resolve endpoint]
    
    D -- Yes --> G{Is current request regional endpoint unavailable?}
    
    G -- No --> H[Update partition info with current request endpoint]
    H --> I[Send request to current endpoint]
    
    G -- Yes --> J[Cycle through available endpoints]
    J --> K{Found available endpoint?}
    
    K -- Yes --> L[Update partition info with new endpoint]
    L --> M[Send request to new endpoint]
    
    K -- No --> N[Reset cache entry]
    N --> O[Use default GlobalEndpointManager to resolve endpoint]

Difference in behaviors and status codes

This section covers the behavior for the different status codes that are relevant to PPAF error handling. For some of these, we have a threshold-based approach, that requires 10 consecutive failures (default value can be changed) before performing a partition-level failover, while others will immediately attempt to do a partition-level failover.

Error codes	Behaviors
403.3, 503	Direct partition-level failover will happen. This behavior aligns with .NET SDK. 503s have had their own policy made.
408, 500, 502, 504, ServiceResponseErrors	Threshold-based failover logic, we only failover after 10 consecutive exceptions
404.1002	Region failover will utilize partition-level failover info from PPAF if it is available, otherwise will have default behavior

Concerns

Some of the things below have not been tested or done yet, and should be looked at before releasing this feature.

Missing work on session token false progress - the [session token container fixes PR] ([Cosmos] Session container fixes #40366) might be needed as a prerequisite since it fixes much of the session token logic within the SDK.
Partition splits can change the range that a given partition key range id or partition key value are reaching out to, and as such might cause an issue with the partition info cache if a split happens while the write partition is in the middle of a failover. We should verify partition split behavior since I'm not 100% certain we refresh the partition key range cache in many situations.
Need to verify the configuration sent from the service for an account with PPAF enabled matches the configuration being used in the code - should be a quick check against a live account with the config
- Update: Verifiied that enablePerPartitionFailoverEnabled is the correct config from the service.
Need to verify diagnostics and logging is doing what it should for supplying PPAF information to our users -> confirmed since we log the different requests failing over

Additional work to be done

Add README entry for all of this
Add ability for PPAF to be enabled dynamically -> this was always enabled by default since we update the database account cache with every health check and read the cache to verify if we should enforce the config in the global endpoint manager. checks run every 5 minutes, so it should take at most 5 minutes for the feature to be enabled dynamically.
Add retry mechanisms for 408, 502, 503, 504 exceptions based on consecutive request failures
Add retry mechanisms for 404.1002 to utilize partition-level failover region as opposed to main failover region if the info is available
Add retry mechanisms for ServiceResponseErrors

Eventually I'd like to also add tests with more regions in order to properly test excluded regions -> ideally, we configure an account with 3 regions to more effectively verify multi-region behavior and use excluded regions with greater depth.

Addresses #39686

Copilot

Pull Request Overview

This PR implements per-partition automatic failover for write requests in multi-region Cosmos accounts. The changes add new classes and methods for partition-level failover, update fault injection and retry policies to support the new configuration, and adjust client connection and location cache logic.

Extended fault injection logic in tests to support max fault counts and custom responses.
Updated endpoint manager and retry policies to integrate per-partition automatic failover.
Added new constants and updated client connection logic for the PPAF feature.

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/tests/_fault_injection_transport_async.py	Updated fault injection API with new optional parameters for controlling fault counts.
sdk/cosmos/azure-cosmos/tests/_fault_injection_transport.py	Mirrored changes in fault injection for the synchronous transport variant.
sdk/cosmos/azure-cosmos/azure/cosmos/documents.py	Added a new flag to enable per-partition failover behavior in the database account.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py	Introduced the asynchronous partition endpoint manager handling per-partition failover logic.
sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py	Added new constants for enabling per-partition automatic failover.
(Multiple files)	Updated retry policies, request routing, and the circuit breaker endpoint manager to support the new feature.

Comments suppressed due to low confidence (3)

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker_core.py:62

The use of chained environment variable lookups for configuration may lead to unexpected behavior if the two settings differ. Consider splitting the checks for per-partition failover and the traditional circuit breaker into separate conditions to improve clarity and maintainability.

circuit_breaker_enabled = os.environ.get(Constants.PER_PARTITION_AUTOMATIC_FAILOVER_ENABLED_CONFIG, os.environ.get(Constants.CIRCUIT_BREAKER_ENABLED_CONFIG, Constants.CIRCUIT_BREAKER_ENABLED_CONFIG_DEFAULT)).lower() == "true"

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py:119

When no available regional endpoint is found and the failover cache is reset, it would be helpful to add targeted tests (or logging enhancements) to ensure this behavior is correctly handled in production scenarios.

self.partition_range_to_failover_info[pk_range_wrapper] = PartitionLevelFailoverInfo()

sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py:476

[nitpick] The renaming from 'orderedLocations' to 'ordered_locations' improves readability; please ensure that any related documentation or in-code comments are updated to reflect this consistent naming convention.

for location in ordered_locations:

sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py

…into cosmos-ppaf # Conflicts: # sdk/cosmos/azure-cosmos/tests/test_per_partition_circuit_breaker_sm_mrr.py

tvaron3 · 2025-07-07T21:35:56Z

/azp run python - cosmos - tests

azure-pipelines · 2025-07-07T21:36:15Z

Azure Pipelines successfully started running 1 pipeline(s).

github-actions · 2025-07-08T17:03:16Z

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-cosmos

simorenoh · 2025-10-31T02:40:17Z

/azp run python - cosmos - tests

azure-pipelines · 2025-10-31T02:40:35Z

Azure Pipelines successfully started running 1 pipeline(s).

simorenoh · 2025-11-17T20:33:44Z

/azp run python - cosmos - tests

azure-pipelines · 2025-11-17T20:34:02Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py

sdk/cosmos/azure-cosmos/azure/cosmos/_endpoint_discovery_retry_policy.py

tvaron3

LGTM

...e-cosmos/azure/cosmos/_global_partition_endpoint_manager_per_partition_automatic_failover.py

sdk/cosmos/azure-cosmos/azure/cosmos/_partition_health_tracker.py

sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py

simorenoh · 2025-11-20T00:57:30Z

/azp run python - cosmos - tests

azure-pipelines · 2025-11-20T00:57:47Z

Azure Pipelines successfully started running 1 pipeline(s).

simorenoh · 2025-11-20T15:41:48Z

/azp run python - cosmos - tests

azure-pipelines · 2025-11-20T15:42:15Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2025-11-20T23:46:13Z

LGTM - thanks for the great work.

* sync PPAF * async changes * Update test_per_partition_automatic_failover_async.py * CI fixes * changelog * broken link * Update test_location_cache.py * change PPAF detection logic * Update _global_partition_endpoint_manager_circuit_breaker_core.py * Update _global_partition_endpoint_manager_circuit_breaker_core.py * fix tests and remove environment variable * fix tests * revert excluded locations change * fix analyze * test excluded locations * Add different error handling for 503 and 408s, update README * mypy, cspell, pylint * remove tag from tests since config is service based * add threshold-based retries for 408, 5xx errors * update constant use, rollback session token PR change * threshold based retries * Update _base.py * cspell, test fixes * Update _service_unavailable_retry_policy.py * mypy, pylint * 503 behavior change, use regional contexts * mypy, pylint, tests * special-casing 503s * small fix * exclude region tests * session retry tests * pylint, cspell * change errors since 503 is now retried directly * Update sdk/cosmos/azure-cosmos/README.md Co-authored-by: Abhijeet Mohanty <[email protected]> * address comments update changelog, update docs, add typehints and documentation * Update _service_unavailable_retry_policy.py * small test updates for 503 behavior * further comments * Update test_per_partition_circuit_breaker_sm_mrr.py * test fixes * Update test_excluded_locations.py * small improvement to region-finding * pylint * Update _global_partition_endpoint_manager_per_partition_automatic_failover.py * address comments, add threshold lock * add more comments * edge cases * changes from testing * pylint * fixes pylint/mypy * mypy complaining about assigning str to none * testing changes - will roll back later * Update _endpoint_discovery_retry_policy.py * Update _asynchronous_request.py * add user agent feature flags * Update test_per_partition_automatic_failover_async.py * move user agent logic * sync and async match, remove print statements * leftover timer * Update _retry_utility.py * use constants * pylint * Update CHANGELOG.md * react to comments * Update _retry_utility.py * mypy pylint * test fixes * add lock to failure additions --------- Co-authored-by: tvaron3 <[email protected]> Co-authored-by: Abhijeet Mohanty <[email protected]>

sync PPAF

a47452e

github-actions bot added the Cosmos label Jun 15, 2025

github-project-automation bot added this to CosmosDB Python Eco-System Jun 15, 2025

simorenoh added 3 commits June 15, 2025 20:50

async changes

b8228e7

Update test_per_partition_automatic_failover_async.py

151a2fa

CI fixes

b9e0a08

simorenoh marked this pull request as ready for review June 16, 2025 13:09

Copilot AI review requested due to automatic review settings June 16, 2025 13:09

simorenoh requested a review from a team as a code owner June 16, 2025 13:09

Copilot AI reviewed Jun 16, 2025

View reviewed changes

simorenoh added 2 commits June 16, 2025 09:36

changelog

e4d7046

broken link

09e7163

tvaron3 reviewed Jun 16, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py Outdated Show resolved Hide resolved

simorenoh and others added 10 commits June 16, 2025 11:40

Update test_location_cache.py

4e28f66

change PPAF detection logic

c5319e8

Update _global_partition_endpoint_manager_circuit_breaker_core.py

eba6093

Update _global_partition_endpoint_manager_circuit_breaker_core.py

2ec5c5d

fix tests and remove environment variable

62d7be0

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

b57949d

…into cosmos-ppaf # Conflicts: # sdk/cosmos/azure-cosmos/tests/test_per_partition_circuit_breaker_sm_mrr.py

fix tests

24b8415

revert excluded locations change

9595327

fix analyze

8911ef5

test excluded locations

25dbeb3

Add different error handling for 503 and 408s, update README

d61a9a9

simorenoh added 3 commits July 30, 2025 12:00

Merge branch 'main' into cosmos-ppaf

3f8ac23

mypy, cspell, pylint

f1c69ed

remove tag from tests since config is service based

9306d15

pylint

0495c7b

simorenoh mentioned this pull request Nov 10, 2025

[Cosmos][PPAF] Hub region processing with read session not available #43914

Open

simorenoh added 3 commits November 17, 2025 10:12

Merge branch 'main' into cosmos-ppaf

335e10e

Merge branch 'main' into cosmos-ppaf

2f004b7

Update CHANGELOG.md

8639093