Skip to content

Conversation

@simorenoh
Copy link
Member

@simorenoh simorenoh commented Jun 15, 2025

Per-Partition Automatic Failover

This PR adds the ability for the SDK to utilize per-partition automatic failover as a resiliency mechanism for write requests in single-write multi-region Cosmos accounts. Big picture, PPAF allows the SDK to reach out to other regions when the main write region becomes unavailable for a given partition. This means that while the partition is having issues in the main write region, the SDK will be able to route these write requests to one of the available read regions for the account for write requests targeting that partition. Once the endpoint is available again, the service will let the SDK know that failing back is now possible (403.3) and the SDK will attempt to return to the main write region in a round-robin fashion through the available regions for the account. Neither preferred regions nor excluded regions play a role in PPAF, since the decision of where the current write region for a partition is located depends entirely on the service. It is also worth noting that enabling PPAF will also enable PPCB.

Unlike PPCB, PPAF is a service feature - that is, it won't work right out the box with just the SDK, but also requires the database account to be configured with the feature. More information on the feature and how to enable it for an account can be found here: https://learn.microsoft.com/azure/cosmos-db/how-to-configure-per-partition-automatic-failover

TLDR: in order use these enhancements, a user will need to have:

  • Have configured an account with PPAF as per the linked document above.
  • A single-write multi-region Cosmos account.
  • More than one region available for their account.

Design and new classes

_GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover

Takes care of the regional routing and partition unavailability tracking when PPAF is enabled by using its own cache of PartitionLevelFailoverInfo objects.
Instance attributes:

  • partition_range_to_failover_info: mapping of a partition key range to their respective PartitionLevelFailoverInfo object.
  • ppaf_thresholds_tracker: mapping of a partition key range to the number of consecutive errors of a given type

PartitionLevelFailoverInfo

Holds the relevant partition level failover information for a partition. This information is used in order to route requests to the next available region based on the known information about its availability.
Instance attributes:

  • unavailable_regional_endpoints: Set holding the regional endpoints that have been marked as unavailable for this partition.
  • current_regional_endpoint: the current regional endpoint to be used by PPAF requests - stored in order to ensure that requests are properly routed to this region if PPAF is enabled.
  • _lock: To ensure updating logic is thread-safe for a given partition.

Request flow with PPAF

flowchart TD
    A[User sends request] --> B{Is PPAF valid? do we have regions available, is this a write request, is this a multi-write account}
    
    B -- No --> C[Use per-partition Circuit Breaker GlobalEndpointManager]
    
    B -- Yes --> D{Is PK range info from this request cached?}
    
    D -- No --> E[Create new cache entry]
    E --> F[Use default GlobalEndpointManager to resolve endpoint]
    
    D -- Yes --> G{Is current request regional endpoint unavailable?}
    
    G -- No --> H[Update partition info with current request endpoint]
    H --> I[Send request to current endpoint]
    
    G -- Yes --> J[Cycle through available endpoints]
    J --> K{Found available endpoint?}
    
    K -- Yes --> L[Update partition info with new endpoint]
    L --> M[Send request to new endpoint]
    
    K -- No --> N[Reset cache entry]
    N --> O[Use default GlobalEndpointManager to resolve endpoint]
Loading

Difference in behaviors and status codes

This section covers the behavior for the different status codes that are relevant to PPAF error handling. For some of these, we have a threshold-based approach, that requires 10 consecutive failures (default value can be changed) before performing a partition-level failover, while others will immediately attempt to do a partition-level failover.

Error codes Behaviors
403.3, 503 Direct partition-level failover will happen. This behavior aligns with .NET SDK. 503s have had their own policy made.
408, 500, 502, 504, ServiceResponseErrors Threshold-based failover logic, we only failover after 10 consecutive exceptions
404.1002 Region failover will utilize partition-level failover info from PPAF if it is available, otherwise will have default behavior

Concerns

Some of the things below have not been tested or done yet, and should be looked at before releasing this feature.

  • Missing work on session token false progress - the [session token container fixes PR] ([Cosmos] Session container fixes #40366) might be needed as a prerequisite since it fixes much of the session token logic within the SDK.
  • Partition splits can change the range that a given partition key range id or partition key value are reaching out to, and as such might cause an issue with the partition info cache if a split happens while the write partition is in the middle of a failover. We should verify partition split behavior since I'm not 100% certain we refresh the partition key range cache in many situations.
  • Need to verify the configuration sent from the service for an account with PPAF enabled matches the configuration being used in the code - should be a quick check against a live account with the config
    • Update: Verifiied that enablePerPartitionFailoverEnabled is the correct config from the service.
  • Need to verify diagnostics and logging is doing what it should for supplying PPAF information to our users -> confirmed since we log the different requests failing over

Additional work to be done

  • Add README entry for all of this
  • Add ability for PPAF to be enabled dynamically -> this was always enabled by default since we update the database account cache with every health check and read the cache to verify if we should enforce the config in the global endpoint manager. checks run every 5 minutes, so it should take at most 5 minutes for the feature to be enabled dynamically.
  • Add retry mechanisms for 408, 502, 503, 504 exceptions based on consecutive request failures
  • Add retry mechanisms for 404.1002 to utilize partition-level failover region as opposed to main failover region if the info is available
  • Add retry mechanisms for ServiceResponseErrors

Eventually I'd like to also add tests with more regions in order to properly test excluded regions -> ideally, we configure an account with 3 regions to more effectively verify multi-region behavior and use excluded regions with greater depth.

Addresses #39686

@simorenoh simorenoh marked this pull request as ready for review June 16, 2025 13:09
Copilot AI review requested due to automatic review settings June 16, 2025 13:09
@simorenoh simorenoh requested a review from a team as a code owner June 16, 2025 13:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements per-partition automatic failover for write requests in multi-region Cosmos accounts. The changes add new classes and methods for partition-level failover, update fault injection and retry policies to support the new configuration, and adjust client connection and location cache logic.

  • Extended fault injection logic in tests to support max fault counts and custom responses.
  • Updated endpoint manager and retry policies to integrate per-partition automatic failover.
  • Added new constants and updated client connection logic for the PPAF feature.

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/tests/_fault_injection_transport_async.py Updated fault injection API with new optional parameters for controlling fault counts.
sdk/cosmos/azure-cosmos/tests/_fault_injection_transport.py Mirrored changes in fault injection for the synchronous transport variant.
sdk/cosmos/azure-cosmos/azure/cosmos/documents.py Added a new flag to enable per-partition failover behavior in the database account.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py Introduced the asynchronous partition endpoint manager handling per-partition failover logic.
sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py Added new constants for enabling per-partition automatic failover.
(Multiple files) Updated retry policies, request routing, and the circuit breaker endpoint manager to support the new feature.
Comments suppressed due to low confidence (3)

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker_core.py:62

  • The use of chained environment variable lookups for configuration may lead to unexpected behavior if the two settings differ. Consider splitting the checks for per-partition failover and the traditional circuit breaker into separate conditions to improve clarity and maintainability.
circuit_breaker_enabled = os.environ.get(Constants.PER_PARTITION_AUTOMATIC_FAILOVER_ENABLED_CONFIG, os.environ.get(Constants.CIRCUIT_BREAKER_ENABLED_CONFIG, Constants.CIRCUIT_BREAKER_ENABLED_CONFIG_DEFAULT)).lower() == "true"

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py:119

  • When no available regional endpoint is found and the failover cache is reset, it would be helpful to add targeted tests (or logging enhancements) to ensure this behavior is correctly handled in production scenarios.
self.partition_range_to_failover_info[pk_range_wrapper] = PartitionLevelFailoverInfo()

sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py:476

  • [nitpick] The renaming from 'orderedLocations' to 'ordered_locations' improves readability; please ensure that any related documentation or in-code comments are updated to reflect this consistent naming convention.
for location in ordered_locations:

@tvaron3
Copy link
Member

tvaron3 commented Jul 7, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions
Copy link

github-actions bot commented Jul 8, 2025

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-cosmos

@simorenoh
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@simorenoh
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simorenoh
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@simorenoh
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member

LGTM - thanks for the great work.

@simorenoh simorenoh merged commit 97ce0dd into main Nov 21, 2025
21 checks passed
@simorenoh simorenoh deleted the cosmos-ppaf branch November 21, 2025 17:03
msyyc pushed a commit that referenced this pull request Nov 25, 2025
* sync PPAF

* async changes

* Update test_per_partition_automatic_failover_async.py

* CI fixes

* changelog

* broken link

* Update test_location_cache.py

* change PPAF detection logic

* Update _global_partition_endpoint_manager_circuit_breaker_core.py

* Update _global_partition_endpoint_manager_circuit_breaker_core.py

* fix tests and remove environment variable

* fix tests

* revert excluded locations change

* fix analyze

* test excluded locations

* Add different error handling for 503 and 408s, update README

* mypy, cspell, pylint

* remove tag from tests since config is service based

* add threshold-based retries for 408, 5xx errors

* update constant use, rollback session token PR change

* threshold based retries

* Update _base.py

* cspell, test fixes

* Update _service_unavailable_retry_policy.py

* mypy, pylint

* 503 behavior change, use regional contexts

* mypy, pylint, tests

* special-casing 503s

* small fix

* exclude region tests

* session retry tests

* pylint, cspell

* change errors since 503 is now retried directly

* Update sdk/cosmos/azure-cosmos/README.md

Co-authored-by: Abhijeet Mohanty <[email protected]>

* address comments

update changelog, update docs, add typehints and documentation

* Update _service_unavailable_retry_policy.py

* small test updates for 503 behavior

* further comments

* Update test_per_partition_circuit_breaker_sm_mrr.py

* test fixes

* Update test_excluded_locations.py

* small improvement to region-finding

* pylint

* Update _global_partition_endpoint_manager_per_partition_automatic_failover.py

* address comments, add threshold lock

* add more comments

* edge cases

* changes from testing

* pylint

* fixes pylint/mypy

* mypy complaining about assigning str to none

* testing changes - will roll back later

* Update _endpoint_discovery_retry_policy.py

* Update _asynchronous_request.py

* add user agent feature flags

* Update test_per_partition_automatic_failover_async.py

* move user agent logic

* sync and async match, remove print statements

* leftover timer

* Update _retry_utility.py

* use constants

* pylint

* Update CHANGELOG.md

* react to comments

* Update _retry_utility.py

* mypy pylint

* test fixes

* add lock to failure additions

---------

Co-authored-by: tvaron3 <[email protected]>
Co-authored-by: Abhijeet Mohanty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants