-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[Cosmos] Per-Partition Automatic Failover #41588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements per-partition automatic failover for write requests in multi-region Cosmos accounts. The changes add new classes and methods for partition-level failover, update fault injection and retry policies to support the new configuration, and adjust client connection and location cache logic.
- Extended fault injection logic in tests to support max fault counts and custom responses.
- Updated endpoint manager and retry policies to integrate per-partition automatic failover.
- Added new constants and updated client connection logic for the PPAF feature.
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/tests/_fault_injection_transport_async.py | Updated fault injection API with new optional parameters for controlling fault counts. |
| sdk/cosmos/azure-cosmos/tests/_fault_injection_transport.py | Mirrored changes in fault injection for the synchronous transport variant. |
| sdk/cosmos/azure-cosmos/azure/cosmos/documents.py | Added a new flag to enable per-partition failover behavior in the database account. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py | Introduced the asynchronous partition endpoint manager handling per-partition failover logic. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py | Added new constants for enabling per-partition automatic failover. |
| (Multiple files) | Updated retry policies, request routing, and the circuit breaker endpoint manager to support the new feature. |
Comments suppressed due to low confidence (3)
sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker_core.py:62
- The use of chained environment variable lookups for configuration may lead to unexpected behavior if the two settings differ. Consider splitting the checks for per-partition failover and the traditional circuit breaker into separate conditions to improve clarity and maintainability.
circuit_breaker_enabled = os.environ.get(Constants.PER_PARTITION_AUTOMATIC_FAILOVER_ENABLED_CONFIG, os.environ.get(Constants.CIRCUIT_BREAKER_ENABLED_CONFIG, Constants.CIRCUIT_BREAKER_ENABLED_CONFIG_DEFAULT)).lower() == "true"
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_global_partition_endpoint_manager_per_partition_automatic_failover_async.py:119
- When no available regional endpoint is found and the failover cache is reset, it would be helpful to add targeted tests (or logging enhancements) to ensure this behavior is correctly handled in production scenarios.
self.partition_range_to_failover_info[pk_range_wrapper] = PartitionLevelFailoverInfo()
sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py:476
- [nitpick] The renaming from 'orderedLocations' to 'ordered_locations' improves readability; please ensure that any related documentation or in-code comments are updated to reflect this consistent naming convention.
for location in ordered_locations:
…into cosmos-ppaf # Conflicts: # sdk/cosmos/azure-cosmos/tests/test_per_partition_circuit_breaker_sm_mrr.py
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/azure/cosmos/_endpoint_discovery_retry_policy.py
Show resolved
Hide resolved
tvaron3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...e-cosmos/azure/cosmos/_global_partition_endpoint_manager_per_partition_automatic_failover.py
Show resolved
Hide resolved
...e-cosmos/azure/cosmos/_global_partition_endpoint_manager_per_partition_automatic_failover.py
Outdated
Show resolved
Hide resolved
...e-cosmos/azure/cosmos/_global_partition_endpoint_manager_per_partition_automatic_failover.py
Outdated
Show resolved
Hide resolved
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
LGTM - thanks for the great work. |
* sync PPAF * async changes * Update test_per_partition_automatic_failover_async.py * CI fixes * changelog * broken link * Update test_location_cache.py * change PPAF detection logic * Update _global_partition_endpoint_manager_circuit_breaker_core.py * Update _global_partition_endpoint_manager_circuit_breaker_core.py * fix tests and remove environment variable * fix tests * revert excluded locations change * fix analyze * test excluded locations * Add different error handling for 503 and 408s, update README * mypy, cspell, pylint * remove tag from tests since config is service based * add threshold-based retries for 408, 5xx errors * update constant use, rollback session token PR change * threshold based retries * Update _base.py * cspell, test fixes * Update _service_unavailable_retry_policy.py * mypy, pylint * 503 behavior change, use regional contexts * mypy, pylint, tests * special-casing 503s * small fix * exclude region tests * session retry tests * pylint, cspell * change errors since 503 is now retried directly * Update sdk/cosmos/azure-cosmos/README.md Co-authored-by: Abhijeet Mohanty <[email protected]> * address comments update changelog, update docs, add typehints and documentation * Update _service_unavailable_retry_policy.py * small test updates for 503 behavior * further comments * Update test_per_partition_circuit_breaker_sm_mrr.py * test fixes * Update test_excluded_locations.py * small improvement to region-finding * pylint * Update _global_partition_endpoint_manager_per_partition_automatic_failover.py * address comments, add threshold lock * add more comments * edge cases * changes from testing * pylint * fixes pylint/mypy * mypy complaining about assigning str to none * testing changes - will roll back later * Update _endpoint_discovery_retry_policy.py * Update _asynchronous_request.py * add user agent feature flags * Update test_per_partition_automatic_failover_async.py * move user agent logic * sync and async match, remove print statements * leftover timer * Update _retry_utility.py * use constants * pylint * Update CHANGELOG.md * react to comments * Update _retry_utility.py * mypy pylint * test fixes * add lock to failure additions --------- Co-authored-by: tvaron3 <[email protected]> Co-authored-by: Abhijeet Mohanty <[email protected]>
Per-Partition Automatic Failover
This PR adds the ability for the SDK to utilize per-partition automatic failover as a resiliency mechanism for write requests in single-write multi-region Cosmos accounts. Big picture, PPAF allows the SDK to reach out to other regions when the main write region becomes unavailable for a given partition. This means that while the partition is having issues in the main write region, the SDK will be able to route these write requests to one of the available read regions for the account for write requests targeting that partition. Once the endpoint is available again, the service will let the SDK know that failing back is now possible (403.3) and the SDK will attempt to return to the main write region in a round-robin fashion through the available regions for the account. Neither preferred regions nor excluded regions play a role in PPAF, since the decision of where the current write region for a partition is located depends entirely on the service. It is also worth noting that enabling PPAF will also enable PPCB.
Unlike PPCB, PPAF is a service feature - that is, it won't work right out the box with just the SDK, but also requires the database account to be configured with the feature. More information on the feature and how to enable it for an account can be found here: https://learn.microsoft.com/azure/cosmos-db/how-to-configure-per-partition-automatic-failover
TLDR: in order use these enhancements, a user will need to have:
Design and new classes
_GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover
Takes care of the regional routing and partition unavailability tracking when PPAF is enabled by using its own cache of PartitionLevelFailoverInfo objects.
Instance attributes:
partition_range_to_failover_info: mapping of a partition key range to their respective PartitionLevelFailoverInfo object.ppaf_thresholds_tracker: mapping of a partition key range to the number of consecutive errors of a given typePartitionLevelFailoverInfo
Holds the relevant partition level failover information for a partition. This information is used in order to route requests to the next available region based on the known information about its availability.
Instance attributes:
unavailable_regional_endpoints: Set holding the regional endpoints that have been marked as unavailable for this partition.current_regional_endpoint: the current regional endpoint to be used by PPAF requests - stored in order to ensure that requests are properly routed to this region if PPAF is enabled._lock: To ensure updating logic is thread-safe for a given partition.Request flow with PPAF
flowchart TD A[User sends request] --> B{Is PPAF valid? do we have regions available, is this a write request, is this a multi-write account} B -- No --> C[Use per-partition Circuit Breaker GlobalEndpointManager] B -- Yes --> D{Is PK range info from this request cached?} D -- No --> E[Create new cache entry] E --> F[Use default GlobalEndpointManager to resolve endpoint] D -- Yes --> G{Is current request regional endpoint unavailable?} G -- No --> H[Update partition info with current request endpoint] H --> I[Send request to current endpoint] G -- Yes --> J[Cycle through available endpoints] J --> K{Found available endpoint?} K -- Yes --> L[Update partition info with new endpoint] L --> M[Send request to new endpoint] K -- No --> N[Reset cache entry] N --> O[Use default GlobalEndpointManager to resolve endpoint]Difference in behaviors and status codes
This section covers the behavior for the different status codes that are relevant to PPAF error handling. For some of these, we have a threshold-based approach, that requires 10 consecutive failures (default value can be changed) before performing a partition-level failover, while others will immediately attempt to do a partition-level failover.
Concerns
Some of the things below have not been tested or done yet, and should be looked at before releasing this feature.
enablePerPartitionFailoverEnabledis the correct config from the service.Additional work to be done
Eventually I'd like to also add tests with more regions in order to properly test excluded regions -> ideally, we configure an account with 3 regions to more effectively verify multi-region behavior and use excluded regions with greater depth.
Addresses #39686