Per Partition Circuit Breaker #40302

tvaron3 · 2025-03-31T22:51:15Z

Problem

There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.

Goal

Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.

Solution

Scope

Per partition circuit breaker is applicable for

any consistency level
document and partition key operations
single write region accounts with multiple read regions
multiple write region accounts

New Request Flow

flowchart TD
    A[Operation] --> B[Obtain effective partition key range from partition key or  Obtain partition key range id ]
    B --> C[Use partition key range cache to determine partition key range]
    C--> G[Check if request can be marked as healthy tentative if necessary time has passed]
    G --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for operation]
    E --> F[Let GlobalEndpointManager determine next region]

New State

Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. The statistics including the number of success and failures will be tracked for one minute and then reset for a partition. Once the partition reaches one of the thresholds it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.

Healthy: This status indicates that the partition has seen only successful requests. A partition that is not tracked by PartitionHealthTracker is considered to be in Healthy status.

Unhealthy Tentative: This status indicates that a partition reached one of the thresholds. It will be unavailable for 1 minute until it is confirmed to be Unhealthy or Healthy.

Unhealthy: A partition is put in such a state after the sdk tried to recover the partition and failed. Requests will not go to a partition in this state.

Healthy Tentative: A request gets marked healthy tentative to check if a partition is healthy again. This request will have a request timeout of 6 seconds and will not be retried. Only one request should be marked healthy tentative when it is time to recover.

stateDiagram
   state "Healthy" as Healthy
   state "Unhealthy" as Unhealthy
   state "Healthy Tentative" as HealthyTentative
   state "Unhealthy Tentative" as UnhealthyTentative
   Start --> Healthy
   Healthy --> UnhealthyTentative : Failure breaking threshold
   HealthyTentative --> Healthy : Success
   HealthyTentative --> Unhealthy : Failure
   UnhealthyTentative --> HealthyTentative: After 60s
   Unhealthy --> HealthyTentative : After Some Time Period Following Exponential Backoff

Service Request Errors are not tracked for circuit breaker and will keep behavior the same. There are three in region retries and then a region gets marked as unavailable.

Client Timeout Errors are not being tracked by circuit breaker because this error only gets raised right before a retry so the relevant errors are already being tracked.

New Environment Variables

"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.

Other potential configs to expose

initial unavailable time. default 1 minute
max unavailable time. default 20 minutes
exponential backoff factor. default 2
minimum number of requests for failure rate threshold. default 100

Other Implementations and Differences

In the other sdks, there is no failure rate threshold for failing over a partition. This was added to python sdk to for the scenario where a partition is seeing a lot of errors causing availability loss that are not consecutive.
In this implementation the partitions do not get marked as healthy in a background task unlike the other sdks.
Python sdk tracks all 5xx errors not just 503s

Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023

Other Changes

Container cache will have two mappings one from the container rid and one from the container link. The first document request on a container will now initialize the cache.
Fixed bug in the resource token parsing logic where we wouldn't populate the authorization header for metadata requests in the lifecycle of a document operation.

Follow up work

Queries without partition information like "Select * From c" and read all items are currently not covered by circuit breaker. These will need to be changed to make requests to all the partitions instead of leaving this work up to the sdk.
Read and write hedging (dependent on non idempotent retries). [Feature] Hedging #40919
Deleting the pk range entry of the parent partition in the partition health tracker in the case of a split

Relevant Issue

#39687

…into tvaron3/readtimeout

Fixed the timeout logic

Fixed the timeout retry policy

…e-sdk-for-python into users/fabianm/tests

…into users/fabianm/tests

tvaron3 · 2025-05-12T23:56:22Z

/azp run python - cosmos - tests

azure-pipelines · 2025-05-12T23:56:40Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py

sdk/cosmos/azure-cosmos/azure/cosmos/_partition_health_tracker.py

FabianMeiswinkel

LGTM - Thanks!

simorenoh

two small comments, great work!!

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py

sdk/cosmos/azure-cosmos/pytest.ini

…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_global_endpoint_manager.py

…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py # sdk/cosmos/azure-cosmos/azure/cosmos/_request_object.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations_async.py # sdk/cosmos/azure-cosmos/tests/test_location_cache.py

tvaron3 · 2025-05-27T18:27:32Z

/azp run python - cosmos - tests

azure-pipelines · 2025-05-27T18:27:52Z

Azure Pipelines successfully started running 1 pipeline(s).

sdk/cosmos/azure-cosmos/azure/cosmos/_base.py

sdk/cosmos/azure-cosmos/azure/cosmos/_container_recreate_retry_policy.py

sdk/cosmos/azure-cosmos/azure/cosmos/_location_cache.py

sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py

sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py

…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py

tvaron3 · 2025-05-28T00:19:08Z

/azp run python - cosmos - tests

azure-pipelines · 2025-05-28T00:19:28Z

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3 · 2025-05-28T17:24:36Z

/azp run python - cosmos - tests

azure-pipelines · 2025-05-28T17:24:57Z

Azure Pipelines successfully started running 1 pipeline(s).

allenkim0129 · 2025-05-28T21:17:20Z

sdk/cosmos/azure-cosmos/azure/cosmos/_synchronized_request.py

    client_timeout = kwargs.get('timeout')
    start_time = time.time()
+    if request_params.healthy_tentative_location:
+        read_timeout = connection_policy.RecoveryReadTimeout


This timeout can be overridden by connection_policy.DBAReadTimeout in line 99. Would this be okay?

allenkim0129 · 2025-05-28T21:33:07Z

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py

            request_options["maxIntegratedCacheStaleness"] = max_integrated_cache_staleness_in_ms
-        if self.container_link in self.__get_client_container_caches():
-            request_options["containerRID"] = self.__get_client_container_caches()[self.container_link]["_rid"]
+        await self._get_properties_with_options(request_options)


Nit: Since _get_properties_with_options method returns container properties, which is self.client_connection._container_properties_cache[self.container_link], we can use the return value to get the containerRID in line365.

sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py

tvaron3 and others added 28 commits February 5, 2025 19:03

change default read timeout

ff20cf9

fix tests

40e43c4

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

faf6c27

…into tvaron3/readtimeout

Add read timeout tests for database account calls

aefe30b

fix timeout retry policy

9a234f8

Fixed the timeout logic

8859c9f

Merge pull request #2 from tvaron3/tvaron3/readTimeout

8b166fc

Fixed the timeout logic

Fixed the timeout retry policy

ac78da9

Merge pull request #3 from tvaron3/readtimeout

e8bc02e

Fixed the timeout retry policy

Mock tests for timeout and failover retry policy

09aac90

Merge branch 'tvaron3/readtimeout' of https://github.com/tvaron3/azur…

48a20fa

…e-sdk-for-python into users/fabianm/tests

Create test_dummy.py

f22e7d2

Update test_dummy.py

dd8a466

Update test_dummy.py

8ac11c5

Update test_dummy.py

b53e2e9

Iterating on fault injection tooling

973ec44

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

f25af53

…into users/fabianm/tests

Refactoring to have FaultInjectionTransport in its own file

5d72848

Update test_dummy.py

8c9aa4b

Reafctoring FaultInjectionTransport

7260e9d

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

bf3e60b

…into users/fabianm/tests

Iterating on tests

0705aeb

Prettifying tests

baf7aea

small refactoring

e90b722

Adding MM topology on Emulator

cb58896

Adding cross region retry tests

46ec31c

Add Excluded Locations Feature

f03f51f

initial ppcb changes

cf42098

github-actions bot added the Cosmos label Mar 31, 2025

github-project-automation bot added this to CosmosDB Python Eco-System Mar 31, 2025

jeet1995 reviewed May 13, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py Outdated Show resolved Hide resolved

jeet1995 reviewed May 13, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py Show resolved Hide resolved

jeet1995 reviewed May 13, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_partition_health_tracker.py Show resolved Hide resolved

react to comments

08075bd

FabianMeiswinkel approved these changes May 19, 2025

View reviewed changes

simorenoh mentioned this pull request May 19, 2025

[Cosmos] Session container fixes #40366

Closed

simorenoh requested changes May 23, 2025

View reviewed changes

sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py Outdated Show resolved Hide resolved

sdk/cosmos/azure-cosmos/pytest.ini Outdated Show resolved Hide resolved

tvaron3 added 4 commits May 27, 2025 09:44

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

adce28f

…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_global_endpoint_manager.py

react to async client changes from merge

da362fc

react to comments

c29c52a

fix tests

2feca2a

allenkim0129 reviewed May 27, 2025

View reviewed changes

tvaron3 added 3 commits May 27, 2025 17:07

react to comments and fix tests

c0441bf

react to comments and fix tests

5f8b7e8

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

9c37222

…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py

fix tests

3010aa5

simorenoh approved these changes May 29, 2025

View reviewed changes

tvaron3 merged commit 37c99f0 into Azure:main May 29, 2025
32 checks passed

github-project-automation bot moved this to Done in CosmosDB Python Eco-System May 29, 2025

allenkim0129 reviewed May 29, 2025

View reviewed changes

bambriz mentioned this pull request Jun 30, 2025

[Cosmos] Session container fixes new branch #41678

Merged

Per Partition Circuit Breaker #40302

Per Partition Circuit Breaker #40302

Uh oh!

Conversation

tvaron3 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Goal

Solution

Scope

New Request Flow

New State

New Environment Variables

Other potential configs to expose

Other Implementations and Differences

Other Changes

Follow up work

Relevant Issue

Uh oh!

tvaron3 commented May 12, 2025

Uh oh!

azure-pipelines bot commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

Uh oh!

simorenoh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tvaron3 commented May 27, 2025

Uh oh!

azure-pipelines bot commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvaron3 commented May 28, 2025

Uh oh!

azure-pipelines bot commented May 28, 2025

Uh oh!

tvaron3 commented May 28, 2025

Uh oh!

azure-pipelines bot commented May 28, 2025

Uh oh!

Uh oh!

allenkim0129 May 28, 2025

Choose a reason for hiding this comment

Uh oh!

allenkim0129 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tvaron3 commented Mar 31, 2025 •

edited

Loading

allenkim0129 May 28, 2025 •

edited

Loading