Skip to content

Conversation

@tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Mar 31, 2025

Problem

There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.

Goal

Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.

Solution

Scope

Per partition circuit breaker is applicable for

  • any consistency level
  • document and partition key operations
  • single write region accounts with multiple read regions
  • multiple write region accounts

New Request Flow

flowchart TD
    A[Operation] --> B[Obtain effective partition key range from partition key or  Obtain partition key range id ]
    B --> C[Use partition key range cache to determine partition key range]
    C--> G[Check if request can be marked as healthy tentative if necessary time has passed]
    G --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for operation]
    E --> F[Let GlobalEndpointManager determine next region]
Loading

New State

Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. The statistics including the number of success and failures will be tracked for one minute and then reset for a partition. Once the partition reaches one of the thresholds it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.

Healthy: This status indicates that the partition has seen only successful requests. A partition that is not tracked by PartitionHealthTracker is considered to be in Healthy status.

Unhealthy Tentative: This status indicates that a partition reached one of the thresholds. It will be unavailable for 1 minute until it is confirmed to be Unhealthy or Healthy.

Unhealthy: A partition is put in such a state after the sdk tried to recover the partition and failed. Requests will not go to a partition in this state.

Healthy Tentative: A request gets marked healthy tentative to check if a partition is healthy again. This request will have a request timeout of 6 seconds and will not be retried. Only one request should be marked healthy tentative when it is time to recover.

stateDiagram
   state "Healthy" as Healthy
   state "Unhealthy" as Unhealthy
   state "Healthy Tentative" as HealthyTentative
   state "Unhealthy Tentative" as UnhealthyTentative
   Start --> Healthy
   Healthy --> UnhealthyTentative : Failure breaking threshold
   HealthyTentative --> Healthy : Success
   HealthyTentative --> Unhealthy : Failure
   UnhealthyTentative --> HealthyTentative: After 60s
   Unhealthy --> HealthyTentative : After Some Time Period Following Exponential Backoff
Loading

Service Request Errors are not tracked for circuit breaker and will keep behavior the same. There are three in region retries and then a region gets marked as unavailable.

Client Timeout Errors are not being tracked by circuit breaker because this error only gets raised right before a retry so the relevant errors are already being tracked.

New Environment Variables

"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.

Other potential configs to expose

  • initial unavailable time. default 1 minute
  • max unavailable time. default 20 minutes
  • exponential backoff factor. default 2
  • minimum number of requests for failure rate threshold. default 100

Other Implementations and Differences

  • In the other sdks, there is no failure rate threshold for failing over a partition. This was added to python sdk to for the scenario where a partition is seeing a lot of errors causing availability loss that are not consecutive.
  • In this implementation the partitions do not get marked as healthy in a background task unlike the other sdks.
  • Python sdk tracks all 5xx errors not just 503s

Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023

Other Changes

  • Container cache will have two mappings one from the container rid and one from the container link. The first document request on a container will now initialize the cache.
  • Fixed bug in the resource token parsing logic where we wouldn't populate the authorization header for metadata requests in the lifecycle of a document operation.

Follow up work

  • Queries without partition information like "Select * From c" and read all items are currently not covered by circuit breaker. These will need to be changed to make requests to all the partitions instead of leaving this work up to the sdk.
  • Read and write hedging (dependent on non idempotent retries). [Feature] Hedging #40919
  • Deleting the pk range entry of the parent partition in the partition health tracker in the case of a split

Relevant Issue

#39687

tvaron3 and others added 28 commits February 5, 2025 19:03
Fixed the timeout retry policy
@tvaron3
Copy link
Member Author

tvaron3 commented May 12, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

Copy link
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small comments, great work!!

tvaron3 added 4 commits May 27, 2025 09:44
…into tvaron3/ppcb

# Conflicts:
#	sdk/cosmos/azure-cosmos/CHANGELOG.md
#	sdk/cosmos/azure-cosmos/azure/cosmos/_global_endpoint_manager.py
…into tvaron3/ppcb

# Conflicts:
#	sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py
#	sdk/cosmos/azure-cosmos/azure/cosmos/_request_object.py
#	sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
#	sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py
#	sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py
#	sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py
#	sdk/cosmos/azure-cosmos/tests/test_excluded_locations.py
#	sdk/cosmos/azure-cosmos/tests/test_excluded_locations_async.py
#	sdk/cosmos/azure-cosmos/tests/test_location_cache.py
@tvaron3
Copy link
Member Author

tvaron3 commented May 27, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3 added 3 commits May 27, 2025 17:07
…into tvaron3/ppcb

# Conflicts:
#	sdk/cosmos/azure-cosmos/CHANGELOG.md
#	sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py
@tvaron3
Copy link
Member Author

tvaron3 commented May 28, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3
Copy link
Member Author

tvaron3 commented May 28, 2025

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3 tvaron3 merged commit 37c99f0 into Azure:main May 29, 2025
32 checks passed
client_timeout = kwargs.get('timeout')
start_time = time.time()
if request_params.healthy_tentative_location:
read_timeout = connection_policy.RecoveryReadTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout can be overridden by connection_policy.DBAReadTimeout in line 99. Would this be okay?

request_options["maxIntegratedCacheStaleness"] = max_integrated_cache_staleness_in_ms
if self.container_link in self.__get_client_container_caches():
request_options["containerRID"] = self.__get_client_container_caches()[self.container_link]["_rid"]
await self._get_properties_with_options(request_options)
Copy link
Contributor

@allenkim0129 allenkim0129 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since _get_properties_with_options method returns container properties, which is self.client_connection._container_properties_cache[self.container_link], we can use the return value to get the containerRID in line365.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants