-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Per Partition Circuit Breaker #40302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into tvaron3/readtimeout
Fixed the timeout logic
Fixed the timeout retry policy
…into users/fabianm/tests
…into users/fabianm/tests
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_retry_utility_async.py
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py
Show resolved
Hide resolved
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks!
simorenoh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two small comments, great work!!
sdk/cosmos/azure-cosmos/azure/cosmos/_global_partition_endpoint_manager_circuit_breaker.py
Outdated
Show resolved
Hide resolved
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_global_endpoint_manager.py
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py # sdk/cosmos/azure-cosmos/azure/cosmos/_request_object.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py # sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations.py # sdk/cosmos/azure-cosmos/tests/test_excluded_locations_async.py # sdk/cosmos/azure-cosmos/tests/test_location_cache.py
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sdk/cosmos/azure-cosmos/azure/cosmos/_container_recreate_retry_policy.py
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Show resolved
Hide resolved
…into tvaron3/ppcb # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/azure/cosmos/_utils.py
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| client_timeout = kwargs.get('timeout') | ||
| start_time = time.time() | ||
| if request_params.healthy_tentative_location: | ||
| read_timeout = connection_policy.RecoveryReadTimeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout can be overridden by connection_policy.DBAReadTimeout in line 99. Would this be okay?
| request_options["maxIntegratedCacheStaleness"] = max_integrated_cache_staleness_in_ms | ||
| if self.container_link in self.__get_client_container_caches(): | ||
| request_options["containerRID"] = self.__get_client_container_caches()[self.container_link]["_rid"] | ||
| await self._get_properties_with_options(request_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since _get_properties_with_options method returns container properties, which is self.client_connection._container_properties_cache[self.container_link], we can use the return value to get the containerRID in line365.
Problem
There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.
Goal
Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.
Solution
Scope
Per partition circuit breaker is applicable for
New Request Flow
flowchart TD A[Operation] --> B[Obtain effective partition key range from partition key or Obtain partition key range id ] B --> C[Use partition key range cache to determine partition key range] C--> G[Check if request can be marked as healthy tentative if necessary time has passed] G --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range] D --> E[Add them to effective excluded regions for operation] E --> F[Let GlobalEndpointManager determine next region]New State
Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. The statistics including the number of success and failures will be tracked for one minute and then reset for a partition. Once the partition reaches one of the thresholds it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.
Healthy: This status indicates that the partition has seen only successful requests. A partition that is not tracked by PartitionHealthTracker is considered to be in Healthy status.
Unhealthy Tentative: This status indicates that a partition reached one of the thresholds. It will be unavailable for 1 minute until it is confirmed to be Unhealthy or Healthy.
Unhealthy: A partition is put in such a state after the sdk tried to recover the partition and failed. Requests will not go to a partition in this state.
Healthy Tentative: A request gets marked healthy tentative to check if a partition is healthy again. This request will have a request timeout of 6 seconds and will not be retried. Only one request should be marked healthy tentative when it is time to recover.
Service Request Errors are not tracked for circuit breaker and will keep behavior the same. There are three in region retries and then a region gets marked as unavailable.
Client Timeout Errors are not being tracked by circuit breaker because this error only gets raised right before a retry so the relevant errors are already being tracked.
New Environment Variables
"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.
Other potential configs to expose
Other Implementations and Differences
Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023
Other Changes
Follow up work
Relevant Issue
#39687