-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Partition level circuit breaker #39265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…strategy enabled.
|
API change check APIView has identified API level changes in this PR and created following API reviews. |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…rtitionLevelCircuitBreaker # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java
|
API change check APIView has identified API level changes in this PR and created following API reviews. |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…rtitionLevelCircuitBreaker
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Fixed Spring cosmos test pipeline failure #41227 |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Introduction
A physical partition for a region might see the following issues:
Read Session Not Availableerrors due to it not having progressed up to the requested session token.Any issue with the primary replica could result in write-availability loss. Even with threshold-based availability strategy and non-idempotent retriable writes enabled, a write may see a success after the configured threshold duration from the hedged region which can breach the P99 latency SLA requirement of a downstream application. In certain cases, although rare (due to 4 replicas being eligible to serve reads), reads could be in a retry loop in the local region - for e.g. when the replica in the local region's physical partition is lagging and requests to this region are seeing 404:1002s (in Session consistency scenarios). Requests in such a cases can timeout if end-to-end timeout is set for that request.
The partition-level circuit breaker feature strives to improve upon tail latency and write availability (when a primary replica is down) by bookmarking terminal failures for a particular physical partition in a particular region and then short-circuiting requests to that physical partition in those regions for which these terminal failures have exceeded a certain threshold and instead forcing the SDK to leverage other available regions for that physical partition.
DISCLAIMER : This feature is only applicable when using the SDK with multi-write Cosmos DB accounts.
Design
Major flows involved
Unavailableregion for a physical partition.Healthy,HealthyWithFailures,HealthyTentativeandUnavailable.Tracking terminal failures from a region
flowchart TD A[Upstream layer] --> |503 bubbled up|B[ClientRetryPolicy] B --> |Handle exception for partition key range and region contacted pair|C[GlobalPartitionEndpointManageForCircuitBreaker]OperationCancelledExceptionis seen on an operation.flowchart TD A[Downstream layer] --> |OperationCancelledException|B[RxDocumentClientImpl] B --> |Handle exception for a partition key range and first contacted region|C[GlobalPartitionEndpointManager]RequestTimeoutExceptionis seen for write requests.GoneExceptionwhich help such requests to be retried.RequestTimeoutException- this can be tracked as well for circuit breaking behavior.flowchart TD A[Upstream layer] --> |RequestTimeoutException bubbled up|B[ClientRetryPolicy] B --> |Handle exception for partition key range and region contacted pair provided it is a write operation with non-idempotent write retry policy disabled.|C[GlobalPartitionEndpointManageForCircuitBreaker]flowchart TD A[Downstream layer] --> |CANCEL signal|B[RxDocumentClientImpl] B --> |Handle exception for partition key range and first contacted region provided hedged region saw successful response for that partition key range|C[GlobalPartitionEndpointManagerForCircuitBreaker]Short circuiting a region
flowchart TD A[Point operation] --> B[Extract partition key and partition key definition to determine effective partition key string] B --> C[Use partition key range cache to determine partition key range] C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range] D --> E[Add them to effective excluded regions for point operation] E --> F[Let GlobalEndpointManager determine next region]flowchart TD A[Query operation] --> B[In DocumentProducer, apply feed range filtering to an EPK-range scoped request] B --> C[Map the resolved partition key range for the feed range] C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range] D --> E[Add them to effective excluded regions for EPK-scoped request] E --> F[Let GlobalEndpointManager determine next region]flowchart TD A[Change feed operation] --> B[In ChangeFeedQueryImpl, apply feed range filtering to an EPK-range scoped request] B --> C[Map the resolved partition key range for the feed range] C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range] D --> E[Add them to effective excluded regions for EPK-scoped request] E --> F[Let GlobalEndpointManager determine next region]Classes / Concepts introduced
GlobalPartitionEndpointManagerForCircuitBreakerThis class will store mappings b/w the
PartitionKeyRangeWrapper(encapsulates the physical partition representation along with the collection rid) instance and physical partition specific health metadata. Partition specific health metadata is further segregated into the partition's health metadata per location which is represented by an instance ofLocationSpecificHealthContext.The
GlobalPartitionEndpointManagerForCircuitBreakerwill expose methods to handle an exception for a partition and region, handle a successful response from a region for a partition and to identify what regions are unavailable for a partition.LocationHealthStatusThis type is an
enumwhich encapsulates the availability status of a region for a physical partition. Below are the 3 possible availability statuses the physical partition can be in:Healthy: This status indicates that the region has seen only successful requests.HealthyWithFailures: This status indicates that a region started seeing failures but is still within the threshold. The SDK will still attempt to send requests to such a region for the concerned physical partition.Unavailable: A region is put in such a status when the failure rate crosses a certain threshold. Once the region is put in such a status, the SDK will not route requests to this region for the concerned physical partition.HealthyTentative: A region is put in such a status when it has been put inUnavailablestatus for beyond a configured duration. The SDK will route requests to a region inHealthyTentativestatus. Upon seeing a certain count / threshold of successes, such a region is promoted toHealthystatus and upon failure demoted toUnavailablestatus. The tolerance threshold to be demoted toUnavailableis lower fromHealthyTentativewhen compared to demotion fromHealthyWithFailures.LocationSpecificHealthContextThis class maintains state on exception count, success count, health status and since when a particular location has been unavailable for a partition.
LocationSpecificHealthContextTransitionHandlerThis class atomically modifies the state of an
LocationSpecificHealthContextinstance.ConsecutiveExceptionCountBasedCircuitBreakerThis class depending on the success / exception of the request figures out how to increment the exception count and success count and whether the health status of a particular region can be maintained or has to be promoted or demoted.
Sequence diagram involving above classes.
sequenceDiagram PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleException LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleSuccess LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleSuccess LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. canHealthStatusBeUpgraded LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleException LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. shouldHealthStatusBeDowngradedMarking a region
Unavailablein case consecutive failures are received.flowchart TD A[Service] --> |Server generated 503s|CRP[ClientRetryPolicy] B[Upstream layer] --> |RequestTimeoutException for write|CRP CRP -->GPEMCB[GlobalPartitionEndpointManagerForCircuitBreaker] GPEMCB --> C[handleLocationExceptionForPartitionKeyRange] RXCL[RxDocumentClientImpl] -->|OperationCancelledException|GPEMCB C --> D[Store mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if does not exist already] D --> E[Increment read / write specific failure counter for region] E --> F{Is failure threshold exceeded for region?} F --> |Yes|I{Are other regions available for the partition?} I --> |Yes|G[Mark region as Unavailable] I --> |No|K[Remove mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if mapping exists]Marking a region as
Healthyor bookmarking success response from the region.flowchart TD RxDocumentClientImpl -->|handleSuccess|C[GlobalPartitionEndpointManagerForCircuitBreaker] C -->D{Is region HealthyTentative?} C -->E{Is region HealthyWithFailures?} C -->H{Is region Healthy?} D -->|Yes|F[Mark region as Healthy and reset exception counters provided thresholds are met] E -->|Yes|G[Reset exception counters] H --> |Yes|I[Do nothing]Checking if failover is possible at all for the physical partition
flowchart TD A[Obtain applicable regions for request from GlobalEndpointManager] --> B{Is there a region which is Healthy / HealthyWithFailures / HealthyTentative or isn’t present in PartitionLevelLocationUnavailabilityInfo?} B --> |Yes|C[Return true for failover is possible] B --> |No|D[Remove health metadata tracking for the partition]Marking a partition as
HealthyTentative(so that it is set up to get marked asHealthy)Unavailableregions and attempts to mark them asHealthyTentative.Unavailablestatus for greater than a certain time window. If this query succeeds, then the partition for that region is marked asHealthyTentative.flowchart TD A[Iterate through partition key ranges which have Unavailable status for certain regions] --> B{Has the region been in Unavailable status > unavailability staleness window} B --> |Yes|C{Is the client in direct connectivity mode?} C --> |Yes|D{Is the connection establishment to addresses in the Unavailable region for partition key range successful?} C --> |No|E D --> |Yes|E[Mark the region as HealthyTentative] B --> |No|F[Let the region be in Unavailable status] D --> |No|FDesign Philosophy
ClientRetryPolicy,LocationCacheandGlobalEndpointManagerpick the next applicable region (a region which is both available from the client's perspective and not excluded from the request's perspective) for the target physical partition and letGlobalPartitionEndpointManagerForCircuitBreakershort circuit accordingly.Unavailablewill be marked so for both reads and writes. The idea is avoiding issues around writing to one region and reading from another (cross-regional replication lag). This will lead to RU spikes in the failed over region.Config changes needed
Unavailablefor a given region in the following manner:Unavailablestatus for more than a minute and tries to promote it to aHealthyTentativestatus. The time interval of the background process kicking in can also be configured as below:Unavailablestatus for a default of 30s after which a background thread attempts to recover it toHealthyTentativestatus. How long a partition should be inUnavailablecan be modified as below:Diagnostic changes
Perf Benchmark Results
Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(P99LatencyInMs) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;
Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(SuccessRate) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;