Partition level circuit breaker #39265

jeet1995 · 2024-03-16T23:09:06Z

Introduction

A physical partition for a region might see the following issues:

A primary replica could be on a node that is unhealthy.
A partition could be getting throttled.
A partition could be undergoing migrations.
A partition could be undergoing upgrades.
A partition could be undergoing splits.
A replica in a partition might return 404 Read Session Not Available errors due to it not having progressed up to the requested session token.

Any issue with the primary replica could result in write-availability loss. Even with threshold-based availability strategy and non-idempotent retriable writes enabled, a write may see a success after the configured threshold duration from the hedged region which can breach the P99 latency SLA requirement of a downstream application. In certain cases, although rare (due to 4 replicas being eligible to serve reads), reads could be in a retry loop in the local region - for e.g. when the replica in the local region's physical partition is lagging and requests to this region are seeing 404:1002s (in Session consistency scenarios). Requests in such a cases can timeout if end-to-end timeout is set for that request.

The partition-level circuit breaker feature strives to improve upon tail latency and write availability (when a primary replica is down) by bookmarking terminal failures for a particular physical partition in a particular region and then short-circuiting requests to that physical partition in those regions for which these terminal failures have exceeded a certain threshold and instead forcing the SDK to leverage other available regions for that physical partition.

DISCLAIMER : This feature is only applicable when using the SDK with multi-write Cosmos DB accounts.

Design

Major flows involved

Tracking terminal failures for a physical partition within a region.
Short circuiting requests to an Unavailable region for a physical partition.
Transitioning the state of a region for a physical partition between Healthy, HealthyWithFailures, HealthyTentative and Unavailable.

Tracking terminal failures from a region

When 503s or 500s are seen from a region

flowchart TD
    A[Upstream layer] --> |503 bubbled up|B[ClientRetryPolicy]
    B --> |Handle exception for partition key range and region contacted pair|C[GlobalPartitionEndpointManageForCircuitBreaker]

When 408 / 20008 [or] OperationCancelledException is seen on an operation.

flowchart TD
    A[Downstream layer] --> |OperationCancelledException|B[RxDocumentClientImpl]
    B --> |Handle exception for a partition key range and first contacted region|C[GlobalPartitionEndpointManager]

When RequestTimeoutException is seen for write requests.
- With non-idempotent write retry policy enabled, transit timeouts for writes are wrapped as GoneException which help such requests to be retried.
- When not enabled, transit timeouts on writes materialize as RequestTimeoutException - this can be tracked as well for circuit breaking behavior.

flowchart TD
    A[Upstream layer] --> |RequestTimeoutException bubbled up|B[ClientRetryPolicy]
    B --> |Handle exception for partition key range and region contacted pair provided it is a write operation with non-idempotent write retry policy disabled.|C[GlobalPartitionEndpointManageForCircuitBreaker]

When threshold-based availability strategy is enabled

flowchart TD
    A[Downstream layer] --> |CANCEL signal|B[RxDocumentClientImpl]
    B --> |Handle exception for partition key range and first contacted region provided hedged region saw successful response for that partition key range|C[GlobalPartitionEndpointManagerForCircuitBreaker]

Short circuiting a region

Short circuiting a region for a point operation

flowchart TD
    A[Point operation] --> B[Extract partition key and partition key definition to determine effective partition key string]
    B --> C[Use partition key range cache to determine partition key range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for point operation]
    E --> F[Let GlobalEndpointManager determine next region]

Short circuiting a region for a query

flowchart TD
    A[Query operation] --> B[In DocumentProducer, apply feed range filtering to an EPK-range scoped request]
    B --> C[Map the resolved partition key range for the feed range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for EPK-scoped request]
    E --> F[Let GlobalEndpointManager determine next region]

Short circuiting a region for a document change-feed operation.

flowchart TD
    A[Change feed operation] --> B[In ChangeFeedQueryImpl, apply feed range filtering to an EPK-range scoped request]
    B --> C[Map the resolved partition key range for the feed range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for EPK-scoped request]
    E --> F[Let GlobalEndpointManager determine next region]

Classes / Concepts introduced

`GlobalPartitionEndpointManagerForCircuitBreaker`

This class will store mappings b/w the PartitionKeyRangeWrapper (encapsulates the physical partition representation along with the collection rid) instance and physical partition specific health metadata. Partition specific health metadata is further segregated into the partition's health metadata per location which is represented by an instance of LocationSpecificHealthContext.

The GlobalPartitionEndpointManagerForCircuitBreaker will expose methods to handle an exception for a partition and region, handle a successful response from a region for a partition and to identify what regions are unavailable for a partition.

`LocationHealthStatus`

This type is an enum which encapsulates the availability status of a region for a physical partition. Below are the 3 possible availability statuses the physical partition can be in:

Healthy: This status indicates that the region has seen only successful requests.
HealthyWithFailures: This status indicates that a region started seeing failures but is still within the threshold. The SDK will still attempt to send requests to such a region for the concerned physical partition.
Unavailable: A region is put in such a status when the failure rate crosses a certain threshold. Once the region is put in such a status, the SDK will not route requests to this region for the concerned physical partition.
HealthyTentative: A region is put in such a status when it has been put in Unavailable status for beyond a configured duration. The SDK will route requests to a region in HealthyTentative status. Upon seeing a certain count / threshold of successes, such a region is promoted to Healthy status and upon failure demoted to Unavailable status. The tolerance threshold to be demoted to Unavailable is lower from HealthyTentative when compared to demotion from HealthyWithFailures.

`LocationSpecificHealthContext`

This class maintains state on exception count, success count, health status and since when a particular location has been unavailable for a partition.

`LocationSpecificHealthContextTransitionHandler`

This class atomically modifies the state of an LocationSpecificHealthContext instance.

`ConsecutiveExceptionCountBasedCircuitBreaker`

This class depending on the success / exception of the request figures out how to increment the exception count and success count and whether the health status of a particular region can be maintained or has to be promoted or demoted.

Sequence diagram involving above classes.

sequenceDiagram
    PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleException
    LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext
    PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleSuccess
    LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleSuccess
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. canHealthStatusBeUpgraded
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleException
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. shouldHealthStatusBeDowngraded

Marking a region `Unavailable` in case consecutive failures are received.

flowchart TD
    A[Service] --> |Server generated 503s|CRP[ClientRetryPolicy]
    B[Upstream layer] --> |RequestTimeoutException for write|CRP
    CRP -->GPEMCB[GlobalPartitionEndpointManagerForCircuitBreaker]
    GPEMCB --> C[handleLocationExceptionForPartitionKeyRange]
    RXCL[RxDocumentClientImpl] -->|OperationCancelledException|GPEMCB
    C --> D[Store mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if does not exist already]
    D --> E[Increment read / write specific failure counter for region]
    E --> F{Is failure threshold exceeded for region?}
    F --> |Yes|I{Are other regions available for the partition?}
    I --> |Yes|G[Mark region as Unavailable]
    I --> |No|K[Remove mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if mapping exists]

Marking a region as `Healthy` or bookmarking success response from the region.

flowchart TD
    RxDocumentClientImpl -->|handleSuccess|C[GlobalPartitionEndpointManagerForCircuitBreaker]
    C -->D{Is region HealthyTentative?}
    C -->E{Is region HealthyWithFailures?}
    C -->H{Is region Healthy?}
    D -->|Yes|F[Mark region as Healthy and reset exception counters provided thresholds are met]
    E -->|Yes|G[Reset exception counters]
    H --> |Yes|I[Do nothing]

Checking if failover is possible at all for the physical partition

flowchart TD
    A[Obtain applicable regions for request from GlobalEndpointManager] --> B{Is there a region which is Healthy / HealthyWithFailures / HealthyTentative or isn’t present in PartitionLevelLocationUnavailabilityInfo?}
    B --> |Yes|C[Return true for failover is possible]
    B --> |No|D[Remove health metadata tracking for the partition]

Marking a partition as `HealthyTentative` (so that it is set up to get marked as `Healthy`)

A background thread runs through partitions which have Unavailable regions and attempts to mark them as HealthyTentative.
A query which selects 1 item is first run against the partition in Unavailable status for greater than a certain time window. If this query succeeds, then the partition for that region is marked as HealthyTentative.

flowchart TD
    A[Iterate through partition key ranges which have Unavailable status for certain regions] --> B{Has the region been in Unavailable status > unavailability staleness window} 
    B --> |Yes|C{Is the client in direct connectivity mode?}
    C --> |Yes|D{Is the connection establishment to addresses in the Unavailable region for partition key range successful?}
    C --> |No|E
    D --> |Yes|E[Mark the region as HealthyTentative]
    B --> |No|F[Let the region be in Unavailable status]
    D --> |No|F

Design Philosophy

The request targeted to a region will face short circuits before being routed to the available region. The design philosophy is to let ClientRetryPolicy, LocationCache and GlobalEndpointManager pick the next applicable region (a region which is both available from the client's perspective and not excluded from the request's perspective) for the target physical partition and let GlobalPartitionEndpointManagerForCircuitBreaker short circuit accordingly.
Region marked as Unavailable will be marked so for both reads and writes. The idea is avoiding issues around writing to one region and reading from another (cross-regional replication lag). This will lead to RU spikes in the failed over region.

Config changes needed

In the first iteration of the partition-level circuit breaker, exception thresholds are maintained through an integral value and a consecutive count of exceptions or successes are needed to demote or promote the health status of a partition.
The application developer can specify exception thresholds for a partition to be marked as Unavailable for a given region in the following manner:

System.setProperty(
   "COSMOS.PARTITION_LEVEL_CIRCUIT_BREAKER_CONFIG",
      "{\"isPartitionLevelCircuitBreakerEnabled\": true, "
      + "\"circuitBreakerType\": \"CONSECUTIVE_EXCEPTION_COUNT_BASED\","
      + "\"consecutiveExceptionCountToleratedForReads\": 10,"
      + "\"consecutiveExceptionCountToleratedForWrites\": 5,"
      + "}");

A background process periodically runs every 1 minute to check for regions for a partition which have been in Unavailable status for more than a minute and tries to promote it to a HealthyTentative status. The time interval of the background process kicking in can also be configured as below:

System.setProperty("COSMOS.STALE_PARTITION_UNAVAILABILITY_REFRESH_INTERVAL_IN_SECONDS", "60");

A partition can be in Unavailable status for a default of 30s after which a background thread attempts to recover it to HealthyTentative status. How long a partition should be in Unavailable can be modified as below:

System.setProperty("COSMOS.ALLOWED_PARTITION_UNAVAILABILITY_DURATION_IN_SECONDS", "30");

DISCLAIMER: The above settings act as lower bounds - setting anything below will reset it to above defaults. This is done to avoid extremely aggressive thresholds to circuit break or recover a partition or to avoid settings for which negative values wouldn't make sense in this context - like threshold counts, time etc.

Diagnostic changes

Changes were made to include the circuit breaking thresholds and partition's health status for a given region:

{
    "userAgent": "azsdk-java-cosmos/4.63.0-beta.1 Windows11/10.0 JRE/18.0.2.1",
    "activityId": "a1ed2393-0308-4ed1-97d4-c75ace0ef075",
    "requestLatencyInMs": 28,
    "requestStartTimeUTC": "2024-07-05T16:23:30.557367900Z",
    "requestEndTimeUTC": "2024-07-05T16:23:30.585384500Z",
    "responseStatisticsList": [],
    "supplementalResponseStatisticsList": [],
    "addressResolutionStatistics": {},
    "regionsContacted": [
        "south central us"
    ],
    "retryContext": {
        "statusAndSubStatusCodes": null,
        "retryCount": 0,
        "retryLatency": 0
    },
    "metadataDiagnosticsContext": {
        "metadataDiagnosticList": null
    },
    "serializationDiagnosticsContext": {
        "serializationDiagnosticsList": null
    },
    "gatewayStatisticsList": [
        {
            "sessionToken": "0:0#2#1=-1#5=-1",
            "operationType": "Read",
            "resourceType": "Document",
            "statusCode": 200,
            "subStatusCode": 0,
            "requestCharge": 1.0,
            "requestTimeline": [
                {
                    "eventName": "connectionAcquired",
                    "startTimeUTC": "2024-07-05T16:23:30.558367900Z",
                    "durationInMilliSecs": 0.0
                },
                {
                    "eventName": "connectionConfigured",
                    "startTimeUTC": "2024-07-05T16:23:30.558367900Z",
                    "durationInMilliSecs": 1.0021
                },
                {
                    "eventName": "requestSent",
                    "startTimeUTC": "2024-07-05T16:23:30.559370Z",
                    "durationInMilliSecs": 0.0
                },
                {
                    "eventName": "transitTime",
                    "startTimeUTC": "2024-07-05T16:23:30.559370Z",
                    "durationInMilliSecs": 26.0145
                },
                {
                    "eventName": "received",
                    "startTimeUTC": "2024-07-05T16:23:30.585384500Z",
                    "durationInMilliSecs": 0.0
                }
            ],
            "partitionKeyRangeId": "0",
            "responsePayloadSizeInBytes": 369,
            "faultInjectionEvaluationResults": [
                "service-unavailable-rule-c3a95e10-b4e3-4c4b-80c7-686f848399d6 [RegionEndpoint mismatch: Expected [[https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/]], Actual [https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/]]"
            ],
            "locationToLocationSpecificHealthContext": {
                "v": {
                    "east us": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Healthy",
                        "unavailableSince": "+1000000000-12-31T23:59:59.999999999Z"
                    },
                    "west us 2": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Unavailable",
                        "unavailableSince": "2024-07-05T16:23:30.280312700Z"
                    },
                    "south central us": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Healthy",
                        "unavailableSince": "+1000000000-12-31T23:59:59.999999999Z"
                    }
                }
            }
        }
    ],
    "samplingRateSnapshot": 1.0,
    "bloomFilterInsertionCountSnapshot": 0,
    "systemInformation": {
        "usedMemory": "60457 KB",
        "availableMemory": "4133847 KB",
        "systemCpuLoad": "(2024-07-05T16:23:12.550066Z 5.8%), (2024-07-05T16:23:17.545800700Z 11.3%), (2024-07-05T16:23:20.900012800Z 6.7%), (2024-07-05T16:23:25.914797Z 4.6%), (2024-07-05T16:23:26.812925600Z 2.8%), (2024-07-05T16:23:28.280351400Z 5.2%)",
        "availableProcessors": 8
    },
    "clientCfgs": {
        "id": 4,
        "machineId": "uuid:4eccf1d0-db74-44c7-bdf6-71e627cf0dc8",
        "connectionMode": "GATEWAY",
        "numberOfClients": 1,
        "excrgns": "[]",
        "clientEndpoints": {
            "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": 4
        },
        "connCfg": {
            "rntbd": null,
            "gw": "(cps:1000, nrto:PT1M, icto:PT1M, p:false)",
            "other": "(ed: true, cs: false, rv: true)"
        },
        "consistencyCfg": "(consistency: Session, mm: true, prgns: [westus2,southcentralus,eastus])",
        "proactiveInitCfg": "",
        "e2ePolicyCfg": "",
        "sessionRetryCfg": "",
        "partitionLevelCircuitBreakerCfg": "(cb: true, type: CONSECUTIVE_EXCEPTION_COUNT_BASED, rexcntt: 10, wexcntt: 5)"
    }
}

Perf Benchmark Results

The below result is for a workload of 5 million reads on a collection with 2 physical partitions and an operation concurrency of 40.
There is 1%-2% regression for both P99 latency and success rate (throughput).

Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(P99LatencyInMs) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;

Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(SuccessRate) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;

… write.

…eaker.

…ient layer.

…strategy enabled.

azure-sdk · 2024-07-17T00:18:35Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos

jeet1995 · 2024-07-17T00:23:47Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-17T00:24:08Z

Azure Pipelines successfully started running 1 pipeline(s).

…rtitionLevelCircuitBreaker # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

azure-sdk · 2024-07-17T20:16:41Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos

jeet1995 · 2024-07-17T22:49:56Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-17T22:50:21Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2024-07-19T13:11:46Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-19T13:12:09Z

Azure Pipelines successfully started running 1 pipeline(s).

…rtitionLevelCircuitBreaker

jeet1995 · 2024-07-19T13:34:49Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-19T13:35:11Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2024-07-19T14:54:13Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-19T14:54:35Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 · 2024-07-19T15:11:11Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-19T15:11:36Z

Azure Pipelines successfully started running 1 pipeline(s).

moarychan · 2024-07-22T09:12:52Z

Fixed Spring cosmos test pipeline failure #41227

…rtitionLevelCircuitBreaker

jeet1995 · 2024-07-22T19:42:41Z

/azp run java - cosmos - tests

azure-pipelines · 2024-07-22T19:43:07Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added 10 commits January 2, 2024 17:42

Fixing CI pipeline.

e26c10b

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java

83ead01

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java

a05afc7

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java

fd240f5

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java

e27b39c

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java

f3b7eea

Added skeletal classes.

552580c

Added skeletal classes and method calls.

097f7a5

Added skeletal classes and method calls.

e97f50b

Added skeletal flow for marking a partition as unavailable for read /…

7bac0f7

… write.

github-actions bot added the Cosmos label Mar 16, 2024

jeet1995 added 19 commits March 18, 2024 09:04

Added skeletal flow for marking a partition as unavailable for read /…

ed15a83

… write.

Adding skeletal methods to GlobalPartitionEndpointManagerForCircuitBr…

e4b0cd8

…eaker.

Adding skeletal methods to GlobalPartitionEndpointManagerForCircuitBr…

1ae1382

…eaker.

Updated CHANGELOG.

7a9f32b

Added class to error track on a per-region basis.

610af34

Refactor region health transitions.

af6b6b3

Added class to error track on a per-region basis.

207c268

Adapt point operations to bookmark failures.

9e56e8c

Wiring region feedback handling for point operations.

ddb19f7

Fixing compilation errors.

1845447

Added partitionKeyRange detection for point operations in document cl…

d24e0a4

…ient layer.

Updated CHANGELOG.md.

d07e810

Fixing partition state transition logic.

5c0468e

Modify logger level.

647dcc6

Adding a way to exclude partition-level unavailable regions.

7974575

Adding a way to exclude partition-level unavailable regions for queries.

84cc95a

Adding circuit breaking for 408s in point operations.

5e3ca33

Handling 408 cases for queries.

ebdbd8f

Adding shared state for point and query operations with availability …

c835a77

…strategy enabled.

jeet1995 added 4 commits July 17, 2024 12:15

Reacting to review comments.

13a710f

Reacting to review comments.

15fe11b

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java into Pa…

6826f65

…rtitionLevelCircuitBreaker # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

Fixing CI pipeline.

36109f1

Fixing CI pipeline.

ca0313b

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java into Pa…

d878c10

…rtitionLevelCircuitBreaker

Fixing CI pipeline.

e45f1e9

Fixing CI pipeline.

f5a6218

jeet1995 added 3 commits July 22, 2024 12:07

Updated CHANGELOG.md.

ac8d848

Added code comments.

fb59af5

Merge branch 'main' of github.com:jeet1995/azure-sdk-for-java into Pa…

6e1f8a7

…rtitionLevelCircuitBreaker

jeet1995 merged commit 0b53398 into Azure:main Jul 23, 2024

tvaron3 mentioned this pull request Apr 3, 2025

Per Partition Circuit Breaker Azure/azure-sdk-for-python#40302

Merged

Partition level circuit breaker #39265

Partition level circuit breaker #39265

Uh oh!

Conversation

jeet1995 commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Design

Major flows involved

Tracking terminal failures from a region

Short circuiting a region

Classes / Concepts introduced

GlobalPartitionEndpointManagerForCircuitBreaker

LocationHealthStatus

LocationSpecificHealthContext

LocationSpecificHealthContextTransitionHandler

ConsecutiveExceptionCountBasedCircuitBreaker

Sequence diagram involving above classes.

Marking a region Unavailable in case consecutive failures are received.

Marking a region as Healthy or bookmarking success response from the region.

Checking if failover is possible at all for the physical partition

Marking a partition as HealthyTentative (so that it is set up to get marked as Healthy)

Design Philosophy

Config changes needed

Diagnostic changes

Perf Benchmark Results

Uh oh!

azure-sdk commented Jul 17, 2024

Uh oh!

jeet1995 commented Jul 17, 2024

Uh oh!

azure-pipelines bot commented Jul 17, 2024

Uh oh!

azure-sdk commented Jul 17, 2024

Uh oh!

jeet1995 commented Jul 17, 2024

Uh oh!

azure-pipelines bot commented Jul 17, 2024

Uh oh!

jeet1995 commented Jul 19, 2024

Uh oh!

azure-pipelines bot commented Jul 19, 2024

Uh oh!

jeet1995 commented Jul 19, 2024

Uh oh!

azure-pipelines bot commented Jul 19, 2024

Uh oh!

jeet1995 commented Jul 19, 2024

Uh oh!

azure-pipelines bot commented Jul 19, 2024

Uh oh!

jeet1995 commented Jul 19, 2024

Uh oh!

azure-pipelines bot commented Jul 19, 2024

Uh oh!

moarychan commented Jul 22, 2024

Uh oh!

jeet1995 commented Jul 22, 2024

Uh oh!

azure-pipelines bot commented Jul 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jeet1995 commented Mar 16, 2024 •

edited

Loading

`GlobalPartitionEndpointManagerForCircuitBreaker`

`LocationHealthStatus`

`LocationSpecificHealthContext`

`LocationSpecificHealthContextTransitionHandler`

`ConsecutiveExceptionCountBasedCircuitBreaker`

Marking a region `Unavailable` in case consecutive failures are received.

Marking a region as `Healthy` or bookmarking success response from the region.

Marking a partition as `HealthyTentative` (so that it is set up to get marked as `Healthy`)