Skip to content

Conversation

@jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Mar 16, 2024

Introduction

A physical partition for a region might see the following issues:

  • A primary replica could be on a node that is unhealthy.
  • A partition could be getting throttled.
  • A partition could be undergoing migrations.
  • A partition could be undergoing upgrades.
  • A partition could be undergoing splits.
  • A replica in a partition might return 404 Read Session Not Available errors due to it not having progressed up to the requested session token.

Any issue with the primary replica could result in write-availability loss. Even with threshold-based availability strategy and non-idempotent retriable writes enabled, a write may see a success after the configured threshold duration from the hedged region which can breach the P99 latency SLA requirement of a downstream application. In certain cases, although rare (due to 4 replicas being eligible to serve reads), reads could be in a retry loop in the local region - for e.g. when the replica in the local region's physical partition is lagging and requests to this region are seeing 404:1002s (in Session consistency scenarios). Requests in such a cases can timeout if end-to-end timeout is set for that request.

The partition-level circuit breaker feature strives to improve upon tail latency and write availability (when a primary replica is down) by bookmarking terminal failures for a particular physical partition in a particular region and then short-circuiting requests to that physical partition in those regions for which these terminal failures have exceeded a certain threshold and instead forcing the SDK to leverage other available regions for that physical partition.

DISCLAIMER : This feature is only applicable when using the SDK with multi-write Cosmos DB accounts.
 

Design

Major flows involved

  • Tracking terminal failures for a physical partition within a region.
  • Short circuiting requests to an Unavailable region for a physical partition.
  • Transitioning the state of a region for a physical partition between Healthy, HealthyWithFailures, HealthyTentative and Unavailable.

Tracking terminal failures from a region

  • When 503s or 500s are seen from a region
flowchart TD
    A[Upstream layer] --> |503 bubbled up|B[ClientRetryPolicy]
    B --> |Handle exception for partition key range and region contacted pair|C[GlobalPartitionEndpointManageForCircuitBreaker]
Loading
  • When 408 / 20008 [or] OperationCancelledException is seen on an operation.
flowchart TD
    A[Downstream layer] --> |OperationCancelledException|B[RxDocumentClientImpl]
    B --> |Handle exception for a partition key range and first contacted region|C[GlobalPartitionEndpointManager]
Loading
  • When RequestTimeoutException is seen for write requests.
    • With non-idempotent write retry policy enabled, transit timeouts for writes are wrapped as GoneException which help such requests to be retried.
    • When not enabled, transit timeouts on writes materialize as RequestTimeoutException - this can be tracked as well for circuit breaking behavior.
flowchart TD
    A[Upstream layer] --> |RequestTimeoutException bubbled up|B[ClientRetryPolicy]
    B --> |Handle exception for partition key range and region contacted pair provided it is a write operation with non-idempotent write retry policy disabled.|C[GlobalPartitionEndpointManageForCircuitBreaker]
Loading
  • When threshold-based availability strategy is enabled
flowchart TD
    A[Downstream layer] --> |CANCEL signal|B[RxDocumentClientImpl]
    B --> |Handle exception for partition key range and first contacted region provided hedged region saw successful response for that partition key range|C[GlobalPartitionEndpointManagerForCircuitBreaker]
Loading

Short circuiting a region

  • Short circuiting a region for a point operation
flowchart TD
    A[Point operation] --> B[Extract partition key and partition key definition to determine effective partition key string]
    B --> C[Use partition key range cache to determine partition key range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for point operation]
    E --> F[Let GlobalEndpointManager determine next region]
Loading
  • Short circuiting a region for a query
flowchart TD
    A[Query operation] --> B[In DocumentProducer, apply feed range filtering to an EPK-range scoped request]
    B --> C[Map the resolved partition key range for the feed range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for EPK-scoped request]
    E --> F[Let GlobalEndpointManager determine next region]
Loading
  • Short circuiting a region for a document change-feed operation.
flowchart TD
    A[Change feed operation] --> B[In ChangeFeedQueryImpl, apply feed range filtering to an EPK-range scoped request]
    B --> C[Map the resolved partition key range for the feed range]
    C --> D[Use GlobalPartitionEndpointManagerForCircuitBreaker to extract Unavailable regions for partition key range]
    D --> E[Add them to effective excluded regions for EPK-scoped request]
    E --> F[Let GlobalEndpointManager determine next region]
Loading

Classes / Concepts introduced

GlobalPartitionEndpointManagerForCircuitBreaker

This class will store mappings b/w the PartitionKeyRangeWrapper (encapsulates the physical partition representation along with the collection rid) instance and physical partition specific health metadata. Partition specific health metadata is further segregated into the partition's health metadata per location which is represented by an instance of LocationSpecificHealthContext.

The GlobalPartitionEndpointManagerForCircuitBreaker will expose methods to handle an exception for a partition and region, handle a successful response from a region for a partition and to identify what regions are unavailable for a partition.

LocationHealthStatus

This type is an enum which encapsulates the availability status of a region for a physical partition. Below are the 3 possible availability statuses the physical partition can be in:

  • Healthy: This status indicates that the region has seen only successful requests.
  • HealthyWithFailures: This status indicates that a region started seeing failures but is still within the threshold. The SDK will still attempt to send requests to such a region for the concerned physical partition.
  • Unavailable: A region is put in such a status when the failure rate crosses a certain threshold. Once the region is put in such a status, the SDK will not route requests to this region for the concerned physical partition.
  • HealthyTentative: A region is put in such a status when it has been put in Unavailable status for beyond a configured duration. The SDK will route requests to a region in HealthyTentative status. Upon seeing a certain count / threshold of successes, such a region is promoted to Healthy status and upon failure demoted to Unavailable status. The tolerance threshold to be demoted to Unavailable is lower from HealthyTentative when compared to demotion from HealthyWithFailures.

LocationSpecificHealthContext

This class maintains state on exception count, success count, health status and since when a particular location has been unavailable for a partition.

LocationSpecificHealthContextTransitionHandler

This class atomically modifies the state of an LocationSpecificHealthContext instance.

ConsecutiveExceptionCountBasedCircuitBreaker

This class depending on the success / exception of the request figures out how to increment the exception count and success count and whether the health status of a particular region can be maintained or has to be promoted or demoted.

Sequence diagram involving above classes.

sequenceDiagram
    PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleException
    LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext
    PartitionLevelLocationUnavailabilityInfo->>LocationSpecificHealthContextTransitionHandler: 1. handleSuccess
    LocationSpecificHealthContextTransitionHandler->>PartitionLevelLocationUnavailabilityInfo: 4. Updated LocationSpecificHealthContext
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleSuccess
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. canHealthStatusBeUpgraded
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 2. handleException
    LocationSpecificHealthContextTransitionHandler->>ConsecutiveExceptionBasedCircuitBreaker: 3. shouldHealthStatusBeDowngraded

Loading

Marking a region Unavailable in case consecutive failures are received.

flowchart TD
    A[Service] --> |Server generated 503s|CRP[ClientRetryPolicy]
    B[Upstream layer] --> |RequestTimeoutException for write|CRP
    CRP -->GPEMCB[GlobalPartitionEndpointManagerForCircuitBreaker]
    GPEMCB --> C[handleLocationExceptionForPartitionKeyRange]
    RXCL[RxDocumentClientImpl] -->|OperationCancelledException|GPEMCB
    C --> D[Store mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if does not exist already]
    D --> E[Increment read / write specific failure counter for region]
    E --> F{Is failure threshold exceeded for region?}
    F --> |Yes|I{Are other regions available for the partition?}
    I --> |Yes|G[Mark region as Unavailable]
    I --> |No|K[Remove mapping b/w pkRange and partitionLevelLocationUnavailabilityInfo if mapping exists]
Loading

Marking a region as Healthy or bookmarking success response from the region.

flowchart TD
    RxDocumentClientImpl -->|handleSuccess|C[GlobalPartitionEndpointManagerForCircuitBreaker]
    C -->D{Is region HealthyTentative?}
    C -->E{Is region HealthyWithFailures?}
    C -->H{Is region Healthy?}
    D -->|Yes|F[Mark region as Healthy and reset exception counters provided thresholds are met]
    E -->|Yes|G[Reset exception counters]
    H --> |Yes|I[Do nothing]
Loading

Checking if failover is possible at all for the physical partition

flowchart TD
    A[Obtain applicable regions for request from GlobalEndpointManager] --> B{Is there a region which is Healthy / HealthyWithFailures / HealthyTentative or isn’t present in PartitionLevelLocationUnavailabilityInfo?}
    B --> |Yes|C[Return true for failover is possible]
    B --> |No|D[Remove health metadata tracking for the partition]
Loading

Marking a partition as HealthyTentative (so that it is set up to get marked as Healthy)

  • A background thread runs through partitions which have Unavailable regions and attempts to mark them as HealthyTentative.
  • A query which selects 1 item is first run against the partition in Unavailable status for greater than a certain time window. If this query succeeds, then the partition for that region is marked as HealthyTentative.
flowchart TD
    A[Iterate through partition key ranges which have Unavailable status for certain regions] --> B{Has the region been in Unavailable status > unavailability staleness window} 
    B --> |Yes|C{Is the client in direct connectivity mode?}
    C --> |Yes|D{Is the connection establishment to addresses in the Unavailable region for partition key range successful?}
    C --> |No|E
    D --> |Yes|E[Mark the region as HealthyTentative]
    B --> |No|F[Let the region be in Unavailable status]
    D --> |No|F
Loading

Design Philosophy

  • The request targeted to a region will face short circuits before being routed to the available region. The design philosophy is to let ClientRetryPolicy, LocationCache and GlobalEndpointManager pick the next applicable region (a region which is both available from the client's perspective and not excluded from the request's perspective) for the target physical partition and let GlobalPartitionEndpointManagerForCircuitBreaker short circuit accordingly.
  • Region marked as Unavailable will be marked so for both reads and writes. The idea is avoiding issues around writing to one region and reading from another (cross-regional replication lag). This will lead to RU spikes in the failed over region.

Config changes needed

  • In the first iteration of the partition-level circuit breaker, exception thresholds are maintained through an integral value and a consecutive count of exceptions or successes are needed to demote or promote the health status of a partition.
  • The application developer can specify exception thresholds for a partition to be marked as Unavailable for a given region in the following manner:
System.setProperty(
   "COSMOS.PARTITION_LEVEL_CIRCUIT_BREAKER_CONFIG",
      "{\"isPartitionLevelCircuitBreakerEnabled\": true, "
      + "\"circuitBreakerType\": \"CONSECUTIVE_EXCEPTION_COUNT_BASED\","
      + "\"consecutiveExceptionCountToleratedForReads\": 10,"
      + "\"consecutiveExceptionCountToleratedForWrites\": 5,"
      + "}");
  • A background process periodically runs every 1 minute to check for regions for a partition which have been in Unavailable status for more than a minute and tries to promote it to a HealthyTentative status. The time interval of the background process kicking in can also be configured as below:
System.setProperty("COSMOS.STALE_PARTITION_UNAVAILABILITY_REFRESH_INTERVAL_IN_SECONDS", "60");
  • A partition can be in Unavailable status for a default of 30s after which a background thread attempts to recover it to HealthyTentative status. How long a partition should be in Unavailable can be modified as below:
System.setProperty("COSMOS.ALLOWED_PARTITION_UNAVAILABILITY_DURATION_IN_SECONDS", "30");
  • DISCLAIMER: The above settings act as lower bounds - setting anything below will reset it to above defaults. This is done to avoid extremely aggressive thresholds to circuit break or recover a partition or to avoid settings for which negative values wouldn't make sense in this context - like threshold counts, time etc.

Diagnostic changes

  • Changes were made to include the circuit breaking thresholds and partition's health status for a given region:
{
    "userAgent": "azsdk-java-cosmos/4.63.0-beta.1 Windows11/10.0 JRE/18.0.2.1",
    "activityId": "a1ed2393-0308-4ed1-97d4-c75ace0ef075",
    "requestLatencyInMs": 28,
    "requestStartTimeUTC": "2024-07-05T16:23:30.557367900Z",
    "requestEndTimeUTC": "2024-07-05T16:23:30.585384500Z",
    "responseStatisticsList": [],
    "supplementalResponseStatisticsList": [],
    "addressResolutionStatistics": {},
    "regionsContacted": [
        "south central us"
    ],
    "retryContext": {
        "statusAndSubStatusCodes": null,
        "retryCount": 0,
        "retryLatency": 0
    },
    "metadataDiagnosticsContext": {
        "metadataDiagnosticList": null
    },
    "serializationDiagnosticsContext": {
        "serializationDiagnosticsList": null
    },
    "gatewayStatisticsList": [
        {
            "sessionToken": "0:0#2#1=-1#5=-1",
            "operationType": "Read",
            "resourceType": "Document",
            "statusCode": 200,
            "subStatusCode": 0,
            "requestCharge": 1.0,
            "requestTimeline": [
                {
                    "eventName": "connectionAcquired",
                    "startTimeUTC": "2024-07-05T16:23:30.558367900Z",
                    "durationInMilliSecs": 0.0
                },
                {
                    "eventName": "connectionConfigured",
                    "startTimeUTC": "2024-07-05T16:23:30.558367900Z",
                    "durationInMilliSecs": 1.0021
                },
                {
                    "eventName": "requestSent",
                    "startTimeUTC": "2024-07-05T16:23:30.559370Z",
                    "durationInMilliSecs": 0.0
                },
                {
                    "eventName": "transitTime",
                    "startTimeUTC": "2024-07-05T16:23:30.559370Z",
                    "durationInMilliSecs": 26.0145
                },
                {
                    "eventName": "received",
                    "startTimeUTC": "2024-07-05T16:23:30.585384500Z",
                    "durationInMilliSecs": 0.0
                }
            ],
            "partitionKeyRangeId": "0",
            "responsePayloadSizeInBytes": 369,
            "faultInjectionEvaluationResults": [
                "service-unavailable-rule-c3a95e10-b4e3-4c4b-80c7-686f848399d6 [RegionEndpoint mismatch: Expected [[https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/]], Actual [https://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/]]"
            ],
            "locationToLocationSpecificHealthContext": {
                "v": {
                    "east us": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Healthy",
                        "unavailableSince": "+1000000000-12-31T23:59:59.999999999Z"
                    },
                    "west us 2": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Unavailable",
                        "unavailableSince": "2024-07-05T16:23:30.280312700Z"
                    },
                    "south central us": {
                        "exceptionCountForWriteForCircuitBreaking": 0,
                        "exceptionCountForReadForCircuitBreaking": 0,
                        "successCountForWriteForRecovery": 0,
                        "successCountForReadForRecovery": 0,
                        "locationHealthStatus": "Healthy",
                        "unavailableSince": "+1000000000-12-31T23:59:59.999999999Z"
                    }
                }
            }
        }
    ],
    "samplingRateSnapshot": 1.0,
    "bloomFilterInsertionCountSnapshot": 0,
    "systemInformation": {
        "usedMemory": "60457 KB",
        "availableMemory": "4133847 KB",
        "systemCpuLoad": "(2024-07-05T16:23:12.550066Z 5.8%), (2024-07-05T16:23:17.545800700Z 11.3%), (2024-07-05T16:23:20.900012800Z 6.7%), (2024-07-05T16:23:25.914797Z 4.6%), (2024-07-05T16:23:26.812925600Z 2.8%), (2024-07-05T16:23:28.280351400Z 5.2%)",
        "availableProcessors": 8
    },
    "clientCfgs": {
        "id": 4,
        "machineId": "uuid:4eccf1d0-db74-44c7-bdf6-71e627cf0dc8",
        "connectionMode": "GATEWAY",
        "numberOfClients": 1,
        "excrgns": "[]",
        "clientEndpoints": {
            "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx": 4
        },
        "connCfg": {
            "rntbd": null,
            "gw": "(cps:1000, nrto:PT1M, icto:PT1M, p:false)",
            "other": "(ed: true, cs: false, rv: true)"
        },
        "consistencyCfg": "(consistency: Session, mm: true, prgns: [westus2,southcentralus,eastus])",
        "proactiveInitCfg": "",
        "e2ePolicyCfg": "",
        "sessionRetryCfg": "",
        "partitionLevelCircuitBreakerCfg": "(cb: true, type: CONSECUTIVE_EXCEPTION_COUNT_BASED, rexcntt: 10, wexcntt: 5)"
    }
}

Perf Benchmark Results

  • The below result is for a workload of 5 million reads on a collection with 2 physical partitions and an operation concurrency of 40.
  • There is 1%-2% regression for both P99 latency and success rate (throughput).

Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(P99LatencyInMs) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;

image

Results
| where TIMESTAMP >= ago(1d)
| where BranchName has "MainClone" or BranchName has "PartitionLevelCircuitBreaker"
| summarize max(SuccessRate) by bin(TIMESTAMP, 5m), strcat(BranchName, ":", CommitId)
| render timechart;

image

@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added 4 commits July 17, 2024 12:15
…rtitionLevelCircuitBreaker

# Conflicts:
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java
@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@moarychan
Copy link
Member

Fixed Spring cosmos test pipeline failure #41227

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants