Skip to content

Conversation

@NaluTripician
Copy link
Contributor

@NaluTripician NaluTripician commented Apr 4, 2025

Pull Request Template

Description

Context

Currently, there is a bug in the SDK where upon a cold start of the SDK and rare edge cases involving online/offline-ing regions, where only query requests are made, the SDK will not retry certain status code responses from metadata requests causing the entire request to fail. The correct behavior would be for the SDK to do cross region retries on these metadata requests.

This pulls request includes several updates to enhance error handling and retry logic in the Cosmos DB SDK. The changes mainly focus on extending support for additional server error types and improving retry policies for various scenarios.

Improvements to retry logic:

  • ClientRetryPolicy.cs: Enhanced retry logic to handle InternalServerError , DatabaseAccountNotFound, and LeaseNotFound status codes.
  • MetadataRequestThrottleRetryPolicy.cs: Refactored retry policy to handle additional status codes and renamed methods and constants to reflect the broader scope of endpoint unavailability.

FaultInjection enhancements to error handling testing:

  • FaultInjectionRuleBuilder.cs: Added support for additional server error types such as DatabaseAccountNotFound, ServiceUnavailable, InternalServerError, and LeaseNotFound in the for metadata requests.
  • FaultInjectionServerErrorType.cs: Updated the FaultInjectionServerErrorType enum to include LeaseNotFound and corrected the status code for DatabaseAccountNotFound.
  • FaultInjectionServerErrorResultInternal.cs: Added handling for LeaseNotFound and updated the status code for DatabaseAccountNotFound in the GetInjectedServerError method.

Testing updates:

  • CosmosItemIntegrationTests.cs: Added a new test method MetadataEndpointUnavailableCrossRegionalRetryTest to validate the retry logic for various server error types.
  • ClientRetryPolicyTests.cs: Extended existing tests to cover additional status codes and substatus codes.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #4710

@NaluTripician NaluTripician self-assigned this Apr 4, 2025
@NaluTripician NaluTripician marked this pull request as ready for review April 8, 2025 19:04
kundadebdatta
kundadebdatta previously approved these changes Apr 8, 2025
Copy link
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kundadebdatta kundadebdatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Now.

@NaluTripician NaluTripician added the auto-merge Enables automation to merge PRs label Apr 12, 2025
@kirankumarkolli kirankumarkolli merged commit cd21ae6 into master Apr 12, 2025
26 checks passed
@kirankumarkolli kirankumarkolli deleted the users/nalutripician/queryPipelineRetryPolicyHookUp branch April 12, 2025 04:05
NaluTripician added a commit that referenced this pull request Apr 12, 2025
…retried with a client cold start with only query requests (#5108)

# Pull Request Template

## Description

### Context 

Currently, there is a bug in the SDK where upon a cold start of the SDK
and rare edge cases involving online/offline-ing regions, where only
query requests are made, the SDK will not retry certain status code
responses from metadata requests causing the entire request to fail. The
correct behavior would be for the SDK to do cross region retries on
these metadata requests.

This pulls request includes several updates to enhance error handling
and retry logic in the Cosmos DB SDK. The changes mainly focus on
extending support for additional server error types and improving retry
policies for various scenarios.

### Improvements to retry logic:

*
[`ClientRetryPolicy.cs`](diffhunk://#diff-2b056512ca285b1d95e025e31f60345059fa92d958becc38f90a6fb54ce1bbb4R331-R341):
Enhanced retry logic to handle `InternalServerError` ,
`DatabaseAccountNotFound`, and `LeaseNotFound` status codes.
*
[`MetadataRequestThrottleRetryPolicy.cs`](diffhunk://#diff-a5ed5985909c3dcb6e4ce186cdd662d590dac5297ea14e68560c7d1eca307be4L26-R28):
Refactored retry policy to handle additional status codes and renamed
methods and constants to reflect the broader scope of endpoint
unavailability.

### FaultInjection enhancements to error handling testing:

*
[`FaultInjectionRuleBuilder.cs`](diffhunk://#diff-d827164a4a6a0d8737e6598f8132c915ef48a1fc01daaa6422706f770dada5d5L152-R156):
Added support for additional server error types such as
`DatabaseAccountNotFound`, `ServiceUnavailable`, `InternalServerError`,
and `LeaseNotFound` in the for metadata requests.
*
[`FaultInjectionServerErrorType.cs`](diffhunk://#diff-0c89faa9a48c428a7a98662d995474e34295618ac60e677ad9762fd048f33601L75-R82):
Updated the `FaultInjectionServerErrorType` enum to include
`LeaseNotFound` and corrected the status code for
`DatabaseAccountNotFound`.
*
[`FaultInjectionServerErrorResultInternal.cs`](diffhunk://#diff-1ae8256c6d505a8f3b0a350978a0cc9a08f6234f7328f28f1db14302ee691d72L473-R473):
Added handling for `LeaseNotFound` and updated the status code for
`DatabaseAccountNotFound` in the `GetInjectedServerError` method.
* 
### Testing updates:

*
[`CosmosItemIntegrationTests.cs`](diffhunk://#diff-16d429adf686a32936696d2014afab3fc8faf91f10c880850fb8b30f8b96bb33R153-R214):
Added a new test method
`MetadataEndpointUnavailableCrossRegionalRetryTest` to validate the
retry logic for various server error types.
*
[`ClientRetryPolicyTests.cs`](diffhunk://#diff-d3fdfdc5d4f4d8af2c2cc463d928285680e7695861422ef2e3330d1a956807e1L167-R176):
Extended existing tests to cover additional status codes and substatus
codes.

## Type of change

Please delete options that are not relevant.

- [] Bug fix (non-breaking change which fixes an issue)

## Closing issues

To automatically close an issue: closes #4710

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>
kirankumarkolli pushed a commit that referenced this pull request Apr 12, 2025
…5124)

# Pull Request Template

## Description

Hotfix will include crucial bugfix:
#5108

## Type of change

Please delete options that are not relevant.

## Closing issues

To automatically close an issue: closes #IssueNumber
NaluTripician added a commit that referenced this pull request Apr 12, 2025
…5124)

Hotfix will include crucial bugfix:
#5108

Please delete options that are not relevant.

To automatically close an issue: closes #IssueNumber
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge Enables automation to merge PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Handling 403:1008 for address resolution calls.

5 participants