Skip to content

Conversation

@kundadebdatta
Copy link
Member

@kundadebdatta kundadebdatta commented Nov 14, 2025

Description:

This PR is introducing GlobalEndpointManager, ClientRetryPolicy, LocationCache and AccountProperties to support cross-regional routing in the rust SDK. Some of the key design decisions are highlighted below:

  • We used Moka as a caching library which supports thread safe access of the cached data and uses LRFU based cache eviction mechanism.

  • The cache refresh has been done on a TTL (time to live) model, where each incoming request has the opportunity to refresh AccountProperties and LocationCache, if the TTL expires or force_refresh is passed as True.

  • We haven't took the background refresh route yet, since we will need to be a little careful and optimized for the rust bg task management. However, our long term goal is to refresh the cache in the background, so that we do not penalize an incoming request to refresh the cache.

The below flow diagram depicts the entire component interactions to better understand how the request will be made.

Flow Diagram:

sequenceDiagram
    participant J as ContainerClient
    participant A as CosmosPipeline
    participant B as RetryHandler
    participant C as ClientRetryPolicy
    participant D as Core Pipeline <br> send_raw <br> through delegate
    participant E as GlobalEndpointManager
    participant F as LocationCache


    J-->>A: 1. send(CosmosRequest)
    A->>B: 2. send<Sender, Fut>()
    loop Retry Iterations with force-refresh flag = false/ true
        B->>C: 3. before_send_request(request)
        C->>E: 4. refresh_location_async() <br> Refresh Cache only when TTL Expires OR <br> force_refresh is requested. <br><br> NO-OP when either of the above is false.
        E->>E: 5. get_database_account() <br> Fetch Account Properties <br> from Gateway.
        E->>F: 6. on_database_account_read() <br> Refresh Location Cache
        C->>E: 7. resolve_service_endpoint() <br> Fetch and set the next endpoint in <br> CosmosRequest to route request.
        B->>D: 8. Call sender delegate(request)
        B->>C: 9. Invoke should_retry(result)
        B->>B: 10. RetryAfter <br> (TimeSpan.Zero)
    end
        B->>J: 11. Return result if No Retry
Loading

Closing issues

Fixes #3175

@kundadebdatta kundadebdatta self-assigned this Nov 14, 2025
@kundadebdatta kundadebdatta added Cosmos The azure_cosmos crate high-availability labels Nov 14, 2025
@github-actions
Copy link

github-actions bot commented Nov 14, 2025

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure_data_cosmos

Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Still working through GEM and Location Cache, but wanted to share some of my initial stylistic feedback

@kundadebdatta kundadebdatta changed the title [WIP] Cosmos: Adds GlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional Routing Cosmos: Adds GlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional Routing Nov 19, 2025
@kundadebdatta kundadebdatta marked this pull request as ready for review November 19, 2025 19:51
@kundadebdatta kundadebdatta requested a review from a team as a code owner November 19, 2025 19:51
Copilot AI review requested due to automatic review settings November 19, 2025 19:51
Copilot finished reviewing on behalf of kundadebdatta November 19, 2025 19:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive cross-regional routing support for the Cosmos DB Rust SDK by adding GlobalEndpointManager, ClientRetryPolicy, LocationCache, and MetadataRequestRetryPolicy. Key changes include:

  • Implementation of multi-region endpoint management with failover support
  • TTL-based caching (600s) using Moka library for account properties and location information
  • Comprehensive retry policies for data plane (ClientRetryPolicy) and metadata operations (MetadataRequestRetryPolicy)
  • Refactoring of CosmosRequest from separate builder to unified builder pattern
  • Resource type renaming from Items to Documents for consistency

Reviewed Changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
routing/global_endpoint_manager.rs New component managing endpoint resolution, failover, and account properties caching
routing/location_cache.rs New component tracking regional endpoints, availability, and routing preferences
retry_policies/client_retry_policy.rs New retry policy for data plane operations with session token, endpoint failure, and throttling handling
retry_policies/metadata_request_retry_policy.rs New retry policy for metadata operations with simplified failover logic
retry_policies/mod.rs Refactored from trait to enum-based pattern, added substatus code helpers
retry_policies/resource_throttle_retry_policy.rs Removed async_trait, made methods public for enum delegation
cosmos_request.rs Merged builder into main struct, removed CosmosRequestBuilder, added ResourceLink
pipeline/mod.rs Updated to accept GlobalEndpointManager and pre-configured pipeline
clients/*.rs Updated to use Arc and new CosmosRequest builder pattern
resource_context.rs Renamed Items to Documents, added is_meta_data method
operation_context.rs Added ReadFeed and Execute variants, added http_method function
constants.rs Added SubStatusCode enum and ACCOUNT_PROPERTIES_KEY constant
models/account_properties.rs Added PartialEq, Eq, Hash derives to AccountRegion
options/mod.rs Changed application_preferred_regions to Vec<Cow<'static, str>>
Comments suppressed due to low confidence (1)

sdk/cosmos/azure_data_cosmos/src/routing/location_cache.rs:1

  • The get_substatus_code_from_response function clones the response unnecessarily on line 184. The function signature takes &RawResponse but internally clones it with response.clone(). This clone is unnecessary and impacts performance. Change line 184 to: let sub_status_code = get_substatus_code_from_response(response); and update the function in retry_policies/mod.rs to remove the .clone() call.
// Copyright (c) Microsoft Corporation. All rights reserved.

Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure we're getting the architecture of this right for Rust. There's a lot of data being moved around, which is cheap in .NET/Java, but in Rust it results in a lot of .clone() operations, which require copying actual bytes in memory and gets quite slow.

But here are my first set of comments. We can refactor this code to improve it later, so I'm not necessarily going to block merging on figuring out the exact right Rust architecture, but I have this growing sense that we should consider getting back to basics and trying to work out the "Rust-y" way to achieve our underlying goals rather than just how to translate existing logic in to Rust.

@kundadebdatta kundadebdatta enabled auto-merge (squash) November 20, 2025 02:45
Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style nit, but it's not critical

Copy link
Member

@heaths heaths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for stuff outside sdk/cosmos. I really only looked at where moka was used and was told it will be more complex later and worth the investment.

@kundadebdatta kundadebdatta merged commit b592daf into main Nov 20, 2025
17 checks passed
@kundadebdatta kundadebdatta deleted the users/kundadebdatta/3175_build_global_endpoint_manager branch November 20, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate high-availability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cosmos: Build GlobalEndpointManager For Endpoint Management

4 participants