-
Notifications
You must be signed in to change notification settings - Fork 320
Cosmos: Adds GlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional Routing
#3348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos: Adds GlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional Routing
#3348
Conversation
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
analogrelay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Still working through GEM and Location Cache, but wanted to share some of my initial stylistic feedback
GlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional RoutingGlobalEndpointManager, ClientRetryPolicy and LocationCache To Support Cross-Regional Routing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive cross-regional routing support for the Cosmos DB Rust SDK by adding GlobalEndpointManager, ClientRetryPolicy, LocationCache, and MetadataRequestRetryPolicy. Key changes include:
- Implementation of multi-region endpoint management with failover support
- TTL-based caching (600s) using Moka library for account properties and location information
- Comprehensive retry policies for data plane (ClientRetryPolicy) and metadata operations (MetadataRequestRetryPolicy)
- Refactoring of CosmosRequest from separate builder to unified builder pattern
- Resource type renaming from
ItemstoDocumentsfor consistency
Reviewed Changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| routing/global_endpoint_manager.rs | New component managing endpoint resolution, failover, and account properties caching |
| routing/location_cache.rs | New component tracking regional endpoints, availability, and routing preferences |
| retry_policies/client_retry_policy.rs | New retry policy for data plane operations with session token, endpoint failure, and throttling handling |
| retry_policies/metadata_request_retry_policy.rs | New retry policy for metadata operations with simplified failover logic |
| retry_policies/mod.rs | Refactored from trait to enum-based pattern, added substatus code helpers |
| retry_policies/resource_throttle_retry_policy.rs | Removed async_trait, made methods public for enum delegation |
| cosmos_request.rs | Merged builder into main struct, removed CosmosRequestBuilder, added ResourceLink |
| pipeline/mod.rs | Updated to accept GlobalEndpointManager and pre-configured pipeline |
| clients/*.rs | Updated to use Arc and new CosmosRequest builder pattern |
| resource_context.rs | Renamed Items to Documents, added is_meta_data method |
| operation_context.rs | Added ReadFeed and Execute variants, added http_method function |
| constants.rs | Added SubStatusCode enum and ACCOUNT_PROPERTIES_KEY constant |
| models/account_properties.rs | Added PartialEq, Eq, Hash derives to AccountRegion |
| options/mod.rs | Changed application_preferred_regions to Vec<Cow<'static, str>> |
Comments suppressed due to low confidence (1)
sdk/cosmos/azure_data_cosmos/src/routing/location_cache.rs:1
- The
get_substatus_code_from_responsefunction clones the response unnecessarily on line 184. The function signature takes&RawResponsebut internally clones it withresponse.clone(). This clone is unnecessary and impacts performance. Change line 184 to:let sub_status_code = get_substatus_code_from_response(response);and update the function inretry_policies/mod.rsto remove the.clone()call.
// Copyright (c) Microsoft Corporation. All rights reserved.
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
analogrelay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure we're getting the architecture of this right for Rust. There's a lot of data being moved around, which is cheap in .NET/Java, but in Rust it results in a lot of .clone() operations, which require copying actual bytes in memory and gets quite slow.
But here are my first set of comments. We can refactor this code to improve it later, so I'm not necessarily going to block merging on figuring out the exact right Rust architecture, but I have this growing sense that we should consider getting back to basics and trying to work out the "Rust-y" way to achieve our underlying goals rather than just how to translate existing logic in to Rust.
sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/src/routing/global_endpoint_manager.rs
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
analogrelay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small style nit, but it's not critical
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos/src/retry_policies/client_retry_policy.rs
Outdated
Show resolved
Hide resolved
heaths
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for stuff outside sdk/cosmos. I really only looked at where moka was used and was told it will be more complex later and worth the investment.
Description:
This PR is introducing
GlobalEndpointManager,ClientRetryPolicy,LocationCacheandAccountPropertiesto support cross-regional routing in the rust SDK. Some of the key design decisions are highlighted below:We used Moka as a caching library which supports thread safe access of the cached data and uses LRFU based cache eviction mechanism.
The cache refresh has been done on a TTL (time to live) model, where each incoming request has the opportunity to refresh
AccountPropertiesandLocationCache, if the TTL expires orforce_refreshis passed asTrue.We haven't took the background refresh route yet, since we will need to be a little careful and optimized for the rust bg task management. However, our long term goal is to refresh the cache in the background, so that we do not penalize an incoming request to refresh the cache.
The below flow diagram depicts the entire component interactions to better understand how the request will be made.
Flow Diagram:
sequenceDiagram participant J as ContainerClient participant A as CosmosPipeline participant B as RetryHandler participant C as ClientRetryPolicy participant D as Core Pipeline <br> send_raw <br> through delegate participant E as GlobalEndpointManager participant F as LocationCache J-->>A: 1. send(CosmosRequest) A->>B: 2. send<Sender, Fut>() loop Retry Iterations with force-refresh flag = false/ true B->>C: 3. before_send_request(request) C->>E: 4. refresh_location_async() <br> Refresh Cache only when TTL Expires OR <br> force_refresh is requested. <br><br> NO-OP when either of the above is false. E->>E: 5. get_database_account() <br> Fetch Account Properties <br> from Gateway. E->>F: 6. on_database_account_read() <br> Refresh Location Cache C->>E: 7. resolve_service_endpoint() <br> Fetch and set the next endpoint in <br> CosmosRequest to route request. B->>D: 8. Call sender delegate(request) B->>C: 9. Invoke should_retry(result) B->>B: 10. RetryAfter <br> (TimeSpan.Zero) end B->>J: 11. Return result if No RetryClosing issues
Fixes #3175