-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928
Merged
microsoft-github-policy-service
merged 2 commits into
master
from
users/kundadebdatta/upgrade_direct_package_to_3_37_3
Dec 11, 2024
Merged
Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928
microsoft-github-policy-service
merged 2 commits into
master
from
users/kundadebdatta/upgrade_direct_package_to_3_37_3
Dec 11, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
kundadebdatta
changed the title
[Internal] Direct Package Upgrade: Refactors Code to Bump Up
Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object.
Dec 10, 2024
Cosmos.Direct
Package to 3.37.3
kundadebdatta
requested review from
khdang,
sboshra,
adityasa,
neildsh,
kirankumarkolli,
FabianMeiswinkel,
kirillg and
Pilchie
as code owners
December 11, 2024 12:40
kirankumarkolli
approved these changes
Dec 11, 2024
aavasthy
approved these changes
Dec 11, 2024
microsoft-github-policy-service
bot
deleted the
users/kundadebdatta/upgrade_direct_package_to_3_37_3
branch
December 11, 2024 18:51
kundadebdatta
added a commit
that referenced
this pull request
Dec 23, 2024
…bChannelState Object. (#4928) This PR bumps up the `Cosmos.Direct` package version to `3.37.3` which is released on top of `EN20241007` release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow. **Problem:** One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the `LoadBalancingPartition`. The team is using the .NET SDK version `3.39.0-preview` to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying `CosmosClient` is keeping a high number of unhealthy and unused `LbChannelState`s. In a nut shell, below are few of the account configurations and facts: • **account-1:** 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account. • **account-2** 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account. • **account-3:** 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account. To understand this in more detail, please take a look at the memory dump below. ![ic3_memory_dump_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/b78585de-5318-436a-8511-0107d570e8d7) _[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]_ Upon further analysis of the memory dump, it is clear that: - The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up. - Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print. ![ic3_new_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/ea3a4f82-daca-40f8-bcd6-d29da806a3c8) _[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]_ - Even without the replica validation feature, the memory footprint showed a consistent increase over time. ![ic3_old_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/471a50c3-1786-4f12-b5e3-1ab192c6fef5) _[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]_ **Analysis :** Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that: - **With Replica Validation Enabled:** Each of the impacted `LoadBalancingPartition` was holding more than `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. - **With Replica Validation Disabled:** Each of the impacted `LoadBalancingPartition` was holding exactly `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. Let's take a look at the below diagram to understand this in more detail: ![image](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/f4e2d494-962b-41da-ae80-0d542d62accf) _[Fig-4: The above figure shows an instance of the `LoadBalancingPartition` holding more than one entry of unhealthy `LbChannelState`.]_ By looking at the above memory dump snapshot, it is clear that these stale `LbChannelState` entries are kept in the `LoadBalancingPartition`, until they are removed the `openChannels` list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging `LbChannelState`s claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption. **Proposed Solution :** There are few changes proposed to fix this scenario. These are discussed briefly in the below section: - During the replica validation phase, in the `OpenConnectionAsync()`, proactively remove all the `Unhealthy` connections from the `openChannels` within the `LoadBalancingPartition`. This guarantees that any unhealthy `LbChannelStates` will be removed from the `LoadBalancingPartition`, freeing up the additional memory. Please delete options that are not relevant. - [x] Bug fix (non-breaking change which fixes an issue) To automatically close an issue: closes #4467
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Template
Description
This PR bumps up the
Cosmos.Direct
package version to3.37.3
which is released on top ofEN20241007
release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow.Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the
LoadBalancingPartition
. The team is using the .NET SDK version3.39.0-preview
to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlyingCosmosClient
is keeping a high number of unhealthy and unusedLbChannelState
s.In a nut shell, below are few of the account configurations and facts:
• account-1: 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account.
• account-2 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account.
• account-3: 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account.
To understand this in more detail, please take a look at the memory dump below.
[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]
Upon further analysis of the memory dump, it is clear that:
The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up.
Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print.
[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]
Even without the replica validation feature, the memory footprint showed a consistent increase over time.
[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]
Analysis : Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that:
With Replica Validation Enabled: Each of the impacted
LoadBalancingPartition
was holding more than1
Unhealthy staleLbChannelState
(which is a wrapper around theDispatcher
and aChannel
), when the connection to the backend replica was closed deterministically.With Replica Validation Disabled: Each of the impacted
LoadBalancingPartition
was holding exactly1
Unhealthy staleLbChannelState
(which is a wrapper around theDispatcher
and aChannel
), when the connection to the backend replica was closed deterministically.Let's take a look at the below diagram to understand this in more detail:
[Fig-4: The above figure shows an instance of the
LoadBalancingPartition
holding more than one entry of unhealthyLbChannelState
.]By looking at the above memory dump snapshot, it is clear that these stale
LbChannelState
entries are kept in theLoadBalancingPartition
, until they are removed theopenChannels
list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hangingLbChannelState
s claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption.Proposed Solution :
There are few changes proposed to fix this scenario. These are discussed briefly in the below section:
OpenConnectionAsync()
, proactively remove all theUnhealthy
connections from theopenChannels
within theLoadBalancingPartition
. This guarantees that any unhealthyLbChannelStates
will be removed from theLoadBalancingPartition
, freeing up the additional memory.Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #4467