add all host map to priority set for fast host searching#17290
add all host map to priority set for fast host searching#17290snowp merged 15 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: wbpcode <wbphub@live.com>
|
@mattklein123 Before proceeding with this work, I hope you can review the feasibility of this solution. 😃 An |
Signed-off-by: wbpcode <wbphub@live.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for working on this. One question to get started.
/wait-any
| } | ||
| }); | ||
| // Update read only all host map in worker threads. | ||
| cluster_manager->thread_local_clusters_[info->name()]->priority_set_.setReadOnlyAllHostMap(map); |
There was a problem hiding this comment.
Why can't this map just be calculated on the main thread and then passed by shared_ptr to all of the workers similar to how everything else is passed in the updateClusterMembership() call? It's unclear to me why this needs to be special cased like this?
There was a problem hiding this comment.
In an update after aggregation, updateClusterMembership() will be executed multiple times (once for each priority). The host map only needs to be updated once. Although it is not a problem to add it to the updateClusterMembership(), it will only be repeated several times to set the pointer of the host map.
There was a problem hiding this comment.
it will only be repeated several times to set the pointer of the host map.
As long as the copy doesn't happen multiple times I don't think this matters? My preference would be to keep the code as simple as possible unless we have a known perf issue. This code is already super complicated.
There was a problem hiding this comment.
This code is already super complicated.
Yes it is. The code is complex enough. It took me a long time to understand the source code of ClusterManager and EDS.
I will modify and improve this PR first based on your suggestions.
| // Copy old read only host map to mutable host map. | ||
| mutable_all_host_map_ = read_only_all_host_map_ != nullptr | ||
| ? std::make_shared<HostMap>(*read_only_all_host_map_) | ||
| : std::make_shared<HostMap>(); |
There was a problem hiding this comment.
Would this be prohibitive for large clusters? Even if just one host changes, we end up doing an O(n) copy of this map.
Speaking of large clusters, since this would add both space and computational overhead to large clusters, should we make this configurable or only enabled when actually used in some way?
There was a problem hiding this comment.
I think we are already paying this copy cost for all the other data structures, right? (Not that it's good to add yet another one.). One option for this particular structure would be to use R/W locks and then we wouldn't need to actually do any copies on it.
There was a problem hiding this comment.
I guess we do pay this when processing the EDS update as a whole, so we wouldn't make it that much worse.
I was originally thinking in terms of the update callbacks, most (all?) of which operate on the per priority level, so a change in P0 won't trigger updates to P1, P2, ...
There was a problem hiding this comment.
Yeah, good point re: per-priority, though I'm not sure how much that matters in practice for most users. I think though that as long as we can replace the EDS private one with this one it's probably a wash?
There was a problem hiding this comment.
Yep if this lets us replace the all_hosts_ map then I think we're good
There was a problem hiding this comment.
Yep if this lets us replace the
all_hosts_map then I think we're good
@snowp Yep. This will replace the all_hosts_ map in EDS. This may still bring additional overhead, because each priority may be updated, causing the callback to be executed multiple times. In the original EDS, because batchUpdate was used, all_hosts_ would only be copied once. But this problem can be alleviated by setting merge_timeout.
Another way to avoid extra overhead is to use an additional memberUpdateCallback to synchronize read_only_all_host_map_ to the worker threads instead of directly reusing postThreadLocalClusterUpdate. This ensures that when batchUpdate is used, there is only one map copy at most. I think this is also a point that can be discussed.
There was a problem hiding this comment.
I think we are already paying this copy cost for all the other data structures, right? (Not that it's good to add yet another one.). One option for this particular structure would be to use R/W locks and then we wouldn't need to actually do any copies on it.
@mattklein123 I have considered using R/W locks, but I don't want to increase the cost of acquiring R locks during each request for occasional membership updates.
Signed-off-by: wbpcode <wbphub@live.com>
Signed-off-by: wbpcode <wbphub@live.com>
|
I think this PR is ready for a review. @mattklein123 😄 |
|
GCC CI passed. 😂 It seems that the memory test is not very stable. |
Signed-off-by: wbpcode <wbphub@live.com>
Signed-off-by: wbpcode <wbphub@live.com>
|
/retest |
|
Retrying Azure Pipelines: |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks in general this looks like it's on the right track. I have a few comments and I would definitely like @snowp to take a look. Thank you!
/wait
| overprovisioning_factor); | ||
|
|
||
| // Update read only all host map in worker threads. | ||
| cluster_entry->priority_set_.setReadOnlyAllHostMap(read_only_host_map); |
There was a problem hiding this comment.
Can this just be folded into the updateHosts() call?
| // When the configuration contains duplicate hosts, only the first one will be retained. | ||
| if (all_new_hosts.count(address->asString()) > 0) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Is this an unrelated bug fix? Or is this somehow required with this new code?
| if (all_new_hosts.count(address->asString()) > 0) { | ||
| continue; | ||
| } |
| if (updated_hosts.count(host->address()->asString())) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
I see is this basically just moved to a new location?
There was a problem hiding this comment.
Yep. I removed updated_hosts, so in order to ensure that the existing behavior not changed, I need to de-duplicate the hosts in other locations.
There was a problem hiding this comment.
There is a slight difference in doing the bookkeeping here vs the new location in that doing it here includes endpoints in PENDING_DYNAMIC_REMOVAL
That said I don't think I can come up with a case where that matters after https://github.com/envoyproxy/envoy/pull/14483/files, so this is probably fine
| // PrioritySet | ||
| const HostMapConstSharedPtr& readOnlyAllHostMap() const override { | ||
| // Check if the host set in the main thread PrioritySet has been updated. | ||
| if (mutable_all_host_map_ != nullptr) { | ||
| read_only_all_host_map_ = std::move(mutable_all_host_map_); | ||
| ASSERT(mutable_all_host_map_ == nullptr); | ||
| } | ||
| return read_only_all_host_map_; | ||
| } | ||
|
|
||
| // PrioritySetImpl | ||
| void setReadOnlyAllHostMap(HostMapConstSharedPtr) override { | ||
| // Using an external host map to update the read_only_all_host_map_ is not allowed in | ||
| // MainPrioritySetImpl. | ||
| NOT_IMPLEMENTED_GCOVR_EXCL_LINE; | ||
| } |
There was a problem hiding this comment.
Per my other comment I wonder if this could be removed from the interface and just be part of the implementation?
There was a problem hiding this comment.
OK, we can make the interface simpler. 🤔
Signed-off-by: wbpcode <wbphub@live.com>
snowp
left a comment
There was a problem hiding this comment.
Thanks this change makes sense to me, just a few smaller comments
| HostMapConstSharedPtr host_map = cm_cluster.cluster().prioritySet().crossPriorityHostMap(); | ||
|
|
||
| tls_.runOnAllThreads([info = cm_cluster.cluster().info(), params = std::move(params), | ||
| add_or_update_cluster, load_balancer_factory, map = std::move(host_map)]( |
There was a problem hiding this comment.
nit: Maybe call it host_map within the lambda as well? map seems a bit too generic
| LocalityWeightsConstSharedPtr locality_weights, const HostVector& hosts_added, | ||
| const HostVector& hosts_removed, uint64_t overprovisioning_factor) { | ||
| const HostVector& hosts_removed, uint64_t overprovisioning_factor, | ||
| const HostMapConstSharedPtr& cross_priority_host_map) { |
There was a problem hiding this comment.
I think usually we'd pass a shared ptr by value and std::move it to avoid copies where appropriate
| // All host map for current resolve target. When we have multiple resolve targets, multiple | ||
| // targets may contain two different host objects with the same address. This host map cannot be | ||
| // replaced by the read only all host map in the priority set. | ||
| HostMap all_hosts_; |
There was a problem hiding this comment.
Is there anything that would prevent someone from using the shared map for strict DNS clusters? Seems like it would not be valid?
There was a problem hiding this comment.
The shared map is still worked for strict DNS clusters. However, we cannot use shared map to replace this all_hosts_.
Because strict DNS cluster will remove duplicate hosts in the resolve target. But if multiple resolve targets contain hosts with the same address, these hosts will exist at the same time. As a global host map, shared map cannot reserve two different hosts with the same address for two different resolve targets, nor can it keep the logic of strict DNS cluster unchanged.
So shared map cannot replace all_hosts_ (note that all_hosts_ here is a member of resolve target).
There was a problem hiding this comment.
I guess my concern would be around people trying to implement sticky LB for strict DNS but running into weird edge cases: suppose they use regular LB to select an endpoint from the STRICT_DNS cluster and wants to pin this lb selection
This works fine if there is only one host per ip within the cluster, but if this is configured with multiple resolve targets that resolve to the same ip, some of which have different metadata, you might run into a situation where the first LB selects a different endpoint than what the shared map gives you when you ask for the IP
I don't think we need to fix this, but perhaps some documentation around this limitation with the strict dns cluster would be good?
There was a problem hiding this comment.
This works fine if there is only one host per ip within the cluster, but if this is configured with multiple resolve targets that resolve to the same ip, some of which have different metadata, you might run into a situation where the first LB selects a different endpoint than what the shared map gives you when you ask for the IP
Yes, this is a known limitation. I will add some new comments to illustrate this problem.
| if (updated_hosts.count(host->address()->asString())) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
There is a slight difference in doing the bookkeeping here vs the new location in that doing it here includes endpoints in PENDING_DYNAMIC_REMOVAL
That said I don't think I can come up with a case where that matters after https://github.com/envoyproxy/envoy/pull/14483/files, so this is probably fine
|
|
||
| void batchHostUpdate(BatchUpdateCb& callback) override; | ||
|
|
||
| const HostMapConstSharedPtr& crossPriorityHostMap() const override { |
There was a problem hiding this comment.
Should this return by value? Seems a bit safer so nothing stores this by ref and tries to access it?
There was a problem hiding this comment.
I think if someone needs to store it, then a copy of the shared pointer should be created from this ref instead of storing the reference directly. 🤔
According to my imagination, this method has two main usage scenarios:
- First, in the main thread, we use this method to get the latest host map. At this point, as the code shows, I will create a copy of the shared pointer.
- Second, in the worker threads, we will obtain a reference to the host map through this method, and then perform a host search, but will not save this reference. In this case, returning reference can reduce unnecessary atomic operations.
There was a problem hiding this comment.
So is this in preparation for another usage where we access it without taking a copy from the worker threads? All the usages in this PR seems to make a copy. In general forcing a copy via the API makes the API a bit safer since it means the caller can't mess up by trying to access the ref later on, so it makes it less likely for bugs to happen later
If we are trying to allow for ref access via the ptr I think we need to add some comments around the lifetime of the returned reference since it is different from the lifetime of the PSet
Another strategy would be to have a std::shared_ptr<T> and a const T& accessor, which lets consumers of the API choose between a call that has very clear ownership semantics and one that requires some special attention
There was a problem hiding this comment.
So is this in preparation for another usage where we access it without taking a copy from the worker threads?
Yes, it is preparation for Sticky LB.
🤔 My be we can return shared pointer directly. As long as we don't call this method frequently, there won't be too many atomic operations.
|
|
||
| /** | ||
| * Specialized PrioritySetImpl designed for the main thread. It will update and maintain the read | ||
| * only all host map when the hosts set changes. |
| // Check if the host set in the main thread PrioritySet has been updated. | ||
| if (mutable_cross_priority_host_map_ != nullptr) { | ||
| const_cross_priority_host_map_ = std::move(mutable_cross_priority_host_map_); | ||
| ASSERT(mutable_cross_priority_host_map_ == nullptr); |
There was a problem hiding this comment.
This ASSERT seems a bit unnecessary since we null it out on the previous line, but I don't feel that strongly about it
There was a problem hiding this comment.
Because I need to modify two data members (mutable_cross_priority_host_map_ & const_cross_priority_host_map_) in a const method (crossPriorityHostMap() const), I added ASSERT to ensure that the move is executed correctly.
| // Only mutable host map can be updated directly. Read only host map will not be updated before | ||
| // 'crossPriorityHostMap' is called. |
There was a problem hiding this comment.
not be updated until
I think we also prefer backticks (`) instead of quotes (') when referencing function names
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this LGTM at a high level. I will defer remaining review to @snowp who I think is better able to review this code than I am.
Signed-off-by: wbpcode <wbphub@live.com>
|
I would bump the limit. This PR is going to slightly increase the memory usage for static clusters, but not substantially so for EDS clusters (since we're replacing the existing map) which I imagine is what the main cluster type that is used for large deployments. Thanks for raising! /wait |
Signed-off-by: wbpcode <wbphub@live.com>
|
@snowp So, what should I do next with this PR? 🤔 |
Signed-off-by: wbpcode <wbphub@live.com>
|
I'd suggest increasing the limits just enough to make this PR pass consistently |
Oh, I see. It seems that I misunderstood the meaning of your last comment. 🤣 |
|
I executed the memory consumption case 500 times and get 1000 result. Here is the statistical results. @snowp @mattklein123 From the results, the fluctuation range is very large. Finally, I adjusted the limit to 40650 to ensure that CI will not be affected in most cases. |
Signed-off-by: wbpcode <wbphub@live.com>
|
Do we have any likely explanation for the large fluctuations? Are they coming from this change or is the fluctuation already there and we are just increasing it by a constant amount? cc @jmarantz as well (for when he's back) |
This PR should not bring such fluctuations. But I can try to perform the same test in main to see the distribution of memory consumption. |
|
@snowp It seems that the fluctuations has always been great. Here are some detailed distributions. Before this PR: After this PR: |
|
For the record, the fluctuations started with some new version of tcmalloc we pulled in at some point. Prior to that the measurements were always exact. Thanks for providing the distribution. Looks like bumping up the per-cluster cost by 500 bytes is needed and (I assume) warranted by the benefits of this change. |
|
I get a point and may be we can reduce some of memory consumption. |
Signed-off-by: wbpcode <wbphub@live.com>
|
The latest memory consumption distribution (The latest commit reduces memory consumption by nearly 200~300 bytes for each cluster ): |
|
Thanks for reducing the memory footprint! If I understand correctly you removed one of the callbacks and instead reused an existing one? Just trying to make sure I understand :) |
About the same. I removed an unnecessary callback and did what the callback needed to do directly in updateHosts (building the cross Priority host map). |
* main: Fix for fuzz tests failing due to invalid corpus paths (envoyproxy#17767) kafka: fix integration test (envoyproxy#17764) Fix typo in cluster.proto (envoyproxy#17755) cluster manager: add drainConnections() API (envoyproxy#17747) kafka broker filter: move to contrib (envoyproxy#17750) quiche: switch external dependency to github (envoyproxy#17732) quiche: implement stream idle timeout in codec (envoyproxy#17674) Update c-ares to 1.17.2 (envoyproxy#17704) Fix dns resolve fuzz bug (envoyproxy#17107) Remove members that shadow members of the base class (envoyproxy#17713) thrift proxy: missing parts from the previous PR (envoyproxy#17668) thrift-proxy: cleanup ConnectionManager::ActiveRpc (envoyproxy#17734) listener: extra warning for deprecated use_proxy_proto field (envoyproxy#17736) kafka: add support for metadata request in mesh-filter (envoyproxy#17597) upstream: add all host map to priority set for fast host searching (envoyproxy#17290) Remove the support for `hidden_envoy_deprecated_per_filter_config` (envoyproxy#17725) tls: SDS support for private key providers (envoyproxy#16737) bazel: update rules_foreign_cc (envoyproxy#17445) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Introduce a new stateful session extension. This is a change after #17848 #17290. In #17290, we added a new cross-priority host map for fast host searching. In #17848, we extend `LoadBalancerContext` interface to provide an override host and to select the upstream host by the override host. Finally, in this change, we expand a new API to allow users to extract the state of the session from the request and change the result of load balancing by setting the override host value. Related doc: https://docs.google.com/document/d/1IU4b76AgOXijNa4sew1gfBfSiOMbZNiEt5Dhis8QpYg/edit?usp=sharing. Risk Level: Medium Testing: Added Docs Changes: Added Release Notes: Added Platform-Specific Features: N/A. Signed-off-by: wbpcode <wbphub@live.com>
Introduce a new stateful session extension. This is a change after envoyproxy#17848 envoyproxy#17290. In envoyproxy#17290, we added a new cross-priority host map for fast host searching. In envoyproxy#17848, we extend `LoadBalancerContext` interface to provide an override host and to select the upstream host by the override host. Finally, in this change, we expand a new API to allow users to extract the state of the session from the request and change the result of load balancing by setting the override host value. Related doc: https://docs.google.com/document/d/1IU4b76AgOXijNa4sew1gfBfSiOMbZNiEt5Dhis8QpYg/edit?usp=sharing. Risk Level: Medium Testing: Added Docs Changes: Added Release Notes: Added Platform-Specific Features: N/A. Signed-off-by: wbpcode <wbphub@live.com> Signed-off-by: Josh Perry <josh.perry@mx.com>
Signed-off-by: wbpcode wbphub@live.com
Commit Message: add all host map to priority set for fast host searching
Additional Description:
Add all host map to priority set of main thread and then sync it to different worker threads. This host map will be used for fast host searching.
This is one of the most critical tasks related to #16698. Check #16698 and #17242 get more background info.
**This PR is still under active development, and related tests have not been added yet. It can only be used for early reviews. **
Risk Level: Low.
Testing: Wait.
Docs Changes: N/A.
Release Notes: N/A.