upstream: make sure all_hosts_ is updated correctly#4575
upstream: make sure all_hosts_ is updated correctly#4575mattklein123 merged 13 commits intoenvoyproxy:masterfrom dio:fix-4548
Conversation
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
|
@snowp can you help take a look at this and the related issue? Much appreciated. |
|
Yeah I’ll have a look later today.
…On Mon, Oct 1, 2018 at 9:09 AM Matt Klein ***@***.***> wrote:
@snowp <https://github.com/snowp> can you help take a look at this and
the related issue? Much appreciated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB7fxrs9AS820ycTFTULv29ZrQILkvFgks5ugj4ygaJpZM4XBxrU>
.
|
| parent_.updateAllHosts(hosts_added, hosts_removed, locality_lb_endpoint_.priority()); | ||
| } | ||
|
|
||
| // TODO(dio): Not sure if we should do this if active health checking is disabled. |
There was a problem hiding this comment.
we should, the all_hosts_ mapping is also used to determine if the new hosts match within the same priority, which affects hosts_changed
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
|
Part of my comment in #4548 that's relevant to this PR:
The reason we're seeing duplicate addresses is because we fail to match a new address to an existing one, which is caused by the fact that we reset the |
|
Yeah, I think I'll add From my quick experiment, with CDS response similar to #4548. |
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
|
cc @tonya11en who I think is also hitting this at Lyft. |
| host_addresses.insert((*i)->address()->asString()); | ||
| ++i; | ||
| } else { | ||
| i = target->hosts_.erase(i); |
There was a problem hiding this comment.
In the case of multiple instances of the same host, you could probably save yourself a few calls to erase if you just used std::remove_if(i, hosts_.end(), XXXX). Your predicate function would just check for the existence of each element in host_addresses. std::remove_if also has the same complexity as a vector erase.
There was a problem hiding this comment.
remove_if doesn't work on associative containers so that won't work.
There was a problem hiding this comment.
@lizan target->hosts_ is a vector, so it will work. Which container are you referring to?
There was a problem hiding this comment.
ah ignore my comment above, I thought it is about host_addresses (unordered_set)
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
| if (target->locality_lb_endpoint_.priority() == current_priority) { | ||
| priority_state_manager.registerHostForPriority(host, target->locality_lb_endpoint_, | ||
| target->lb_endpoint_, absl::nullopt); | ||
| std::unordered_set<std::string> host_addresses; |
There was a problem hiding this comment.
I'm a little confused on what the underlying bug actually is here. Can you add more comments so it's easier to understand what this code is doing now? Also, should whatever this is doing actually be done inside the priority state manager so it applies to all discovery types? Is this related to @snowp's comment in the issue around reconciling DNS with EDS behavior? Maybe the comments will help me understand more. Feel free to leave TODOs on what else needs to be done later. Thank you!
There was a problem hiding this comment.
Sure will add more comments here. While I think this dedupe here probably is not required anymore since the underlying problem is indeed when we update the internal all_hosts_ data (for EDS (STATIC), the updated hosts consist of all host sets, while for STRICT_DNS it is not, depends on each DNS resolution). This:
envoy/source/common/upstream/upstream_impl.cc
Lines 1208 to 1213 in e3c0de0
Will make sure the priority state manager can actually handle this (no duplications).
There was a problem hiding this comment.
@mattklein123 for following up this issue, I have opened one here: #4590 to track. This is based on @snowp's comment. I'll take a look at that later.
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for fixing this. This makes sense to me at a high level but I would really appreciate more comments. Thank you! @snowp can you also review the change itself and any additional comments that you would like added?
| } | ||
|
|
||
| // This makes sure parent_.all_hosts_ is updated with all resolved hosts from all | ||
| // priorities. This reconciliation is required since we check each new host againsts this |
There was a problem hiding this comment.
typo "againsts"
I think this issue is confusing enough that we should have more explanation. Can you be really explicit about the flow here and why this is necessary? I think I understand but I want to make sure the next person that comes along has a really clear understanding of the bug and how we are fixing it for now. My limited understanding is that previously we would set "all hosts" to just this resolve group, which would lead to logic in the future continuing to add hosts that already exist? That's about where my understanding ends without paging back in all this code and the recent changes. So I think more explanation would be great. :)
I would potentially also reference the opened tracking issue to clean this up in the comment.
There was a problem hiding this comment.
The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.
I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.
There was a problem hiding this comment.
@mattklein123 @snowp I have updated the comment accordingly.
snowp
left a comment
There was a problem hiding this comment.
Approach looks good to me until we can figure out what we want to do this with longer term.
| } | ||
|
|
||
| // This makes sure parent_.all_hosts_ is updated with all resolved hosts from all | ||
| // priorities. This reconciliation is required since we check each new host againsts this |
There was a problem hiding this comment.
The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.
I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.
| // | ||
| // TODO(dio): The uniqueness of a host address resolved in STRICT_DNS cluster per priority is not | ||
| // guaranteed. Need a clear agreement on the behavior here, whether it is allowable to have | ||
| // duplicated hosts inside a priority. And if we want to enforce this behavior, it should be done |
There was a problem hiding this comment.
The other open question is whether we want to allow duplicated hosts between priorities. EDS does not allow this.
There was a problem hiding this comment.
I think duplicated hosts between priorities could land us in trouble, but I'm not sure if the scenario is plausible. Consider this:
host A: 50% of endpoints, duplicated between priority 0 and 1.
host B: 25% of endpoints, priority 0
host C: 25% of endpoints, priority 1
If a large portion of host A's endpoints are unhealthy and trigger some percentage of traffic to go to the priority 1 hosts, the overall health of priority 1 is affected as well and could trigger failover into an even lower priority. It seems unintuitive and unnecessary to allow duplicate hosts between priorities.
I'd like to hear an argument in the other direction if anyone has a use-case in mind.
There was a problem hiding this comment.
This has been brought up in other issues: #4280 (comment)
The issue talks about EDS and would probably be supported by the indirection that's later suggested in the issue, but I'm not sure if we could do the same for STRICT_DNS.
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thank you! The comment is super clear. Great work!
Signed-off-by: Dhi Aurrahman <dio@tetrate.io> Signed-off-by: Aaltan Ahmad <aa@stripe.com>
Description:
To make sure we all_hosts_ is updated correctly by all hosts from all priorities and no-rebuild if the update is the same.
This #4548 seems to happen only for
STRICT_DNScluster.Risk Level: Medium
Testing: unit tests on checking
update_no_rebuildwhenhosts_changeequalsfalse, existing tests, manual tests.Docs Changes: N/A
Release Notes: N/A
Fixes #4548
Signed-off-by: Dhi Aurrahman dio@tetrate.io