upstream: make sure all_hosts_ is updated correctly by dio · Pull Request #4575 · envoyproxy/envoy

dio · 2018-10-01T10:22:46Z

Description:
To make sure we all_hosts_ is updated correctly by all hosts from all priorities and no-rebuild if the update is the same.

This #4548 seems to happen only for STRICT_DNS cluster.

Risk Level: Medium
Testing: unit tests on checking update_no_rebuild when hosts_change equals false, existing tests, manual tests.
Docs Changes: N/A
Release Notes: N/A

Fixes #4548

Signed-off-by: Dhi Aurrahman dio@tetrate.io

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123 · 2018-10-01T16:08:56Z

@snowp can you help take a look at this and the related issue? Much appreciated.

snowp · 2018-10-01T16:16:07Z

Yeah I’ll have a look later today.

…

On Mon, Oct 1, 2018 at 9:09 AM Matt Klein ***@***.***> wrote: @snowp <https://github.com/snowp> can you help take a look at this and the related issue? Much appreciated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4575 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB7fxrs9AS820ycTFTULv29ZrQILkvFgks5ugj4ygaJpZM4XBxrU> .

snowp · 2018-10-01T21:37:30Z

source/common/upstream/upstream_impl.cc

          parent_.updateAllHosts(hosts_added, hosts_removed, locality_lb_endpoint_.priority());
        }

+        // TODO(dio): Not sure if we should do this if active health checking is disabled.


we should, the all_hosts_ mapping is also used to determine if the new hosts match within the same priority, which affects hosts_changed

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

snowp · 2018-10-01T22:31:50Z

Part of my comment in #4548 that's relevant to this PR:

I think we probably want to not use the updated_hosts mechanism for STRICT_EDS (since it's based around updating an entire priority) and instead keep track of what hosts were resolved per resolve target. With that, we could compute the diff for each resolve target after resolution and update all_hosts_ based on that (allowing cross priority moves).

The reason we're seeing duplicate addresses is because we fail to match a new address to an existing one, which is caused by the fact that we reset the all_hosts_ map between resolves. While I think simply de-duping the addresses will solve the leak and risk of DDOS, it will cause unnecessary "cluster changed" triggers and priority moves would reset the HC status. Ideally we'd fix the core issue, but if we want to rush this in for 1.8 (I'm not sure when that's going to get cut?) I'd be okay with de-duping as a stop gap.

dio · 2018-10-01T22:59:06Z

Yeah, I think I'll add update_no_rebuild counter to this to actually monitor the updates.

From my quick experiment, with CDS response similar to #4548.

cluster.service1.update_attempt: 34
cluster.service1.update_empty: 0
cluster.service1.update_failure: 0
cluster.service1.update_no_rebuild: 32
cluster.service1.update_success: 34

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123 · 2018-10-02T21:21:09Z

cc @tonya11en who I think is also hitting this at Lyft.

tonya11en · 2018-10-03T00:00:33Z

source/common/upstream/upstream_impl.cc

+          host_addresses.insert((*i)->address()->asString());
+          ++i;
+        } else {
+          i = target->hosts_.erase(i);


In the case of multiple instances of the same host, you could probably save yourself a few calls to erase if you just used std::remove_if(i, hosts_.end(), XXXX). Your predicate function would just check for the existence of each element in host_addresses. std::remove_if also has the same complexity as a vector erase.

remove_if doesn't work on associative containers so that won't work.

@lizan target->hosts_ is a vector, so it will work. Which container are you referring to?

ah ignore my comment above, I thought it is about host_addresses (unordered_set)

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123 · 2018-10-03T05:21:44Z

source/common/upstream/upstream_impl.cc

-      if (target->locality_lb_endpoint_.priority() == current_priority) {
-        priority_state_manager.registerHostForPriority(host, target->locality_lb_endpoint_,
-                                                       target->lb_endpoint_, absl::nullopt);
+    std::unordered_set<std::string> host_addresses;


I'm a little confused on what the underlying bug actually is here. Can you add more comments so it's easier to understand what this code is doing now? Also, should whatever this is doing actually be done inside the priority state manager so it applies to all discovery types? Is this related to @snowp's comment in the issue around reconciling DNS with EDS behavior? Maybe the comments will help me understand more. Feel free to leave TODOs on what else needs to be done later. Thank you!

Sure will add more comments here. While I think this dedupe here probably is not required anymore since the underlying problem is indeed when we update the internal all_hosts_ data (for EDS (STATIC), the updated hosts consist of all host sets, while for STRICT_DNS it is not, depends on each DNS resolution). This:

envoy/source/common/upstream/upstream_impl.cc

Lines 1208 to 1213 in e3c0de0

for (const auto& set : parent_.prioritySet().hostSetsPerPriority()) {

for (const auto& host : set->hosts()) {

updated_hosts.insert({host->address()->asString(), host});

}

}

parent_.updateHostMap(std::move(updated_hosts));

should remedy the issue.

Will make sure the priority state manager can actually handle this (no duplications).

@mattklein123 for following up this issue, I have opened one here: #4590 to track. This is based on @snowp's comment. I'll take a look at that later.

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123

Thanks for fixing this. This makes sense to me at a high level but I would really appreciate more comments. Thank you! @snowp can you also review the change itself and any additional comments that you would like added?

mattklein123 · 2018-10-03T17:06:09Z

source/common/upstream/upstream_impl.cc

        }

+        // This makes sure parent_.all_hosts_ is updated with all resolved hosts from all
+        // priorities. This reconciliation is required since we check each new host againsts this


typo "againsts"

I think this issue is confusing enough that we should have more explanation. Can you be really explicit about the flow here and why this is necessary? I think I understand but I want to make sure the next person that comes along has a really clear understanding of the bug and how we are fixing it for now. My limited understanding is that previously we would set "all hosts" to just this resolve group, which would lead to logic in the future continuing to add hosts that already exist? That's about where my understanding ends without paging back in all this code and the recent changes. So I think more explanation would be great. :)

I would potentially also reference the opened tracking issue to clean this up in the comment.

The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.

I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.

@mattklein123 @snowp I have updated the comment accordingly.

snowp

Approach looks good to me until we can figure out what we want to do this with longer term.

snowp · 2018-10-03T17:32:46Z

source/common/upstream/upstream_impl.cc

        }

+        // This makes sure parent_.all_hosts_ is updated with all resolved hosts from all
+        // priorities. This reconciliation is required since we check each new host againsts this


The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.

I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.

snowp · 2018-10-03T17:46:20Z

source/common/upstream/upstream_impl.cc

+  //
+  // TODO(dio): The uniqueness of a host address resolved in STRICT_DNS cluster per priority is not
+  // guaranteed. Need a clear agreement on the behavior here, whether it is allowable to have
+  // duplicated hosts inside a priority. And if we want to enforce this behavior, it should be done


The other open question is whether we want to allow duplicated hosts between priorities. EDS does not allow this.

I think duplicated hosts between priorities could land us in trouble, but I'm not sure if the scenario is plausible. Consider this:

host A: 50% of endpoints, duplicated between priority 0 and 1.
host B: 25% of endpoints, priority 0
host C: 25% of endpoints, priority 1

If a large portion of host A's endpoints are unhealthy and trigger some percentage of traffic to go to the priority 1 hosts, the overall health of priority 1 is affected as well and could trigger failover into an even lower priority. It seems unintuitive and unnecessary to allow duplicate hosts between priorities.

I'd like to hear an argument in the other direction if anyone has a use-case in mind.

This has been brought up in other issues: #4280 (comment)

The issue talks about EDS and would probably be supported by the indirection that's later suggested in the issue, but I'm not sure if we could do the same for STRICT_DNS.

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123

Thank you! The comment is super clear. Great work!

Signed-off-by: Dhi Aurrahman <dio@tetrate.io> Signed-off-by: Aaltan Ahmad <aa@stripe.com>

Make sure we have no duplicated hosts

0ad25a5

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

dio mentioned this pull request Oct 1, 2018

Envoy leaks cluster members when using STRICT_DNS and ClusterLoadAssignment #4548

Closed

dio added 3 commits October 1, 2018 19:58

Hack the initialization of updated_hosts

232006b

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

Move to after dynamic hosts update

4e8c0b9

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

Fix test

feb0fba

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

junr03 assigned snowp Oct 1, 2018

snowp reviewed Oct 1, 2018

View reviewed changes

Make sure we have all hosts updated

f6ac401

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

dio added 2 commits October 2, 2018 06:42

Add update_no_rebuild

f2146af

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

If no change, no cluster rebuild

f3e1e5d

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

dio changed the title ~~[WIP] upstream: make sure we have no duplicated hosts~~ upstream: make sure we have no duplicated hosts Oct 2, 2018

dio added 2 commits October 2, 2018 07:32

Simulate different updates

97cf9d5

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

Kick CI

c32348f

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

tonya11en reviewed Oct 3, 2018

View reviewed changes

dio added 2 commits October 3, 2018 09:39

Use local variable

2490830

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

Use remove_if

e3c0de0

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123 self-assigned this Oct 3, 2018

mattklein123 reviewed Oct 3, 2018

View reviewed changes

Remove dedupe code, add comments

95510b5

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

dio changed the title ~~upstream: make sure we have no duplicated hosts~~ upstream: make sure all_hosts_ is updated correctly Oct 3, 2018

dio mentioned this pull request Oct 3, 2018

Reconcile the differences in behavior between STRICT_DNS and EDS #4590

Closed

mattklein123 requested changes Oct 3, 2018

View reviewed changes

snowp suggested changes Oct 3, 2018

View reviewed changes

Update comments

a1d2995

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

mattklein123 approved these changes Oct 4, 2018

View reviewed changes

mattklein123 merged commit 0f42fd6 into envoyproxy:master Oct 4, 2018

aa-stripe pushed a commit to aa-stripe/envoy that referenced this pull request Oct 11, 2018

upstream: make sure all_hosts_ is updated correctly (envoyproxy#4575)

1628bdd

Signed-off-by: Dhi Aurrahman <dio@tetrate.io> Signed-off-by: Aaltan Ahmad <aa@stripe.com>

	for (const auto& set : parent_.prioritySet().hostSetsPerPriority()) {
	for (const auto& host : set->hosts()) {
	updated_hosts.insert({host->address()->asString(), host});
	}
	}
	parent_.updateHostMap(std::move(updated_hosts));

Conversation

dio commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Oct 1, 2018

Uh oh!

snowp commented Oct 1, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp commented Oct 1, 2018

Uh oh!

dio commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Oct 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dio Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snowp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dio commented Oct 1, 2018 •

edited

Loading

dio commented Oct 1, 2018 •

edited

Loading

dio Oct 3, 2018 •

edited

Loading

mattklein123 Oct 3, 2018 •

edited

Loading