Skip to content

upstream: make sure all_hosts_ is updated correctly#4575

Merged
mattklein123 merged 13 commits intoenvoyproxy:masterfrom
dio:fix-4548
Oct 4, 2018
Merged

upstream: make sure all_hosts_ is updated correctly#4575
mattklein123 merged 13 commits intoenvoyproxy:masterfrom
dio:fix-4548

Conversation

@dio
Copy link
Member

@dio dio commented Oct 1, 2018

Description:
To make sure we all_hosts_ is updated correctly by all hosts from all priorities and no-rebuild if the update is the same.

This #4548 seems to happen only for STRICT_DNS cluster.

Risk Level: Medium
Testing: unit tests on checking update_no_rebuild when hosts_change equals false, existing tests, manual tests.
Docs Changes: N/A
Release Notes: N/A

Fixes #4548

Signed-off-by: Dhi Aurrahman dio@tetrate.io

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
dio added 3 commits October 1, 2018 19:58
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@mattklein123
Copy link
Member

@snowp can you help take a look at this and the related issue? Much appreciated.

@snowp
Copy link
Contributor

snowp commented Oct 1, 2018 via email

parent_.updateAllHosts(hosts_added, hosts_removed, locality_lb_endpoint_.priority());
}

// TODO(dio): Not sure if we should do this if active health checking is disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should, the all_hosts_ mapping is also used to determine if the new hosts match within the same priority, which affects hosts_changed

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@snowp
Copy link
Contributor

snowp commented Oct 1, 2018

Part of my comment in #4548 that's relevant to this PR:

I think we probably want to not use the updated_hosts mechanism for STRICT_EDS (since it's based around updating an entire priority) and instead keep track of what hosts were resolved per resolve target. With that, we could compute the diff for each resolve target after resolution and update all_hosts_ based on that (allowing cross priority moves).

The reason we're seeing duplicate addresses is because we fail to match a new address to an existing one, which is caused by the fact that we reset the all_hosts_ map between resolves. While I think simply de-duping the addresses will solve the leak and risk of DDOS, it will cause unnecessary "cluster changed" triggers and priority moves would reset the HC status. Ideally we'd fix the core issue, but if we want to rush this in for 1.8 (I'm not sure when that's going to get cut?) I'd be okay with de-duping as a stop gap.

@dio
Copy link
Member Author

dio commented Oct 1, 2018

Yeah, I think I'll add update_no_rebuild counter to this to actually monitor the updates.

From my quick experiment, with CDS response similar to #4548.

cluster.service1.update_attempt: 34
cluster.service1.update_empty: 0
cluster.service1.update_failure: 0
cluster.service1.update_no_rebuild: 32
cluster.service1.update_success: 34

dio added 2 commits October 2, 2018 06:42
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@dio dio changed the title [WIP] upstream: make sure we have no duplicated hosts upstream: make sure we have no duplicated hosts Oct 2, 2018
dio added 2 commits October 2, 2018 07:32
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@mattklein123
Copy link
Member

cc @tonya11en who I think is also hitting this at Lyft.

host_addresses.insert((*i)->address()->asString());
++i;
} else {
i = target->hosts_.erase(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of multiple instances of the same host, you could probably save yourself a few calls to erase if you just used std::remove_if(i, hosts_.end(), XXXX). Your predicate function would just check for the existence of each element in host_addresses. std::remove_if also has the same complexity as a vector erase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_if doesn't work on associative containers so that won't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizan target->hosts_ is a vector, so it will work. Which container are you referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ignore my comment above, I thought it is about host_addresses (unordered_set)

dio added 2 commits October 3, 2018 09:39
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@mattklein123 mattklein123 self-assigned this Oct 3, 2018
if (target->locality_lb_endpoint_.priority() == current_priority) {
priority_state_manager.registerHostForPriority(host, target->locality_lb_endpoint_,
target->lb_endpoint_, absl::nullopt);
std::unordered_set<std::string> host_addresses;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on what the underlying bug actually is here. Can you add more comments so it's easier to understand what this code is doing now? Also, should whatever this is doing actually be done inside the priority state manager so it applies to all discovery types? Is this related to @snowp's comment in the issue around reconciling DNS with EDS behavior? Maybe the comments will help me understand more. Feel free to leave TODOs on what else needs to be done later. Thank you!

Copy link
Member Author

@dio dio Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will add more comments here. While I think this dedupe here probably is not required anymore since the underlying problem is indeed when we update the internal all_hosts_ data (for EDS (STATIC), the updated hosts consist of all host sets, while for STRICT_DNS it is not, depends on each DNS resolution). This:

for (const auto& set : parent_.prioritySet().hostSetsPerPriority()) {
for (const auto& host : set->hosts()) {
updated_hosts.insert({host->address()->asString(), host});
}
}
parent_.updateHostMap(std::move(updated_hosts));
should remedy the issue.

Will make sure the priority state manager can actually handle this (no duplications).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 for following up this issue, I have opened one here: #4590 to track. This is based on @snowp's comment. I'll take a look at that later.

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@dio dio changed the title upstream: make sure we have no duplicated hosts upstream: make sure all_hosts_ is updated correctly Oct 3, 2018
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. This makes sense to me at a high level but I would really appreciate more comments. Thank you! @snowp can you also review the change itself and any additional comments that you would like added?

}

// This makes sure parent_.all_hosts_ is updated with all resolved hosts from all
// priorities. This reconciliation is required since we check each new host againsts this
Copy link
Member

@mattklein123 mattklein123 Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "againsts"

I think this issue is confusing enough that we should have more explanation. Can you be really explicit about the flow here and why this is necessary? I think I understand but I want to make sure the next person that comes along has a really clear understanding of the bug and how we are fixing it for now. My limited understanding is that previously we would set "all hosts" to just this resolve group, which would lead to logic in the future continuing to add hosts that already exist? That's about where my understanding ends without paging back in all this code and the recent changes. So I think more explanation would be great. :)

I would potentially also reference the opened tracking issue to clean this up in the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.

I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 @snowp I have updated the comment accordingly.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good to me until we can figure out what we want to do this with longer term.

}

// This makes sure parent_.all_hosts_ is updated with all resolved hosts from all
// priorities. This reconciliation is required since we check each new host againsts this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that whenever a target resolved, it would set its hosts as all_hosts_, so that when another target resolved it wouldn't see its own hosts in all_hosts_. It would think that they're new hosts, so it would add them to its host list over and over again.

I'd probably add a comment explaining this exact thing, I think it would provide a decent explanation as to why we're doing this.

//
// TODO(dio): The uniqueness of a host address resolved in STRICT_DNS cluster per priority is not
// guaranteed. Need a clear agreement on the behavior here, whether it is allowable to have
// duplicated hosts inside a priority. And if we want to enforce this behavior, it should be done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other open question is whether we want to allow duplicated hosts between priorities. EDS does not allow this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think duplicated hosts between priorities could land us in trouble, but I'm not sure if the scenario is plausible. Consider this:

host A: 50% of endpoints, duplicated between priority 0 and 1.
host B: 25% of endpoints, priority 0
host C: 25% of endpoints, priority 1

If a large portion of host A's endpoints are unhealthy and trigger some percentage of traffic to go to the priority 1 hosts, the overall health of priority 1 is affected as well and could trigger failover into an even lower priority. It seems unintuitive and unnecessary to allow duplicate hosts between priorities.

I'd like to hear an argument in the other direction if anyone has a use-case in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been brought up in other issues: #4280 (comment)

The issue talks about EDS and would probably be supported by the indirection that's later suggested in the issue, but I'm not sure if we could do the same for STRICT_DNS.

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The comment is super clear. Great work!

@mattklein123 mattklein123 merged commit 0f42fd6 into envoyproxy:master Oct 4, 2018
aa-stripe pushed a commit to aa-stripe/envoy that referenced this pull request Oct 11, 2018
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Aaltan Ahmad <aa@stripe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants