upstream: rebuild cluster when health check config is changed#4075
upstream: rebuild cluster when health check config is changed#4075htuch merged 21 commits intoenvoyproxy:masterfrom dio:update-health-check-address
Conversation
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
| if (*(*i)->healthCheckAddress() != *host->healthCheckAddress()) { | ||
| // If the health check address is updated, set the current host's health check address | ||
| // using the updated one. | ||
| (*i)->setHealthCheckAddress(host->healthCheckAddress()); |
There was a problem hiding this comment.
Although this is safe from a locking perspective since HC is only on the main thread, it seems like the HC address should be an immutable property of a host? How often does it really change? Did we consider making this a thing that causes a host to get removed/re-added? (Basically having the host check pivot off of normal address and HC address)
There was a problem hiding this comment.
should be an immutable property of a host
Yeah, on the second thought this PR seems like a quick hack for me.
How often does it really change? Did we consider making this a thing that causes a host to get removed/re-added?
In practice probably not that often. Will update accordingly to make it as a case of causing a host to get removed/re-added.
There was a problem hiding this comment.
Yeah, makes sense to do a full add/remove, this code is already crazy complicated ☣️
…ck-address Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
| i = current_hosts.erase(i); | ||
| found = true; | ||
| } else { | ||
| if (address_matched && health_check_changed) { |
There was a problem hiding this comment.
Sorry little confused, why is this change necessary? Same below? Seems like just pivoting on address + HC address should be enough above?
There was a problem hiding this comment.
Tried by only introducing the change above. But from my test, without this, it leads to keeping the old hosts.
There was a problem hiding this comment.
OK, can you take a look at why that is? This code is getting really complicated and could use some more comments anyway. :)
There was a problem hiding this comment.
Sure. Will try to trace out the flow and craft some comments.
There was a problem hiding this comment.
Referring to the case example by @bplotnick in here: https://github.com/bplotnick/envoy_examples/blob/master/healthcheck_updates/envoy_config/clusters.yaml, the drain_connections_on_host_removal is not being set (as true) for the EDS cluster config (type: EDS). If this set to true, we don't need these two changes. Do you think if we have health checker, we should always set this flag to true?
There was a problem hiding this comment.
These two changes make sure, regardless the flag, we always rebuild and skip keeping healthy (old/current) hosts.
There was a problem hiding this comment.
Heh whoops. That was actually unintentional. In my original configuration where I saw this issue, I had that flag set 😛
There was a problem hiding this comment.
😄 yeah, in that case, we only need to add https://github.com/envoyproxy/envoy/pull/4075/files#diff-cd280606785949980237b2a7f08b7885R872 as suggested by @mattklein123.
There was a problem hiding this comment.
But probably that's the correct behavior? Once the old hosts die, it will be removed eventually.
There was a problem hiding this comment.
I need a little time to grok this conversation thread. I will follow up today or tomorrow. Thank you!
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
| health_checker_ != nullptr && *(*i)->healthCheckAddress() != *host->healthCheckAddress(); | ||
| // If we find a host matched based on address and the health check address is not changed, we | ||
| // keep it. However we do change weight inline so do that here. | ||
| if (address_matched && !health_check_changed) { |
There was a problem hiding this comment.
It's not your fault, but this entire method is completely unreadable to me; it's hard to keep track of what current_hosts represent, which hosts are to be removed, under which conditions we're going to do a rebuild, inplace modify or remove/add. Do you think there's a way to explain to the layman how this works? Each time I reread this method, by head hurts more and more (and I'm partly to blame for sure).
There was a problem hiding this comment.
Yes I agree. This has been an iterative process over time and I agree it's now not very readable and it hurts my head also. I think it would be worth it to have a lot more comments in here, possibly variable name changes, and maybe even functions broken out. @dio since you are in here now do you mind taking a look?
There was a problem hiding this comment.
since you are in here now do you mind taking a look?
Sure, love to have a look 🙂. Will spend some time dive into this.
There was a problem hiding this comment.
It seems there is a quite intensive refactoring in here: #3959
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
|
Thanks to @snowp the code is a lot clearer now. Guess this PR is ready for review. Thanks! |
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
htuch
left a comment
There was a problem hiding this comment.
Looks much nicer, a couple of comments.
test/common/upstream/eds_test.cc
Outdated
| EXPECT_EQ(1UL, stats_.counter("cluster.name.update_no_rebuild").value()); | ||
| } | ||
|
|
||
| TEST_F(EdsTest, EndpointUpdateHealthCheckConfig) { |
There was a problem hiding this comment.
Arguably, you could subclass the test fixture EdsTest and add in a bunch of helper methods to make these tests less verbose; there is a lot of boilerplate that could be refactored away to make the tests more readable that way.
| health_checker_ != nullptr && existing_host_found && | ||
| *existing_host->second->healthCheckAddress() != *host->healthCheckAddress(); | ||
|
|
||
| if (existing_host_found && !health_check_changed) { |
There was a problem hiding this comment.
I think this can still be simplified a bit, since later on, we might add another rebuild condition. I'd rename health_check_changed to no_inplace_host_update. Also, maybe add a comment to the top of this if block explaining what it's doing in terms of hosts insertion/deletion behavior.
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
| health_checker_ != nullptr && existing_host_found && | ||
| *existing_host->second->healthCheckAddress() != *host->healthCheckAddress(); | ||
|
|
||
| // When there is a match and we decided to do in-place update, we potentially updating the |
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
* origin/master: (34 commits) fuzz: fixes oss-fuzz: 9895 (envoyproxy#4189) docs: added developer docs for pprof/tcmalloc testing (envoyproxy#4194) fix sometimes returns response body to HEAD requests (envoyproxy#3985) admin: fix config dump order (envoyproxy#4197) thrift: allow translation between downstream and upstream protocols (envoyproxy#4136) Use local_info.node() instead of bootstrap.node() whenever possible (envoyproxy#4120) upstream: allow extension protocol options to be used with http filters (envoyproxy#4165) [thrift_proxy] Replace local HeaderMap with Http::HeaderMap[Impl] (envoyproxy#4169) docs: improve gRPC-JSON filter doc (envoyproxy#4184) stats: refactoring MetricImpl without strings (envoyproxy#4190) fuzz: coverage profile generation instructions. (envoyproxy#4193) upstream: rebuild cluster when health check config is changed (envoyproxy#4075) build: use clang-6.0. (envoyproxy#4168) thrift_proxy: introduce header transport (envoyproxy#4082) tcp: allow connection pool callers to store protocol state (envoyproxy#4131) configs: match latest API changes (envoyproxy#4185) Fix wrong mock function name. (envoyproxy#4187) Bump yaml-cpp so it builds with Visual Studio 15.8 (envoyproxy#4182) Converting envoy configs to V2 (envoyproxy#2957) Add timestamp to HealthCheckEvent definition (envoyproxy#4119) ... Signed-off-by: Snow Pettersen <snowp@squareup.com>
…#1938) This is far from finished, but it reduces memory usage by ~10%. Pulling the following changes from github.com/envoyproxy/envoy: c1cc68d stats: refactoring MetricImpl without strings (envoyproxy#4190) 36809d8 fuzz: coverage profile generation instructions. (envoyproxy#4193) ba40cc9 upstream: rebuild cluster when health check config is changed (envoyproxy#4075) 05c0d52 build: use clang-6.0. (envoyproxy#4168) 01f403e thrift_proxy: introduce header transport (envoyproxy#4082) 564d256 tcp: allow connection pool callers to store protocol state (envoyproxy#4131) 3e1d643 configs: match latest API changes (envoyproxy#4185) bc6a10c Fix wrong mock function name. (envoyproxy#4187) e994c1c Bump yaml-cpp so it builds with Visual Studio 15.8 (envoyproxy#4182) 3d1325e Converting envoy configs to V2 (envoyproxy#2957) 8d0680f Add timestamp to HealthCheckEvent definition (envoyproxy#4119) 497efb9 server: handle non-EnvoyExceptions safely if thrown in constructor. (envoyproxy#4173) 6d6fafd config: strengthen validation for gRPC config sources. (envoyproxy#4171) 132302c fuzz: reduce log level when running under fuzz engine. (envoyproxy#4161) 7c04ac2 test: fix V6EmptyOptions in coverage with IPv6 enable (envoyproxy#4155) 1b2219b ci: remove deprecated bazel --batch option (envoyproxy#4166) 2db6a4c ci: update clang to version 6.0 in the Ubuntu build image. (envoyproxy#4157) 71152b7 ratelimit: Add ratelimit custom response headers (envoyproxy#4015) 3062874 ssl: make Ssl::Connection const everywhere (envoyproxy#4179) 706e262 Handle validation of non expiring tokens in jwt_authn filter (envoyproxy#4007) f06e958 fuzz: tag trivial fuzzers with no_fuzz. (envoyproxy#4156) 27fb1d3 thrift_proxy: add service name matching to router implementation (envoyproxy#4130) 8c189a5 Make over provisioning factor configurable (envoyproxy#4003) 6c08fb4 Making test less flaky (envoyproxy#4149) 592775b fuzz: bare bones HCM fuzzer. (envoyproxy#4118) For istio/istio#7912. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

Description:
This PR makes sure changing health check address triggers rebuilding the cluster.
Risk Level: medium
Testing: unit, manual as depicted by @bplotnick in #4070.
Docs Changes: N/A
Release Notes: N/A
Fixes #4070
Signed-off-by: Dhi Aurrahman dio@tetrate.io