network: support draining connections after triggered DNS refresh#2225
network: support draining connections after triggered DNS refresh#2225
Conversation
Signed-off-by: Mike Schore <mike.schore@gmail.com>
|
Some C++ tests need to be added/fixed and I'm going to be renaming the public |
|
Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074? |
+1 seems like we should use the predicate? I think from the cache you can actually get all host names, so you could even get all the hosts pre-refresh, put them in a map, and then just drain each host when you get any type of resolution completion. |
I was considering doing this first at least over the weekend, since there's some extra complexity with individual host predicates (I think we either need to do @mattklein123's suggestion and track each individually and/or define a time window to watch for resolutions so we don't continue to drain indefinitely). I agree we'd still want to iterate to eventually using host predicates, but this simpler version could perhaps be easier to test and reason about initially. One thing I could do though is log which host resolution we see that triggers the drain. |
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Augustyniak
left a comment
There was a problem hiding this comment.
I think that we should strongly consider supporting draining connection using the predicate thing that @mattklein123 added - I know that you mentioned that you were planning to work on it.
As it is now, the reliability of us draining connections (arguably) goes down significantly one we drain connection after the first DNS resolution comes back. That's true especially with the number of hosts (often >= 4) that we deal with in Android Driver app. In theory we could ship this 'simple' version first and improve it later but with our long iteration cycle and limited visibility into the state of user sessions (and the high cost of looking at them) I think that we should strongly consider shipping refresh + drain connections in its 'final' form from the get go.
The current version is harder to investigate than the 'final' one as it's going to be harder to say whether our draining of connections was performed at the right time or not. Obviously we can check this on session by session basis (if we add more events) but it's still going to be hard (if not impossible) to tell how we are doing in general.
| Thread::LockGuard lock{network_state_.mutex_}; | ||
| if (network_state_.network_ == network) { | ||
| // Provide a non-current key preventing further scheduled effects (e.g. DNS refresh). | ||
| return network_state_.configuration_key_ - 1; |
There was a problem hiding this comment.
Can we log something in here? using ENVOY_LOG_EVENT
There was a problem hiding this comment.
I think it will honestly just add volume/noise if we log here, given we already log all network updates.
| if (enable_drain_post_dns_refresh_ && pending_drain_) { | ||
| pending_drain_ = false; | ||
| if (status == Network::DnsResolver::ResolutionStatus::Success) { | ||
| cluster_manager_.drainConnections(); |
There was a problem hiding this comment.
let's add a log here? Unless there is some log inside of drainConnections (maybe using ENVOY_LOG_EVENT?)
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
|
I have a branch where I have work in progress for supporting the predicate-based refresh, but the functionality here is green, covered, and ready to go, so when folks come online Monday, I suggest we merge at least for some initial validation. I should have my PR for predicate support up later in the day. |
|
|
||
| envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) { | ||
| Thread::LockGuard lock{network_state_.mutex_}; | ||
| if (network_state_.network_ == network) { |
There was a problem hiding this comment.
I think that this check will break a DNS a network change update for when we move from no-internet connection to wi-fi connection and NWPathMonitor is disabled. This is because of this logic that does not differentiate between the lack of connection and wi-fi connection https://github.com/envoyproxy/envoy-mobile/blob/main/library/objective-c/EnvoyNetworkMonitor.m#L125-L132
There was a problem hiding this comment.
We can update NetworkPathMonitor so that it properly differentiates between no-connection and wi-fi connection but in general addition of this check does not seem to be related to the changes in this PR and it does increase a risk of the changes in this PR- especially as this functionality is not behind a FF.
There was a problem hiding this comment.
I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.
There was a problem hiding this comment.
Added further comments, but also disabled this check for now.
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for working on this. Some thoughts before our discussion.
|
|
||
| envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) { | ||
| Thread::LockGuard lock{network_state_.mutex_}; | ||
| if (network_state_.network_ == network) { |
There was a problem hiding this comment.
I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.
| pending_drain_ = false; | ||
| } else if (!dns_callbacks_handle_) { | ||
| if (auto dns_cache = dnsCache()) { | ||
| dns_callbacks_handle_ = dns_cache->addUpdateCallbacks(*this); |
There was a problem hiding this comment.
For simplicity I think it would be simpler to just always have the callbacks registered but up to you.
There was a problem hiding this comment.
Added clarifying comment to the code here.
| Network::DnsResolver::ResolutionStatus status) { | ||
| if (enable_drain_post_dns_refresh_ && pending_drain_) { | ||
| pending_drain_ = false; | ||
| if (status == Network::DnsResolver::ResolutionStatus::Success) { |
There was a problem hiding this comment.
I'm not sure if we want to gate this on success or not. If we are doing this because of a network switch or background to foreground switch it seems better to just always do it after we tried to refresh DNS?
Can we add more comments somewhere on generally when this is going to fire? Does this change only do this on network switch or are we also doing background to foreground as well?
Also, if we end up leaving this implementation and not doing the per-host version, can you add comments/TODO for the downsides of this implementation? I'm torn on whether it's worth it to do this change or not without the per-host version. It seems not that much harder to me to snap the cache state with all hosts and then just keep track of refresh state for each one.
There was a problem hiding this comment.
Removed the check, and added an additional comment. I think it may be possible to refine this logic to only drain for a specific host when either DNS has changed, or we have reason to believe prior connections are probably defunct, but after further consideration, I agree that the simplest thing to do is always trigger drain.
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Description: Expose an option to enable connections draining post dns refresh on Engine iOS builder. This change was accidentally not implemented as part of #2225. Risk Level: Low, addition of a new functionality. Testing: Unit Docs Changes: Release Notes: updated Signed-off-by: Rafal Augustyniak raugustyniak@lyft.com
| const std::string& host, const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&, | ||
| Network::DnsResolver::ResolutionStatus) { | ||
| if (enable_drain_post_dns_refresh_ && pending_drain_) { | ||
| pending_drain_ = false; |
There was a problem hiding this comment.
Is onDnsResolutionComplete guaranteed to always be invoked on the same thread as setDrainPostDnsRefreshEnabled? If not, should we guard this with a mutex?
There was a problem hiding this comment.
Generally speaking, all C++ code in Envoy Mobile should be assumed to run on a single thread with only a very few exceptions. As it happens, I added a clarifying comment on the subject for this class elsewhere in the PR:
https://github.com/envoyproxy/envoy-mobile/pull/2225/files#diff-41a6b7732eaa28a4c8cd16a941d7d7cb70b91428a7a47ff352a1bd9b3c9e5446R66
…rtion * origin/main: (57 commits) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Bump rules_apple to 0.34.2 (#2236) Bump Lyft Support Rotation (#2232) ci: add support for `/retest` command (#2219) format: add SwiftLint to check-format script (#2230) use 64 bit emulators for test (#2228) envoy: update to `d0befbb` & add `h2ExtendKeepaliveTimeout` (#2229) Add Ryan Hamilton to OWNERS.md (#2224) Android cert verifier: first import from chromium/net (#2222) Cleanup: remove the unused stats sink metrics_service (#2220) bazel: move back to symbol mapping table files (#2218) Add new version history section (#2209) build: simplify jnilib copy (#2214) ci: Update macOS version to macOS 12 (#2208) Update releasing.rst (#2200) support: Post Lyft support rotation changes to Slack (#2207) Don't run bump_support_rotation GitHub Action on forks (#2204) Release 0.4.6 (#2201) ... Signed-off-by: JP Simard <jp@jpsim.com>
…ol--fix-documentation-comment-issues * origin/main: Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234) jni: fix mismatched return value types (#2252) Remove bintray as a maven source (#2248) test: refine connection drain test (#2245) cleanup: remove comment in Http::Client that no longer applies (#2203) cleanup: fix test with inaccurate description (#2202) network: perform post-DNS connection drain on a per-host basis (#2240) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Bump rules_apple to 0.34.2 (#2236) Bump Lyft Support Rotation (#2232) Signed-off-by: JP Simard <jp@jpsim.com>
* main: envoy: update to efbbb04 (#2258) Update rules_java, rules_detekt, rules_jvm_external (#2249) Upgrade rules_jvm_external from 4.1 -> 4.2 (#2254) CancelProofEnvoyStream (#2250) swift: add DrString tool & fix documentation comment issues (#2233) Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234) jni: fix mismatched return value types (#2252) Remove bintray as a maven source (#2248) test: refine connection drain test (#2245) cleanup: remove comment in Http::Client that no longer applies (#2203) cleanup: fix test with inaccurate description (#2202) network: perform post-DNS connection drain on a per-host basis (#2240) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Signed-off-by: JP Simard <jp@jpsim.com>
Description: Adds the facility to drain connections after the resolution of a soft DNS refresh. A refresh may be triggered directly via the Engine API, or as a result of a network status update provided by the OS. Draining connections does not interrupt existing connections or requests, but will establish new connections for any further requests.
Risk Level: Moderate
Testing: New & Updated Coverage in Builder, Config, and Network::Configurator
Signed-off-by: Mike Schore mike.schore@gmail.com