network: support draining connections after triggered DNS refresh by goaway · Pull Request #2225 · envoyproxy/envoy-mobile

goaway · 2022-04-28T19:08:16Z

Description: Adds the facility to drain connections after the resolution of a soft DNS refresh. A refresh may be triggered directly via the Engine API, or as a result of a network status update provided by the OS. Draining connections does not interrupt existing connections or requests, but will establish new connections for any further requests.
Risk Level: Moderate
Testing: New & Updated Coverage in Builder, Config, and Network::Configurator

Signed-off-by: Mike Schore mike.schore@gmail.com

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway · 2022-04-29T19:59:05Z

Some C++ tests need to be added/fixed and I'm going to be renaming the public drainConnections call to be something else, but other than that this should be ready for someone to look at the logic at least of what's happening.

jpsim · 2022-04-29T20:05:33Z

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

mattklein123 · 2022-04-29T20:39:35Z

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

+1 seems like we should use the predicate? I think from the cache you can actually get all host names, so you could even get all the hosts pre-refresh, put them in a map, and then just drain each host when you get any type of resolution completion.

goaway · 2022-04-29T22:31:24Z

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

I was considering doing this first at least over the weekend, since there's some extra complexity with individual host predicates (I think we either need to do @mattklein123's suggestion and track each individually and/or define a time window to watch for resolutions so we don't continue to drain indefinitely).

I agree we'd still want to iterate to eventually using host predicates, but this simpler version could perhaps be easier to test and reason about initially. One thing I could do though is log which host resolution we see that triggers the drain.

Signed-off-by: Mike Schore <mike.schore@gmail.com>

Augustyniak

I think that we should strongly consider supporting draining connection using the predicate thing that @mattklein123 added - I know that you mentioned that you were planning to work on it.

As it is now, the reliability of us draining connections (arguably) goes down significantly one we drain connection after the first DNS resolution comes back. That's true especially with the number of hosts (often >= 4) that we deal with in Android Driver app. In theory we could ship this 'simple' version first and improve it later but with our long iteration cycle and limited visibility into the state of user sessions (and the high cost of looking at them) I think that we should strongly consider shipping refresh + drain connections in its 'final' form from the get go.

The current version is harder to investigate than the 'final' one as it's going to be harder to say whether our draining of connections was performed at the right time or not. Obviously we can check this on session by session basis (if we add more events) but it's still going to be hard (if not impossible) to tell how we are doing in general.

Augustyniak · 2022-04-29T22:37:19Z

library/common/network/configurator.cc

  Thread::LockGuard lock{network_state_.mutex_};
+  if (network_state_.network_ == network) {
+    // Provide a non-current key preventing further scheduled effects (e.g. DNS refresh).
+    return network_state_.configuration_key_ - 1;


Can we log something in here? using ENVOY_LOG_EVENT

I think it will honestly just add volume/noise if we log here, given we already log all network updates.

Augustyniak · 2022-04-29T22:38:51Z

library/common/network/configurator.cc

+  if (enable_drain_post_dns_refresh_ && pending_drain_) {
+    pending_drain_ = false;
+    if (status == Network::DnsResolver::ResolutionStatus::Success) {
+      cluster_manager_.drainConnections();


let's add a log here? Unless there is some log inside of drainConnections (maybe using ENVOY_LOG_EVENT?)

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway · 2022-05-02T13:01:37Z

I have a branch where I have work in progress for supporting the predicate-based refresh, but the functionality here is green, covered, and ready to go, so when folks come online Monday, I suggest we merge at least for some initial validation.

I should have my PR for predicate support up later in the day.

Augustyniak · 2022-05-02T13:08:57Z

library/common/network/configurator.cc


 envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) {
  Thread::LockGuard lock{network_state_.mutex_};
+  if (network_state_.network_ == network) {


I think that this check will break a DNS a network change update for when we move from no-internet connection to wi-fi connection and NWPathMonitor is disabled. This is because of this logic that does not differentiate between the lack of connection and wi-fi connection https://github.com/envoyproxy/envoy-mobile/blob/main/library/objective-c/EnvoyNetworkMonitor.m#L125-L132

We can update NetworkPathMonitor so that it properly differentiates between no-connection and wi-fi connection but in general addition of this check does not seem to be related to the changes in this PR and it does increase a risk of the changes in this PR- especially as this functionality is not behind a FF.

I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.

Added further comments, but also disabled this check for now.

mattklein123

Thanks for working on this. Some thoughts before our discussion.

mattklein123 · 2022-05-02T17:13:17Z

library/common/network/configurator.cc


 envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) {
  Thread::LockGuard lock{network_state_.mutex_};
+  if (network_state_.network_ == network) {


I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.

mattklein123 · 2022-05-02T17:13:44Z

library/common/network/configurator.cc

+    pending_drain_ = false;
+  } else if (!dns_callbacks_handle_) {
+    if (auto dns_cache = dnsCache()) {
+      dns_callbacks_handle_ = dns_cache->addUpdateCallbacks(*this);


For simplicity I think it would be simpler to just always have the callbacks registered but up to you.

Added clarifying comment to the code here.

mattklein123 · 2022-05-02T17:15:35Z

library/common/network/configurator.cc

+    Network::DnsResolver::ResolutionStatus status) {
+  if (enable_drain_post_dns_refresh_ && pending_drain_) {
+    pending_drain_ = false;
+    if (status == Network::DnsResolver::ResolutionStatus::Success) {


I'm not sure if we want to gate this on success or not. If we are doing this because of a network switch or background to foreground switch it seems better to just always do it after we tried to refresh DNS?

Can we add more comments somewhere on generally when this is going to fire? Does this change only do this on network switch or are we also doing background to foreground as well?

Also, if we end up leaving this implementation and not doing the per-host version, can you add comments/TODO for the downsides of this implementation? I'm torn on whether it's worth it to do this change or not without the per-host version. It seems not that much harder to me to snap the cache state with all hosts and then just keep track of refresh state for each one.

Removed the check, and added an additional comment. I think it may be possible to refine this logic to only drain for a specific host when either DNS has changed, or we have reason to believe prior connections are probably defunct, but after further consideration, I agree that the simplest thing to do is always trigger drain.

Signed-off-by: Mike Schore <mike.schore@gmail.com>

Description: Expose an option to enable connections draining post dns refresh on Engine iOS builder. This change was accidentally not implemented as part of #2225. Risk Level: Low, addition of a new functionality. Testing: Unit Docs Changes: Release Notes: updated Signed-off-by: Rafal Augustyniak raugustyniak@lyft.com

jpsim · 2022-05-03T19:05:22Z

library/common/network/configurator.cc

+    const std::string& host, const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&,
+    Network::DnsResolver::ResolutionStatus) {
+  if (enable_drain_post_dns_refresh_ && pending_drain_) {
+    pending_drain_ = false;


Is onDnsResolutionComplete guaranteed to always be invoked on the same thread as setDrainPostDnsRefreshEnabled? If not, should we guard this with a mutex?

Generally speaking, all C++ code in Envoy Mobile should be assumed to run on a single thread with only a very few exceptions. As it happens, I added a clarifying comment on the subject for this class elsewhere in the PR:
https://github.com/envoyproxy/envoy-mobile/pull/2225/files#diff-41a6b7732eaa28a4c8cd16a941d7d7cb70b91428a7a47ff352a1bd9b3c9e5446R66

…rtion * origin/main: (57 commits) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Bump rules_apple to 0.34.2 (#2236) Bump Lyft Support Rotation (#2232) ci: add support for `/retest` command (#2219) format: add SwiftLint to check-format script (#2230) use 64 bit emulators for test (#2228) envoy: update to `d0befbb` & add `h2ExtendKeepaliveTimeout` (#2229) Add Ryan Hamilton to OWNERS.md (#2224) Android cert verifier: first import from chromium/net (#2222) Cleanup: remove the unused stats sink metrics_service (#2220) bazel: move back to symbol mapping table files (#2218) Add new version history section (#2209) build: simplify jnilib copy (#2214) ci: Update macOS version to macOS 12 (#2208) Update releasing.rst (#2200) support: Post Lyft support rotation changes to Slack (#2207) Don't run bump_support_rotation GitHub Action on forks (#2204) Release 0.4.6 (#2201) ... Signed-off-by: JP Simard <jp@jpsim.com>

…ol--fix-documentation-comment-issues * origin/main: Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234) jni: fix mismatched return value types (#2252) Remove bintray as a maven source (#2248) test: refine connection drain test (#2245) cleanup: remove comment in Http::Client that no longer applies (#2203) cleanup: fix test with inaccurate description (#2202) network: perform post-DNS connection drain on a per-host basis (#2240) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Bump rules_apple to 0.34.2 (#2236) Bump Lyft Support Rotation (#2232) Signed-off-by: JP Simard <jp@jpsim.com>

* main: envoy: update to efbbb04 (#2258) Update rules_java, rules_detekt, rules_jvm_external (#2249) Upgrade rules_jvm_external from 4.1 -> 4.2 (#2254) CancelProofEnvoyStream (#2250) swift: add DrString tool & fix documentation comment issues (#2233) Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234) jni: fix mismatched return value types (#2252) Remove bintray as a maven source (#2248) test: refine connection drain test (#2245) cleanup: remove comment in Http::Client that no longer applies (#2203) cleanup: fix test with inaccurate description (#2202) network: perform post-DNS connection drain on a per-host basis (#2240) network: add enableDrainPostDnsRefresh to iOS (#2242) envoy: update to em-cherry (#2241) network: support draining connections after triggered DNS refresh (#2225) Signed-off-by: JP Simard <jp@jpsim.com>

internal setup for connection draining

6c6bccc

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway changed the title ~~network: support draining connections after forced DNS refresh~~ network: support draining connections after triggered DNS refresh Apr 28, 2022

goaway added 4 commits April 29, 2022 19:25

green

32f8799

Signed-off-by: Mike Schore <mike.schore@gmail.com>

configuration wiring and tests + fixes

7623d39

Signed-off-by: Mike Schore <mike.schore@gmail.com>

format

fb52423

Signed-off-by: Mike Schore <mike.schore@gmail.com>

spelling

e42b881

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway marked this pull request as ready for review April 29, 2022 12:49

goaway requested review from Augustyniak, jpsim and mattklein123 April 29, 2022 19:59

mattklein123 self-assigned this Apr 29, 2022

Merge branch 'main' into ms/drain-cxns

b897c04

Signed-off-by: Mike Schore <mike.schore@gmail.com>

Augustyniak reviewed Apr 29, 2022

View reviewed changes

goaway added 3 commits May 2, 2022 19:59

fix and add tests

2af1ba2

Signed-off-by: Mike Schore <mike.schore@gmail.com>

format

afec7b7

Signed-off-by: Mike Schore <mike.schore@gmail.com>

add log event to drain

fff6d27

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway requested a review from Augustyniak May 2, 2022 12:08

Augustyniak reviewed May 2, 2022

View reviewed changes

mattklein123 requested changes May 2, 2022

View reviewed changes

goaway added 4 commits May 3, 2022 05:29

rename drainConnections to resetConnectivityState

0bfb557

Signed-off-by: Mike Schore <mike.schore@gmail.com>

add additional clarifying comments

d75e77b

Signed-off-by: Mike Schore <mike.schore@gmail.com>

fixes and comments

383f2d5

Signed-off-by: Mike Schore <mike.schore@gmail.com>

format and todos

d034970

Signed-off-by: Mike Schore <mike.schore@gmail.com>

goaway requested review from Augustyniak and mattklein123 May 2, 2022 23:51

Augustyniak approved these changes May 3, 2022

View reviewed changes

goaway merged commit 75cef7d into main May 3, 2022

goaway deleted the ms/drain-cxns branch May 3, 2022 07:44

Augustyniak mentioned this pull request May 3, 2022

network: add enableDrainPostDnsRefresh to iOS #2242

Merged

jpsim reviewed May 3, 2022

View reviewed changes

Conversation

goaway commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goaway commented Apr 29, 2022

Uh oh!

jpsim commented Apr 29, 2022

Uh oh!

mattklein123 commented Apr 29, 2022

Uh oh!

goaway commented Apr 29, 2022

Uh oh!

Augustyniak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goaway commented May 2, 2022

Uh oh!

Augustyniak May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goaway commented Apr 28, 2022 •

edited

Loading

Augustyniak May 2, 2022 •

edited

Loading