Skip to content

network: support draining connections after triggered DNS refresh#2225

Merged
goaway merged 13 commits intomainfrom
ms/drain-cxns
May 3, 2022
Merged

network: support draining connections after triggered DNS refresh#2225
goaway merged 13 commits intomainfrom
ms/drain-cxns

Conversation

@goaway
Copy link
Contributor

@goaway goaway commented Apr 28, 2022

Description: Adds the facility to drain connections after the resolution of a soft DNS refresh. A refresh may be triggered directly via the Engine API, or as a result of a network status update provided by the OS. Draining connections does not interrupt existing connections or requests, but will establish new connections for any further requests.
Risk Level: Moderate
Testing: New & Updated Coverage in Builder, Config, and Network::Configurator

Signed-off-by: Mike Schore mike.schore@gmail.com

Signed-off-by: Mike Schore <mike.schore@gmail.com>
@goaway goaway changed the title network: support draining connections after forced DNS refresh network: support draining connections after triggered DNS refresh Apr 28, 2022
goaway added 4 commits April 29, 2022 19:25
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
@goaway goaway marked this pull request as ready for review April 29, 2022 12:49
@goaway
Copy link
Contributor Author

goaway commented Apr 29, 2022

Some C++ tests need to be added/fixed and I'm going to be renaming the public drainConnections call to be something else, but other than that this should be ready for someone to look at the logic at least of what's happening.

@mattklein123 mattklein123 self-assigned this Apr 29, 2022
@jpsim
Copy link
Contributor

jpsim commented Apr 29, 2022

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

@mattklein123
Copy link
Member

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

+1 seems like we should use the predicate? I think from the cache you can actually get all host names, so you could even get all the hosts pre-refresh, put them in a map, and then just drain each host when you get any type of resolution completion.

@goaway
Copy link
Contributor Author

goaway commented Apr 29, 2022

Do you want to pass in a host predicate like was exposed in envoyproxy/envoy#21074?

I was considering doing this first at least over the weekend, since there's some extra complexity with individual host predicates (I think we either need to do @mattklein123's suggestion and track each individually and/or define a time window to watch for resolutions so we don't continue to drain indefinitely).

I agree we'd still want to iterate to eventually using host predicates, but this simpler version could perhaps be easier to test and reason about initially. One thing I could do though is log which host resolution we see that triggers the drain.

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Copy link
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should strongly consider supporting draining connection using the predicate thing that @mattklein123 added - I know that you mentioned that you were planning to work on it.

As it is now, the reliability of us draining connections (arguably) goes down significantly one we drain connection after the first DNS resolution comes back. That's true especially with the number of hosts (often >= 4) that we deal with in Android Driver app. In theory we could ship this 'simple' version first and improve it later but with our long iteration cycle and limited visibility into the state of user sessions (and the high cost of looking at them) I think that we should strongly consider shipping refresh + drain connections in its 'final' form from the get go.

The current version is harder to investigate than the 'final' one as it's going to be harder to say whether our draining of connections was performed at the right time or not. Obviously we can check this on session by session basis (if we add more events) but it's still going to be hard (if not impossible) to tell how we are doing in general.

Thread::LockGuard lock{network_state_.mutex_};
if (network_state_.network_ == network) {
// Provide a non-current key preventing further scheduled effects (e.g. DNS refresh).
return network_state_.configuration_key_ - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log something in here? using ENVOY_LOG_EVENT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will honestly just add volume/noise if we log here, given we already log all network updates.

if (enable_drain_post_dns_refresh_ && pending_drain_) {
pending_drain_ = false;
if (status == Network::DnsResolver::ResolutionStatus::Success) {
cluster_manager_.drainConnections();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a log here? Unless there is some log inside of drainConnections (maybe using ENVOY_LOG_EVENT?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goaway added 3 commits May 2, 2022 19:59
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
@goaway goaway requested a review from Augustyniak May 2, 2022 12:08
@goaway
Copy link
Contributor Author

goaway commented May 2, 2022

I have a branch where I have work in progress for supporting the predicate-based refresh, but the functionality here is green, covered, and ready to go, so when folks come online Monday, I suggest we merge at least for some initial validation.

I should have my PR for predicate support up later in the day.


envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) {
Thread::LockGuard lock{network_state_.mutex_};
if (network_state_.network_ == network) {
Copy link
Contributor

@Augustyniak Augustyniak May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this check will break a DNS a network change update for when we move from no-internet connection to wi-fi connection and NWPathMonitor is disabled. This is because of this logic that does not differentiate between the lack of connection and wi-fi connection https://github.com/envoyproxy/envoy-mobile/blob/main/library/objective-c/EnvoyNetworkMonitor.m#L125-L132

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update NetworkPathMonitor so that it properly differentiates between no-connection and wi-fi connection but in general addition of this check does not seem to be related to the changes in this PR and it does increase a risk of the changes in this PR- especially as this functionality is not behind a FF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added further comments, but also disabled this check for now.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Some thoughts before our discussion.


envoy_netconf_t Configurator::setPreferredNetwork(envoy_network_t network) {
Thread::LockGuard lock{network_state_.mutex_};
if (network_state_.network_ == network) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this change in general? Can you add more comments? We can also discuss in our sync.

pending_drain_ = false;
} else if (!dns_callbacks_handle_) {
if (auto dns_cache = dnsCache()) {
dns_callbacks_handle_ = dns_cache->addUpdateCallbacks(*this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity I think it would be simpler to just always have the callbacks registered but up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarifying comment to the code here.

Network::DnsResolver::ResolutionStatus status) {
if (enable_drain_post_dns_refresh_ && pending_drain_) {
pending_drain_ = false;
if (status == Network::DnsResolver::ResolutionStatus::Success) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to gate this on success or not. If we are doing this because of a network switch or background to foreground switch it seems better to just always do it after we tried to refresh DNS?

Can we add more comments somewhere on generally when this is going to fire? Does this change only do this on network switch or are we also doing background to foreground as well?

Also, if we end up leaving this implementation and not doing the per-host version, can you add comments/TODO for the downsides of this implementation? I'm torn on whether it's worth it to do this change or not without the per-host version. It seems not that much harder to me to snap the cache state with all hosts and then just keep track of refresh state for each one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the check, and added an additional comment. I think it may be possible to refine this logic to only drain for a specific host when either DNS has changed, or we have reason to believe prior connections are probably defunct, but after further consideration, I agree that the simplest thing to do is always trigger drain.

goaway added 4 commits May 3, 2022 05:29
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: Mike Schore <mike.schore@gmail.com>
@goaway goaway requested review from Augustyniak and mattklein123 May 2, 2022 23:51
@goaway goaway merged commit 75cef7d into main May 3, 2022
@goaway goaway deleted the ms/drain-cxns branch May 3, 2022 07:44
Augustyniak added a commit that referenced this pull request May 3, 2022
Description: Expose an option to enable connections draining post dns refresh on Engine iOS builder. This change was accidentally not implemented as part of #2225.
Risk Level: Low, addition of a new functionality.
Testing: Unit
Docs Changes:
Release Notes: updated

Signed-off-by: Rafal Augustyniak raugustyniak@lyft.com
const std::string& host, const Extensions::Common::DynamicForwardProxy::DnsHostInfoSharedPtr&,
Network::DnsResolver::ResolutionStatus) {
if (enable_drain_post_dns_refresh_ && pending_drain_) {
pending_drain_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is onDnsResolutionComplete guaranteed to always be invoked on the same thread as setDrainPostDnsRefreshEnabled? If not, should we guard this with a mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, all C++ code in Envoy Mobile should be assumed to run on a single thread with only a very few exceptions. As it happens, I added a clarifying comment on the subject for this class elsewhere in the PR:
https://github.com/envoyproxy/envoy-mobile/pull/2225/files#diff-41a6b7732eaa28a4c8cd16a941d7d7cb70b91428a7a47ff352a1bd9b3c9e5446R66

jpsim added a commit that referenced this pull request May 3, 2022
…rtion

* origin/main: (57 commits)
  network: add enableDrainPostDnsRefresh to iOS (#2242)
  envoy: update to em-cherry (#2241)
  network: support draining connections after triggered DNS refresh (#2225)
  Bump rules_apple to 0.34.2 (#2236)
  Bump Lyft Support Rotation (#2232)
  ci: add support for `/retest` command (#2219)
  format: add SwiftLint to check-format script (#2230)
  use 64 bit emulators for test (#2228)
  envoy: update to `d0befbb` & add `h2ExtendKeepaliveTimeout` (#2229)
  Add Ryan Hamilton to OWNERS.md (#2224)
  Android cert verifier: first import from chromium/net (#2222)
  Cleanup: remove the unused stats sink metrics_service (#2220)
  bazel: move back to symbol mapping table files (#2218)
  Add new version history section (#2209)
  build: simplify jnilib copy (#2214)
  ci: Update macOS version to macOS 12 (#2208)
  Update releasing.rst (#2200)
  support: Post Lyft support rotation changes to Slack (#2207)
  Don't run bump_support_rotation GitHub Action on forks (#2204)
  Release 0.4.6 (#2201)
  ...

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 5, 2022
…ol--fix-documentation-comment-issues

* origin/main:
  Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234)
  jni: fix mismatched return value types (#2252)
  Remove bintray as a maven source (#2248)
  test: refine connection drain test (#2245)
  cleanup: remove comment in Http::Client that no longer applies (#2203)
  cleanup: fix test with inaccurate description (#2202)
  network: perform post-DNS connection drain on a per-host basis (#2240)
  network: add enableDrainPostDnsRefresh to iOS (#2242)
  envoy: update to em-cherry (#2241)
  network: support draining connections after triggered DNS refresh (#2225)
  Bump rules_apple to 0.34.2 (#2236)
  Bump Lyft Support Rotation (#2232)

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request May 6, 2022
* main:
  envoy: update to efbbb04 (#2258)
  Update rules_java, rules_detekt, rules_jvm_external (#2249)
  Upgrade rules_jvm_external from 4.1 -> 4.2 (#2254)
  CancelProofEnvoyStream (#2250)
  swift: add DrString tool & fix documentation comment issues (#2233)
  Fix `bump_lyft_support_rotation.sh` posting to Slack (#2234)
  jni: fix mismatched return value types (#2252)
  Remove bintray as a maven source (#2248)
  test: refine connection drain test (#2245)
  cleanup: remove comment in Http::Client that no longer applies (#2203)
  cleanup: fix test with inaccurate description (#2202)
  network: perform post-DNS connection drain on a per-host basis (#2240)
  network: add enableDrainPostDnsRefresh to iOS (#2242)
  envoy: update to em-cherry (#2241)
  network: support draining connections after triggered DNS refresh (#2225)

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants