Make health check loop wait for any required SDS secrets to be loaded… by mpuncel · Pull Request #17756 · envoyproxy/envoy

mpuncel · 2021-08-18T13:50:26Z

Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for no_traffic_interval (default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.
Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977, #17529

… before starting. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel · 2021-08-18T13:56:15Z

This PR is essentially a duplicate of #16236 (which got closed as stale) but with PR feedback addressed. I plan on rolling out this PR internally to confirm it fixes the Envoy bootup time in our environment and also reduce the risk of surprises.

The final feedback on the last PR was:

From what threads are callbacks added to this list [in ssl_socket.cc]? Do adds happen from outside the thread where onAddOrUpdateSecret() is called?

Callbacks are only added from the main thread, which is also the thread that calls onAddOrUpdateSecret.

On what thread to these callbacks run?

I don't know what the threading model looks like for health checker vs SDS. I think there's potential for callbacks being invoked from the wrong thread.

Matt helped me in Slack to confirm that health checking and SDS are all run from the main thread. So the callback to begin health checks happens on the correct thread.

I see no mechanism to cancel pending callbacks on ActiveHealthCheckSession deletion. Is it possible for it to take a long time for secrets to become available, and for the ActiveHealthCheckSession to be deleted before this callback is executed?

This was something missing from the last PR. I added a check to ensure that the ActiveHealthCheckSession wasn't deleted into the callback before attempting to start health checks. I accomplished this with attempting to get a weak pointer to an otherwise-useless uint32 in the ActiveHealthCheckSession struct.

mpuncel · 2021-08-18T16:53:10Z

/retest

repokitteh-read-only · 2021-08-18T16:53:13Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17756 (comment) was created by @mpuncel.

see: more, trace.

mpuncel · 2021-08-18T19:41:42Z

/retest

repokitteh-read-only · 2021-08-18T19:41:47Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17756 (comment) was created by @mpuncel.

see: more, trace.

mpuncel · 2021-08-18T20:00:29Z

I'm pretty sure the test failures are not related to my change, they're all in fuzz tests with errors like:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CorpusExamples/FuzzerCorpusTest
[ RUN      ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0
[2021-08-18 19:47:48.237][16][info][misc] [external/envoy/test/fuzz/main.cc:45] Corpus file: external/envoy/test/common/upstream/least_request_load_balancer_corpus
unknown file: Failure
C++ exception with description "Invalid path: external/envoy/test/common/upstream/least_request_load_balancer_corpus" thrown in the test body.
[  FAILED  ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0, where GetParam() = "external/envoy/test/common/upstream/least_request_load_balancer_corpus" (0 ms)
[----------] 1 test from CorpusExamples/FuzzerCorpusTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0, where GetParam() = "external/envoy/test/common/upstream/least_request_load_balancer_corpus"

* main: Fix for fuzz tests failing due to invalid corpus paths (envoyproxy#17767) kafka: fix integration test (envoyproxy#17764) Fix typo in cluster.proto (envoyproxy#17755) cluster manager: add drainConnections() API (envoyproxy#17747) kafka broker filter: move to contrib (envoyproxy#17750) quiche: switch external dependency to github (envoyproxy#17732) quiche: implement stream idle timeout in codec (envoyproxy#17674) Update c-ares to 1.17.2 (envoyproxy#17704) Fix dns resolve fuzz bug (envoyproxy#17107) Remove members that shadow members of the base class (envoyproxy#17713) thrift proxy: missing parts from the previous PR (envoyproxy#17668) thrift-proxy: cleanup ConnectionManager::ActiveRpc (envoyproxy#17734) listener: extra warning for deprecated use_proxy_proto field (envoyproxy#17736) kafka: add support for metadata request in mesh-filter (envoyproxy#17597) upstream: add all host map to priority set for fast host searching (envoyproxy#17290) Remove the support for `hidden_envoy_deprecated_per_filter_config` (envoyproxy#17725) tls: SDS support for private key providers (envoyproxy#16737) bazel: update rules_foreign_cc (envoyproxy#17445) Signed-off-by: Michael Puncel <mpuncel@squareup.com>

rojkov

Thanks! Added a couple of nitpicks.

rojkov · 2021-08-23T08:49:25Z

source/extensions/transport_sockets/tls/ssl_socket.cc

+  }
+
+  if (should_run_callbacks) {
+    {


This block looks redundant.

rojkov · 2021-08-23T09:06:18Z

source/extensions/transport_sockets/tls/ssl_socket.cc

+      {
+        absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
+        secrets_ready_callbacks_.push_back(callback);
+      }


This block could be less nested if executed as an else branch of the condition below.

rojkov · 2021-08-23T09:07:18Z

source/extensions/transport_sockets/tls/ssl_socket.cc

+  }
+
+  if (should_run_callbacks) {
+    {


redundant nested block

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

rojkov

Thanks! Looking good. Added a couple of other nits.
You may try to merge the latest main to fix the coverage issue.

rojkov · 2021-08-24T09:37:07Z

envoy/network/transport_socket.h

  virtual bool supportsAlpn() const { return false; }
+
+  /**
+   * @param a callback to be invoked when the secrets required by the created transport


Suggested change

* @param a callback to be invoked when the secrets required by the created transport

* @param callback supplies a callback to be invoked when the secrets required by the created transport

rojkov · 2021-08-24T09:41:33Z

source/extensions/transport_sockets/tls/ssl_socket.cc

+void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
+  bool immediately_run_callback = false;
+  {
+    absl::ReaderMutexLock l(&ssl_ctx_mu_);
+    if (ssl_ctx_) {
+      immediately_run_callback = true;
+    } else {
+      absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
+      secrets_ready_callbacks_.push_back(callback);
+    }
+  }
+  if (immediately_run_callback) {
+    callback();
+  }
+}


Consider this change here and in ServerSslSocketFactory::addReadyCb(). This way the ssl_ctx lock could be released earlier.

Suggested change

void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {

bool immediately_run_callback = false;

{

absl::ReaderMutexLock l(&ssl_ctx_mu_);

if (ssl_ctx_) {

immediately_run_callback = true;

} else {

absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);

secrets_ready_callbacks_.push_back(callback);

}

}

if (immediately_run_callback) {

callback();

}

}

void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {

bool immediately_run_callback = false;

{

absl::ReaderMutexLock l(&ssl_ctx_mu_);

if (ssl_ctx_) {

immediately_run_callback = true;

}

}

if (immediately_run_callback) {

callback();

} else {

absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);

secrets_ready_callbacks_.push_back(callback);

}

}

don't I need to hold both locks when adding to the callbacks list? otherwise another thread could set ssl_ctx_mu and then not see this callback should be run

What are the implications if the missed callback runs next time onAddOrUpdateSecret() is called?

I think the next onAddOrUpdateSecret() could potentially be hours or days later, e.g. served by an SDS server with a low certificate refresh rate. that would mean health checks don't start for that long and Envoy is stuck in a warming state

Alright, SGTM.

rojkov · 2021-08-25T07:32:46Z

/wait

* main: (32 commits) Stop processing pending H/2 frames if connection transitioned to the closed state http2: limit use of deferred resets in the http2 codec to server-side connections Abort filter chain iteration on local reply Reject or strip fragment from request URI ext-authz: merge duplicate headers from client request in check request common: introduce stable logger /w examples in DNS (envoyproxy#17772) route: fast return when route matches failed (envoyproxy#17769) owners: add owners for dubbo proxy network filter (envoyproxy#17820) config/router/tcp_proxy/options: v2 API, boosting and --bootstrap-version CLI removal. (envoyproxy#17724) coverage: revert the limit http/cache to 92.6. (envoyproxy#17817) network: rename SocketAddressProvider as ConnectionInfoProvider (envoyproxy#17717) test: bumping coverage (envoyproxy#17757) conn_pool: Minor cleanups to ConnPoolBaseImpl (envoyproxy#17710) Split VaryHeader into VaryAllowList and VaryUtils to organize vary-related logic (envoyproxy#17728) ext_proc: Make tests more resilient to IPv6 support (envoyproxy#17784) Remove invlaid backquote from doc (envoyproxy#17797) rocketmq: move to contrib (envoyproxy#17796) kafka: upstream kafka facade in mesh-filter (envoyproxy#17783) ecds: create shared base class for DynamicFilterConfigProviderImpl (envoyproxy#17735) Change log level from debug to trace (envoyproxy#17774) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

rojkov

Thank you! Looks really good.

rojkov · 2021-08-27T06:24:03Z

source/extensions/transport_sockets/tls/ssl_socket.cc

+void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
+  bool immediately_run_callback = false;
+  {
+    absl::ReaderMutexLock l(&ssl_ctx_mu_);
+    if (ssl_ctx_) {
+      immediately_run_callback = true;
+    } else {
+      absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
+      secrets_ready_callbacks_.push_back(callback);
+    }
+  }
+  if (immediately_run_callback) {
+    callback();
+  }
+}


Alright, SGTM.

mpuncel · 2021-08-30T20:16:57Z

Before merging this I actually want to check if there is a simpler way of achieving my goals here by changing some ordering in when the cluster manager starts health checks. I think there is already a step in that flow that waits for SDS to load via init manager, which might cut down on the complexity of the change if it works.

rojkov · 2021-08-31T06:32:44Z

/wait

github-actions · 2021-09-30T08:01:23Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-10-07T08:01:35Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Make health check loop wait for any required SDS secrets to be loaded…

ffa08e1

… before starting. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

mpuncel requested review from alyssawilk, antoniovicente, asraa, ggreenway and lizan as code owners August 18, 2021 13:50

rojkov reviewed Aug 23, 2021

View reviewed changes

phlax assigned rojkov Aug 23, 2021

PR feedback: remove unnecessary nesting

d18d23e

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

rojkov reviewed Aug 24, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Aug 25, 2021

mpuncel added 2 commits August 25, 2021 09:25

PR feedback: comment format fix

0cc47e5

Signed-off-by: Michael Puncel <mpuncel@squareup.com>

repokitteh-read-only bot removed the waiting label Aug 25, 2021

rojkov approved these changes Aug 27, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Aug 31, 2021

mpuncel mentioned this pull request Sep 14, 2021

upstream: start health checker after init manager complete #18119

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 30, 2021

github-actions bot closed this Oct 7, 2021

	* @param a callback to be invoked when the secrets required by the created transport
	* @param callback supplies a callback to be invoked when the secrets required by the created transport

Conversation

mpuncel commented Aug 18, 2021

Uh oh!

mpuncel commented Aug 18, 2021

Uh oh!

mpuncel commented Aug 18, 2021

Uh oh!

repokitteh-read-only bot commented Aug 18, 2021

Uh oh!

mpuncel commented Aug 18, 2021

Uh oh!

repokitteh-read-only bot commented Aug 18, 2021

Uh oh!

mpuncel commented Aug 18, 2021

Uh oh!

rojkov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rojkov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rojkov commented Aug 25, 2021

Uh oh!

rojkov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpuncel commented Aug 30, 2021

Uh oh!

rojkov commented Aug 31, 2021

Uh oh!

github-actions bot commented Sep 30, 2021

Uh oh!

github-actions bot commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants