Skip to content

Make health check loop wait for any required SDS secrets to be loaded…#17756

Closed
mpuncel wants to merge 5 commits intoenvoyproxy:mainfrom
mpuncel:mpuncel/sds-hc-sequence-2
Closed

Make health check loop wait for any required SDS secrets to be loaded…#17756
mpuncel wants to merge 5 commits intoenvoyproxy:mainfrom
mpuncel:mpuncel/sds-hc-sequence-2

Conversation

@mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Aug 18, 2021

Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for no_traffic_interval (default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.
Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977, #17529

… before starting.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 18, 2021

This PR is essentially a duplicate of #16236 (which got closed as stale) but with PR feedback addressed. I plan on rolling out this PR internally to confirm it fixes the Envoy bootup time in our environment and also reduce the risk of surprises.

The final feedback on the last PR was:

From what threads are callbacks added to this list [in ssl_socket.cc]? Do adds happen from outside the thread where onAddOrUpdateSecret() is called?

Callbacks are only added from the main thread, which is also the thread that calls onAddOrUpdateSecret.

On what thread to these callbacks run?

I don't know what the threading model looks like for health checker vs SDS. I think there's potential for callbacks being invoked from the wrong thread.

Matt helped me in Slack to confirm that health checking and SDS are all run from the main thread. So the callback to begin health checks happens on the correct thread.

I see no mechanism to cancel pending callbacks on ActiveHealthCheckSession deletion. Is it possible for it to take a long time for secrets to become available, and for the ActiveHealthCheckSession to be deleted before this callback is executed?

This was something missing from the last PR. I added a check to ensure that the ActiveHealthCheckSession wasn't deleted into the callback before attempting to start health checks. I accomplished this with attempting to get a weak pointer to an otherwise-useless uint32 in the ActiveHealthCheckSession struct.

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 18, 2021

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17756 (comment) was created by @mpuncel.

see: more, trace.

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 18, 2021

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17756 (comment) was created by @mpuncel.

see: more, trace.

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 18, 2021

I'm pretty sure the test failures are not related to my change, they're all in fuzz tests with errors like:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CorpusExamples/FuzzerCorpusTest
[ RUN      ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0
[2021-08-18 19:47:48.237][16][info][misc] [external/envoy/test/fuzz/main.cc:45] Corpus file: external/envoy/test/common/upstream/least_request_load_balancer_corpus
unknown file: Failure
C++ exception with description "Invalid path: external/envoy/test/common/upstream/least_request_load_balancer_corpus" thrown in the test body.
[  FAILED  ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0, where GetParam() = "external/envoy/test/common/upstream/least_request_load_balancer_corpus" (0 ms)
[----------] 1 test from CorpusExamples/FuzzerCorpusTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CorpusExamples/FuzzerCorpusTest.RunOneCorpusFile/0, where GetParam() = "external/envoy/test/common/upstream/least_request_load_balancer_corpus"

* main:
  Fix for fuzz tests failing due to invalid corpus paths (envoyproxy#17767)
  kafka: fix integration test (envoyproxy#17764)
  Fix typo in cluster.proto (envoyproxy#17755)
  cluster manager: add drainConnections() API (envoyproxy#17747)
  kafka broker filter: move to contrib (envoyproxy#17750)
  quiche: switch external dependency to github (envoyproxy#17732)
  quiche: implement stream idle timeout in codec (envoyproxy#17674)
  Update c-ares to 1.17.2 (envoyproxy#17704)
  Fix dns resolve fuzz bug (envoyproxy#17107)
  Remove members that shadow members of the base class (envoyproxy#17713)
  thrift proxy: missing parts from the previous PR (envoyproxy#17668)
  thrift-proxy: cleanup ConnectionManager::ActiveRpc (envoyproxy#17734)
  listener: extra warning for deprecated use_proxy_proto field (envoyproxy#17736)
  kafka: add support for metadata request in mesh-filter (envoyproxy#17597)
  upstream: add all host map to priority set for fast host searching (envoyproxy#17290)
  Remove the support for `hidden_envoy_deprecated_per_filter_config` (envoyproxy#17725)
  tls: SDS support for private key providers (envoyproxy#16737)
  bazel: update rules_foreign_cc (envoyproxy#17445)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added a couple of nitpicks.

}

if (should_run_callbacks) {
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block looks redundant.

Comment on lines +410 to +413
{
absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
secrets_ready_callbacks_.push_back(callback);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block could be less nested if executed as an else branch of the condition below.

}

if (should_run_callbacks) {
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant nested block

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good. Added a couple of other nits.
You may try to merge the latest main to fix the coverage issue.

virtual bool supportsAlpn() const { return false; }

/**
* @param a callback to be invoked when the secrets required by the created transport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param a callback to be invoked when the secrets required by the created transport
* @param callback supplies a callback to be invoked when the secrets required by the created transport

Comment on lines +401 to +415
void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
bool immediately_run_callback = false;
{
absl::ReaderMutexLock l(&ssl_ctx_mu_);
if (ssl_ctx_) {
immediately_run_callback = true;
} else {
absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
secrets_ready_callbacks_.push_back(callback);
}
}
if (immediately_run_callback) {
callback();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this change here and in ServerSslSocketFactory::addReadyCb(). This way the ssl_ctx lock could be released earlier.

Suggested change
void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
bool immediately_run_callback = false;
{
absl::ReaderMutexLock l(&ssl_ctx_mu_);
if (ssl_ctx_) {
immediately_run_callback = true;
} else {
absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
secrets_ready_callbacks_.push_back(callback);
}
}
if (immediately_run_callback) {
callback();
}
}
void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
bool immediately_run_callback = false;
{
absl::ReaderMutexLock l(&ssl_ctx_mu_);
if (ssl_ctx_) {
immediately_run_callback = true;
}
}
if (immediately_run_callback) {
callback();
} else {
absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
secrets_ready_callbacks_.push_back(callback);
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't I need to hold both locks when adding to the callbacks list? otherwise another thread could set ssl_ctx_mu and then not see this callback should be run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications if the missed callback runs next time onAddOrUpdateSecret() is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the next onAddOrUpdateSecret() could potentially be hours or days later, e.g. served by an SDS server with a low certificate refresh rate. that would mean health checks don't start for that long and Envoy is stuck in a warming state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, SGTM.

@rojkov
Copy link
Member

rojkov commented Aug 25, 2021

/wait

* main: (32 commits)
  Stop processing pending H/2 frames if connection transitioned to the closed state
  http2: limit use of deferred resets in the http2 codec to server-side connections
  Abort filter chain iteration on local reply
  Reject or strip fragment from request URI
  ext-authz: merge duplicate headers from client request in check request
  common: introduce stable logger /w examples in DNS  (envoyproxy#17772)
  route: fast return when route matches failed (envoyproxy#17769)
  owners: add owners for dubbo proxy network filter (envoyproxy#17820)
  config/router/tcp_proxy/options: v2 API, boosting and --bootstrap-version CLI removal. (envoyproxy#17724)
  coverage: revert the limit http/cache to 92.6. (envoyproxy#17817)
  network: rename SocketAddressProvider as ConnectionInfoProvider (envoyproxy#17717)
  test: bumping coverage (envoyproxy#17757)
  conn_pool: Minor cleanups to ConnPoolBaseImpl (envoyproxy#17710)
  Split VaryHeader into VaryAllowList and VaryUtils to organize vary-related logic (envoyproxy#17728)
  ext_proc: Make tests more resilient to IPv6 support (envoyproxy#17784)
  Remove invlaid backquote from doc (envoyproxy#17797)
  rocketmq: move to contrib (envoyproxy#17796)
  kafka: upstream kafka facade in mesh-filter (envoyproxy#17783)
  ecds: create shared base class for DynamicFilterConfigProviderImpl (envoyproxy#17735)
  Change log level from debug to trace (envoyproxy#17774)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Copy link
Member

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks really good.

Comment on lines +401 to +415
void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) {
bool immediately_run_callback = false;
{
absl::ReaderMutexLock l(&ssl_ctx_mu_);
if (ssl_ctx_) {
immediately_run_callback = true;
} else {
absl::WriterMutexLock m(&secrets_ready_callbacks_mu_);
secrets_ready_callbacks_.push_back(callback);
}
}
if (immediately_run_callback) {
callback();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, SGTM.

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 30, 2021

Before merging this I actually want to check if there is a simpler way of achieving my goals here by changing some ordering in when the cluster manager starts health checks. I think there is already a step in that flow that waits for SDS to load via init manager, which might cut down on the complexity of the change if it works.

@rojkov
Copy link
Member

rojkov commented Aug 31, 2021

/wait

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 30, 2021
@github-actions
Copy link

github-actions bot commented Oct 7, 2021

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should HC activation be delayed until needed secrets are available?

2 participants