Make health check loop wait for any required SDS secrets to be loaded…#17756
Make health check loop wait for any required SDS secrets to be loaded…#17756mpuncel wants to merge 5 commits intoenvoyproxy:mainfrom
Conversation
… before starting. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
This PR is essentially a duplicate of #16236 (which got closed as stale) but with PR feedback addressed. I plan on rolling out this PR internally to confirm it fixes the Envoy bootup time in our environment and also reduce the risk of surprises. The final feedback on the last PR was:
Callbacks are only added from the main thread, which is also the thread that calls onAddOrUpdateSecret.
Matt helped me in Slack to confirm that health checking and SDS are all run from the main thread. So the callback to begin health checks happens on the correct thread.
This was something missing from the last PR. I added a check to ensure that the ActiveHealthCheckSession wasn't deleted into the callback before attempting to start health checks. I accomplished this with attempting to get a weak pointer to an otherwise-useless uint32 in the ActiveHealthCheckSession struct. |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
I'm pretty sure the test failures are not related to my change, they're all in fuzz tests with errors like: |
* main: Fix for fuzz tests failing due to invalid corpus paths (envoyproxy#17767) kafka: fix integration test (envoyproxy#17764) Fix typo in cluster.proto (envoyproxy#17755) cluster manager: add drainConnections() API (envoyproxy#17747) kafka broker filter: move to contrib (envoyproxy#17750) quiche: switch external dependency to github (envoyproxy#17732) quiche: implement stream idle timeout in codec (envoyproxy#17674) Update c-ares to 1.17.2 (envoyproxy#17704) Fix dns resolve fuzz bug (envoyproxy#17107) Remove members that shadow members of the base class (envoyproxy#17713) thrift proxy: missing parts from the previous PR (envoyproxy#17668) thrift-proxy: cleanup ConnectionManager::ActiveRpc (envoyproxy#17734) listener: extra warning for deprecated use_proxy_proto field (envoyproxy#17736) kafka: add support for metadata request in mesh-filter (envoyproxy#17597) upstream: add all host map to priority set for fast host searching (envoyproxy#17290) Remove the support for `hidden_envoy_deprecated_per_filter_config` (envoyproxy#17725) tls: SDS support for private key providers (envoyproxy#16737) bazel: update rules_foreign_cc (envoyproxy#17445) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
rojkov
left a comment
There was a problem hiding this comment.
Thanks! Added a couple of nitpicks.
| } | ||
|
|
||
| if (should_run_callbacks) { | ||
| { |
| { | ||
| absl::WriterMutexLock m(&secrets_ready_callbacks_mu_); | ||
| secrets_ready_callbacks_.push_back(callback); | ||
| } |
There was a problem hiding this comment.
This block could be less nested if executed as an else branch of the condition below.
| } | ||
|
|
||
| if (should_run_callbacks) { | ||
| { |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
rojkov
left a comment
There was a problem hiding this comment.
Thanks! Looking good. Added a couple of other nits.
You may try to merge the latest main to fix the coverage issue.
envoy/network/transport_socket.h
Outdated
| virtual bool supportsAlpn() const { return false; } | ||
|
|
||
| /** | ||
| * @param a callback to be invoked when the secrets required by the created transport |
There was a problem hiding this comment.
| * @param a callback to be invoked when the secrets required by the created transport | |
| * @param callback supplies a callback to be invoked when the secrets required by the created transport |
| void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) { | ||
| bool immediately_run_callback = false; | ||
| { | ||
| absl::ReaderMutexLock l(&ssl_ctx_mu_); | ||
| if (ssl_ctx_) { | ||
| immediately_run_callback = true; | ||
| } else { | ||
| absl::WriterMutexLock m(&secrets_ready_callbacks_mu_); | ||
| secrets_ready_callbacks_.push_back(callback); | ||
| } | ||
| } | ||
| if (immediately_run_callback) { | ||
| callback(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Consider this change here and in ServerSslSocketFactory::addReadyCb(). This way the ssl_ctx lock could be released earlier.
| void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) { | |
| bool immediately_run_callback = false; | |
| { | |
| absl::ReaderMutexLock l(&ssl_ctx_mu_); | |
| if (ssl_ctx_) { | |
| immediately_run_callback = true; | |
| } else { | |
| absl::WriterMutexLock m(&secrets_ready_callbacks_mu_); | |
| secrets_ready_callbacks_.push_back(callback); | |
| } | |
| } | |
| if (immediately_run_callback) { | |
| callback(); | |
| } | |
| } | |
| void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) { | |
| bool immediately_run_callback = false; | |
| { | |
| absl::ReaderMutexLock l(&ssl_ctx_mu_); | |
| if (ssl_ctx_) { | |
| immediately_run_callback = true; | |
| } | |
| } | |
| if (immediately_run_callback) { | |
| callback(); | |
| } else { | |
| absl::WriterMutexLock m(&secrets_ready_callbacks_mu_); | |
| secrets_ready_callbacks_.push_back(callback); | |
| } | |
| } |
There was a problem hiding this comment.
don't I need to hold both locks when adding to the callbacks list? otherwise another thread could set ssl_ctx_mu and then not see this callback should be run
There was a problem hiding this comment.
What are the implications if the missed callback runs next time onAddOrUpdateSecret() is called?
There was a problem hiding this comment.
I think the next onAddOrUpdateSecret() could potentially be hours or days later, e.g. served by an SDS server with a low certificate refresh rate. that would mean health checks don't start for that long and Envoy is stuck in a warming state
|
/wait |
* main: (32 commits) Stop processing pending H/2 frames if connection transitioned to the closed state http2: limit use of deferred resets in the http2 codec to server-side connections Abort filter chain iteration on local reply Reject or strip fragment from request URI ext-authz: merge duplicate headers from client request in check request common: introduce stable logger /w examples in DNS (envoyproxy#17772) route: fast return when route matches failed (envoyproxy#17769) owners: add owners for dubbo proxy network filter (envoyproxy#17820) config/router/tcp_proxy/options: v2 API, boosting and --bootstrap-version CLI removal. (envoyproxy#17724) coverage: revert the limit http/cache to 92.6. (envoyproxy#17817) network: rename SocketAddressProvider as ConnectionInfoProvider (envoyproxy#17717) test: bumping coverage (envoyproxy#17757) conn_pool: Minor cleanups to ConnPoolBaseImpl (envoyproxy#17710) Split VaryHeader into VaryAllowList and VaryUtils to organize vary-related logic (envoyproxy#17728) ext_proc: Make tests more resilient to IPv6 support (envoyproxy#17784) Remove invlaid backquote from doc (envoyproxy#17797) rocketmq: move to contrib (envoyproxy#17796) kafka: upstream kafka facade in mesh-filter (envoyproxy#17783) ecds: create shared base class for DynamicFilterConfigProviderImpl (envoyproxy#17735) Change log level from debug to trace (envoyproxy#17774) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
| void ClientSslSocketFactory::addReadyCb(std::function<void()> callback) { | ||
| bool immediately_run_callback = false; | ||
| { | ||
| absl::ReaderMutexLock l(&ssl_ctx_mu_); | ||
| if (ssl_ctx_) { | ||
| immediately_run_callback = true; | ||
| } else { | ||
| absl::WriterMutexLock m(&secrets_ready_callbacks_mu_); | ||
| secrets_ready_callbacks_.push_back(callback); | ||
| } | ||
| } | ||
| if (immediately_run_callback) { | ||
| callback(); | ||
| } | ||
| } |
|
Before merging this I actually want to check if there is a simpler way of achieving my goals here by changing some ordering in when the cluster manager starts health checks. I think there is already a step in that flow that waits for SDS to load via init manager, which might cut down on the complexity of the change if it works. |
|
/wait |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for no_traffic_interval (default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.
Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977, #17529