health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting.#16236
health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting.#16236mpuncel wants to merge 7 commits intoenvoyproxy:mainfrom
Conversation
This commit is a revival of envoyproxy#13516 which aimed to solve envoyproxy#12389. That PR was merged and then reverted due to a deadlock bug, which is still present in this commit. The fix will be applied in a subsequent commit in order to make the resolution clearer to PR reviewers. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
The bug was caused by not queuing callbacks in addReadyCb() while holding the lock. This means that you could make the decision to NOT run the callback, get a secret concurrently which would run all callbacks, and then add it to the queue at which point it would never run. This also changed the name of addReadyCb() to addHealthCheckingReadyCb() on HostImpl, and made the logic add a callback for the health checking transport socket. That's the socket that we actually want health checks to wait on which may be different from the data plane socket. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
For reviewers: the first commit is basically just a revert-of-a-revert of #13516 and then rebased onto The second commit has the fix, and then the Run on my local machine demonstrating the bug: And after it was fixed: |
| // catch a race condition introduced in | ||
| // https://github.com/envoyproxy/envoy/pull/13516 where a callback would never | ||
| // be called due to a concurrency bug. | ||
| TEST_P(SslSocketTest, ServerAddSecretsReadyCallbackParallel) { |
There was a problem hiding this comment.
This and the next test are probably a bit odd since they start a thread to try to tickle a concurrency bug that has been fixed now, and only happened in ~5% of runs on my machine. This means if there is a regression for some reason we might not see it in the PR that introduces it.
Let me know if there is a better way of handling tests like this
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
| // 1 microsecond of sleep seems to be a sweet spot for giving the addReadyCb | ||
| // thread enough of a head start to reliably cause the test to fail with buggy | ||
| // code. | ||
| std::this_thread::sleep_for(std::chrono::microseconds(1)); |
There was a problem hiding this comment.
I'm failing the style check because of this, I'm looking into whether the issue can be reproduced with ThreadSynchronizer instead (thanks Greg for suggesting that on Slack).
It's a bit weird because with the buggy code I certainly can reproduce it 100% of the time with that, but then when it is fixed there is nowhere I can put the syncPoint() that would let me deadlock it (because the syncpoint would need to be put in a place where the mutex is held).
All of this makes me wonder if this test isn't worth checking in at all: maybe the value was just in demonstrating the bug and that it was fixed, but it's not so useful as a regression test since it's testing a very specific concurrency bug?
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
|
/assign-from @envoyproxy/first-pass-reviewers |
|
@envoyproxy/first-pass-reviewers assignee is @antoniovicente |
antoniovicente
left a comment
There was a problem hiding this comment.
Thanks for the fixes to SDS handling and health checks.
I'ld like to better understand threading involved in SDS updates vs health checkers. Please see comments below.
| if (ssl_ctx_) { | ||
| immediately_run_callback = true; | ||
| } else { | ||
| secrets_ready_callbacks_.push_back(callback); |
There was a problem hiding this comment.
What mutex guards secrets_ready_callbacks_ ?
You only have a reader lock on ssl_ctx_mu_ here, I think you need a write lock.
| const std::vector<std::string> server_names_; | ||
| mutable absl::Mutex ssl_ctx_mu_; | ||
| Envoy::Ssl::ServerContextSharedPtr ssl_ctx_ ABSL_GUARDED_BY(ssl_ctx_mu_); | ||
| std::list<std::function<void()>> secrets_ready_callbacks_; |
There was a problem hiding this comment.
Please add ABSL_GUARDED_BY
std::list<std::function<void()>> secrets_ready_callbacks_ ABSL_GUARDED_BY(ssl_ctx_mu_);
| if (ssl_ctx_) { | ||
| immediately_run_callback = true; | ||
| } else { | ||
| secrets_ready_callbacks_.push_back(callback); |
There was a problem hiding this comment.
I see no mechanism to cancel pending callbacks on ActiveHealthCheckSession deletion. Is it possible for it to take a long time for secrets to become available, and for the ActiveHealthCheckSession to be deleted before this callback is executed?
| Network::TransportSocketPtr | ||
| createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | ||
|
|
||
| // TODO(mpuncel) only invoke callback() once secrets are ready. |
There was a problem hiding this comment.
IIRC TSI socket does not use SDS to propagate credentials. I think that the current implementation is good enough.
| createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | ||
| bool implementsSecureTransport() const override { return false; } | ||
| bool usesProxyProtocolOptions() const override { return false; } | ||
| void addReadyCb(std::function<void()> callback) override { callback(); } |
There was a problem hiding this comment.
// TODO(mpuncel) only invoke callback() once secrets are ready. may be appropriate in this case.
| createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override; | ||
| bool implementsSecureTransport() const override; | ||
| bool usesProxyProtocolOptions() const override { return true; } | ||
| void addReadyCb(std::function<void()> callback) override { callback(); }; |
There was a problem hiding this comment.
// TODO(mpuncel) only invoke callback() once secrets are ready. may be appropriate in this case.
| } | ||
| if (should_run_callbacks) { | ||
| for (const auto& cb : secrets_ready_callbacks_) { | ||
| cb(); |
There was a problem hiding this comment.
On what thread to these callbacks run?
I don't know what the threading model looks like for health checker vs SDS. I think there's potential for callbacks being invoked from the wrong thread.
| } | ||
| } | ||
| if (should_run_callbacks) { | ||
| for (const auto& cb : secrets_ready_callbacks_) { |
There was a problem hiding this comment.
From what threads are callbacks added to this list? Do adds happen from outside the thread where onAddOrUpdateSecret() is called?
|
thanks for the feedback, I'll hopefully be able to address it in a couple days |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for
no_traffic_interval(default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977