sds: keep warming when dynamic inserted cluster can't be extracted secret entity#13344
sds: keep warming when dynamic inserted cluster can't be extracted secret entity#13344lizan merged 28 commits intoenvoyproxy:masterfrom
Conversation
…ter when initialize Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…cret entity Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
|
/cc @willliamaronli |
|
@lizan @Shikugawa do we have any process for this PR. I think this is a fix for this release blocker issue: istio/istio#22443 could we accelerate it |
| common_tls_context.tls_certificate_sds_secret_configs()) { | ||
| auto& config_name = sds_secret_config.name(); | ||
| auto& config_source = sds_secret_config.sds_config(); | ||
| if (!secret_manager_.checkTlsCertificateEntityExists(config_source, config_name)) { |
There was a problem hiding this comment.
Could you add the comment when checkTlsCertificateEntityExists could return false?
What is the recovery path to fix if it's temporary failure? An implicit retry? Redeliver the same SDS config or cluster config?
There was a problem hiding this comment.
I think that there is no secret entity should be treated as failed to initialize cluster, so that it should be warming forever. Do you have any idea about this to be more reliable?
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
howardjohn
left a comment
There was a problem hiding this comment.
I tried this out on my cluster, I believe this has introduced a crash:
2020-10-14T20:09:44.575361Z critical envoy assert assert failure: !std::any_of(cluster_list.begin(), cluster_list.end(), [&existing_warming_cluster](Cluster* cluster) { return existing_warming_cluster->second->cluster_.get() == cluster; }).
2020-10-14T20:09:44.575747Z critical envoy backtrace Caught Aborted, suspect faulting address 0x53900000012
2020-10-14T20:09:44.575826Z critical envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2020-10-14T20:09:44.575838Z critical envoy backtrace Envoy version: 697c43c3b7d970bb01de403fcac2a9bf218a2562/1.17.0-dev/Clean/DEBUG/BoringSSL
2020-10-14T20:09:44.767932Z critical envoy backtrace #0: Envoy::SignalAction::sigHandler() [0x55b51b353a5c]
2020-10-14T20:09:44.768136Z critical envoy backtrace #1: __restore_rt [0x7fcf9c1a08a0]
2020-10-14T20:09:44.918339Z critical envoy backtrace #2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0x55b51a5dd073]
2020-10-14T20:09:45.052725Z critical envoy backtrace #3: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0x55b51a5dc689]
2020-10-14T20:09:45.184972Z critical envoy backtrace #4: Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate() [0x55b51ad67d41]
2020-10-14T20:09:45.352374Z critical envoy backtrace #5: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0x55b51ad6cece]
2020-10-14T20:09:45.542644Z critical envoy backtrace #6: Envoy::Config::GrpcStream<>::onReceiveMessage() [0x55b51ad72cd7]
2020-10-14T20:09:45.699578Z critical envoy backtrace #7: Envoy::Grpc::AsyncStreamCallbacks<>::onReceiveMessageRaw() [0x55b51ad72869]
2020-10-14T20:09:45.871283Z critical envoy backtrace #8: Envoy::Grpc::AsyncStreamImpl::onData() [0x55b51ada11f6]
2020-10-14T20:09:46.418313Z critical envoy backtrace #9: Envoy::Http::AsyncStreamImpl::encodeData() [0x55b51adae98b]
2020-10-14T20:09:46.823011Z critical envoy backtrace #10: Envoy::Router::Filter::onUpstreamData() [0x55b51ae2fa4e]
2020-10-14T20:09:47.082308Z critical envoy backtrace #11: Envoy::Router::UpstreamRequest::decodeData() [0x55b51ae460e6]
2020-10-14T20:09:47.248533Z critical envoy backtrace #12: Envoy::Http::ResponseDecoderWrapper::decodeData() [0x55b51a868b3c]
2020-10-14T20:09:47.404514Z critical envoy backtrace #13: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0x55b51acdd652]
2020-10-14T20:09:47.595835Z critical envoy backtrace #14: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::operator()() [0x55b51ace9b00]
2020-10-14T20:09:47.765358Z critical envoy backtrace #15: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::__invoke() [0x55b51ace9ac5]
2020-10-14T20:09:47.938590Z critical envoy backtrace #16: session_call_on_frame_received [0x55b51b09d802]
2020-10-14T20:09:48.091134Z critical envoy backtrace #17: nghttp2_session_on_data_received [0x55b51b09f2ae]
2020-10-14T20:09:48.227366Z critical envoy backtrace #18: session_process_data_frame [0x55b51b0a3ba7]
2020-10-14T20:09:48.379274Z critical envoy backtrace #19: nghttp2_session_mem_recv [0x55b51b0a1cba]
2020-10-14T20:09:48.542720Z critical envoy backtrace #20: Envoy::Http::Http2::ConnectionImpl::innerDispatch() [0x55b51acd945f]
2020-10-14T20:09:48.767699Z critical envoy backtrace #21: Envoy::Http::Http2::ConnectionImpl::dispatch()::$_8::operator()() [0x55b51acebc4a]
2020-10-14T20:09:48.946253Z critical envoy backtrace #22: std::_Function_handler<>::_M_invoke() [0x55b51acebaeb]
2020-10-14T20:09:49.130159Z critical envoy backtrace #23: std::function<>::operator()() [0x55b51b011791]
2020-10-14T20:09:49.306681Z critical envoy backtrace #24: Envoy::Http::Utility::exceptionToStatus() [0x55b51b005438]
2020-10-14T20:09:49.501596Z critical envoy backtrace #25: Envoy::Http::Http2::ConnectionImpl::dispatch() [0x55b51acd8cbc]
2020-10-14T20:09:49.645102Z critical envoy backtrace #26: Envoy::Http::Http2::ConnectionImpl::dispatch() [0x55b51acd8ddb]
2020-10-14T20:09:49.886405Z critical envoy backtrace #27: Envoy::Http::CodecClient::onData() [0x55b51a99b41a]
2020-10-14T20:09:50.161979Z critical envoy backtrace #28: Envoy::Http::CodecClient::CodecReadFilter::onData() [0x55b51a99fde7]
2020-10-14T20:09:50.403897Z critical envoy backtrace #29: Envoy::Network::FilterManagerImpl::onContinueReading() [0x55b51a519e35]
2020-10-14T20:09:50.553617Z critical envoy backtrace #30: Envoy::Network::FilterManagerImpl::onRead() [0x55b51a51a2da]
2020-10-14T20:09:50.694561Z critical envoy backtrace #31: Envoy::Network::ConnectionImpl::onRead() [0x55b51a5022e1]
2020-10-14T20:09:50.862013Z critical envoy backtrace #32: Envoy::Network::ConnectionImpl::onReadReady() [0x55b51a50d156]
2020-10-14T20:09:51.022075Z critical envoy backtrace #33: Envoy::Network::ConnectionImpl::onFileEvent() [0x55b51a5090ef]
2020-10-14T20:09:51.162114Z critical envoy backtrace #34: Envoy::Network::ConnectionImpl::ConnectionImpl()::$_6::operator()() [0x55b51a51208e]
2020-10-14T20:09:51.330130Z critical envoy backtrace #35: std::_Function_handler<>::_M_invoke() [0x55b51a511f51]
2020-10-14T20:09:51.529677Z critical envoy backtrace #36: std::function<>::operator()() [0x55b51a4e9fba]
2020-10-14T20:09:51.726144Z critical envoy backtrace #37: Envoy::Event::FileEventImpl::mergeInjectedEventsAndRunCb() [0x55b51a4e8fcb]
2020-10-14T20:09:51.907346Z critical envoy backtrace #38: Envoy::Event::FileEventImpl::assignEvents()::$_1::operator()() [0x55b51a4e9589]
2020-10-14T20:09:52.064777Z critical envoy backtrace #39: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x55b51a4e90d6]
2020-10-14T20:09:52.240164Z critical envoy backtrace #40: event_persist_closure [0x55b51b08029e]
2020-10-14T20:09:52.400234Z critical envoy backtrace #41: event_process_active_single_queue [0x55b51b07f8c8]
2020-10-14T20:09:52.558007Z critical envoy backtrace #42: event_process_active [0x55b51b07a08a]
2020-10-14T20:09:52.780285Z critical envoy backtrace #43: event_base_loop [0x55b51b078f3c]
2020-10-14T20:09:52.942622Z critical envoy backtrace #44: Envoy::Event::LibeventScheduler::run() [0x55b51b05dcb4]
2020-10-14T20:09:53.131037Z critical envoy backtrace #45: Envoy::Event::DispatcherImpl::run() [0x55b51a4def6a]
2020-10-14T20:09:53.305119Z critical envoy backtrace #46: Envoy::Server::InstanceImpl::run() [0x55b51a4698d2]
2020-10-14T20:09:53.532208Z critical envoy backtrace #47: Envoy::MainCommonBase::run() [0x55b51786f074]
2020-10-14T20:09:53.742968Z critical envoy backtrace #48: Envoy::MainCommon::run() [0x55b517872dce]
2020-10-14T20:09:53.945325Z critical envoy backtrace #49: Envoy::MainCommon::main() [0x55b51786fd7c]
2020-10-14T20:09:54.087207Z critical envoy backtrace #50: main [0x55b51786cd3b]
2020-10-14T20:09:54.087701Z critical envoy backtrace #51: __libc_start_main [0x7fcf9bdbeb97]
AsyncClient 0x25b4bf272300, stream_id_: 1817134650148571154
&stream_info_:
StreamInfoImpl 0x25b4bf272490, protocol_: 1, response_code_: null, response_code_details_: null, health_check_request_: 0, route_name_:
I have gotten this fairly often, looks like ~5% of the time
|
@howardjohn This problem is caused on cluster manager. So I will fix it on #12783. Thank you for reporting! |
|
@lizan ping |
…-sds-activate-timing Signed-off-by: Shikugawa <rei@tetrate.io>
lizan
left a comment
There was a problem hiding this comment.
LGTM, @htuch or @mattklein123 for non-Tetrands review.
| // Context, when it failed to extract them via SDS, it will fail to change cluster status from | ||
| // warming to active. In current implementation, there is no strategy to activate clusters | ||
| // which failed to initialize at once. | ||
| // TODO(shikugawa): Consider retry strategy of clusters which failed to activate at once. |
There was a problem hiding this comment.
This seems a bit dicey. If I'm reading the code/comments correctly, we have a race condition in cluster warming where a slow SDS update will leave a cluster stuck in the warming state forever.
There was a problem hiding this comment.
@htuch I can't imagine what problem will cause if warming state is being stuck forever. Sorry. It will be an expected behavior. What you say is that the retry comment is not required, isn't it?
There was a problem hiding this comment.
The issue is that this is very surprising behavior to an Envoy operator who experiences a race between SDS availability and cluster warming. In fact, I'd argue that this effectively breaks eventual warming semantics.
It looks like, from the comment, that the right long-term solution is to continue warming when SDS makes the necessary secret available.
I think we should have this behavior. If you don't want to do it in this PR, then we should at least put in place a runtime flag that disables this new behavior by default, and leave it disabled until we actually fix warming for folks who are relying on eventual consistency.
There was a problem hiding this comment.
Sure. So to speak, what you said is that we should implement to be available by keeping warming after no-available secret entity behavior occured, right? This feature is the blocker of next Istio release. So I overcome this by setting runtime feature flag temporary.
Signed-off-by: Shikugawa <rei@tetrate.io>
|
@htuch SGTM. |
…cret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <rei@tetrate.io>
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) Signed-off-by: Shikugawa <rei@tetrate.io> * sds: keep warming when dynamic inserted cluster can't be extracted secret entity (envoyproxy#13344) This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster. Risk Level: Mid Testing: Unit Docs Changes: Release Notes: Added Fixes envoyproxy#11120, future work: envoyproxy#13777 Signed-off-by: Shikugawa <rei@tetrate.io> Co-authored-by: Rei Shimizu <rei@tetrate.io>
…acted secret entity (envoyproxy#13344)" This reverts commit e5a8c21. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Commit Message: fix #11120. This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Additional Description:
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]