Skip to content

sds: keep warming when dynamic inserted cluster can't be extracted secret entity#13344

Merged
lizan merged 28 commits intoenvoyproxy:masterfrom
Shikugawa:fix-sds-activate-timing
Oct 27, 2020
Merged

sds: keep warming when dynamic inserted cluster can't be extracted secret entity#13344
lizan merged 28 commits intoenvoyproxy:masterfrom
Shikugawa:fix-sds-activate-timing

Conversation

@Shikugawa
Copy link
Member

@Shikugawa Shikugawa commented Oct 1, 2020

Commit Message: fix #11120. This PR highly depends on #12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Additional Description:
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

…ter when initialize

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…cret entity

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
@Shikugawa Shikugawa marked this pull request as ready for review October 1, 2020 09:32
@Shikugawa
Copy link
Member Author

/review @lizan @JimmyCYJ

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
@Shikugawa Shikugawa requested a review from snowp as a code owner October 1, 2020 11:38
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
@williamaronli
Copy link

/cc @willliamaronli

@williamaronli
Copy link

@lizan @Shikugawa do we have any process for this PR. I think this is a fix for this release blocker issue: istio/istio#22443

could we accelerate it

common_tls_context.tls_certificate_sds_secret_configs()) {
auto& config_name = sds_secret_config.name();
auto& config_source = sds_secret_config.sds_config();
if (!secret_manager_.checkTlsCertificateEntityExists(config_source, config_name)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the comment when checkTlsCertificateEntityExists could return false?

What is the recovery path to fix if it's temporary failure? An implicit retry? Redeliver the same SDS config or cluster config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is no secret entity should be treated as failed to initialize cluster, so that it should be warming forever. Do you have any idea about this to be more reliable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment addition done.

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this out on my cluster, I believe this has introduced a crash:

2020-10-14T20:09:44.575361Z     critical        envoy assert    assert failure: !std::any_of(cluster_list.begin(), cluster_list.end(), [&existing_warming_cluster](Cluster* cluster) { return existing_warming_cluster->second->cluster_.get() == cluster; }).
2020-10-14T20:09:44.575747Z     critical        envoy backtrace Caught Aborted, suspect faulting address 0x53900000012
2020-10-14T20:09:44.575826Z     critical        envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2020-10-14T20:09:44.575838Z     critical        envoy backtrace Envoy version: 697c43c3b7d970bb01de403fcac2a9bf218a2562/1.17.0-dev/Clean/DEBUG/BoringSSL
2020-10-14T20:09:44.767932Z     critical        envoy backtrace #0: Envoy::SignalAction::sigHandler() [0x55b51b353a5c]
2020-10-14T20:09:44.768136Z     critical        envoy backtrace #1: __restore_rt [0x7fcf9c1a08a0]
2020-10-14T20:09:44.918339Z     critical        envoy backtrace #2: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0x55b51a5dd073]
2020-10-14T20:09:45.052725Z     critical        envoy backtrace #3: Envoy::Upstream::CdsApiImpl::onConfigUpdate() [0x55b51a5dc689]
2020-10-14T20:09:45.184972Z     critical        envoy backtrace #4: Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate() [0x55b51ad67d41]
2020-10-14T20:09:45.352374Z     critical        envoy backtrace #5: Envoy::Config::GrpcMuxImpl::onDiscoveryResponse() [0x55b51ad6cece]
2020-10-14T20:09:45.542644Z     critical        envoy backtrace #6: Envoy::Config::GrpcStream<>::onReceiveMessage() [0x55b51ad72cd7]
2020-10-14T20:09:45.699578Z     critical        envoy backtrace #7: Envoy::Grpc::AsyncStreamCallbacks<>::onReceiveMessageRaw() [0x55b51ad72869]
2020-10-14T20:09:45.871283Z     critical        envoy backtrace #8: Envoy::Grpc::AsyncStreamImpl::onData() [0x55b51ada11f6]
2020-10-14T20:09:46.418313Z     critical        envoy backtrace #9: Envoy::Http::AsyncStreamImpl::encodeData() [0x55b51adae98b]
2020-10-14T20:09:46.823011Z     critical        envoy backtrace #10: Envoy::Router::Filter::onUpstreamData() [0x55b51ae2fa4e]
2020-10-14T20:09:47.082308Z     critical        envoy backtrace #11: Envoy::Router::UpstreamRequest::decodeData() [0x55b51ae460e6]
2020-10-14T20:09:47.248533Z     critical        envoy backtrace #12: Envoy::Http::ResponseDecoderWrapper::decodeData() [0x55b51a868b3c]
2020-10-14T20:09:47.404514Z     critical        envoy backtrace #13: Envoy::Http::Http2::ConnectionImpl::onFrameReceived() [0x55b51acdd652]
2020-10-14T20:09:47.595835Z     critical        envoy backtrace #14: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::operator()() [0x55b51ace9b00]
2020-10-14T20:09:47.765358Z     critical        envoy backtrace #15: Envoy::Http::Http2::ConnectionImpl::Http2Callbacks::Http2Callbacks()::$_18::__invoke() [0x55b51ace9ac5]
2020-10-14T20:09:47.938590Z     critical        envoy backtrace #16: session_call_on_frame_received [0x55b51b09d802]
2020-10-14T20:09:48.091134Z     critical        envoy backtrace #17: nghttp2_session_on_data_received [0x55b51b09f2ae]
2020-10-14T20:09:48.227366Z     critical        envoy backtrace #18: session_process_data_frame [0x55b51b0a3ba7]
2020-10-14T20:09:48.379274Z     critical        envoy backtrace #19: nghttp2_session_mem_recv [0x55b51b0a1cba]
2020-10-14T20:09:48.542720Z     critical        envoy backtrace #20: Envoy::Http::Http2::ConnectionImpl::innerDispatch() [0x55b51acd945f]
2020-10-14T20:09:48.767699Z     critical        envoy backtrace #21: Envoy::Http::Http2::ConnectionImpl::dispatch()::$_8::operator()() [0x55b51acebc4a]
2020-10-14T20:09:48.946253Z     critical        envoy backtrace #22: std::_Function_handler<>::_M_invoke() [0x55b51acebaeb]
2020-10-14T20:09:49.130159Z     critical        envoy backtrace #23: std::function<>::operator()() [0x55b51b011791]
2020-10-14T20:09:49.306681Z     critical        envoy backtrace #24: Envoy::Http::Utility::exceptionToStatus() [0x55b51b005438]
2020-10-14T20:09:49.501596Z     critical        envoy backtrace #25: Envoy::Http::Http2::ConnectionImpl::dispatch() [0x55b51acd8cbc]
2020-10-14T20:09:49.645102Z     critical        envoy backtrace #26: Envoy::Http::Http2::ConnectionImpl::dispatch() [0x55b51acd8ddb]
2020-10-14T20:09:49.886405Z     critical        envoy backtrace #27: Envoy::Http::CodecClient::onData() [0x55b51a99b41a]
2020-10-14T20:09:50.161979Z     critical        envoy backtrace #28: Envoy::Http::CodecClient::CodecReadFilter::onData() [0x55b51a99fde7]
2020-10-14T20:09:50.403897Z     critical        envoy backtrace #29: Envoy::Network::FilterManagerImpl::onContinueReading() [0x55b51a519e35]
2020-10-14T20:09:50.553617Z     critical        envoy backtrace #30: Envoy::Network::FilterManagerImpl::onRead() [0x55b51a51a2da]
2020-10-14T20:09:50.694561Z     critical        envoy backtrace #31: Envoy::Network::ConnectionImpl::onRead() [0x55b51a5022e1]
2020-10-14T20:09:50.862013Z     critical        envoy backtrace #32: Envoy::Network::ConnectionImpl::onReadReady() [0x55b51a50d156]
2020-10-14T20:09:51.022075Z     critical        envoy backtrace #33: Envoy::Network::ConnectionImpl::onFileEvent() [0x55b51a5090ef]
2020-10-14T20:09:51.162114Z     critical        envoy backtrace #34: Envoy::Network::ConnectionImpl::ConnectionImpl()::$_6::operator()() [0x55b51a51208e]
2020-10-14T20:09:51.330130Z     critical        envoy backtrace #35: std::_Function_handler<>::_M_invoke() [0x55b51a511f51]
2020-10-14T20:09:51.529677Z     critical        envoy backtrace #36: std::function<>::operator()() [0x55b51a4e9fba]
2020-10-14T20:09:51.726144Z     critical        envoy backtrace #37: Envoy::Event::FileEventImpl::mergeInjectedEventsAndRunCb() [0x55b51a4e8fcb]
2020-10-14T20:09:51.907346Z     critical        envoy backtrace #38: Envoy::Event::FileEventImpl::assignEvents()::$_1::operator()() [0x55b51a4e9589]
2020-10-14T20:09:52.064777Z     critical        envoy backtrace #39: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x55b51a4e90d6]
2020-10-14T20:09:52.240164Z     critical        envoy backtrace #40: event_persist_closure [0x55b51b08029e]
2020-10-14T20:09:52.400234Z     critical        envoy backtrace #41: event_process_active_single_queue [0x55b51b07f8c8]
2020-10-14T20:09:52.558007Z     critical        envoy backtrace #42: event_process_active [0x55b51b07a08a]
2020-10-14T20:09:52.780285Z     critical        envoy backtrace #43: event_base_loop [0x55b51b078f3c]
2020-10-14T20:09:52.942622Z     critical        envoy backtrace #44: Envoy::Event::LibeventScheduler::run() [0x55b51b05dcb4]
2020-10-14T20:09:53.131037Z     critical        envoy backtrace #45: Envoy::Event::DispatcherImpl::run() [0x55b51a4def6a]
2020-10-14T20:09:53.305119Z     critical        envoy backtrace #46: Envoy::Server::InstanceImpl::run() [0x55b51a4698d2]
2020-10-14T20:09:53.532208Z     critical        envoy backtrace #47: Envoy::MainCommonBase::run() [0x55b51786f074]
2020-10-14T20:09:53.742968Z     critical        envoy backtrace #48: Envoy::MainCommon::run() [0x55b517872dce]
2020-10-14T20:09:53.945325Z     critical        envoy backtrace #49: Envoy::MainCommon::main() [0x55b51786fd7c]
2020-10-14T20:09:54.087207Z     critical        envoy backtrace #50: main [0x55b51786cd3b]
2020-10-14T20:09:54.087701Z     critical        envoy backtrace #51: __libc_start_main [0x7fcf9bdbeb97]
AsyncClient 0x25b4bf272300, stream_id_: 1817134650148571154
&stream_info_:
  StreamInfoImpl 0x25b4bf272490, protocol_: 1, response_code_: null, response_code_details_: null, health_check_request_: 0, route_name_:

I have gotten this fairly often, looks like ~5% of the time

@Shikugawa
Copy link
Member Author

@howardjohn This problem is caused on cluster manager. So I will fix it on #12783. Thank you for reporting!

Signed-off-by: Shikugawa <rei@tetrate.io>
@Shikugawa
Copy link
Member Author

@lizan ping

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
…-sds-activate-timing

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
lizan
lizan previously approved these changes Oct 23, 2020
Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @htuch or @mattklein123 for non-Tetrands review.

@mattklein123 mattklein123 self-assigned this Oct 23, 2020
// Context, when it failed to extract them via SDS, it will fail to change cluster status from
// warming to active. In current implementation, there is no strategy to activate clusters
// which failed to initialize at once.
// TODO(shikugawa): Consider retry strategy of clusters which failed to activate at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit dicey. If I'm reading the code/comments correctly, we have a race condition in cluster warming where a slow SDS update will leave a cluster stuck in the warming state forever.

Copy link
Member Author

@Shikugawa Shikugawa Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch I can't imagine what problem will cause if warming state is being stuck forever. Sorry. It will be an expected behavior. What you say is that the retry comment is not required, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that this is very surprising behavior to an Envoy operator who experiences a race between SDS availability and cluster warming. In fact, I'd argue that this effectively breaks eventual warming semantics.

It looks like, from the comment, that the right long-term solution is to continue warming when SDS makes the necessary secret available.

I think we should have this behavior. If you don't want to do it in this PR, then we should at least put in place a runtime flag that disables this new behavior by default, and leave it disabled until we actually fix warming for folks who are relying on eventual consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. So to speak, what you said is that we should implement to be available by keeping warming after no-available secret entity behavior occured, right? This feature is the blocker of next Istio release. So I overcome this by setting runtime feature flag temporary.

Signed-off-by: Shikugawa <rei@tetrate.io>
Signed-off-by: Shikugawa <rei@tetrate.io>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizan are you in agreement here from an API shepherd angle?

Signed-off-by: Shikugawa <rei@tetrate.io>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo @lizan confirming the behavioral assertions I made. Also, can you open a tracking issue and reference it in the PR for the future work to be done to keep warming after secret failure?

@lizan
Copy link
Member

lizan commented Oct 27, 2020

@htuch SGTM.

@lizan lizan merged commit e5a8c21 into envoyproxy:master Oct 27, 2020
lizan pushed a commit to lizan/envoy that referenced this pull request Oct 30, 2020
…cret entity (envoyproxy#13344)

This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes: Added
Fixes envoyproxy#11120, future work: envoyproxy#13777

Signed-off-by: Shikugawa <rei@tetrate.io>
@lizan lizan mentioned this pull request Oct 30, 2020
istio-testing pushed a commit to istio/envoy that referenced this pull request Nov 2, 2020
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783)

Signed-off-by: Shikugawa <rei@tetrate.io>

* sds: keep warming when dynamic inserted cluster can't be extracted secret entity (envoyproxy#13344)

This PR highly depends on envoyproxy#12783. Changed to keep warming if dynamic inserted clusters (when initialize doesn't finished) failed to extract TLS certificate and certificate validation context. They shouldn't be indicated as ACTIVE cluster.
Risk Level: Mid
Testing: Unit
Docs Changes:
Release Notes: Added
Fixes envoyproxy#11120, future work: envoyproxy#13777

Signed-off-by: Shikugawa <rei@tetrate.io>

Co-authored-by: Rei Shimizu <rei@tetrate.io>
rgs1 pushed a commit to rgs1/envoy that referenced this pull request Nov 3, 2020
…acted secret entity (envoyproxy#13344)"

This reverts commit e5a8c21.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
mattklein123 pushed a commit that referenced this pull request Nov 3, 2020
…acted secret entity (#13344)" (#13886)

This reverts commit e5a8c21.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sds: cluster not warming while certificates are being fetched; immediately marked active

7 participants