-
Notifications
You must be signed in to change notification settings - Fork 5.5k
sds: keep warming when dynamic inserted cluster can't be extracted secret entity #13894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 43 commits
b882bde
7e9332d
8e99e81
478f578
b4af995
8cda02f
5841c9e
99276bc
6e95a07
3c600ec
fa718a9
cb5a373
456b5b9
b77dae0
c24b729
476d706
96a71c4
3a2cfea
34ea49c
b2b99be
5bbae5f
1e4f950
31354e4
39606e6
0ad2773
186c305
5f83864
1cdfbe6
0e237a1
2a46a40
231bfff
3c6902f
d05a1a3
a4fcf00
1575c36
a18c4a6
9fb4528
fb056ad
31bfe83
6ff7ae2
d1cf6a2
37a272f
c83b7c8
85639eb
edead46
150a789
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -431,6 +431,19 @@ void ClusterManagerImpl::onClusterInit(Cluster& cluster) { | |
| // We have a situation that clusters will be immediately active, such as static and primary | ||
| // cluster. So we must have this prevention logic here. | ||
| if (cluster_data != warming_clusters_.end()) { | ||
| // If there is no secret entity, currently supports only TLS Certificate and Validation | ||
| // Context, when it failed to extract them via SDS, it will fail to change cluster status from | ||
| // warming to active. In current implementation, there is no strategy to activate clusters | ||
| // which failed to initialize at once. | ||
| // TODO(shikugawa): To implement to be available by keeping warming after no-available secret | ||
| // entity behavior occurred. And remove | ||
| // `envoy.reloadable_features.cluster_keep_warming_no_secret_entity` runtime feature flag. | ||
| if (Runtime::runtimeFeatureEnabled( | ||
| "envoy.reloadable_features.cluster_keep_warming_no_secret_entity") && | ||
| !cluster.info()->transportSocketMatcher().factoriesReady()) { | ||
| ENVOY_LOG(warn, "Failed to activate {}", cluster.info()->name()); | ||
| return; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still don't understand how this blocks initialization. Unless I am missing something, the ClusterManagerInitHelper will still complete initialization with this early return. This will cause workers to start, etc., leading to 503s. If this is not the case can you update the comments? This will block warming -> active in a subsequent CDS update, which is marginally better, but I don't think it fixes the problem of server init?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add the additional comment from the offline:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. That's what I thought. |
||
| } | ||
| clusterWarmingToActive(cluster.info()->name()); | ||
| updateClusterCounts(); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the cluster ever activate? I think this function never gets called again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is keep the cluster warming unless it has the a valid secret entity since it can't be used in meaningful way, because the cluster will end up using NotReadySslSocket and return 503s always. That's what the TODO above is about (#13952) and the long term fix #13952
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how can the cluster ever leave warming? It needs to get updated again by the management server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, until #13777 (that's the TODO in the previous comment but I pasted wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But stepping back, what does this change give us? If the cluster stays in warming and the server starts up, we will still get 503s? Or if a cluster gets stuck in warming even if the user sets a timeout, it seems like it would be extremely difficult to debug this?
What is the exact scenario we are trying to protect against? I assume the server came up and SDS is not ready during a CDS update? Because I don't think this woulds block initial server startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fine, but I don't really understand what the implications of this are and whether it's going to cause more confusing behavior that is also hard to debug. Can you add more comments and we can discuss? Do we need stats for this also? I just don't know what we are dealing with here honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this won't cause more confusing behavior. While I agree that the current behavior is still confusing because the cluster with not ready ssl will fail when it gets traffic. The problem that this PR fix is that Envoy won't advertise itself is ready in that case.
The runtime flag is off by default and will be removed when we have long term fix. I don't think we need a stats for this given we have a warn level log and the flag is off by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK can you add more comments and we can go from there? I think we are going in circles as I'm unclear on what the old/new behavior is. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 reworded the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardjohn I have seen such an issue as well. Was this root caused to a bug in envoy?