config: do not finish initialization on stream disconnection#7427
config: do not finish initialization on stream disconnection#7427mattklein123 merged 20 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
|
/retest |
|
🔨 rebuilding |
|
@mattklein123 @htuch Please see if this makes sense |
|
IMO this is not safe, since a bad management server can cause Envoy to hang. It seems like maybe you want some kind of feature where you can have some number of retries during init before you give up? |
|
@mattklein123 But the problem is an accidental network disconnection can wipe out the cluster members. Also I thought initial_fetch_timeout is meant to do that safety handling i.e. when no response comes with in the time, it just moves ahead with the initialization. But looking at the initial fetch timeout timer callback implementation, unfortunately it is just passing Should we pass a valid exception like |
|
@ramaraochavali my general feeling is that we need to have sane defaults. Personally, I would be in favor of making a default initial fetch timeout if none is specified. Perhaps 15s or so? If we have a default in place, as well as good documentation around this, I would be more inclined to do a change like this. @htuch WDYT? |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
@mattklein123 I agree we should have sane defaults for However I see your point about having some sane defaults for initial_fetch_timout` so that we are guaranteed not to hang in all cases. So should I first push another PR to just change the |
IMO yes. @htuch any thoughts here? |
|
What is the underlying temporal property we want here? Is it something like "When an Envoy is initialized, it will have the complete state from the management server of all its APIs" or more like "An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? Do we want to support both models? Either way, any PR should probably elaborate on the xDS docs and explain what the key properties we're shooting for here. |
|
I think @mattklein123 is leaning towards ""An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? which seems reasonable to me. Is that correct @mattklein123 ? |
|
Yeah that is my thinking. |
|
That's fair, I think this needs to be documented in the server initialization docs, since this is a key principle we need to respect going forward. |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This is waiting on this PR #7571 |
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for working on this. At a high level I think this change makes sense, but I would like to work on making the control flow more clear and less hacky. I'm going to also assign @htuch to review as he has worked on this code a lot more. Thank you!
/wait
| ENVOY_LOG(warn, "gRPC config: initial fetch timed out for {}", type_url_); | ||
| callbacks_.onConfigUpdateFailed(nullptr); | ||
| try { | ||
| throw EnvoyException("initial fetch timed out"); |
There was a problem hiding this comment.
You should be able to allocate an EnvoyException on the stack and pass it into the update failed function.
| stats_.update_failure_.inc(); | ||
| ENVOY_LOG(debug, "gRPC update for {} failed", type_url_); | ||
| } else { | ||
| // fetch timeout should be disabled only when the actual timeout happens - not on network |
There was a problem hiding this comment.
This seems very fragile to me that we are using exceptions in this way to figure out a timeout vs. not, etc. Can you rework this code to make the control flow a lot more obvious? It's possible that you might need to rework how the update failed functions work.
There was a problem hiding this comment.
Agree this is fragile. I did not try to change that because it triggers more changes because of how the update failed functions work today and this existed earlier as well. But it would be good to clean this up. Let me try.
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
…ction Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
|
@mattklein123 @htuch Added |
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo a comment. This is a nice improvement to the understandability of config update failure!
/wait
source/common/upstream/eds.cc
Outdated
| void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason, | ||
| const EnvoyException* e) { | ||
| // We should not call onPreInitComplete if this method called because of stream disconnection. | ||
| if (e == nullptr) { |
There was a problem hiding this comment.
Why not use the ConfigUpdateFailureReason here rather than the indirect e == nullptr? This seems to be one of the few places (and motivating example) there is benefit from plumbing this reason in to the subscriber callbacks, but we 're not using it.
There was a problem hiding this comment.
Great Catch @htuch. I meant to change this and forgot. Thanks. Changed now, PTAL.
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this is a great improvement. A few small comments and will defer to @htuch for further review.
| } else { | ||
| ENVOY_LOG(warn, "Filesystem config update failure: {}", e.what()); | ||
| stats_.update_failure_.inc(); | ||
| callbacks_.onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure, |
There was a problem hiding this comment.
This is a bit of a strange error code to use here, but I understand why you did it. Maybe a small TODO/comment?
| break; | ||
| case Envoy::Config::ConfigUpdateFailureReason::UpdateRejected: | ||
| // We expect Envoy exception to be thrown when update is rejected. | ||
| ASSERT(e); |
| stats_.update_rejected_.inc(); | ||
| ENVOY_LOG(warn, "gRPC config for {} rejected: {}", type_url_, e->what()); | ||
| break; | ||
| default: |
There was a problem hiding this comment.
this default case should not be needed.
| UNREFERENCED_PARAMETER(e); | ||
| void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason, | ||
| const EnvoyException*) { | ||
| // We should not call onPreInitComplete if this method is called because of stream disconnection. |
There was a problem hiding this comment.
Can you add a comment here that this might hang init forever if the user has disabled the init timeout? Might be worth a TODO to warn in this case?
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
|
@mattklein123 @htuch addressed the feedback. PTAL. |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
|
@htuch merged master. PTAL |
|
@htuch @mattklein123 can this be merged now tests have passed, it might get in to master merge issue because of the number of files? |
Description: As part of #6151 we ensured that envoy initialization would not finish till a named response comes. I found that when Envoys sends EDS request for a cluster and the management server is disconnected/reconnected, Envoy proceeds with the initialization even if named response is not sent. That is because stream disconnection triggers a
onConfigUpdatecallback and we callonPreInitCompletewhen we getonConfigUpdatecallback. This PR ensures that we don;'t callonPreInitCompleteon stream disconnects.Risk Level: Low
Testing: Added a case to test this in the the existing test case.
Docs Changes: N/A
Release Notes: N/A