config: do not finish initialization on stream disconnection by ramaraochavali · Pull Request #7427 · envoyproxy/envoy

ramaraochavali · 2019-06-30T07:02:39Z

Description: As part of #6151 we ensured that envoy initialization would not finish till a named response comes. I found that when Envoys sends EDS request for a cluster and the management server is disconnected/reconnected, Envoy proceeds with the initialization even if named response is not sent. That is because stream disconnection triggers a onConfigUpdate callback and we call onPreInitComplete when we get onConfigUpdate callback. This PR ensures that we don;'t call onPreInitComplete on stream disconnects.
Risk Level: Low
Testing: Added a case to test this in the the existing test case.
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali · 2019-06-30T10:22:50Z

/retest

repokitteh-read-only · 2019-06-30T10:22:54Z

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #7427 (comment) was created by @ramaraochavali.

see: more, trace.

ramaraochavali · 2019-06-30T13:22:57Z

@mattklein123 @htuch Please see if this makes sense

mattklein123 · 2019-07-01T01:16:43Z

IMO this is not safe, since a bad management server can cause Envoy to hang. It seems like maybe you want some kind of feature where you can have some number of retries during init before you give up?

ramaraochavali · 2019-07-01T08:28:20Z

@mattklein123 But the problem is an accidental network disconnection can wipe out the cluster members. Also I thought initial_fetch_timeout is meant to do that safety handling i.e. when no response comes with in the time, it just moves ahead with the initialization.

But looking at the initial fetch timeout timer callback implementation, unfortunately it is just passing nullptr assuming that onConfigUpdate would move forward with init which would be an issue with this change.

envoy/source/common/config/grpc_mux_subscription_impl.cc

Line 26 in ae02dc6

callbacks_.onConfigUpdateFailed(nullptr);

Should we pass a valid exception like TimeoutException here instead of nullptr - so that initial_fetch_timeout can be used as a safety belt here ?

mattklein123 · 2019-07-01T13:26:02Z

@ramaraochavali my general feeling is that we need to have sane defaults. Personally, I would be in favor of making a default initial fetch timeout if none is specified. Perhaps 15s or so? If we have a default in place, as well as good documentation around this, I would be more inclined to do a change like this. @htuch WDYT?

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

stale · 2019-07-08T14:14:08Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

ramaraochavali · 2019-07-09T04:27:20Z

@mattklein123 I agree we should have sane defaults for initial_fetch_timout. There is inconsistency in the initialization behaviour when initial_fetch_timeout is not configured i.e if management server is connected and do not send any response, the initialzation process would wait (for whatever time it takes) and if there is an accidental network disconnection it would just finish the initialization process. So I think it is related but slightly different problem.

However I see your point about having some sane defaults for initial_fetch_timout` so that we are guaranteed not to hang in all cases.

So should I first push another PR to just change the initial_fetch_timeout to have default of 15s?

mattklein123 · 2019-07-09T14:04:07Z

So should I first push another PR to just change the initial_fetch_timeout to have default of 15s?

IMO yes. @htuch any thoughts here?

htuch · 2019-07-09T18:53:21Z

What is the underlying temporal property we want here? Is it something like "When an Envoy is initialized, it will have the complete state from the management server of all its APIs" or more like "An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"?

Do we want to support both models? Either way, any PR should probably elaborate on the xDS docs and explain what the key properties we're shooting for here.

ramaraochavali · 2019-07-10T06:04:32Z

I think @mattklein123 is leaning towards ""An Envoy is always guaranteed to initialize within a bounded period of time, with a best effort made to obtain the complete set of xDS configuration within that subject to the management server availability"? which seems reasonable to me. Is that correct @mattklein123 ?

mattklein123 · 2019-07-10T15:27:58Z

Yeah that is my thinking.

htuch · 2019-07-10T21:57:52Z

That's fair, I think this needs to be documented in the server initialization docs, since this is a key principle we need to respect going forward.

stale · 2019-07-17T22:13:46Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

ramaraochavali · 2019-07-18T04:47:57Z

This is waiting on this PR #7571

stale · 2019-07-30T09:05:25Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

mattklein123

Thanks for working on this. At a high level I think this change makes sense, but I would like to work on making the control flow more clear and less hacky. I'm going to also assign @htuch to review as he has worked on this code a lot more. Thank you!

/wait

mattklein123 · 2019-07-30T23:15:28Z

source/common/config/grpc_mux_subscription_impl.cc

-      ENVOY_LOG(warn, "gRPC config: initial fetch timed out for {}", type_url_);
-      callbacks_.onConfigUpdateFailed(nullptr);
+      try {
+        throw EnvoyException("initial fetch timed out");


You should be able to allocate an EnvoyException on the stack and pass it into the update failed function.

mattklein123 · 2019-07-30T23:17:11Z

source/common/config/grpc_mux_subscription_impl.cc

    stats_.update_failure_.inc();
    ENVOY_LOG(debug, "gRPC update for {} failed", type_url_);
  } else {
+    // fetch timeout should be disabled only when the actual timeout happens - not on network


This seems very fragile to me that we are using exceptions in this way to figure out a timeout vs. not, etc. Can you rework this code to make the control flow a lot more obvious? It's possible that you might need to rework how the update failed functions work.

Agree this is fragile. I did not try to change that because it triggers more changes because of how the update failed functions work today and this existed earlier as well. But it would be good to clean this up. Let me try.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

…ction Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali · 2019-07-31T14:24:23Z

@mattklein123 @htuch Added ConfigUpdateFailedReason enum and changed the flow. PTAL. envoy-macos pipeline failure is not related to this PR? Can you trigger that build again?

htuch

LGTM modulo a comment. This is a nice improvement to the understandability of config update failure!
/wait

htuch · 2019-07-31T14:36:27Z

source/common/upstream/eds.cc

+void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason,
+                                          const EnvoyException* e) {
+  //  We should not call onPreInitComplete if this method called because of stream disconnection.
+  if (e == nullptr) {


Why not use the ConfigUpdateFailureReason here rather than the indirect e == nullptr? This seems to be one of the few places (and motivating example) there is benefit from plumbing this reason in to the subscriber callbacks, but we 're not using it.

Great Catch @htuch. I meant to change this and forgot. Thanks. Changed now, PTAL.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

mattklein123

Thanks this is a great improvement. A few small comments and will defer to @htuch for further review.

mattklein123 · 2019-07-31T19:58:13Z

source/common/config/filesystem_subscription_impl.cc

    } else {
      ENVOY_LOG(warn, "Filesystem config update failure: {}", e.what());
      stats_.update_failure_.inc();
+      callbacks_.onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason::ConnectionFailure,


This is a bit of a strange error code to use here, but I understand why you did it. Maybe a small TODO/comment?

mattklein123 · 2019-07-31T19:58:53Z

source/common/config/grpc_mux_subscription_impl.cc

+    break;
+  case Envoy::Config::ConfigUpdateFailureReason::UpdateRejected:
+    // We expect Envoy exception to be thrown when update is rejected.
+    ASSERT(e);


nit: e != nullptr

mattklein123 · 2019-07-31T19:59:20Z

source/common/config/grpc_mux_subscription_impl.cc

    stats_.update_rejected_.inc();
    ENVOY_LOG(warn, "gRPC config for {} rejected: {}", type_url_, e->what());
+    break;
+  default:


this default case should not be needed.

mattklein123 · 2019-07-31T20:00:28Z

source/common/upstream/eds.cc

-  UNREFERENCED_PARAMETER(e);
+void EdsClusterImpl::onConfigUpdateFailed(Envoy::Config::ConfigUpdateFailureReason reason,
+                                          const EnvoyException*) {
+  //  We should not call onPreInitComplete if this method is called because of stream disconnection.


Can you add a comment here that this might hang init forever if the user has disabled the init timeout? Might be worth a TODO to warn in this case?

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali · 2019-08-01T03:17:54Z

@mattklein123 @htuch addressed the feedback. PTAL.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

htuch

Thanks!

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali · 2019-08-02T01:20:42Z

@htuch merged master. PTAL

ramaraochavali · 2019-08-02T03:23:56Z

@htuch @mattklein123 can this be merged now tests have passed, it might get in to master merge issue because of the number of files?

ramaraochavali added 3 commits June 30, 2019 12:26

do not finish initialization on diconnect

5db975e

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

Merge branch 'master' into fix/init_disconnection

3258141

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

fix format

db1f94c

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

mattklein123 self-assigned this Jul 1, 2019

mattklein123 added the waiting label Jul 1, 2019

ramaraochavali added 3 commits July 5, 2019 16:11

move timer call

25f5eb5

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

Merge branch 'master' into fix/init_disconnection

5b2b4eb

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

format

b4b5c1a

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 8, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 9, 2019

ramaraochavali mentioned this pull request Jul 14, 2019

config: change default initial fetch timeout to 15s #7571

Merged

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 18, 2019

repokitteh-read-only bot removed the waiting label Jul 23, 2019

mattklein123 added the waiting label Jul 23, 2019

mattklein123 requested changes Jul 30, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jul 30, 2019

mattklein123 assigned htuch Jul 30, 2019

add config update failure reason

ffcdd62

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

repokitteh-read-only bot removed the waiting label Jul 31, 2019

ramaraochavali added 6 commits July 31, 2019 14:49

fix missed compilation

882b090

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

fix format

585fc14

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

fix test

86f881d

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

fix filesubscription test

3c5a0aa

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

Merge remote-tracking branch 'upstream/master' into fix/init_disconne…

e1eb790

…ction Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

fix http subscription testt

1ef2350

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

htuch suggested changes Jul 31, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jul 31, 2019

change to check for reason

42f711e

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

repokitteh-read-only bot removed the waiting label Jul 31, 2019

mattklein123 previously approved these changes Jul 31, 2019

View reviewed changes

mattklein123 removed their assignment Jul 31, 2019

address review comments

58f02d6

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

ramaraochavali dismissed mattklein123’s stale review via 58f02d6 August 1, 2019 03:17

correct spelling

e9800c5

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

htuch approved these changes Aug 1, 2019

View reviewed changes

resolve conflicts

91f61ea

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

htuch approved these changes Aug 2, 2019

View reviewed changes

mattklein123 merged commit 0957e9c into envoyproxy:master Aug 2, 2019

ramaraochavali deleted the fix/init_disconnection branch August 3, 2019 06:13

ramaraochavali mentioned this pull request Dec 4, 2019

xds: undo the revert (#9044) of delta+SotW unification #9189

Closed

Conversation

ramaraochavali commented Jun 30, 2019

Uh oh!

ramaraochavali commented Jun 30, 2019

Uh oh!

repokitteh-read-only bot commented Jun 30, 2019

Uh oh!

ramaraochavali commented Jun 30, 2019

Uh oh!

mattklein123 commented Jul 1, 2019

Uh oh!

ramaraochavali commented Jul 1, 2019

Uh oh!

mattklein123 commented Jul 1, 2019

Uh oh!

stale bot commented Jul 8, 2019

Uh oh!

ramaraochavali commented Jul 9, 2019

Uh oh!

mattklein123 commented Jul 9, 2019

Uh oh!

htuch commented Jul 9, 2019

Uh oh!

ramaraochavali commented Jul 10, 2019

Uh oh!

mattklein123 commented Jul 10, 2019

Uh oh!

htuch commented Jul 10, 2019

Uh oh!

stale bot commented Jul 17, 2019

Uh oh!

ramaraochavali commented Jul 18, 2019

Uh oh!

stale bot commented Jul 30, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jul 31, 2019

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Aug 1, 2019

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Aug 2, 2019

Uh oh!

ramaraochavali commented Aug 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!