xds: revert delta/SoTW changes by mattklein123 · Pull Request #9044 · envoyproxy/envoy

mattklein123 · 2019-11-15T22:53:18Z

Revert "config: minor cleanup: remove DeltaSubscriptionState::resourc……e_names_ (#8918)"
This reverts commit 80aedc1.

Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)"
This reverts commit 6d50553.

Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)"
This reverts commit a37522c.

…e_names_ (#8918)" This reverts commit 80aedc1. Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)" This reverts commit 6d50553. Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)" This reverts commit a37522c. Signed-off-by: Matt Klein <mklein@lyft.com>

repokitteh-read-only · 2019-11-15T22:53:24Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9044 was opened by mattklein123.

see: more, trace.

mattklein123 · 2019-11-15T22:54:11Z

See #8974 (comment)

cc @rgs1 @fredlas @wgallagher @lizan @htuch

rgs1 · 2019-11-16T02:23:32Z

Confirming this makes the stuck on startup problem go away.

mattklein123 · 2019-11-16T20:42:38Z

I am vaguely concerned this issue is somehow actually 80aedc1 since resource_names_ wasn't deleted (cc @wgallagher I missed this in code review), but it looks benign to me AFAICT. I think we should just merge this and then potentially @fredlas @wgallagher can you work with @rgs1 next week to debug? Thank you.

fredlas · 2019-11-18T17:26:02Z

@rgs1 can you share logs of this happening?

My first instinct is to look at #8350 (comment) . I wonder if some gRPC xDS setups were unintentionally relying on the old behavior that we concluded was an unnecessarily finnicky: if a ClusterLoadAssignment arrives before any Cluster, the old behavior would reject that CLA update. In the new behavior, the update is not rejected. I wonder if there are servers that, seeing the first CLA get "accepted", assume Envoy has the CLA, and don't see a need to send it even when Envoy sends another CLA request after actually getting the Cluster.

The immediate symptom would be that the Envoy would never load in the CLA it needs. It would send the request, but never get the response (although you would be able to see the response it would need having been sent earlier).

fredlas · 2019-11-25T15:41:05Z

@rgs1 ping; you're the only one I know of who saw a problem, so I need to see what you saw.

rgs1 · 2019-11-25T16:23:01Z

@fredlas sorry for dropping the ball, I was carried away by 40970bd last week. Do you have a PR for reintroducing this that I can take for a test drive to fetch logs & get a repro?

Signed-off-by: Fred Douglas <fredlas@google.com>

fredlas · 2019-11-25T18:00:30Z

No problem!

Here is a ready-to-be-a-PR branch that reintroduces the reverted xDS unification stuff, and restores the corner case behavior present in the older state-of-the-world gRPC xDS. I expect that one extra change should either completely fix the problem, or else at least not change how the problem recurs. (The commit containing roughly exactly the code that broke for you is just a few commits back, 38a29ed ).

https://github.com/fredlas/envoy/tree/UNR_unrevert_xds_unification

rgs1 · 2019-11-25T18:14:23Z

Giving it a try now.

rgs1 · 2019-11-25T19:50:23Z

@fredlas still happening with your branch; gets stuck in hot restart. Let me do a clean -- no process running -- start and see what the logs say...

rgs1 · 2019-11-26T02:00:05Z

Not sure if I mentioned this before, so FWIW we get stuck in:

"state": "PRE_INITIALIZING",

Debugging issues like the one in envoyproxy#9044 are really hard without being able to see what's going on in the CM. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

rgs1 · 2019-11-27T20:38:57Z

Mentioned this offline to @fredlas, but re-posting here for posterity:

Is it possible that ordering of updates changed with this? For our consistently broken setup we found out that we can make it work, if I start Envoy with:

a) removed the listener that has routes inlined with no RDS (there are 3 listeners in our problematic setup)
b) remove all clusters except one (in case that makes fetching secrets easier)
c) remove all routes that reference removed clusters

and then:

d) add list of original clusters
e) add list of original routes
f) add original listeners

and voila, Envoy is happy and (all) listeners are up.

This fixes the issue both on startup (e.g.: do a/b/c and start and then do d/e/f after the first minimal config came up) or if we start broken (stuck) and then do a/b/c/d/e/f.

Debugging issues like the one in #9044 are really hard without being able to see what's going on in the CM. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

rgs1 · 2019-12-02T20:19:37Z

Chatted offline again with @fredlas, we found a workaround by fixing something in our spiffe infrastructure. Surely we could still consider this a regression, but since we have a workaround I am less inclined on blocking this any longer (and no one else has complained yet, so maybe we were the only ones hitting this).

mattklein123 · 2019-12-02T20:21:23Z

I'm really uncomfortable re-merging this until we understand the problem. Is there any way that we can sort that and then we can decide what to do?

fredlas · 2019-12-02T20:26:54Z

To avoid a lie-by-omission false claiming of credit, I'll point out that "we" means entirely people on his end; I did not help with finding that particular problem!

To clarify a bit (although probably to really resolve this we should take it to Slack), Raul is saying (paraphrasing from our discussion) they originally had what they now consider to be a broken config. The change they made to get it working with the new code is not a workaround, and it's somewhat mysterious that the old code was able to work with the old broken config.

kyessenov · 2019-12-02T21:22:14Z

Given that it was holidays, I would not assume that lack of complaints is a good indicator that this won't break someone. Could we document what exactly changed in the protocol somewhere, and ideally, codify the change in a test?

fredlas · 2019-12-02T21:51:07Z

I think you are seeing this revert in terms of there being a breakage in the code, which turned out not to be the case*. They may indeed have been the only ones to try the new version. However, it is not a case of "one attempt to use it broke, we fixed that, and are now going to re-merge it" (which would certainly be a good reason to slow down). Rather, it would be "this code is untested, other than one case where it worked correctly". Relative to being merged for the first time, there's actually a bit less cause for concern.

*It sounds like the hang they saw was actually a desirable outcome of the new version's cleaner approach. @rgs1, are you willing/able to elaborate on that? Although it's spooky that the old code "worked" (in a case where it is better not to), I don't think the old code is worth further effort.

kyessenov · 2019-12-02T21:56:11Z

What I worry about is that control planes are hoping for eventual consistency convergence, which requires that Envoy does not get stuck with some broken config in the mean time. So it's important what sort of config changed the behavior. E.g. do not expect every control plane to do everything right, they don't and mostly rely on Envoy to sort itself out.

mattklein123 · 2019-12-02T23:15:13Z

I do share @kyessenov's concern, though per offline conversation I was convinced that this was enough of a reported edge case, and also broken on the management server side in a pretty specific way, that it might not be worth it to spend time digging in to this more.

So in a perfect world I would like more detail, but I'm not sure that we can gather the resources for the in-depth investigation required. Will defer to @rgs1 to decide.

repokitteh-read-only bot added the api label Nov 15, 2019

mattklein123 mentioned this pull request Nov 15, 2019

config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 #8974

Merged

mattklein123 merged commit 38adf1f into master Nov 16, 2019

mattklein123 deleted the revert_xds branch November 16, 2019 20:43

sthagen mentioned this pull request Nov 17, 2019

Revert "config: minor cleanup: remove DeltaSubscriptionState::resourc… sthagen/envoyproxy-envoy#62

Merged

fredlas added a commit to fredlas/envoy that referenced this pull request Nov 25, 2019

Un-revert the revert of xDS (envoyproxy#9044)

c4065cb

Signed-off-by: Fred Douglas <fredlas@google.com>

rgs1 mentioned this pull request Nov 27, 2019

cluster manager: add more debug logging #9157

Merged

tbarrella mentioned this pull request Dec 7, 2020

integration tests: re-enable set_node_on_first_message_only #14270

Merged

Conversation

mattklein123 commented Nov 15, 2019

Uh oh!

repokitteh-read-only bot commented Nov 15, 2019

Uh oh!

mattklein123 commented Nov 15, 2019

Uh oh!

rgs1 commented Nov 16, 2019

Uh oh!

mattklein123 commented Nov 16, 2019

Uh oh!

fredlas commented Nov 18, 2019

Uh oh!

fredlas commented Nov 25, 2019

Uh oh!

rgs1 commented Nov 25, 2019

Uh oh!

fredlas commented Nov 25, 2019

Uh oh!

rgs1 commented Nov 25, 2019

Uh oh!

rgs1 commented Nov 25, 2019

Uh oh!

rgs1 commented Nov 26, 2019

Uh oh!

rgs1 commented Nov 27, 2019

Uh oh!

rgs1 commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Dec 2, 2019

Uh oh!

fredlas commented Dec 2, 2019

Uh oh!

kyessenov commented Dec 2, 2019

Uh oh!

fredlas commented Dec 2, 2019

Uh oh!

kyessenov commented Dec 2, 2019

Uh oh!

mattklein123 commented Dec 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rgs1 commented Dec 2, 2019 •

edited

Loading