Skip to content

xds: revert delta/SoTW changes#9044

Merged
mattklein123 merged 1 commit intomasterfrom
revert_xds
Nov 16, 2019
Merged

xds: revert delta/SoTW changes#9044
mattklein123 merged 1 commit intomasterfrom
revert_xds

Conversation

@mattklein123
Copy link
Member

Revert "config: minor cleanup: remove DeltaSubscriptionState::resourc……e_names_ (#8918)"
This reverts commit 80aedc1.

Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)"
This reverts commit 6d50553.

Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)"
This reverts commit a37522c.

…e_names_ (#8918)"

This reverts commit 80aedc1.

Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)"
This reverts commit 6d50553.

Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)"
This reverts commit a37522c.

Signed-off-by: Matt Klein <mklein@lyft.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9044 was opened by mattklein123.

see: more, trace.

@mattklein123
Copy link
Member Author

See #8974 (comment)

cc @rgs1 @fredlas @wgallagher @lizan @htuch

@rgs1
Copy link
Member

rgs1 commented Nov 16, 2019

Confirming this makes the stuck on startup problem go away.

@mattklein123
Copy link
Member Author

I am vaguely concerned this issue is somehow actually 80aedc1 since resource_names_ wasn't deleted (cc @wgallagher I missed this in code review), but it looks benign to me AFAICT. I think we should just merge this and then potentially @fredlas @wgallagher can you work with @rgs1 next week to debug? Thank you.

@fredlas
Copy link
Contributor

fredlas commented Nov 18, 2019

@rgs1 can you share logs of this happening?

My first instinct is to look at #8350 (comment) . I wonder if some gRPC xDS setups were unintentionally relying on the old behavior that we concluded was an unnecessarily finnicky: if a ClusterLoadAssignment arrives before any Cluster, the old behavior would reject that CLA update. In the new behavior, the update is not rejected. I wonder if there are servers that, seeing the first CLA get "accepted", assume Envoy has the CLA, and don't see a need to send it even when Envoy sends another CLA request after actually getting the Cluster.

The immediate symptom would be that the Envoy would never load in the CLA it needs. It would send the request, but never get the response (although you would be able to see the response it would need having been sent earlier).

@fredlas
Copy link
Contributor

fredlas commented Nov 25, 2019

@rgs1 ping; you're the only one I know of who saw a problem, so I need to see what you saw.

@rgs1
Copy link
Member

rgs1 commented Nov 25, 2019

@fredlas sorry for dropping the ball, I was carried away by 40970bd last week. Do you have a PR for reintroducing this that I can take for a test drive to fetch logs & get a repro?

fredlas added a commit to fredlas/envoy that referenced this pull request Nov 25, 2019
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor

fredlas commented Nov 25, 2019

No problem!

Here is a ready-to-be-a-PR branch that reintroduces the reverted xDS unification stuff, and restores the corner case behavior present in the older state-of-the-world gRPC xDS. I expect that one extra change should either completely fix the problem, or else at least not change how the problem recurs. (The commit containing roughly exactly the code that broke for you is just a few commits back, 38a29ed ).

https://github.com/fredlas/envoy/tree/UNR_unrevert_xds_unification

@rgs1
Copy link
Member

rgs1 commented Nov 25, 2019

Giving it a try now.

@rgs1
Copy link
Member

rgs1 commented Nov 25, 2019

@fredlas still happening with your branch; gets stuck in hot restart. Let me do a clean -- no process running -- start and see what the logs say...

@rgs1
Copy link
Member

rgs1 commented Nov 26, 2019

Not sure if I mentioned this before, so FWIW we get stuck in:

"state": "PRE_INITIALIZING",

rgs1 pushed a commit to rgs1/envoy that referenced this pull request Nov 27, 2019
Debugging issues like the one in envoyproxy#9044 are really hard without being
able to see what's going on in the CM.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member

rgs1 commented Nov 27, 2019

Mentioned this offline to @fredlas, but re-posting here for posterity:

Is it possible that ordering of updates changed with this? For our consistently broken setup we found out that we can make it work, if I start Envoy with:

a) removed the listener that has routes inlined with no RDS (there are 3 listeners in our problematic setup)
b) remove all clusters except one (in case that makes fetching secrets easier)
c) remove all routes that reference removed clusters

and then:

d) add list of original clusters
e) add list of original routes
f) add original listeners

and voila, Envoy is happy and (all) listeners are up.

This fixes the issue both on startup (e.g.: do a/b/c and start and then do d/e/f after the first minimal config came up) or if we start broken (stuck) and then do a/b/c/d/e/f.

mattklein123 pushed a commit that referenced this pull request Nov 29, 2019
Debugging issues like the one in #9044 are really hard without being
able to see what's going on in the CM.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member

rgs1 commented Dec 2, 2019

Chatted offline again with @fredlas, we found a workaround by fixing something in our spiffe infrastructure. Surely we could still consider this a regression, but since we have a workaround I am less inclined on blocking this any longer (and no one else has complained yet, so maybe we were the only ones hitting this).

@mattklein123
Copy link
Member Author

I'm really uncomfortable re-merging this until we understand the problem. Is there any way that we can sort that and then we can decide what to do?

@fredlas
Copy link
Contributor

fredlas commented Dec 2, 2019

To avoid a lie-by-omission false claiming of credit, I'll point out that "we" means entirely people on his end; I did not help with finding that particular problem!

To clarify a bit (although probably to really resolve this we should take it to Slack), Raul is saying (paraphrasing from our discussion) they originally had what they now consider to be a broken config. The change they made to get it working with the new code is not a workaround, and it's somewhat mysterious that the old code was able to work with the old broken config.

@kyessenov
Copy link
Contributor

Given that it was holidays, I would not assume that lack of complaints is a good indicator that this won't break someone. Could we document what exactly changed in the protocol somewhere, and ideally, codify the change in a test?

@fredlas
Copy link
Contributor

fredlas commented Dec 2, 2019

I think you are seeing this revert in terms of there being a breakage in the code, which turned out not to be the case*. They may indeed have been the only ones to try the new version. However, it is not a case of "one attempt to use it broke, we fixed that, and are now going to re-merge it" (which would certainly be a good reason to slow down). Rather, it would be "this code is untested, other than one case where it worked correctly". Relative to being merged for the first time, there's actually a bit less cause for concern.

*It sounds like the hang they saw was actually a desirable outcome of the new version's cleaner approach. @rgs1, are you willing/able to elaborate on that? Although it's spooky that the old code "worked" (in a case where it is better not to), I don't think the old code is worth further effort.

@kyessenov
Copy link
Contributor

What I worry about is that control planes are hoping for eventual consistency convergence, which requires that Envoy does not get stuck with some broken config in the mean time. So it's important what sort of config changed the behavior. E.g. do not expect every control plane to do everything right, they don't and mostly rely on Envoy to sort itself out.

@mattklein123
Copy link
Member Author

I do share @kyessenov's concern, though per offline conversation I was convinced that this was enough of a reported edge case, and also broken on the management server side in a pretty specific way, that it might not be worth it to spend time digging in to this more.

So in a perfect world I would like more detail, but I'm not sure that we can gather the resources for the in-depth investigation required. Will defer to @rgs1 to decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants