Conversation
…e_names_ (#8918)" This reverts commit 80aedc1. Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)" This reverts commit 6d50553. Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)" This reverts commit a37522c. Signed-off-by: Matt Klein <mklein@lyft.com>
|
Confirming this makes the stuck on startup problem go away. |
|
I am vaguely concerned this issue is somehow actually 80aedc1 since |
|
@rgs1 can you share logs of this happening? My first instinct is to look at #8350 (comment) . I wonder if some gRPC xDS setups were unintentionally relying on the old behavior that we concluded was an unnecessarily finnicky: if a ClusterLoadAssignment arrives before any Cluster, the old behavior would reject that CLA update. In the new behavior, the update is not rejected. I wonder if there are servers that, seeing the first CLA get "accepted", assume Envoy has the CLA, and don't see a need to send it even when Envoy sends another CLA request after actually getting the Cluster. The immediate symptom would be that the Envoy would never load in the CLA it needs. It would send the request, but never get the response (although you would be able to see the response it would need having been sent earlier). |
|
@rgs1 ping; you're the only one I know of who saw a problem, so I need to see what you saw. |
Signed-off-by: Fred Douglas <fredlas@google.com>
|
No problem! Here is a ready-to-be-a-PR branch that reintroduces the reverted xDS unification stuff, and restores the corner case behavior present in the older state-of-the-world gRPC xDS. I expect that one extra change should either completely fix the problem, or else at least not change how the problem recurs. (The commit containing roughly exactly the code that broke for you is just a few commits back, 38a29ed ). https://github.com/fredlas/envoy/tree/UNR_unrevert_xds_unification |
|
Giving it a try now. |
|
@fredlas still happening with your branch; gets stuck in hot restart. Let me do a clean -- no process running -- start and see what the logs say... |
|
Not sure if I mentioned this before, so FWIW we get stuck in: |
Debugging issues like the one in envoyproxy#9044 are really hard without being able to see what's going on in the CM. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
Mentioned this offline to @fredlas, but re-posting here for posterity: Is it possible that ordering of updates changed with this? For our consistently broken setup we found out that we can make it work, if I start Envoy with: a) removed the listener that has routes inlined with no RDS (there are 3 listeners in our problematic setup) and then: d) add list of original clusters and voila, Envoy is happy and (all) listeners are up. This fixes the issue both on startup (e.g.: do a/b/c and start and then do d/e/f after the first minimal config came up) or if we start broken (stuck) and then do a/b/c/d/e/f. |
Debugging issues like the one in #9044 are really hard without being able to see what's going on in the CM. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
Chatted offline again with @fredlas, we found a workaround by fixing something in our spiffe infrastructure. Surely we could still consider this a regression, but since we have a workaround I am less inclined on blocking this any longer (and no one else has complained yet, so maybe we were the only ones hitting this). |
|
I'm really uncomfortable re-merging this until we understand the problem. Is there any way that we can sort that and then we can decide what to do? |
|
To avoid a lie-by-omission false claiming of credit, I'll point out that "we" means entirely people on his end; I did not help with finding that particular problem! To clarify a bit (although probably to really resolve this we should take it to Slack), Raul is saying (paraphrasing from our discussion) they originally had what they now consider to be a broken config. The change they made to get it working with the new code is not a workaround, and it's somewhat mysterious that the old code was able to work with the old broken config. |
|
Given that it was holidays, I would not assume that lack of complaints is a good indicator that this won't break someone. Could we document what exactly changed in the protocol somewhere, and ideally, codify the change in a test? |
|
I think you are seeing this revert in terms of there being a breakage in the code, which turned out not to be the case*. They may indeed have been the only ones to try the new version. However, it is not a case of "one attempt to use it broke, we fixed that, and are now going to re-merge it" (which would certainly be a good reason to slow down). Rather, it would be "this code is untested, other than one case where it worked correctly". Relative to being merged for the first time, there's actually a bit less cause for concern. *It sounds like the hang they saw was actually a desirable outcome of the new version's cleaner approach. @rgs1, are you willing/able to elaborate on that? Although it's spooky that the old code "worked" (in a case where it is better not to), I don't think the old code is worth further effort. |
|
What I worry about is that control planes are hoping for eventual consistency convergence, which requires that Envoy does not get stuck with some broken config in the mean time. So it's important what sort of config changed the behavior. E.g. do not expect every control plane to do everything right, they don't and mostly rely on Envoy to sort itself out. |
|
I do share @kyessenov's concern, though per offline conversation I was convinced that this was enough of a reported edge case, and also broken on the management server side in a pretty specific way, that it might not be worth it to spend time digging in to this more. So in a perfect world I would like more detail, but I'm not sure that we can gather the resources for the in-depth investigation required. Will defer to @rgs1 to decide. |
Revert "config: minor cleanup: remove DeltaSubscriptionState::resourc……e_names_ (#8918)"
This reverts commit 80aedc1.
Revert "config: rename NewGrpcMuxImpl -> GrpcMuxImpl (#8919)"
This reverts commit 6d50553.
Revert "config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 (#8974)"
This reverts commit a37522c.