xds: undo the revert (#9044) of delta+SotW unification#9189
xds: undo the revert (#9044) of delta+SotW unification#9189fredlas wants to merge 14 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification Signed-off-by: Fred Douglas <fredlas@google.com>
This reverts commit 30ce222. This restoration was to test my theory that 8350 was the cause of the hang. Now that we have established it wasn't, it will be good to have this cleaner, more sensible behavior back in. Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification Signed-off-by: Fred Douglas <fredlas@google.com>
|
@rgs1 @fredlas one last question on this: @rgs1 do you have xDS init timeouts disabled in your environment? If you do not, and Envoy would still hang in some case, I think we do have to understand the original cause of the hang. If you have the init timeouts disabled for some reason I can potentially accept this though I'm uncomfortable. cc @wgallagher @kyessenov /wait-any |
@mattklein123 you mean |
OK, that's what I was afraid of. Sorry, we can't merge this until we understand what is going on. Something has changed with regard to the init timeout handling. @rgs1 are you able to re-break your SDS implementation and potentially get more logs? Thank you hugely in advance. |
|
Nice, that's a pretty solid lead. After a quick look, I notice one perhaps important difference in when the old/new code calls I would expect the difference there to be that the old code gives up in the face of some failures, whereas the new code keeps trying. If the config setup was relying on that giving up, that might explain the hang? |
Possibly, but we should still be bound by the 10s timeout. So, perhaps there is some difference in which we are not adhering to the timeout. cc @ramaraochavali who has looked at this a lot. |
|
Hmm... yes, that is true, but it looks like it might be different if the particular error was a connection failure. I think what would happen in the new code, if the issue is that there is an exception-throwing failure involved, is that actually the Now, when the init timeout fires, the result is a |
I will let @ramaraochavali chime in but I think the expected behavior is that we always timeout after the timeout period, but allow retries if there are connection errors. This must be where the functional change is and this definitely needs to be fixed. Great though that we are getting close to understand the problem! |
|
Yes. The original implementation was, Envoy initialization should always timeout and proceed with initialization, with in configured time making best effort to get the initial config during that configured time (and retries will continue in the background to get the config even after the initialization is done). The relevant PRs that made this change are #7571 and #7427. Based on what is being discussed here, it is not finishing initialization after 10s, which seems to be the regression. I have n't looked at the new code to see what the problem is though |
|
It's not clear from Event::Timer's comments whether that timer is a one-time thing, or a "fire every until disabled". But, if the latter, then there would be an ongoing stream of onConfigUpdateFailed() calls, forever. Would that be an issue for init completing? If so, then the reordering of the two lines in |
…ing behavior, which treats any news about the fetch as precluding a timeout, as opposed to only wanting a successful one Signed-off-by: Fred Douglas <fredlas@google.com>
…unification Signed-off-by: Fred Douglas <fredlas@google.com>
|
BTW, you can check init_fetch_timeout stat on the setup where it is initializing successfully to see if the initialization finished because of timeout(10s) with old code? AFAIK, the timer is a one time thing |
|
Ok, added some more logging; @rgs1 please give it a run! Even if it doesn't work, the logs will hopefully be enough to point me to the real problem... |
…unification Signed-off-by: Fred Douglas <fredlas@google.com>
|
@rgs1 friendly ping on this. Did you have a chance to test this out? Thank you! |
|
In progress - I merged just now so that the thing being tested would include the recent security patches. |
Sorry, getting to this now... should have something to share by tomorrow. |
|
@mattklein123 so I just dropped @fredlas' branch on top of master as of yesterday, and got two hosts stuck on init... And what's interesting is, they are not even running the "problematic" spiffe config that triggered this in the first place. I'll get some logs entries and follow-up offline with @fredlas. |
Awesome thank you! |
|
So, it seems we are stuck at From our logs, we never leave that loop (timestamps removed for readability, but it loops every ~2 secs): |
|
So our two possibilities for exiting that loop are: Let me check what's going on in those... |
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification Signed-off-by: Fred Douglas <fredlas@google.com>
|
I think (and looks confirmed by other lines in the log showing messages from the server) that it is not an infinite loop, but rather repeated visits to that loop. The server is sending a new DiscoveryResponse every ~2 seconds, prompting those bursts of activity. |
|
Quick update: I can't repro the prev loop anymore: but, on the other hand although it's still failing to load listeners/clusters the fetch timeout seems to be working because I am getting: And server state gets to which I think we believe was the original behavior that regressed? |
|
And.... ignore what I just set. I moved the repro to the wrong env where configs were missing, thus the above error... Trying again. |
|
Ok, sorry for the above confusion when my repro got mixed up with other ongoing experiments. This is what we have when setting the log level to We are accepting the same ClusterLoadAssignment version again and again... @mattklein123 @ramaraochavali would that cause init to be stuck on |
IMO it should not be stuck here and should continue one way or the other, so there is some bug, but TBH I'm pretty far removed from xDS client these days. cc @htuch @fredlas @wgallagher ^ |
|
interesting, I think it is because the init fetch timer is disabled on every config update here https://github.com/envoyproxy/envoy/blob/master/source/common/config/grpc_mux_subscription_impl.cc#L52 - the time out will not help. The timeout is also not intended for this case. But, once ClusterLoadAssignments are received for every EDS cluster, cluster initialization should move forward to next stage if all clusters have endpoints or will timeout if endpoints for some clusters were missing. |
|
I think it's worth mentioning #9206, which is a problem seen when using the existing DELTA_GRPC (so, basically identical logic to this). Unlike there, our current case does see an LDS request sent, so it's not exactly the same problem. However, it feels pretty close, since it's about init getting stuck. The facts of that one point to something being wrong outside of the xDS client logic. That is, everything seen of the xDS client's behavior is correct (AFAICT), but Envoy is using GrpcSubscription in a way that gets it into trouble. (Or not using, more like: the problem there is that Envoy never even starts an LDS subscription). |
|
@fredlas I think where the problem itself is (in the xDS client code or "outside" it in Envoy), the fact remains that there is a behavior change in this new patch that we must understand. If we don't feel that we can understand it in a reasonable amount of time I think we need to start thinking about how to break it into smaller pieces. I think @wgallagher had some potential thoughts on how to do that. |
|
I think that splitting up this change into smaller pieces is probably the best way to get it submitted. I don't have any specific ideas about what the split lines should be but I would be more than happy to help if that's the route we decide we want to take. |
|
It turns out the every-2-seconds update thing was a red herring; the exact same happens with the working head-of-master code (and everything is fine). It sounds like the only real problem is listeners not getting received. So, I think this problem is the same as #9206, or else very close. |
|
Yeah, doing as much as possible in small pieces might be necessary. "Something about init" is not much to go on, and if nobody who is familiar with the init code narrows it down, it will be hard to make progress. Unfortunately I think breaking it up is going to be awkward. The old gRPC ADS code needed to be rewritten because it wasn't amenable to a slightly different sibling version like delta, and it wasn't amenable because the logic evolved to be interwoven in a single ball with itself, in places where the new code needs to be modular. I think at some point there will be fundamental mismatches. But, at least as a start, the big The only other logic touching watches is the "Maintain a set to avoid dupes" part of |
|
@fredlas I think @wgallagher has some thoughts on how to split this up. I'm going to close this out for now and @wgallagher will follow up with next steps. Thank you for helping to look into this issue. Really appreciate it. |
Originally #8478. Also includes a bit of cleanup and renaming that happened after:
#8918 and #8919.