Skip to content

xds: undo the revert (#9044) of delta+SotW unification#9189

Closed
fredlas wants to merge 14 commits intoenvoyproxy:masterfrom
fredlas:UNR_unrevert_xds_unification
Closed

xds: undo the revert (#9044) of delta+SotW unification#9189
fredlas wants to merge 14 commits intoenvoyproxy:masterfrom
fredlas:UNR_unrevert_xds_unification

Conversation

@fredlas
Copy link
Contributor

@fredlas fredlas commented Dec 2, 2019

Originally #8478. Also includes a bit of cleanup and renaming that happened after:
#8918 and #8919.

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
This reverts commit 30ce222.
This restoration was to test my theory that 8350 was the cause
of the hang. Now that we have established it wasn't, it will be
good to have this cleaner, more sensible behavior back in.

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9189 was opened by fredlas.

see: more, trace.

@mattklein123
Copy link
Member

@rgs1 @fredlas one last question on this: @rgs1 do you have xDS init timeouts disabled in your environment? If you do not, and Envoy would still hang in some case, I think we do have to understand the original cause of the hang. If you have the init timeouts disabled for some reason I can potentially accept this though I'm uncomfortable. cc @wgallagher @kyessenov

/wait-any

@rgs1
Copy link
Member

rgs1 commented Dec 3, 2019

@rgs1 @fredlas one last question on this: @rgs1 do you have xDS init timeouts disabled in your environment? If you do not, and Envoy would still hang in some case, I think we do have to understand the original cause of the hang. If you have the init timeouts disabled for some reason I can potentially accept this though I'm uncomfortable. cc @wgallagher @kyessenov

@mattklein123 you mean initial_fetch_timeout (https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/config_source.proto#L151) right? If so, it's set to 10s.

@mattklein123
Copy link
Member

@mattklein123 you mean initial_fetch_timeout (https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/config_source.proto#L151) right? If so, it's set to 10s.

OK, that's what I was afraid of. Sorry, we can't merge this until we understand what is going on. Something has changed with regard to the init timeout handling. @rgs1 are you able to re-break your SDS implementation and potentially get more logs? Thank you hugely in advance.

@fredlas
Copy link
Contributor Author

fredlas commented Dec 3, 2019

Nice, that's a pretty solid lead. After a quick look, I notice one perhaps important difference in when the old/new code calls disableInitFetchTimeoutTimer(). In onConfigUpdate(), i.e. handling an update and nothing has yet thrown an exception, there is a call through to the onConfigUpdate() of the subscription's callbacks_. The old code disables the timer before calling that onConfigUpdate(), the new code disables it after.

I would expect the difference there to be that the old code gives up in the face of some failures, whereas the new code keeps trying. If the config setup was relying on that giving up, that might explain the hang?

@mattklein123
Copy link
Member

I would expect the difference there to be that the old code gives up in the face of some failures, whereas the new code keeps trying. If the config setup was relying on that giving up, that might explain the hang?

Possibly, but we should still be bound by the 10s timeout. So, perhaps there is some difference in which we are not adhering to the timeout. cc @ramaraochavali who has looked at this a lot.

@fredlas
Copy link
Contributor Author

fredlas commented Dec 3, 2019

Hmm... yes, that is true, but it looks like it might be different if the particular error was a connection failure.

I think what would happen in the new code, if the issue is that there is an exception-throwing failure involved, is that actually the disableInitFetchTimeoutTimer() in onConfigUpdate() would never even get called, because the exception makes us jump out of the function first.

Now, when the init timeout fires, the result is a onConfigUpdateFailed(), which disables the timer in 2 of 3 cases. However, the ConfigUpdateFailureReason::ConnectionFailure case in there does not disable the timer (in both the old and new code).

@mattklein123
Copy link
Member

Now, when the init timeout fires, the result is a onConfigUpdateFailed(), which disables the timer in 2 of 3 cases. However, the ConfigUpdateFailureReason::ConnectionFailure case in there does not disable the timer (in both the old and new code).

I will let @ramaraochavali chime in but I think the expected behavior is that we always timeout after the timeout period, but allow retries if there are connection errors. This must be where the functional change is and this definitely needs to be fixed. Great though that we are getting close to understand the problem!

@ramaraochavali
Copy link
Contributor

Yes. The original implementation was, Envoy initialization should always timeout and proceed with initialization, with in configured time making best effort to get the initial config during that configured time (and retries will continue in the background to get the config even after the initialization is done).

The relevant PRs that made this change are #7571 and #7427.

Based on what is being discussed here, it is not finishing initialization after 10s, which seems to be the regression. I have n't looked at the new code to see what the problem is though

@fredlas
Copy link
Contributor Author

fredlas commented Dec 4, 2019

It's not clear from Event::Timer's comments whether that timer is a one-time thing, or a "fire every until disabled". But, if the latter, then there would be an ongoing stream of onConfigUpdateFailed() calls, forever. Would that be an issue for init completing?

If so, then the reordering of the two lines in onConfigUpdate() that I described would explain the different behavior. (I will swap those two lines regardless of whether we decide that was the cause, to match the existing ordering).

…ing behavior, which treats any news about the fetch as precluding a timeout, as opposed to only wanting a successful one

Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9189 was synchronize by fredlas.

see: more, trace.

@ramaraochavali
Copy link
Contributor

ramaraochavali commented Dec 5, 2019

BTW, you can check init_fetch_timeout stat on the setup where it is initializing successfully to see if the initialization finished because of timeout(10s) with old code?

AFAIK, the timer is a one time thing

@fredlas
Copy link
Contributor Author

fredlas commented Dec 6, 2019

Ok, added some more logging; @rgs1 please give it a run! Even if it doesn't work, the logs will hopefully be enough to point me to the real problem...

…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
@mattklein123
Copy link
Member

@rgs1 friendly ping on this. Did you have a chance to test this out? Thank you!

@fredlas
Copy link
Contributor Author

fredlas commented Dec 10, 2019

In progress - I merged just now so that the thing being tested would include the recent security patches.

@rgs1
Copy link
Member

rgs1 commented Dec 12, 2019

@rgs1 friendly ping on this. Did you have a chance to test this out? Thank you!

Sorry, getting to this now... should have something to share by tomorrow.

@rgs1
Copy link
Member

rgs1 commented Dec 12, 2019

@mattklein123 so I just dropped @fredlas' branch on top of master as of yesterday, and got two hosts stuck on init... And what's interesting is, they are not even running the "problematic" spiffe config that triggered this in the first place.

I'll get some logs entries and follow-up offline with @fredlas.

@mattklein123
Copy link
Member

I'll get some logs entries and follow-up offline with @fredlas.

Awesome thank you!

@rgs1
Copy link
Member

rgs1 commented Dec 12, 2019

So, it seems we are stuck at GrpcMuxImpl::trySendDiscoveryRequests() in source/common/config/grpc_mux_impl.cc :

void GrpcMuxImpl::trySendDiscoveryRequests() {                                                                                                                                  
  while (true) {                                                                                                                                                                
    // Do any of our subscriptions even want to send a request?                                                                                                                 
    absl::optional<std::string> request_type_if_any = whoWantsToSendDiscoveryRequest();                                                                                         
    if (!request_type_if_any.has_value()) {                                                                                                                                     
      break;                                                                                                                                                                    
    }                                                                                                                                                                           
    // If so, which one (by type_url)?                                                                                                                                          
    std::string next_request_type_url = request_type_if_any.value();                                                                                                            
    SubscriptionState& sub = subscriptionStateFor(next_request_type_url);                                                                                                       
    ENVOY_LOG(debug, "GrpcMuxImpl wants to send discovery request for {}", next_request_type_url);                                                                              
    // Try again later if paused/rate limited/stream down.                                                                                                                      
    if (!canSendDiscoveryRequest(next_request_type_url)) {                                                                                                                      
      break;                                                                                                                                                                    
    }                                                                                                                                                                           
    // Get our subscription state to generate the appropriate discovery request, and send.                                                                                      
    if (!pausable_ack_queue_.empty()) {                                                                                                                                         
      // Because ACKs take precedence over plain requests, if there is anything in the queue, it's                                                                              
      // safe to assume it's of the type_url that we're wanting to send.                                                                                                        
      //                                                                                                                                                                        
      // getNextRequestWithAck() returns a raw unowned pointer, which sendGrpcMessage deletes.                                                                                  
      sendGrpcMessage(sub.getNextRequestWithAck(pausable_ack_queue_.front()));                                                                                                  
      pausable_ack_queue_.pop();                                                                                                                                                
      ENVOY_LOG(debug, "GrpcMuxImpl sent ACK discovery request for {}", next_request_type_url);                                                                                 
    } else {                                                                                                                                                                    
      // getNextRequestAckless() returns a raw unowned pointer, which sendGrpcMessage deletes.                                                                                  
      sendGrpcMessage(sub.getNextRequestAckless());                                                                                                                             
      ENVOY_LOG(debug, "GrpcMuxImpl sent non-ACK discovery request for {}", next_request_type_url);                                                                             
    }                                                                                                                                                                           
  }                                                                                                                                                                             
  maybeUpdateQueueSizeStat(pausable_ack_queue_.size());                                                                                                                         
}

From our logs, we never leave that loop (timestamps removed for readability, but it loops every ~2 secs):

 GrpcMuxImpl wants to send discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 GrpcMuxImpl sent ACK discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment received SotW update
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment SotW update successful
 GrpcMuxImpl wants to send discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 GrpcMuxImpl sent ACK discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment received SotW update
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment SotW update successful
 GrpcMuxImpl wants to send discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 GrpcMuxImpl sent ACK discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment received SotW update
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment SotW update successful
 GrpcMuxImpl wants to send discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 GrpcMuxImpl sent ACK discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment received SotW update
 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment SotW update successful

@rgs1
Copy link
Member

rgs1 commented Dec 12, 2019

So our two possibilities for exiting that loop are:

    // Do any of our subscriptions even want to send a request?                                                                                                                 
    absl::optional<std::string> request_type_if_any = whoWantsToSendDiscoveryRequest();                                                                                         
    if (!request_type_if_any.has_value()) {                                                                                                                                     
      break;                                                                                                                                                                    
    }  
    // Try again later if paused/rate limited/stream down.                                                                                                                      
    if (!canSendDiscoveryRequest(next_request_type_url)) {                                                                                                                      
      break;                                                                                                                                                                    
    }  

Let me check what's going on in those...

Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Dec 12, 2019

I think (and looks confirmed by other lines in the log showing messages from the server) that it is not an infinite loop, but rather repeated visits to that loop. The server is sending a new DiscoveryResponse every ~2 seconds, prompting those bursts of activity.

@rgs1
Copy link
Member

rgs1 commented Dec 13, 2019

Quick update:

I can't repro the prev loop anymore:

GrpcMuxImpl wants to send discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
GrpcMuxImpl sent ACK discovery request for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
type.googleapis.com/envoy.api.v2.ClusterLoadAssignment received SotW update
type.googleapis.com/envoy.api.v2.ClusterLoadAssignment SotW update successful

but, on the other hand although it's still failing to load listeners/clusters the fetch timeout seems to be working because I am getting:

[2019-12-13 00:08:17.122][15][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:87] type.googleapis.com/envoy.api.v2.Cluster update failed: FetchTimedout
[2019-12-13 00:08:27.122][15][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:87] type.googleapis.com/envoy.api.v2.Listener update failed: FetchTimedout

And server state gets to LIVE:

$ curl -s 0:9901/server_info  |egrep state
 "state": "LIVE",

which I think we believe was the original behavior that regressed?

@rgs1
Copy link
Member

rgs1 commented Dec 13, 2019

And.... ignore what I just set. I moved the repro to the wrong env where configs were missing, thus the above error... Trying again.

@rgs1
Copy link
Member

rgs1 commented Dec 13, 2019

Ok, sorry for the above confusion when my repro got mixed up with other ongoing experiments.

This is what we have when setting the log level to trace:

[2019-12-13 00:41:13.894][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3274) accepted with 1 resources
[2019-12-13 00:41:15.774][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3275) accepted with 1 resources
[2019-12-13 00:41:17.799][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3276) accepted with 1 resources
[2019-12-13 00:41:21.899][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3278) accepted with 1 resources
[2019-12-13 00:41:23.797][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3279) accepted with 1 resources
[2019-12-13 00:41:27.801][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3281) accepted with 1 resources
[2019-12-13 00:41:33.799][15][debug][config] [external/envoy/source/common/config/sotw_subscription_state.cc:80] Config update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment (version 2019-12-12 22:52:06.730034005 +0000 UTC m=+0.022213113#3284) accepted with 1 resources

We are accepting the same ClusterLoadAssignment version again and again...

@mattklein123 @ramaraochavali would that cause init to be stuck on PRE_INITIALIZING ?

@mattklein123
Copy link
Member

@mattklein123 @ramaraochavali would that cause init to be stuck on PRE_INITIALIZING ?

IMO it should not be stuck here and should continue one way or the other, so there is some bug, but TBH I'm pretty far removed from xDS client these days. cc @htuch @fredlas @wgallagher ^

@ramaraochavali
Copy link
Contributor

interesting, I think it is because the init fetch timer is disabled on every config update here https://github.com/envoyproxy/envoy/blob/master/source/common/config/grpc_mux_subscription_impl.cc#L52 - the time out will not help. The timeout is also not intended for this case. But, once ClusterLoadAssignments are received for every EDS cluster, cluster initialization should move forward to next stage if all clusters have endpoints or will timeout if endpoints for some clusters were missing.

@fredlas
Copy link
Contributor Author

fredlas commented Dec 13, 2019

I think it's worth mentioning #9206, which is a problem seen when using the existing DELTA_GRPC (so, basically identical logic to this). Unlike there, our current case does see an LDS request sent, so it's not exactly the same problem. However, it feels pretty close, since it's about init getting stuck.

The facts of that one point to something being wrong outside of the xDS client logic. That is, everything seen of the xDS client's behavior is correct (AFAICT), but Envoy is using GrpcSubscription in a way that gets it into trouble. (Or not using, more like: the problem there is that Envoy never even starts an LDS subscription).

@mattklein123
Copy link
Member

@fredlas I think where the problem itself is (in the xDS client code or "outside" it in Envoy), the fact remains that there is a behavior change in this new patch that we must understand. If we don't feel that we can understand it in a reasonable amount of time I think we need to start thinking about how to break it into smaller pieces. I think @wgallagher had some potential thoughts on how to do that.

@wgallagher
Copy link

I think that splitting up this change into smaller pieces is probably the best way to get it submitted. I don't have any specific ideas about what the split lines should be but I would be more than happy to help if that's the route we decide we want to take.

@fredlas
Copy link
Contributor Author

fredlas commented Dec 13, 2019

It turns out the every-2-seconds update thing was a red herring; the exact same happens with the working head-of-master code (and everything is fine). It sounds like the only real problem is listeners not getting received. So, I think this problem is the same as #9206, or else very close.

@fredlas
Copy link
Contributor Author

fredlas commented Dec 13, 2019

Yeah, doing as much as possible in small pieces might be necessary. "Something about init" is not much to go on, and if nobody who is familiar with the init code narrows it down, it will be hard to make progress.

Unfortunately I think breaking it up is going to be awkward. The old gRPC ADS code needed to be rewritten because it wasn't amenable to a slightly different sibling version like delta, and it wasn't amenable because the logic evolved to be interwoven in a single ball with itself, in places where the new code needs to be modular. I think at some point there will be fundamental mismatches.

But, at least as a start, the big try in GrpcMuxImpl::onDiscoveryResponse() should be replaceable pretty directly by the SotW variant of WatchMap::onConfigUpdate(). However, that will mean adding WatchMap to GrpcMuxImpl as the component responsible for watches. The current setup has watches being their own objects, with RAII behavior. GrpcMuxSubscriptionImpl will also need a little adjustment to its current interaction with the watches.

The only other logic touching watches is the "Maintain a set to avoid dupes" part of GrpcMuxImpl::sendDiscoveryRequest(). That is basically an on-the-fly reconstruction of what SotwSubscriptionState::names_tracked_ tracks. That can probably stay as-is, might just need to provide a way to reach into WatchMap and see resource names.

@mattklein123
Copy link
Member

@fredlas I think @wgallagher has some thoughts on how to split this up. I'm going to close this out for now and @wgallagher will follow up with next steps. Thank you for helping to look into this issue. Really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants