Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Possibly incorrect ADS requests on new resources #4009

Closed
zerospiel opened this issue Nov 2, 2020 · 4 comments
Closed

xds: Possibly incorrect ADS requests on new resources #4009

zerospiel opened this issue Nov 2, 2020 · 4 comments

Comments

@zerospiel
Copy link

Hi,

I have a simple setup for xds, using envoyproxy/go-control-plane as a management server. And in some way, I started to struggle with some strange behavior from xds-client when it sends several new ADS requests to the server for the same resource type stacking resource names within each new request. I'm not sure if this is expected behavior.

Let's explore logs with the following LB config generated by the xds client:

{
   "loadBalancingConfig":[
      {
         "xds_routing_experimental":{
            "action":{
               "service1-70.platform:grpc_1130880091":{
                  "childPolicy":[
                     {
                        "weighted_target_experimental":{
                           "targets":{
                              "service1-70.platform:grpc":{
                                 "weight":100,
                                 "childPolicy":[
                                    {
                                       "cds_experimental":{
                                          "cluster":"service1-70.platform:grpc"
                                       }
                                    }
                                 ]
                              }
                           }
                        }
                     }
                  ]
               },
               "service1-canary.platform:grpc_service1.platform:grpc_233446717":{
                  "childPolicy":[
                     {
                        "weighted_target_experimental":{
                           "targets":{
                              "service1-canary.platform:grpc":{
                                 "weight":20,
                                 "childPolicy":[
                                    {
                                       "cds_experimental":{
                                          "cluster":"service1-canary.platform:grpc"
                                       }
                                    }
                                 ]
                              },
                              "service1.platform:grpc":{
                                 "weight":80,
                                 "childPolicy":[
                                    {
                                       "cds_experimental":{
                                          "cluster":"service1.platform:grpc"
                                       }
                                    }
                                 ]
                              }
                           }
                        }
                     }
                  ]
               }
            },
            "route":[
               {
                  "prefix":"/",
                  "headers":[
                     {
                        "name":"x-o3-meshversion",
                        "invertMatch":false,
                        "presentMatch":true
                     }
                  ],
                  "action":"service1-70.platform:grpc_1130880091"
               },
               {
                  "prefix":"/",
                  "headers":[
                     {
                        "name":"x-o3-meshversion",
                        "invertMatch":false,
                        "presentMatch":false
                     }
                  ],
                  "action":"service1-canary.platform:grpc_service1.platform:grpc_233446717"
               }
            ]
         }
      }
   ]
}

So there are three different names for CDS resources. What I see in logs:

[xds][xds-client 0xc000850000] ADS request sent: node:<id:"opaque_node_identifier_should_be_setted_here" metadata:<fields:<key:"R_GCP_PROJECT_NUMBER" value:<string_value:"some_postfix_node_id" > > > locality:<region:"RU_CENTRAL" zone:"DATALINE" > build_version:"gRPC Go 1.33.1" user_agent_name:"gRPC Go" user_agent_version:"1.33.1" client_features:"envoy.lb.does_not_support_overprovisioning" > resource_names:"service1-70.platform:grpc" type_url:"type.googleapis.com/envoy.api.v2.Cluster"
...
...
// response from the server that already have been opened stream and sent nonce
[xds][xds-client 0xc000850000] ADS response received: version_info:"3" resources:<type_url:"type.googleapis.com/envoy.api.v2.Cluster" value:"\\n\\031service1-70.platform:grpc\\032\\037\\n\\002\\032\\000\\022\\031service1-70.platform:grpc\\020\\003" > type_url:"type.googleapis.com/envoy.api.v2.Cluster" nonce:"3"
...
...
[xds][xds-client 0xc000850000] ADS request sent: node:<id:"opaque_node_identifier_should_be_setted_here" metadata:<fields:<key:"R_GCP_PROJECT_NUMBER" value:<string_value:"some_postfix_node_id" > > > locality:<region:"RU_CENTRAL" zone:"DATALINE" > build_version:"gRPC Go 1.33.1" user_agent_name:"gRPC Go" user_agent_version:"1.33.1" client_features:"envoy.lb.does_not_support_overprovisioning" > resource_names:"service1-canary.platform:grpc" resource_names:"service1-70.platform:grpc" type_url:"type.googleapis.com/envoy.api.v2.Cluster"
...
...
[xds][xds-client 0xc000850000] ADS request sent: node:<id:"opaque_node_identifier_should_be_setted_here" metadata:<fields:<key:"R_GCP_PROJECT_NUMBER" value:<string_value:"some_postfix_node_id" > > > locality:<region:"RU_CENTRAL" zone:"DATALINE" > build_version:"gRPC Go 1.33.1" user_agent_name:"gRPC Go" user_agent_version:"1.33.1" client_features:"envoy.lb.does_not_support_overprovisioning" > resource_names:"service1-70.platform:grpc" resource_names:"service1-canary.platform:grpc" resource_names:"service1.platform:grpc" type_url:"type.googleapis.com/envoy.api.v2.Cluster"
...
...
[xds][xds-client 0xc000850000] Sending ACK for response type: ClusterResource, version: 3, nonce: 3
...
...
[xds][xds-client 0xc000850000] ADS request sent: version_info:"3" node:<id:"opaque_node_identifier_should_be_setted_here" metadata:<fields:<key:"R_GCP_PROJECT_NUMBER" value:<string_value:"some_postfix_node_id" > > > locality:<region:"RU_CENTRAL" zone:"DATALINE" > build_version:"gRPC Go 1.33.1" user_agent_name:"gRPC Go" user_agent_version:"1.33.1" client_features:"envoy.lb.does_not_support_overprovisioning" > resource_names:"service1-70.platform:grpc" resource_names:"service1-canary.platform:grpc" resource_names:"service1.platform:grpc" type_url:"type.googleapis.com/envoy.api.v2.Cluster" response_nonce:"3"

As I wrote above, the client sending stacking resource names and after all sends ACK of for all resource names for the nonce with the very first response from the server which was for only one of the resource names. On server side it looks like:

2020/11/02 10:57:42 OnStreamRequest [id: 1]: ["type.googleapis.com/envoy.api.v2.Cluster"]: Version: [""]; Nonce: [""]; ResourceNames [[service1-70.platform:grpc]]; 
2020/11/02 10:57:42 OnStreamResponse [id: 1]: ["type.googleapis.com/envoy.api.v2.Cluster"]: Version: ["3"]; Nonce: ["3"]
2020/11/02 10:57:43 OnStreamRequest [id: 1]: ["type.googleapis.com/envoy.api.v2.Cluster"]: Version: [""]; Nonce: [""]; ResourceNames [[service1-canary.platform:grpc service1-70.platform:grpc]]; 
2020/11/02 10:57:43 OnStreamRequest [id: 1]: ["type.googleapis.com/envoy.api.v2.Cluster"]: Version: [""]; Nonce: [""]; ResourceNames [[service1-70.platform:grpc service1-canary.platform:grpc service1.platform:grpc]]; 
// this is the ACK from the client
2020/11/02 10:57:43 OnStreamRequest [id: 1]: ["type.googleapis.com/envoy.api.v2.Cluster"]: Version: ["3"]; Nonce: ["3"]; ResourceNames [[service1.platform:grpc service1-70.platform:grpc service1-canary.platform:grpc]];

What I expect that the client will send a single request to the server with all of the names like

node:
    id:"opaque_node_identifier_should_be_setted_here"
    metadata:
        fields:
            key:"R_GCP_PROJECT_NUMBER"
            value:"some_postfix_node_id"
    locality:
        region:"RU_CENTRAL" 
        zone:"DATALINE"
build_version:"gRPC Go 1.33.1" 
user_agent_name:"gRPC Go" 
user_agent_version:"1.33.1" 
client_features:"envoy.lb.does_not_support_overprovisioning"
resource_names:"service1-70.platform:grpc" 
resource_names:"service1-canary.platform:grpc" 
resource_names:"service1.platform:grpc" 
type_url:"type.googleapis.com/envoy.api.v2.Cluster"

Summing up I have a simple question: does this behavior of the xds client is totally expected hence I should make changes in the management server?

@menghanl
Copy link
Contributor

menghanl commented Nov 5, 2020

I believe this is the right behavior.
The resource names indicate what the client needs. The client add add/remove items anytime.
The version/nonce indicates what the client has received. It's like a response to the response from the server.
There's no real relation between the resource names and the version/nonce.

And the server should only need to look at the last request for each resource type: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#when-to-send-an-update

Let me know if I missed something, or misunderstood your question

@zerospiel
Copy link
Author

Hi again,

Thanks a lot for the clarification! I understand the behavior of ack/nack. It was unclear to me in terms of stacking those resources names for each ADS request.

I paid more attention to the link you provided and I found out that the current envoyproxy/go-control-plane implementation of streams does not support the expected behavior of LDS and CDS (those types with each I had errors on updates) described here:
https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#grouping-resources-into-responses

However, I think that I'm having issues with the management server, not the gprc-go xds client.

I have an additional question (about a server) and will be grateful if you can help with it:
How a management server can determine if an xds client sent the last request for resource type before responding to it?

So i.e. we can rely on logs that I mentioned in the initial message:

  • client sent an initial ADS request with empty versions for resource type type.googleapis.com/envoy.api.v2.Cluster with resource names service1
  • client sent the same request but with resource names service1, service1-canary
  • same but with resource names service1, service1-canary, service1-70
  • a server now should send a response for the requested names (actually not requested but the complete SotW), open stream, and wait for (N)ACK

The last request was actually the last request for resource type Cluster but what conditions I should consider to send a response to requested resource names and open a stream?

@menghanl
Copy link
Contributor

menghanl commented Nov 6, 2020

The server should send a response when

  • the client request's resource names changed
  • a resource the client requested changed

So when a request is received, the server should just need to compare the resource names with the previous request, and decide if a new resource needs to be added to the response. On an ADS stream, requests should only be compared with the requests of the same type.

SotW is saying that the server needs to send ALL the resources the client asked for (even if only one of the resources is different). It's not to send all resources the server knows about. https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#four-variants

There's a bug in go-control-plane's LDS support for gRPC: envoyproxy/go-control-plane#349. You could have been affected by this.


I'm not sure what you mean by server "open a stream". The servers don't open streams. They only accept requests from streams that are opened by the clients.


In terms of ACK/NACK. There's usually not much a server can do when it receives a NACK. Server should keep the ACK/NACK status for the users (https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/status/v3/csds.proto#enum-service-status-v3-clientconfigstatus)
And NACK usually also comes with an error message. The server can log them.

@zerospiel
Copy link
Author

Thanks a lot for your detailed answer!

Actually, I'm not affected by that bug because I'm not using full state snapshots.
I'll now try to solve my problems on the server-side and try to properly use SotW.

Closing this issue because you provided the answers and there are no problems with the xds client in grpc-go (as of 1.34.0-dev at least).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants