Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot cache doesn't handle xDS request correctly #399

Closed
zhishi opened this issue Feb 26, 2021 · 10 comments
Closed

snapshot cache doesn't handle xDS request correctly #399

zhishi opened this issue Feb 26, 2021 · 10 comments

Comments

@zhishi
Copy link

zhishi commented Feb 26, 2021

based on xDS protocol: [https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol]

When the client sends a new request that changes the set of resources being requested, the server must resend any newly requested resources, even if it previously sent those resources without having been asked for them and the resources have not changed since that time.

It basically means just check version_info is not enough to know if discovery server should send a response. so this code won't send response if the request just adding a new name with previous version_info:

	// if the requested version is up-to-date or missing a response, leave an open watch
	if !exists || request.VersionInfo == version {

It need check both version_info and resource_names, which is much more complicated. Because the request won't carry previous resource_names so we must store it somewhere in the cache server, but CreateWatch() interface is called per request level rather than per stream level, so there is no way to know which resource_names should be used. I think to fully resolve this issue correctly we need add the stream id information in the CreateWatch() function.

@eduser25
Copy link

eduser25 commented Mar 18, 2021

We think we are hitting this and it's not so obvious why versioning takes prescedence
if the proto specifies that different resources must be sent if resources requested change on the same version.
We are wondering if we are missing something basic; example below.

Populate snapshot cache with a single version for an envoy

"message":"Proxy client.client"
"message":"[CDS] resources: [server/server client/client-local]"}
"message":"[EDS] resources: [server/server]"}
"message":"[LDS] resources: [outbound-listener]"}
"message":"[RDS] res: [Outbound]"}
"message":"[SDS] res: [tls-cert-for-server validation-context-secret]"}

Both SDS secrets are used for server/server's UpstreamTlsContext, as tls_certificate and as validation_context
After replying CDS, envoy requests for certs, initially envoy for only one of them:

"message":"OnStreamRequest NODE: [client.client]"}
"message":"OnStreamRequest: TYPE: [type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret]"}
"message":"OnStreamRequest: REQUEST RESOURCES: [[tls-cert-for-server]]"}
time="2021-03-18T22:21:15Z" level=debug msg="respond type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret[tls-cert-for-server] version \"\" with version \"1\""
"message":"RESPONSE: Secret[tsl-cert-for-server] num resources: 1"}

Just milis after right after, a request on the same version is issued with both of them:

"message":"OnStreamRequest NODE: [client.client]"}
"message":"OnStreamRequest: TYPE: [type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret]"}
"message":"OnStreamRequest: REQUEST RESOURCES: [[tls-cert-for-server validation-context-secret]]"}
time="2021-03-18T22:21:15Z" level=debug msg="open watch 9 for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret[tsl-cert-for-server validation-context-secret] from nodeID \"client.client\", version \"1\""

A watcher is spawned, but validation-context-secret never seems to be replied to envoy after this, even though it was on the snapshot since the very beginning.
We suspect this is because version "1" has already been acknowledged.

After this interaction, Envoy never recovers, validation-context-secret remains uninitialized and server/server cluster remains warming forever be.

This is Go-control-plane 0.9.8 and envoy 1.17.1

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 18, 2021
@shashankram
Copy link

Commenting so this issue doesn't get wiped out by the stale bot.
@kyessenov, @yangminzhu, @jyotima, @jessicayuen could you help with this?

@github-actions github-actions bot removed the stale label Apr 21, 2021
@czhu-wish
Copy link

czhu-wish commented May 16, 2021

We also found a similar issue recently where we use ADS and updating clusters and routes. We print the snapshot and everything looks correct. However, when looking up discovery callback, we can see OnStreamRequest asking for both CDS and RDS in the same stream, but server only responds CDS. RDS response just never happens. Envoy still runs afterward but the config messes up.

{"level":"info","timestamp":"2021-05-15T00:17:15.048Z","caller":"server/envoy_management_server_callbacks.go:29","msg":"stream 1 open for ADS"}
{"level":"info","timestamp":"2021-05-15T00:17:15.048Z","caller":"server/envoy_management_server_callbacks.go:59","msg":"stream 1 gets request of resources [] with type type.googleapis.com/envoy.config.cluster.v3.Cluster and version "}
{"level":"info","timestamp":"2021-05-15T00:17:15.049Z","caller":"server/envoy_management_server_callbacks.go:76","msg":"stream 1 responds ......] with type type.googleapis.com/envoy.config.cluster.v3.Cluster and version \"8ed2d1c65d155492820c179ae3e815d4\""}
{"level":"info","timestamp":"2021-05-15T00:17:15.054Z","caller":"server/envoy_management_server_callbacks.go:59","msg":"stream 1 gets request of resources [] with type type.googleapis.com/envoy.config.cluster.v3.Cluster and version \"8ed2d1c65d155492820c179ae3e815d4\""}
{"level":"info","timestamp":"2021-05-15T00:17:15.221Z","caller":"server/envoy_management_server_callbacks.go:59","msg":"stream 1 gets request of resources [egress_route_dynamic] with type type.googleapis.com/envoy.config.route.v3.RouteConfiguration and version "}
{"level":"info","timestamp":"2021-05-15T00:17:15.221Z","caller":"server/envoy_management_server_callbacks.go:59","msg":"stream 1 gets request of resources [ingress_route_dynamic egress_route_dynamic] with type type.googleapis.com/envoy.config.route.v3.RouteConfiguration and version "}

@alecholmez
Copy link
Contributor

I think this issue is related to the following: #431

We potentially have a solution and we are looking into it. Thank you all for your patience!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 25, 2021
@github-actions
Copy link

github-actions bot commented Jul 2, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@kkalin68
Copy link
Contributor

kkalin68 commented Aug 31, 2021

I have the same issue using go-control-plane (v0.9.9) Initially I thought it is GRPC client issue (grpc/grpc-go#4703) and with @easwars help we found that the issue is really in go-control-plane implementation. I checked the code and go-control-plane ignores a second DiscoveryRequest of same type regardless if a client requests more resources than in the first request.
I checked Java control-plane implementation and I found that it handles this situation properly because it tracks resourcesNames that the control plane sends back to a client via stream (see https://github.com/envoyproxy/java-control-plane/blob/main/cache/src/main/java/io/envoyproxy/controlplane/cache/SimpleCache.java#L101 & https://github.com/envoyproxy/java-control-plane/blob/main/server/src/main/java/io/envoyproxy/controlplane/server/DiscoveryRequestStreamObserver.java#L77).

I implemented a patch for go-control-plane (similar Java implementation) and tested it. It works. I can open a PR if a way how Java handles, looks correct here as well. At least it looks quite logical for me.

@snowp
Copy link
Contributor

snowp commented Aug 31, 2021

Yep, I did the java-control-plane change way back, it would make sense to do the same thing in go-control-plane.

kkalin68 added a commit to kkalin68/go-control-plane that referenced this issue Sep 1, 2021
…voyproxy#399

The change is implemented simillary to Java implementation.
The control plane keeps track of resource names that a caller knows
already.

Signed-off-by: Konstantin Kalin <[email protected]>
@kkalin68
Copy link
Contributor

kkalin68 commented Sep 1, 2021

I've opened PR #490

@alecholmez alecholmez added the bug label Sep 1, 2021
kkalin68 added a commit to kkalin68/go-control-plane that referenced this issue Oct 15, 2021
…voyproxy#399

The change is implemented simillary to Java implementation.
The control plane keeps track of resource names that a caller knows
already.

Signed-off-by: Konstantin Kalin <[email protected]>
kkalin68 added a commit to kkalin68/go-control-plane that referenced this issue Oct 21, 2021
…voyproxy#399

The change is implemented simillary to Java implementation.
The control plane keeps track of resource names that a caller knows
already.

Signed-off-by: Konstantin Kalin <[email protected]>
kkalin68 added a commit to kkalin68/go-control-plane that referenced this issue Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants