Skip to content

clarify xds docs on resource deletion#20523

Merged
lizan merged 7 commits intoenvoyproxy:mainfrom
ramaraochavali:fix/clarify_xds
Mar 29, 2022
Merged

clarify xds docs on resource deletion#20523
lizan merged 7 commits intoenvoyproxy:mainfrom
ramaraochavali:fix/clarify_xds

Conversation

@ramaraochavali
Copy link
Copy Markdown
Contributor

Commit Message: clarify xds docs on resource deletion
Additional Description: Resource deletion for non wildcard resource types like eds and rds will only happen on parent resource deletion for both Sotw and Incremental variants. This PR clarifies it in doc.
Risk Level: N/A
Testing: N/A
Docs Changes: N/A
Release Notes:N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #20523 was opened by ramaraochavali.

see: more, trace.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
In the incremental protocol variants, the server signals the client that a resource should be
deleted via the :ref:`removed_resources <envoy_v3_api_field_service.discovery.v3.DeltaDiscoveryResponse.removed_resources>`
field of the response. This tells the client to remove the resource from its local cache.
field of the response for wildcard resource types like
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add a section named "wildcard resource" and remove like here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added "wildcard resource" paragraph

@daixiang0
Copy link
Copy Markdown
Member

Also fix CI please.

@phlax
Copy link
Copy Markdown
Member

phlax commented Mar 25, 2022

@phlax phlax self-assigned this Mar 25, 2022
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali small nit with the headers - other than that lgtm from docs pov

probably would be good to get it checked over by someone who knows xds better than me

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@phlax thank you. fixed. PTAL

Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm from docs pov - thanks @ramaraochavali

cc @lizan for api review

@lizan lizan merged commit 642a7b6 into envoyproxy:main Mar 29, 2022
@ramaraochavali ramaraochavali deleted the fix/clarify_xds branch March 29, 2022 07:05

Other Resource Types
""""""""""""""""""""
In both SotW and incremental protocol variants, deletions are indicated implicitly by parent resources being
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. The incremental protocol variant does explicitly indicate resource removal for all resource types, not just for wildcard resource types.

Please fix this in a subsequent PR.

Copy link
Copy Markdown
Contributor Author

@ramaraochavali ramaraochavali Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental protocol variant does explicitly indicate resource removal for all resource types, not just for wildcard resource types.

I think implementation for delta even though removed_resources are specified, would cleanup resources only on parent deletion for non wildcard types like RDS and EDS
cc: @howardjohn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is my understanding of how Envoy EDS works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example RDS:

"Server sent a delta {} update attempting to remove a resource (name: {}). Ignoring.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the current Envoy implementation works that way, but I think that's because the current implementation does not correctly separate the xDS data model layer from the the xDS transport protocol layer.

The xDS transport protocol layer is just a generic pub/sub mechanism where clients can subscribe to arbitrary resources by name and resource type. The transport protocol layer does not know or care about the contents of those resources; it treats them as opaque blobs that are passed up to the data model layer. The data model layer is where the contents of the resources are actually understood -- e.g., when a Listener resource is provided to the data model layer, the data model layer knows what all of the fields in the proto mean and how to use them.

At the xDS transport protocol layer, there are two things matching up:

  1. The set of resources that the client is subscribed to. This is controlled by the client, not the server.
  2. The current value of each resource. This is controlled by the server, not the client.

Note that (1) is determined at the data model layer. Envoy initially makes a wildcard query for Listener and Cluster resources, and the returned set of Listener and Cluster resources tell it what RouteConfiguration and ClusterLoadAssignment resources to subscribe to.

At the transport protocol layer, during the interval between when the client subscribes to a resource and when it unsubscribes from that resource, the server must keep the client updated as to the current state of that resource. "Does not exist" is just one possible distinct value for the resource. If the resource is removed on the server, the server must inform the client that the resource does not exist. If the resource is later recreated on the server and the client is still subscribed to it, the server must then inform the client that the resource now has a new value. The server stops sending updates for the resource only once the client unsubscribes from it.

When a Listener resource goes away, that tells the client (at the data model layer) that it no longer needs to subscribe to the RouteConfiguration resource that was mentioned in the now-removed Listener resource. That is a change to (1), not a change to (2). However, because Envoy does not currently draw a distinction between those two things, it uses that same signal for both (1) and (2).

This behavior is not really correct, but it mostly works because of the fact that Envoy does not currently have a real xDS transport protocol layer that provides proper caching. However, as per discussion in #13009, Envoy will soon get such a separate xDS transport protocol layer (@adisuissa is starting to look at this now). When that happens, it will need to implement proper support for resource deletion at the transport protocol layer.

This spec is talking about the xDS transport protocol layer, and it is a spec about how the protocol is intended to work. It is not attempting to describe the data model layer. In other words, the deletion that we're talking about here is the deletion of the resource in that transport protocol-layer cache, not the deletion of the resource at the data model layer.

I believe that the original text of this document was already correct. I don't think we should change the spec based on Envoy's current implementation, which (a) isn't really correct and (b) is going to change in the not-too-distant future.

Copy link
Copy Markdown
Contributor Author

@ramaraochavali ramaraochavali Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with most of what you said

"Does not exist" is just one possible distinct value for the resource. If the resource is removed on the server, the server must inform the client that the resource does not exist.

It is not clear what client should do with it? Basically I want to understand what is the behaviour as per xds spec when a resource is sent by server in removed_resources when there is a reference to it in dependent resource?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client sees (e.g.) a Listener resource that points to a non-existent RouteConfiguration resource, I think that's fundamentally the same type of problem as a case where the Listener resource contained invalid data in one of its fields. This is a problem to be handled at the data model layer, not at the transport protocol layer, because the transport protocol layer does not know anything about the contents of the resources or the relationships between the resources; it treats them simply as opaque blobs, each of which is completely independent of any other.

In practice, I think there are a couple of basic choices for how to handle problems at the data model layer. The most obvious option is to handle it exactly the same way as if the requested resource never existed in the first place. (It's worth noting that at the transport protocol layer, if the client first starts up and requests a resource that does not exist, the server should report it as being deleted, exactly the same way that it would if the resource was deleted later, after having already been sent to the client.) For example, what would happen if Envoy first started up and was given a Listener resource that pointed to a RouteConfiguration resource that did not exist? I suspect it would fail traffic at L7 in that case, but maybe it could fail to warm the listener or something.

The client does have another option in the case where it had already gotten a copy of the resource from before it was deleted: it can choose to just keep running with the copy it already had, essentially ignoring the deletion. This is probably safer to avoid an outage, but if the client does that, it should probably also provide some signal that it's doing so, so that humans aren't blind to the fact that the client would stop working if it was restarted at that point. (I plan to work on a design for that in the near future as part of #11396.)

In any case, though, this document is a spec for the transport protocol layer; it does not define anything about the data model layer (e.g., it does not say anything about what fields need to be populated with what values in any given resource type), so I don't think it's in scope for this document to define the behavior in cases where there is a config problem at that layer. You may argue (and I think you would be correct to do so :) ) that we should also document the error handling behavior at the data model layer, but I think that would need to be a separate document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for the explanation. I think as it stands today this is not just transport protocol spec though - it explains few cases on error handling or how to do certain things like when to send an update etc. But I get your point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me at specific things you see in the doc that are not about the transport protocol? I think that shouldn't be the case.

Note that when to send an update is actually a function of the transport protocol layer, not the data model layer.

@markdroth
Copy link
Copy Markdown
Contributor

I think this PR should probably be reverted. It seems that the whole purpose of the PR was to change the doc to say something that is not accurate.

@markdroth
Copy link
Copy Markdown
Contributor

I've created #20568 to revert this.

markdroth added a commit to markdroth/envoy that referenced this pull request Mar 31, 2022
This reverts commit 642a7b6.

Signed-off-by: Mark D. Roth <roth@google.com>
lizan pushed a commit that referenced this pull request Apr 4, 2022
This reverts commit 642a7b6.

Signed-off-by: Mark D. Roth <roth@google.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
Resource deletion for non wildcard resource types like eds and rds will only happen on parent resource deletion for both Sotw and Incremental variants. This PR clarifies it in doc.
Risk Level: N/A
Testing: N/A
Docs Changes: N/A
Release Notes:N/A

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
…voyproxy#20568)

This reverts commit 642a7b6.

Signed-off-by: Mark D. Roth <roth@google.com>
@AmitKatyal1980
Copy link
Copy Markdown

AmitKatyal1980 commented May 27, 2023

@markdroth we are observing that delta RDS resource removal is getting ignored as stated above

image

and finally getting removed after the delay. I am not sure if it is getting removed due to the removal of parent resource.
Could you please share the details how does the delta stream route config is getting removed, if the notification sent from the server is being ignored by the client.

I am verifying the configuration using the envoy admin interface.

@howardjohn
Copy link
Copy Markdown
Contributor

You cannot remove it. Its probably getting removed at the "delay" of listener deletion (which has a drain period, so its not instant), which cleans up the last references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants