Skip to content

Update endpoints watcher to not fetch pods for removed endpoints#10013

Merged
adleong merged 4 commits intomainfrom
alex/endpoint-slice-scale-down
Jan 3, 2023
Merged

Update endpoints watcher to not fetch pods for removed endpoints#10013
adleong merged 4 commits intomainfrom
alex/endpoint-slice-scale-down

Conversation

@adleong
Copy link
Member

@adleong adleong commented Dec 16, 2022

Fixes #10003

When endpoints are removed from an EndpointSlice resource, the destination controller builds a list of addresses to remove. However, if any of the removed endpoints have a Pod as their targetRef, we will attempt to fetch that pod to build the address to remove. If that pod has already been removed from the informer cache, this will fail and the endpoint will be skipped in the list of endpoints to be removed. This results in stale endpoints being stuck in the address set and never being removed.

We update the endpoint watcher to construct only a list of endpoint IDs for endpoints to remove, rather than fetching the entire pod object. Since we no longer attempt to fetch the pod, this operation is now infallible and endpoints will no longer be skipped during removal.

We also add a TestEndpointSliceScaleDown test to exercise this.

Signed-off-by: Alex Leong <alex@buoyant.io>
Signed-off-by: Alex Leong <alex@buoyant.io>
@adleong adleong requested a review from a team as a code owner December 16, 2022 20:26
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix! 👍

}
}

// Test that when an EndpointSlice is scaled down, the EndpointsWatcher sends
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: godoc should start with the function name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we typically use godoc for tests, do we? (note the // rather than ///)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok never mind, I thought one could use go doc to show docs from test functions, but that doesn't appear to be the case. OTOH, I'm not aware of the usage of ///, can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, /// isn't a go thing. I got confused; holiday brain.

Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Signed-off-by: Alex Leong <alex@buoyant.io>
@jeremychase
Copy link
Contributor

@adleong Nice fix!

It took me some time to digest why endpointSliceToIDs would resolve this issue over endpointSliceToAddresses, and without your PR description, I don't know that I ever would have 😆

In particular, I found it unclear why one method would be leaky while the other isn't. Will you please update the comment(s) to avoid future confusion?

@adleong adleong merged commit 768e04d into main Jan 3, 2023
@adleong adleong deleted the alex/endpoint-slice-scale-down branch January 3, 2023 18:04
hawkw pushed a commit that referenced this pull request Feb 3, 2023
)

Fixes #10003

When endpoints are removed from an EndpointSlice resource, the destination controller builds a list of addresses to remove.  However, if any of the removed endpoints have a Pod as their targetRef, we will attempt to fetch that pod to build the address to remove.  If that pod has already been removed from the informer cache, this will fail and the endpoint will be skipped in the list of endpoints to be removed.  This results in stale endpoints being stuck in the address set and never being removed.

We update the endpoint watcher to construct only a list of endpoint IDs for endpoints to remove, rather than fetching the entire pod object.  Since we no longer attempt to fetch the pod, this operation is now infallible and endpoints will no longer be skipped during removal.

We also add a `TestEndpointSliceScaleDown` test to exercise this.

Signed-off-by: Alex Leong <alex@buoyant.io>
hawkw pushed a commit that referenced this pull request Feb 6, 2023
)

Fixes #10003

When endpoints are removed from an EndpointSlice resource, the destination controller builds a list of addresses to remove.  However, if any of the removed endpoints have a Pod as their targetRef, we will attempt to fetch that pod to build the address to remove.  If that pod has already been removed from the informer cache, this will fail and the endpoint will be skipped in the list of endpoints to be removed.  This results in stale endpoints being stuck in the address set and never being removed.

We update the endpoint watcher to construct only a list of endpoint IDs for endpoints to remove, rather than fetching the entire pod object.  Since we no longer attempt to fetch the pod, this operation is now infallible and endpoints will no longer be skipped during removal.

We also add a `TestEndpointSliceScaleDown` test to exercise this.

Signed-off-by: Alex Leong <alex@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Destination routing to incorrect Pod IPs

4 participants