Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3015: PreferSameZone/PreferSameNode traffic distribution #5140

Merged
merged 1 commit into from
Feb 11, 2025

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Feb 6, 2025

  • One-line PR description: Adds PreferSameZone and PreferSameNode as values for Service's TrafficDistribution field, deprecates PreferClose.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 6, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025
@danwinship danwinship self-assigned this Feb 6, 2025
@shaneutt shaneutt requested review from aojea and MikeZappa87 February 6, 2025 15:14
#### DNS

As a cluster administrator, I plan to run a DNS pod on each node, and
would like DNS requests from other pods to always go to the local DNS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you state “preferentially” and here you state always. Probably just wording, does this flag guarantee traffic will always be on the same node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some discussion in #4931 about exactly what "Prefer" implies... if it means "definitely unless there are no preferred endpoints" or if it means "probably unless there's a good reason not to". It is not totally clear. At any rate, the "Prefer" here means the same thing as the "Prefer" in "PreferClose" does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wiggle room is justified. We've made the mistake of being to prescriptive without a good corpus of implementations in the past.

@thockin thockin self-assigned this Feb 6, 2025
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Tell me if I misunderstood the fallback comments

`PreferSameNode`, indicating traffic for a service should
preferentially be routed to endpoints on the same node as the client.

(This is the third attempt at this feature, which was previously
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstrating both our willingness to admit we were wrong, and our relentless pursuit of excellence! :)

#### DNS

As a cluster administrator, I plan to run a DNS pod on each node, and
would like DNS requests from other pods to always go to the local DNS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wiggle room is justified. We've made the mistake of being to prescriptive without a good corpus of implementations in the past.

keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
@thockin
Copy link
Member

thockin commented Feb 7, 2025

Maybe I am losing my mind, but why do you need three releases?

You might have nodes which are back rev by three, three but in this specific case, those fall back on the zone hints. I am racking my brain trying to figure out a failure mode that would require three releases but I'm not seeing it


By checking if any Service has `TrafficDistribution: PreferSameNode`.

###### How can someone using this feature know that it is working for their instance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the feature is working you'll see a new field in the EndpointSlices ForNodes populated, in addition to that they need to do the manual inspection you indicate to validate that the traffic directed to those endpoints only goes to those nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only tells you that half of the feature is working. Notably, that would happen even if you still had an old kube-proxy, but the old kube-proxy would ignore the ForNodes hint.

@aojea
Copy link
Member

aojea commented Feb 7, 2025

this LGTM modulo some minor comments
@thockin , are we opting for this in 1.33

### Goals

- Allow configuring a service so that connections will be delivered to
a local endpoint when possible, and a remote endpoint if not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility of allowing PreferSameNode to fall back to PreferSameZone, and PreferSameZone to fall back to any?

I can see that as useful in my environment, but it also increases the complexity of how a user needs to define what they want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is exactly what is implied

@thockin
Copy link
Member

thockin commented Feb 7, 2025

It's up to Dan whether he wants to go for 33, but we have to decide soon, and we have to decide between this and the PreferClose. They all have a clock on them, so getting it to GA starts the clock

@danwinship
Copy link
Contributor Author

Maybe I am losing my mind, but why do you need three releases?

I dunno. I thought that was just a rule.

OK. So:

  1. I assume PreferSameNode needs to start as Alpha, in which case, it needs its own feature gate, since ServiceTrafficDistribution is already Beta. Right?
  2. I assume PreferSameZone doesn't need to start as Alpha, since we were ready to declare PreferClose GA, and this is just a rename of it. Also, if we know we want to deprecate PreferClose, then we should make its replacement fully-available sooner rather than later, right?

So I feel like the right approach is:

  • 1.33: Update KEP-4444 to include PreferSameZone and deprecate PreferClose, but leave it as Beta. Implement this KEP with PreferSameNode as Alpha behind its own feature gate.
  • 1.34: PreferSameNode goes to Beta. TrafficDistribution+PreferSameZone can go to GA, if we want.
  • 1.35: PreferSameNode goes to GA. TrafficDistribution as a whole goes to GA if it didn't in 1.34.

@thockin
Copy link
Member

thockin commented Feb 7, 2025

  1. I assume PreferSameNode needs to start as Alpha, in which case, it needs its own feature gate, since ServiceTrafficDistribution is already Beta. Right?

Correct.

  1. I assume PreferSameZone doesn't need to start as Alpha, since we were ready to declare PreferClose GA, and this is just a rename of it. Also, if we know we want to deprecate PreferClose, then we should make its replacement fully-available sooner rather than later, right?

Unfortunately no. If someone sets "PreferSameZone" and then we rollback to 32, their Service is invalid. It has to start behind an off-by-default gate. I suggested to @gauravkghildiyal that he make that "rename" (alias, really) under your same gate. It's unlikely that "PreferClose" would actually ever go away (risk >> reward), so let's just embrace it. If "PreferClose" goes GA in 33, then you don't have to handle the corner case of PreferClose being gated-off but PreferSameZone being on.

From the POV of the PreferSameZone gate, PreferClose is locked to default.

  • 1.33: Update KEP-4444 to include PreferSameZone and deprecate PreferClose, but leave it as Beta. Implement this KEP with PreferSameNode as Alpha behind its own feature gate.
  • 1.34: PreferSameNode goes to Beta. TrafficDistribution+PreferSameZone can go to GA, if we want.
  • 1.35: PreferSameNode goes to GA. TrafficDistribution as a whole goes to GA if it didn't in 1.34.

I think it has to be:

  • 1.33: GA TrafficDistribution+PreferClose, introduce PreferSameZone and PreferSameNode behind an alpha gate
  • 1.34: PreferSame* goes to beta
  • 1.35+: PreferSame* goes GA
  • 1.36: Remove the TrafficDistribution gate
  • 1.38+: Remove the PreferSame* gate

It's simpler and really, I am just pre-NAKing the PR to remove "PreferClose". I would probably NAK it anyway.

@danwinship
Copy link
Contributor Author

Maybe I am losing my mind, but why do you need three releases?

You might have nodes which are back rev by three, three but in this specific case, those fall back on the zone hints.

Ah, I was thinking about past problems with apiserver/kubelet skew, but kube-controller-manager can only be 1 release older than kube-apiserver, so a single release as Alpha addresses that. OK.

@adrianmoisey
Copy link
Member

In principal: lgtm

I'm actually super keen to get this feature into our clusters!

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 8, 2025
@danwinship danwinship changed the title KEP-3015: PreferSameNode traffic distribution KEP-3015: PreferSameZone/PreferSameNode traffic distribution Feb 8, 2025

// forNodes indicates the node(s) this endpoint should be targeted by.
// +listType=atomic
ForNodes []string `json:"forNodes,omitempty" protobuf:"bytes,2,name=forNodes"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ForNodes field just a slice of strings, whereas the ForZones field is defined as a slice of ForZone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ForZone struct was apparently added to support future ideas like having multiple zones with weight values... (https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2433-topology-aware-hints/README.md#future-api-expansion)

I guess we could do the same for ForNodes... ? I think the idea of weighted hints is based more on the "smart controllers / ambiguous API" model of TopologyAwareHints, while TrafficDistribution has moved more toward "simple controllers / explicit API" so that sort of future API expansion seems unlikely.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #5140 (comment), even if we are in the "smart controllers / ambiguous API" model, actually, there's no need to include weight information in ForZones hints. It’s enough to determine the weight within the service proxy implementation, so it's unnecessary.

However, looking at this wording, it might be difficult to definitively say that a string slice is the best choice...🤷‍♂️ For example, if fields other than trafficDistribution are added, allowing users to specify the behavior in more detail, is there a possibility that Hints would need to include additional information?

Although it is unclear if either expansion will be necessary, the API is designed in a way to make expansions straightforward.

When updating EndpointSlices, if the EndpointSlice controller sees a
service with `PreferSameNode` traffic distribution, then for each
endpoint in the slice, it will add a `ForNodes` hint including the
name of the endpoint's node. (The field is an array for future
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'future extensibility' here mean leaving room to implement safeguards against overload later on (e.g., CPU core-based correction in the Topology Aware Routing)? If the safeguard is intended to distribute the load to other nodes, should we add not only the node where the endpoint resides but also the node names that became candidates as a result of the correction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, "future extensibility" means other use cases besides simple "PreferSameNode".

For example, you could theoretically implement "prefer same rack" by manually filling in multiple ForNodes values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Our assumption has been that if you were going to have some sort of "smart" backoff/fallback behavior, that the behavior would be entirely within the proxy, so nobody would be writing out hints about it. The proxy would just determine that it should avoid Node X and starting using Nodes Y and Z isntead, and then do that.)

@thockin
Copy link
Member

thockin commented Feb 9, 2025

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2025
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments - other than that lgtm from prr pov.

keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved

An initial rollout cannot fail and won't impact already-running
workloads, because at the time of the initial rollout, there cannot
already be any `PreferSameZone` or `PreferSameNode` services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - in some catastrophic scenario you may break endpoint-slice controller (fairly unlikely but not technically impossible).
It still won't impact running workloads, but would affect propagating changes to these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's an implied "if there are bugs in the code then arbitrary bad things might happen"? Like, we could crash all of kcm too, but it doesn't seem right to say "an initial rollout might cause the DaemonSet controller to fail".

keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
@wojtek-t wojtek-t self-assigned this Feb 10, 2025
Comment on lines +196 to +199
* `PreferSameNode`: Indicates a preference for routing traffic to
endpoints that are on the same node as the client. In general, the
proxy should always route to a same-node endpoint if any is
available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this field interact with the iTP field?

I assume that what is written here will also apply to PreferSameNode: https://kubernetes.io/docs/reference/networking/virtual-ips/#interaction-with-traffic-policies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All topology-ish features should behave the same way with respect to other features; the only difference is which endpoints they pick as "topologically available".

Local traffic policy ignores topology, because its own routing concerns render topology irrelevant. But if we added other traffic policy types in the future, they might work differently and we'd have to define that then.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A query about ForNodes, and some style nits.


### Goals

- Make `TrafficDistribution` less ambiguous.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
- Make `TrafficDistribution` less ambiguous.
- Make `trafficDistribution` less ambiguous.

In the API, we most commonly see this field name in camelCase. So, use that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in code we see it capitalized. I have no data but I feel like we tend to use field names both ways regularly in KEPs...


#### DNS

As a cluster administrator, I plan to run a DNS pod on each node, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a user story?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... it's under the User Stories heading. (I just didn't number them.)


When updating EndpointSlices, if the EndpointSlice controller sees a
service with `PreferSameNode` traffic distribution, then for each
endpoint in the slice, it will add a `ForNodes` hint including the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endpoint in the slice, it will add a `ForNodes` hint including the
endpoint in the slice, it computes an internal `ForNodes` hint including the

If this isn't an internal detail of kube-controller-manager, I'm puzzled about why https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/ doesn't mention either hints or ForNodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't mention ForNodes because that field is new in this KEP.

It doesn't mention hints because the "Topology information" subsection of that document apparently never got updated to explain the TopologyAwareHints feature... but it's not any more "internal" than anything else in EndpointSlice.

keps/sig-network/3015-prefer-same-node/README.md Outdated Show resolved Hide resolved
@wojtek-t
Copy link
Member

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot merged commit 598575c into kubernetes:master Feb 11, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 11, 2025
@danwinship danwinship deleted the prefer-same-node branch February 11, 2025 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Pre-Alpha
Development

Successfully merging this pull request may close these issues.

9 participants