-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3015: PreferSameZone/PreferSameNode traffic distribution #5140
Conversation
#### DNS | ||
|
||
As a cluster administrator, I plan to run a DNS pod on each node, and | ||
would like DNS requests from other pods to always go to the local DNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above you state “preferentially” and here you state always. Probably just wording, does this flag guarantee traffic will always be on the same node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some discussion in #4931 about exactly what "Prefer" implies... if it means "definitely unless there are no preferred endpoints" or if it means "probably unless there's a good reason not to". It is not totally clear. At any rate, the "Prefer" here means the same thing as the "Prefer" in "PreferClose" does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wiggle room is justified. We've made the mistake of being to prescriptive without a good corpus of implementations in the past.
2d70367
to
1185959
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Tell me if I misunderstood the fallback comments
`PreferSameNode`, indicating traffic for a service should | ||
preferentially be routed to endpoints on the same node as the client. | ||
|
||
(This is the third attempt at this feature, which was previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demonstrating both our willingness to admit we were wrong, and our relentless pursuit of excellence! :)
#### DNS | ||
|
||
As a cluster administrator, I plan to run a DNS pod on each node, and | ||
would like DNS requests from other pods to always go to the local DNS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wiggle room is justified. We've made the mistake of being to prescriptive without a good corpus of implementations in the past.
Maybe I am losing my mind, but why do you need three releases? You might have nodes which are back rev by three, three but in this specific case, those fall back on the zone hints. I am racking my brain trying to figure out a failure mode that would require three releases but I'm not seeing it |
|
||
By checking if any Service has `TrafficDistribution: PreferSameNode`. | ||
|
||
###### How can someone using this feature know that it is working for their instance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the feature is working you'll see a new field in the EndpointSlices ForNodes
populated, in addition to that they need to do the manual inspection you indicate to validate that the traffic directed to those endpoints only goes to those nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That only tells you that half of the feature is working. Notably, that would happen even if you still had an old kube-proxy, but the old kube-proxy would ignore the ForNodes
hint.
this LGTM modulo some minor comments |
### Goals | ||
|
||
- Allow configuring a service so that connections will be delivered to | ||
a local endpoint when possible, and a remote endpoint if not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility of allowing PreferSameNode
to fall back to PreferSameZone
, and PreferSameZone
to fall back to any?
I can see that as useful in my environment, but it also increases the complexity of how a user needs to define what they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is exactly what is implied
It's up to Dan whether he wants to go for 33, but we have to decide soon, and we have to decide between this and the PreferClose. They all have a clock on them, so getting it to GA starts the clock |
I dunno. I thought that was just a rule. OK. So:
So I feel like the right approach is:
|
Correct.
Unfortunately no. If someone sets "PreferSameZone" and then we rollback to 32, their Service is invalid. It has to start behind an off-by-default gate. I suggested to @gauravkghildiyal that he make that "rename" (alias, really) under your same gate. It's unlikely that "PreferClose" would actually ever go away (risk >> reward), so let's just embrace it. If "PreferClose" goes GA in 33, then you don't have to handle the corner case of PreferClose being gated-off but PreferSameZone being on. From the POV of the PreferSameZone gate, PreferClose is locked to default.
I think it has to be:
It's simpler and really, I am just pre-NAKing the PR to remove "PreferClose". I would probably NAK it anyway. |
Ah, I was thinking about past problems with apiserver/kubelet skew, but kube-controller-manager can only be 1 release older than kube-apiserver, so a single release as Alpha addresses that. OK. |
In principal: lgtm I'm actually super keen to get this feature into our clusters! |
1185959
to
b87dea8
Compare
b87dea8
to
0758e62
Compare
|
||
// forNodes indicates the node(s) this endpoint should be targeted by. | ||
// +listType=atomic | ||
ForNodes []string `json:"forNodes,omitempty" protobuf:"bytes,2,name=forNodes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ForNodes field just a slice of strings, whereas the ForZones field is defined as a slice of ForZone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ForZone
struct was apparently added to support future ideas like having multiple zones with weight
values... (https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2433-topology-aware-hints/README.md#future-api-expansion)
I guess we could do the same for ForNodes
... ? I think the idea of weighted hints is based more on the "smart controllers / ambiguous API" model of TopologyAwareHints, while TrafficDistribution has moved more toward "simple controllers / explicit API" so that sort of future API expansion seems unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #5140 (comment), even if we are in the "smart controllers / ambiguous API" model, actually, there's no need to include weight information in ForZones hints. It’s enough to determine the weight within the service proxy implementation, so it's unnecessary.
However, looking at this wording, it might be difficult to definitively say that a string slice is the best choice...🤷♂️ For example, if fields other than trafficDistribution
are added, allowing users to specify the behavior in more detail, is there a possibility that Hints would need to include additional information?
Although it is unclear if either expansion will be necessary, the API is designed in a way to make expansions straightforward.
When updating EndpointSlices, if the EndpointSlice controller sees a | ||
service with `PreferSameNode` traffic distribution, then for each | ||
endpoint in the slice, it will add a `ForNodes` hint including the | ||
name of the endpoint's node. (The field is an array for future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 'future extensibility' here mean leaving room to implement safeguards against overload later on (e.g., CPU core-based correction in the Topology Aware Routing)? If the safeguard is intended to distribute the load to other nodes, should we add not only the node where the endpoint resides but also the node names that became candidates as a result of the correction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, "future extensibility" means other use cases besides simple "PreferSameNode".
For example, you could theoretically implement "prefer same rack" by manually filling in multiple ForNodes
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Our assumption has been that if you were going to have some sort of "smart" backoff/fallback behavior, that the behavior would be entirely within the proxy, so nobody would be writing out hints about it. The proxy would just determine that it should avoid Node X and starting using Nodes Y and Z isntead, and then do that.)
Thanks! /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments - other than that lgtm from prr pov.
|
||
An initial rollout cannot fail and won't impact already-running | ||
workloads, because at the time of the initial rollout, there cannot | ||
already be any `PreferSameZone` or `PreferSameNode` services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - in some catastrophic scenario you may break endpoint-slice controller (fairly unlikely but not technically impossible).
It still won't impact running workloads, but would affect propagating changes to these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there's an implied "if there are bugs in the code then arbitrary bad things might happen"? Like, we could crash all of kcm too, but it doesn't seem right to say "an initial rollout might cause the DaemonSet controller to fail".
* `PreferSameNode`: Indicates a preference for routing traffic to | ||
endpoints that are on the same node as the client. In general, the | ||
proxy should always route to a same-node endpoint if any is | ||
available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this field interact with the iTP field?
I assume that what is written here will also apply to PreferSameNode
: https://kubernetes.io/docs/reference/networking/virtual-ips/#interaction-with-traffic-policies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. All topology-ish features should behave the same way with respect to other features; the only difference is which endpoints they pick as "topologically available".
Local
traffic policy ignores topology, because its own routing concerns render topology irrelevant. But if we added other traffic policy types in the future, they might work differently and we'd have to define that then.
0758e62
to
38a6d41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A query about ForNodes
, and some style nits.
|
||
### Goals | ||
|
||
- Make `TrafficDistribution` less ambiguous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
- Make `TrafficDistribution` less ambiguous. | |
- Make `trafficDistribution` less ambiguous. |
In the API, we most commonly see this field name in camelCase. So, use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in code we see it capitalized. I have no data but I feel like we tend to use field names both ways regularly in KEPs...
|
||
#### DNS | ||
|
||
As a cluster administrator, I plan to run a DNS pod on each node, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a user story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... it's under the User Stories heading. (I just didn't number them.)
|
||
When updating EndpointSlices, if the EndpointSlice controller sees a | ||
service with `PreferSameNode` traffic distribution, then for each | ||
endpoint in the slice, it will add a `ForNodes` hint including the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpoint in the slice, it will add a `ForNodes` hint including the | |
endpoint in the slice, it computes an internal `ForNodes` hint including the |
If this isn't an internal detail of kube-controller-manager, I'm puzzled about why https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/ doesn't mention either hints or ForNodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't mention ForNodes
because that field is new in this KEP.
It doesn't mention hints because the "Topology information" subsection of that document apparently never got updated to explain the TopologyAwareHints
feature... but it's not any more "internal" than anything else in EndpointSlice.
38a6d41
to
f0283c6
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PreferSameZone
andPreferSameNode
as values for Service'sTrafficDistribution
field, deprecatesPreferClose
.