KEP-4814: graduate DRA Partitionable Devices to beta#5767
KEP-4814: graduate DRA Partitionable Devices to beta#5767k8s-ci-robot merged 4 commits intokubernetes:masterfrom
Conversation
| in the cluster and the troubleshooting steps provided through the link above | ||
| should be sufficient to determine the cause. | ||
|
|
||
| ###### How does this feature react if the API server and/or etcd is unavailable? |
There was a problem hiding this comment.
Please make sure to answer these for beta.
There was a problem hiding this comment.
The answer here is the same as for the main DRA feature, so I've added links here rather than duplicate the information.
| are missing a bunch of machinery and tooling and can't do that now. | ||
| --> | ||
| Will be considered for beta. | ||
| This will be done manually before transition to beta by bringing up a KinD cluster with kubeadm |
There was a problem hiding this comment.
Based on https://github.com/kubernetes/enhancements/pull/5716/changes#r2609510062, it seems like automated tests here should be possible. Will try to follow the patterns set for DeviceTaintRule here.
There was a problem hiding this comment.
@mortent please update the PR if you are now planning automated tests.
| Will be considered for beta. | ||
| See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#what-are-other-known-failure-modes. | ||
|
|
||
| ###### What steps should be taken if SLOs are not being met to determine the problem? |
There was a problem hiding this comment.
I see that DRA did not have an answer for this but I think we should probably answer this.
There was a problem hiding this comment.
Good point. I did dig a little deeper into the SLOs for DRA in general, so I've updated both this section and the SLO section further up in the document.
|
Based on kubernetes/kubernetes#133362 and offline discussions, I think there might be some open questions around how the Partitionable Devices feature will work with Autoscaling. It would be great to get some input from @jackfrancis or @towca |
|
/wg device-management |
|
/retitle: KEP-4814: graduate Partitionable Devices to beta |
|
/retitle KEP-4814: graduate DRA Partitionable Devices to beta |
Thanks for tagging me! This is a pretty complex topic for Cluster Autoscaler, not necessarily tied to this specific KEP. Sorry for the length here... Let me try to sum up the context first:
IMO adapting the utilization logic to handle arbitrary DRA Devices will necessarily mean adding the ability to configure different behaviors for different types of DRA Devices. It seems that we need the ability to configure at least:
Similarly for the the cluster-wide resource limits logic - adapting it to DRA will IMO mean adding the ability to configure:
For both changes above, we essentially need the ability to configure some simple metadata for DRA Devices and ResourcePools, specifically for the purpose of Node autoscaling. I see 2 distinct ways we can do that:
IMO from an API design standpoint Option 2. seems much better here - we're not bloating the DRA API with autoscaling-specific information that will only be used in some clusters. The big advantage of Option 1. however, is that these config options are easy to determine for already-created Devices and ResourcePools, by the same component that publishes ResourceSlices. If we want to go with Option 2., we need to be able to configure these options ahead of time, targeting a whole "class" of Devices or ResourcePools. For example we could want to express "Nvidia GPU X ResourcePools are more important for utilization than Nvidia GPU Y ResourcePools" - so we'd have to be able to somehow select "Nvidia GPU X/Y ResourcePools" from the autoscaling API. And similarly we'd want to express "Nvidia GPU X Device consumes The problems described above intersect with this KEP because it adds the CounterSets which makes the full picture a bit more complex. But I don't think the KEP changes the fundamental problems we need to solve meaningfully. If I'm right here, we just need to make sure we're not fundamentally limiting the DRA API in a way that would prevent us from solving these problems in the future without breaking backwards compatibility. But I don't think this is true either, right? @mortent With the full context in mind - do you agree that this KEP isn't making the CA problems meaningfully more difficult, and that we're not restricting ourselves from solving them in the future with this KEP? Happy to schedule a meeting if something's not clear in my ramblings above, I know it's a ton of context. @jackfrancis @mtrqq @MenD32 Could you validate that my analysis above makes sense? |
|
@towca @mortent could we get some representative examples of ResourcePools that express ambiguity in terms of their overall utilization status? The way the problem is expressed it sounds like we don't have a way to inspect the sum of Pending Pod device requests and answer the basic question "do we have room on existing nodes for these devices or do we need to build new infra?" I'm not 100% up on this API but I'm surprised that you can't introspect CapacityPool membership across all pools and deduce the answers. Some examples would help. The cluster-wide problem is more complicated, in particular: "cluster-wide" defines a sum that is calculated by adding up from a common unit. Because DRA enables infinitely flexible device classifications, there doesn't appear to be a real-world value to carrying this forward. The options would be:
We may wish to worry about the first one as a useful boundary for folks to use. Do we care about the 2nd one? Is that a real problem users will care about from an autoscaler perspective? To be more concrete, how do we calculate cluster-wide limits on a cluster that has both |
|
@towca great analysis of the current status of CA regarding those issues! I'd like to highlight 2 of what I think are the biggest difficulties regarding DRA Partitionable Devices and autoscaling:
Addendum: when I say physical device I'm refering to it from a cost perspective. i.e. a device that is part of the physical rack that runs the node (GPU, network card, etc...) and that a user is probably paying for. A virtual device would be an interface that allows of incomplete use of the available capacity of the physical device (like time slices, partitions, etc...). |
|
Thanks @towca for the great analysis! I agree that this KEP isn't adding to the problems we already have in the autoscaler. A couple points from me:
About the ideas for adding metadata I think it's reasonable to introduce separate CRD(s) for these purposes so that cluster admins may influence how autoscaler performs such calculations. One thing worth delegating to driver maintainers is utilization calculation, potentially with a fallback to reasonable default behavior. |
CounterSets often correspond to physical devices. Similar issues of how we articulate capacity to the end user for partitionable devices came up in #5677 which @jackfrancis has tagged for SIG autoscalling to take a look at. CounterSets have a name, which likely will be things like "gpu0", "gpu1", etc. So, would treating them as physical devices work for autoscaling? I think those may not be as confusing for users, at least in these common cases, as we are worried about. Sure, I expect people may find other uses for CounterSets that don't correspond to physical devices, but I don't think that will be the common case. |
The way that autoscaler deals with this in kubernetes/autoscaler#8559 is by trying to count unique devices via counterSets. one device can have multiple counterSets so countersSets are not a 1:1 map to physical devices, so the logic looks for resourceSlices that consume multiple counterSets and groups them as both part of a device. This is not a full-proof solution because it is not guaranteed that there would be a device that consumes all the counters of a device. Following the discussion in kubernetes/kubernetes#133362 IMO a possible solution could be to create a convention DRA plugins for device attributes. specifically for This would help in both resource limits and utilization calculation since it'd be a standard way to count physical device, and to define precise cluster limits in and it could also be applicable to ResourceQuota with DRA. |
The way that autoscaler deals with this in kubernetes/autoscaler#8559 is by trying to count unique devices via counterSets. one device can have multiple counterSets so countersSets are not a 1:1 map to physical devices, so the logic looks for resourceSlices that consume multiple counterSets and groups them as both part of a device. This is not a full-proof solution because it is not guaranteed that there would be a device that consumes all the counters of a device. Following the discussion in kubernetes/kubernetes#133362 IMO a possible solution could be to create a convention DRA plugins for device attributes. specifically for This would help in both resource limits and utilization calculation since it'd be a standard way to count physical device, and to define precise cluster limits in and it could also be applicable to ResourceQuota with DRA. |
|
@jackfrancis For your question about ResourcePools that express ambiguity in terms of their overall utilization status, I think the fundamental challenge that comes with Partitionable Devices, is that it will no longer be possible to determine available capacity by counting the number of devices in a resource pool. Since the devices share the underlying hardware, allocating a single device can make several of the remaining devices in the pool unavailable. Also, the devices will be different partitions of the underlying hardware, so they are not homogeneous. From what I can tell about the proposal in kubernetes/autoscaler#8559, it seems like it would handle what we think are the most common scenarios where there is a 1:1 mapping between a physical device and a CounterSet. |
|
So I agree with the @towca that this KEP doesn't change the fundamental challenges that already exists for adapting CA to DRA. It seems like there are a few areas discussed here:
So to summarize, the impact of this KEP is mostly on computing utilization, but we are making progress on it. There is still work to do in order to support this in the general case for DRA. I agree that handling this with CA-specific configuration of DRA using CRDs seems like a good way to handle it. |
If there is a convention or even a set of fields we can add to the API to make this better for CA, please feel free to suggest them. |
pohly
left a comment
There was a problem hiding this comment.
Some nits, but overall this looks good to me.
The discussion around autoscaling also seems to have settled down.
I'd like to highlight one thing here - right now CA basically doesn't support multi-host DRA autoscaling pretty much at all:
Adding support for Node-local Devices to CA was far from easy, but it didn't require changes to how CA behaves at a fundamental level. Adding support for multi-host Devices (or for attaching new Devices to existing Nodes) will require fundamental changes to how CA models its simulations. I imagine this would be a dedicated effort with a comprehensive design (KEP/AEP) of its own. So IMO discussing it now is out of scope, and we should focus on the Node-local case that CA supports right now.
Similarly here - I think it's hard to determine the fields/conventions without a dedicated, autoscaling-specific design effort. As mentioned in my previous comment, it's e.g. not clear to me if these fields would be better placed in the DRA API, or a new dedicated autoscaling-specific CRD. We have this effort planned for the near future (kubernetes/autoscaler#7781, kubernetes/autoscaler#8184), but not in time for K8s 1.36. With the 2 points above in mind, IMO we should focus the Node autoscaling discussion here on 2 aspects:
For 1., as mentioned before I don't personally think this KEP is restricting us from the possible future changes. There are also other voices with a similar opinion in this thread IIUC. 2. is a bit more tricky, and I'd like to clarify a few things there. Apart from one point, IMO this part isn't really blocking for the KEP approval because it's largely orthogonal. The PR is still being discussed, but here's my view of how the utilization logic should work in CA before we have the dedicated effort for it (copied from the PR review):
IIUC @mortent validated on the PR review that these seem like reasonable assumptions for CA to make in the common case. It'd be great if we could get more eyes on this to double-check. I also have some follow-up questions:
|
I think the assumptions you have listed are reasonable. The main caveat I think of is that users might choose to do a static partition of their devices, meaning that the devices listed in the ResourceSlice no longer maps 1:1 to physical devices. And in this situation there isn't any need to use the Partitionable Devices feature since the partitioning is done such that the devices don't use overlapping hardware. So similar to the way MIGs can be supported with Device Plugin today. But like I said, this is separate from the Partitionable Devices feature.
The requirements for the node selection fields (
There is nothing that prevents a driver from publishing a ResourcePool where
For node-local devices, I'm not aware of any use-cases where a single device would consume counters from multiple CounterSets. So I think it is safe to assume only one for the base common-case support we are planning for the first phase. |
You mean that we could have a ResourcePool with statically partitioned GPUs, so that SharedCounters are not used at all, but the Devices within the pool are not homogeneous? Yeah this seems fully orthogonal and would fall under the list of use-cases not supported by CA in the initial logic.
Just to double-check here - so in the examples from the KEP (e.g. the ResourceSlice with
Yeah, I understand that part - these would also fall under the list of use-cases not supported by CA. We do probably need more explicit error handling for this in CA, so that a single ResourcePool breaking this assumption doesn't affect others (there's kubernetes/autoscaler#7784 planned to tackle this and other similar problems). But as you're saying - this is not introduced in this KEP.
Ack, thanks for the explanation! Seems best to keep the shared path with summing up both kinds in CA logic then.
That makes sense, thanks a lot for the confirmation! cc @MenD32 I don't think I have any more questions or concerns. @mtrqq @MenD32 @jackfrancis Do you? Also tagging the Karpenter folks before formal approval - @jonathan-innis @njtran Have you had a chance to review this KEP update? Any concerns from the Karpenter end? |
|
/approve for sig-scheduling There is nothing new from the scheduler perspective and scheduling challenges has been identified, so LGTM. We're currently expanding gang scheduling to support topologies (see #5733), so scheduler should be able to consider various scheduling options (devices) and pick the one that allows binding of all pods in the gang. |
I would expect For resource pools containing devices that are not node-local, they will almost certainly use a different node selection than |
|
Looks good to me |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, johnbelamaric, mortent The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: Update KEP to prepare for beta in 1.36
Issue link: DRA: Partitionable Devices #4815
Other comments: Some of the material is already covered in KEP-4381: DRA Structure Parameters, so that KEP is referenced in some places.