Skip to content

KEP-4814: graduate DRA Partitionable Devices to beta#5767

Merged
k8s-ci-robot merged 4 commits intokubernetes:masterfrom
mortent:Update4815ForBeta
Jan 28, 2026
Merged

KEP-4814: graduate DRA Partitionable Devices to beta#5767
k8s-ci-robot merged 4 commits intokubernetes:masterfrom
mortent:Update4815ForBeta

Conversation

@mortent
Copy link
Member

@mortent mortent commented Jan 4, 2026

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 4, 2026
@k8s-ci-robot k8s-ci-robot requested a review from macsko January 4, 2026 20:48
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2026
in the cluster and the troubleshooting steps provided through the link above
should be sufficient to determine the cause.

###### How does this feature react if the API server and/or etcd is unavailable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to answer these for beta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer here is the same as for the main DRA feature, so I've added links here rather than duplicate the information.

are missing a bunch of machinery and tooling and can't do that now.
-->
Will be considered for beta.
This will be done manually before transition to beta by bringing up a KinD cluster with kubeadm
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://github.com/kubernetes/enhancements/pull/5716/changes#r2609510062, it seems like automated tests here should be possible. Will try to follow the patterns set for DeviceTaintRule here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortent please update the PR if you are now planning automated tests.

Will be considered for beta.
See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#what-are-other-known-failure-modes.

###### What steps should be taken if SLOs are not being met to determine the problem?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that DRA did not have an answer for this but I think we should probably answer this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I did dig a little deeper into the SLOs for DRA in general, so I've updated both this section and the SLO section further up in the document.

@mortent
Copy link
Member Author

mortent commented Jan 5, 2026

Based on kubernetes/kubernetes#133362 and offline discussions, I think there might be some open questions around how the Partitionable Devices feature will work with Autoscaling. It would be great to get some input from @jackfrancis or @towca

@mortent
Copy link
Member Author

mortent commented Jan 5, 2026

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 5, 2026
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 6, 2026
@pohly
Copy link
Contributor

pohly commented Jan 6, 2026

/retitle: KEP-4814: graduate Partitionable Devices to beta

@k8s-ci-robot k8s-ci-robot changed the title Update KEP-4815 for beta : KEP-4814: graduate Partitionable Devices to beta Jan 6, 2026
@pohly
Copy link
Contributor

pohly commented Jan 6, 2026

/retitle KEP-4814: graduate DRA Partitionable Devices to beta

@k8s-ci-robot k8s-ci-robot changed the title : KEP-4814: graduate Partitionable Devices to beta KEP-4814: graduate DRA Partitionable Devices to beta Jan 6, 2026
@towca
Copy link

towca commented Jan 16, 2026

Based on kubernetes/kubernetes#133362 and offline discussions, I think there might be some open questions around how the Partitionable Devices feature will work with Autoscaling. It would be great to get some input from @jackfrancis or @towca

Thanks for tagging me! This is a pretty complex topic for Cluster Autoscaler, not necessarily tied to this specific KEP. Sorry for the length here...

Let me try to sum up the context first:

  • Cluster Autoscaler has some logic tightly coupled with Device Plugin extended resources. One important place is calculating resource utilization for a Node, the other is calculating cluster-wide resource limits. Both these places need to be adapted for DRA.
  • CA computes Node utilization by dividing the scheduled Pods' resource requests by the Node allocatable resources. This initially was only for CPU and memory, then the logic was extended for GPUs/accelerators exposed via Device Plugin. The CPU/memory requests/allocatable division logic works out of the box for that. The name of the extended resource is delegated to cloud-provider-specific logic. There is an implicit assumption that GPU is more costly than CPU and memory, so if a Node has a GPU, only the GPU utilization matters and CPU/memory is ignored.
  • CA has a concept of "cluster-wide resource limits" - a min/max limit for the total of a given resource (CPU/memory/Device Plugin exended resource) across all Nodes in the cluster. CA doesn't exceed the limits when adding/removing Nodes in the cluster. Determining the limits is delegated to cloud-provider-specifc logic, and they're defined as resourceName -> quantity entries. The cluster-wide resource limits hasn't been extended for DRA yet, IIUC Shared Counter attributes in DRA kubernetes#133362 was a part of an ongoing attempt to tackle this.
  • The utilization logic was extended for DRA in the following way: utilization is calculated for each Node-local ResourcePool as the number of allocated Devices in the pool divided by the number of total Devices in the pool; the highest Node-local ResourcePool utilization is returned as the Node utilization. Similarly to the Device Plugin GPU case, if a Node has Node-local DRA Devices, only the DRA utilization matters and CPU/memory is ignored. This mirrors the previous logic for Device Plugin GPU utilization (works perfectly for a single Node-local ResourcePool of homogeneous Devices significantly more expensive than CPU/memory), but doesn't really work for other cases (e.g. DRA Devices cheaper than CPU/memory, heterogeneous Devices within a ResourcePool, multiple completely different ResourcePools on the same Node).
  • We have Feat: partitionable devices support autoscaler#8559 in review which further adapts the DRA utilization logic for the Partitionable Devices KEP. The new logic still has the same limiting assumptions, but now when counting total Devices it counts each Device without ConsumesCounters as 1, and each SharedCounter as 1. When counting allocated Devices, Devices without ConsumesCounters count as 1, and the utilization of each SharedCounter (in the [0-1] range) is added. So basically it assumes that Devices without ConsumesCounters, and SharedCounters all represent the same homogeneous "full Devices" within a single ResourcePool.
  • Summing up - CA has some Device-Plugin-based logic that needs to be adapted to DRA. Parts of it are adapted already, but the DRA logic assumes a lot about the Devices. Ideally we adapt the logic in a generic way that works for arbitrary Devices.

IMO adapting the utilization logic to handle arbitrary DRA Devices will necessarily mean adding the ability to configure different behaviors for different types of DRA Devices. It seems that we need the ability to configure at least:

  • How "important" a given ResourcePool is for the Node utilization - in relation to other ResourcePools, and CPU/memory. This is so that CA can reason about the utilization of multiple Node-local ResourcePools and CPU/memory without assuming anything about the pools. Could be a priority order, could be weights for a weighed sum.
  • How "much" a given Device/CounterSet contributes to the overall utilization of its ResourcePool. This is so that CA can reason about the utilization of a ResourcePool without assuming all Devices are homogeneous
  • We might also need the ability to inluence how the utilization of a given Counter contributes to its CounterSet utilization, but I'm not sure here - maybe the taking the max among the Counters is good enough.

Similarly for the the cluster-wide resource limits logic - adapting it to DRA will IMO mean adding the ability to configure:

  • For a given DRA Device, which cluster-wide resource limits it consumes, and in what quantity.

For both changes above, we essentially need the ability to configure some simple metadata for DRA Devices and ResourcePools, specifically for the purpose of Node autoscaling. I see 2 distinct ways we can do that:

  1. We extend the DRA API with the required configurability. For example we could add a UtilizationWeight float field to ResourcePool, and standardize an utilization SharedCounter that would track how much a given Device contributes to the ResourcePool utilization (or add new equivalent fields just for utilization instead of piggybacking on SharedCounters). And for the limits we could add a new consumesAutoscalingLimits map[string]quantity field to Device.
  2. We introduce new, autoscaling-specific APIs with the required configurability. For example we could introduce a new DraAutoscalingConfig CRD with a field allowing to configure weights per ResourcePool, another field allowing to configure how much each Device contributes to utilization of its ResourcePool, and another field allowing to configure which autoscaling limits it consumes (or 3 separate CRDs for each of the options).

IMO from an API design standpoint Option 2. seems much better here - we're not bloating the DRA API with autoscaling-specific information that will only be used in some clusters. The big advantage of Option 1. however, is that these config options are easy to determine for already-created Devices and ResourcePools, by the same component that publishes ResourceSlices.

If we want to go with Option 2., we need to be able to configure these options ahead of time, targeting a whole "class" of Devices or ResourcePools. For example we could want to express "Nvidia GPU X ResourcePools are more important for utilization than Nvidia GPU Y ResourcePools" - so we'd have to be able to somehow select "Nvidia GPU X/Y ResourcePools" from the autoscaling API. And similarly we'd want to express "Nvidia GPU X Device consumes gpu.nvidia.com/X: 1; gpu.nvidia.com/X/memory: 1024Gi from autoscaling limits", and also probably something like "Nvidia MIG GPU X partitioned to Y consumes gpu.nvidia.com/X: 500m; gpu.nvidia.com/X/memory: 512Gi". Expressing this selection seems easy for individual Devices using CEL, but difficult for whole ResourcePools or CounterSets representing a "full Device".

The problems described above intersect with this KEP because it adds the CounterSets which makes the full picture a bit more complex. But I don't think the KEP changes the fundamental problems we need to solve meaningfully. If I'm right here, we just need to make sure we're not fundamentally limiting the DRA API in a way that would prevent us from solving these problems in the future without breaking backwards compatibility. But I don't think this is true either, right?

@mortent With the full context in mind - do you agree that this KEP isn't making the CA problems meaningfully more difficult, and that we're not restricting ourselves from solving them in the future with this KEP? Happy to schedule a meeting if something's not clear in my ramblings above, I know it's a ton of context.

@jackfrancis @mtrqq @MenD32 Could you validate that my analysis above makes sense?

@jackfrancis
Copy link

@towca @mortent could we get some representative examples of ResourcePools that express ambiguity in terms of their overall utilization status? The way the problem is expressed it sounds like we don't have a way to inspect the sum of Pending Pod device requests and answer the basic question "do we have room on existing nodes for these devices or do we need to build new infra?" I'm not 100% up on this API but I'm surprised that you can't introspect CapacityPool membership across all pools and deduce the answers. Some examples would help.

The cluster-wide problem is more complicated, in particular: "cluster-wide" defines a sum that is calculated by adding up from a common unit. Because DRA enables infinitely flexible device classifications, there doesn't appear to be a real-world value to carrying this forward. The options would be:

  • Per-device type+unit "cluster-wide" limits
  • All-up (across all devices and device units) min/max

We may wish to worry about the first one as a useful boundary for folks to use. Do we care about the 2nd one? Is that a real problem users will care about from an autoscaler perspective? To be more concrete, how do we calculate cluster-wide limits on a cluster that has both gpu.nvidia.com- and tpu.google.com-attached devices?

@MenD32
Copy link

MenD32 commented Jan 17, 2026

@towca great analysis of the current status of CA regarding those issues!

I'd like to highlight 2 of what I think are the biggest difficulties regarding DRA Partitionable Devices and autoscaling:

  1. Distinguishing "physical" and "virtual" devices - as @towca already mentioned that in his comment, understanding what physical devices exist on each node is vital for features like cluster limits, Ideally the DRA implementation should be as independent of specific DRA plugins (f.e. by not depending on device attributes) in order to work.

  2. Autoscaling with multi host architectures - the partionable devices KEP describes 2 potential use cases, intra-node resources (like NVIDIA MIGs) and inter-node resources (like TPU ICI). There is currently much less support to the second case than the first one. IMO autoscaling for the second case is a much deeper issue than I originally anticipated, and will require its own dicussion internally within CA.

Addendum: when I say physical device I'm refering to it from a cost perspective. i.e. a device that is part of the physical rack that runs the node (GPU, network card, etc...) and that a user is probably paying for. A virtual device would be an interface that allows of incomplete use of the available capacity of the physical device (like time slices, partitions, etc...).

@mtrqq
Copy link

mtrqq commented Jan 20, 2026

Thanks @towca for the great analysis! I agree that this KEP isn't adding to the problems we already have in the autoscaler. A couple points from me:

  • Resource limits. This API takes away the possibility to make an assumption that count of devices exposed in the resource pool is equal to the amount of physical resources available on the node, e.g. 8 devices exposed by the gpu.nvidia.com driver would be treated the same as having nvidia.com/gpu: 8 set in the allocatable of the node. I don't think that this approach was a viable option in the long term as it breaches the driver API surface by making an assumption that device == physical device, so it's definitely not a loss from my perspective
  • Utilization weight. Current assumption that DRA devices always outweigh CPU/Memory is incorrect for certain device types, though this seems like an orthogonal problem to partitionable devices support.
  • Utilization calculation. CA logic needs to handle counter sets semantics to correctly measure pool utilization and it's already of PR in flight you've mentioned: Feat: partitionable devices support autoscaler#8559

About the ideas for adding metadata I think it's reasonable to introduce separate CRD(s) for these purposes so that cluster admins may influence how autoscaler performs such calculations. One thing worth delegating to driver maintainers is utilization calculation, potentially with a fallback to reasonable default behavior.

@haircommander haircommander mentioned this pull request Jan 20, 2026
12 tasks
@johnbelamaric
Copy link
Member

  • Distinguishing "physical" and "virtual" devices - as @towca already mentioned that in his comment, understanding what physical devices exist on each node is vital for features like cluster limits, Ideally the DRA implementation should be as independent of specific DRA plugins (f.e. by not depending on device attributes) in order to work.

CounterSets often correspond to physical devices. Similar issues of how we articulate capacity to the end user for partitionable devices came up in #5677 which @jackfrancis has tagged for SIG autoscalling to take a look at. CounterSets have a name, which likely will be things like "gpu0", "gpu1", etc. So, would treating them as physical devices work for autoscaling? I think those may not be as confusing for users, at least in these common cases, as we are worried about. Sure, I expect people may find other uses for CounterSets that don't correspond to physical devices, but I don't think that will be the common case.

@MenD32
Copy link

MenD32 commented Jan 21, 2026

CounterSets often correspond to physical devices. Similar issues of how we articulate capacity to the end user for partitionable devices came up in #5677 which @jackfrancis has tagged for SIG autoscalling to take a look at. CounterSets have a name, which likely will be things like "gpu0", "gpu1", etc. So, would treating them as physical devices work for autoscaling? I think those may not be as confusing for users, at least in these common cases, as we are worried about. Sure, I expect people may find other uses for CounterSets that don't correspond to physical devices, but I don't think that will be the common case.

The way that autoscaler deals with this in kubernetes/autoscaler#8559 is by trying to count unique devices via counterSets. one device can have multiple counterSets so countersSets are not a 1:1 map to physical devices, so the logic looks for resourceSlices that consume multiple counterSets and groups them as both part of a device. This is not a full-proof solution because it is not guaranteed that there would be a device that consumes all the counters of a device.

Following the discussion in kubernetes/kubernetes#133362 IMO a possible solution could be to create a convention DRA plugins for device attributes. specifically for device_id and device_type, where device_id distinguishes physical devices and device_type identifies "products" (like A100 or H100 in NVIDIA GPUs).

This would help in both resource limits and utilization calculation since it'd be a standard way to count physical device, and to define precise cluster limits in and it could also be applicable to ResourceQuota with DRA.

@MenD32
Copy link

MenD32 commented Jan 21, 2026

CounterSets often correspond to physical devices. Similar issues of how we articulate capacity to the end user for partitionable devices came up in #5677 which @jackfrancis has tagged for SIG autoscalling to take a look at. CounterSets have a name, which likely will be things like "gpu0", "gpu1", etc. So, would treating them as physical devices work for autoscaling? I think those may not be as confusing for users, at least in these common cases, as we are worried about. Sure, I expect people may find other uses for CounterSets that don't correspond to physical devices, but I don't think that will be the common case.

The way that autoscaler deals with this in kubernetes/autoscaler#8559 is by trying to count unique devices via counterSets. one device can have multiple counterSets so countersSets are not a 1:1 map to physical devices, so the logic looks for resourceSlices that consume multiple counterSets and groups them as both part of a device. This is not a full-proof solution because it is not guaranteed that there would be a device that consumes all the counters of a device.

Following the discussion in kubernetes/kubernetes#133362 IMO a possible solution could be to create a convention DRA plugins for device attributes. specifically for device_id and device_type, where device_id distinguishes physical devices and device_type identifies "products" (like A100 or H100 in NVIDIA GPUs).

This would help in both resource limits and utilization calculation since it'd be a standard way to count physical device, and to define precise cluster limits in and it could also be applicable to ResourceQuota with DRA.

@mortent
Copy link
Member Author

mortent commented Jan 21, 2026

@jackfrancis For your question about ResourcePools that express ambiguity in terms of their overall utilization status, I think the fundamental challenge that comes with Partitionable Devices, is that it will no longer be possible to determine available capacity by counting the number of devices in a resource pool. Since the devices share the underlying hardware, allocating a single device can make several of the remaining devices in the pool unavailable. Also, the devices will be different partitions of the underlying hardware, so they are not homogeneous. From what I can tell about the proposal in kubernetes/autoscaler#8559, it seems like it would handle what we think are the most common scenarios where there is a 1:1 mapping between a physical device and a CounterSet.

@mortent
Copy link
Member Author

mortent commented Jan 21, 2026

So I agree with the @towca that this KEP doesn't change the fundamental challenges that already exists for adapting CA to DRA. It seems like there are a few areas discussed here:

  • CA needs to determine the number of physical devices to correctly manage cluster-wide resource limits. It is not clear to me that the ResourceSlice is the best place for CA to determine the number of physical devices on a node. Could this be something that is handled a different way, maybe through cloud-provider-specific logic?

  • Computing the utilization for node-local resource pools. This seems like the most challenging one in the general case. For example, it is possible in DRA to create a resource pool that contains node-local devices for different nodes, although I can't think of any reason why someone would want to do it. This seems to get somewhat more complex with the Partitionable Devices feature, but I think the assumptions in Feat: partitionable devices support autoscaler#8559 should make it work for the normal node-local scenarios. Autoscaling for multi-host devices is more complex in several dimensions I think, so I think that should probably be handled separately.

  • Determining the relative importance of utilization between different types of devices. This seems to be a challenge that predates DRA and I don't think this KEP has any impact on this.

So to summarize, the impact of this KEP is mostly on computing utilization, but we are making progress on it. There is still work to do in order to support this in the general case for DRA. I agree that handling this with CA-specific configuration of DRA using CRDs seems like a good way to handle it.

@johnbelamaric
Copy link
Member

The way that autoscaler deals with this in kubernetes/autoscaler#8559 is by trying to count unique devices via counterSets. one device can have multiple counterSets so countersSets are not a 1:1 map to physical devices, so the logic looks for resourceSlices that consume multiple counterSets and groups them as both part of a device. This is not a full-proof solution because it is not guaranteed that there would be a device that consumes all the counters of a device.

Following the discussion in kubernetes/kubernetes#133362 IMO a possible solution could be to create a convention DRA plugins for device attributes. specifically for device_id and device_type, where device_id distinguishes physical devices and device_type identifies "products" (like A100 or H100 in NVIDIA GPUs).

If there is a convention or even a set of fields we can add to the API to make this better for CA, please feel free to suggest them.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but overall this looks good to me.

The discussion around autoscaling also seems to have settled down.

@towca
Copy link

towca commented Jan 23, 2026

Autoscaling with multi host architectures - the partionable devices KEP describes 2 potential use cases, intra-node resources (like NVIDIA MIGs) and inter-node resources (like TPU ICI). There is currently much less support to the second case than the first one. IMO autoscaling for the second case is a much deeper issue than I originally anticipated, and will require its own dicussion internally within CA.

I'd like to highlight one thing here - right now CA basically doesn't support multi-host DRA autoscaling pretty much at all:

  • While CA tracks all kinds of ResourceSlices and passes them to the vendor kube-scheduler logic, the direct logic in CA mostly deals with Node-local ResourceSlices (the ones with Spec.NodeName set).
  • So e.g. CA can correctly determine that a pending Pod with a non-Node-local ResourceClaim can actually be scheduled on an existing Node in the cluster (by delegating to the kube-scheduler code). But this is pretty much the only thing that works for such claims/slices.
  • For scale-up logic, CA adds new Nodes to the cluster simulation. Each new Node only brings new Node-local ResourceSlices to the simulation. So CA wouldn't scale anything up for a pending Pod with a ResourceClaim that would require adding new non-Node-local ResourceSlices.
  • Similarly for scale-down logic - it works per-Node, and only considers the Node-local ResourceSlices when computing utilization. So a Node with an allocated multi-host DRA Device would be counted as 0% DRA utilization, and CA could scale it down.

Adding support for Node-local Devices to CA was far from easy, but it didn't require changes to how CA behaves at a fundamental level. Adding support for multi-host Devices (or for attaching new Devices to existing Nodes) will require fundamental changes to how CA models its simulations. I imagine this would be a dedicated effort with a comprehensive design (KEP/AEP) of its own. So IMO discussing it now is out of scope, and we should focus on the Node-local case that CA supports right now.

If there is a convention or even a set of fields we can add to the API to make this better for CA, please feel free to suggest them.

Similarly here - I think it's hard to determine the fields/conventions without a dedicated, autoscaling-specific design effort. As mentioned in my previous comment, it's e.g. not clear to me if these fields would be better placed in the DRA API, or a new dedicated autoscaling-specific CRD. We have this effort planned for the near future (kubernetes/autoscaler#7781, kubernetes/autoscaler#8184), but not in time for K8s 1.36.

With the 2 points above in mind, IMO we should focus the Node autoscaling discussion here on 2 aspects:

  1. Making sure we're not extending the DRA API in a way that would prevent us from adding new, autoscaling-specific fields in the future if the dedicated effort lands on that as a solution.
  2. Validating that CA behavior wrt Partitionable Devices proposed in Feat: partitionable devices support autoscaler#8559 is reasonable in the "common" case of Node-local partitioned hardware devices (e.g. Nvidia MIG GPU). This is the behavior we'll be stuck with until the dedicated effort lands.

For 1., as mentioned before I don't personally think this KEP is restricting us from the possible future changes. There are also other voices with a similar opinion in this thread IIUC.

2. is a bit more tricky, and I'd like to clarify a few things there. Apart from one point, IMO this part isn't really blocking for the KEP approval because it's largely orthogonal.

The PR is still being discussed, but here's my view of how the utilization logic should work in CA before we have the dedicated effort for it (copied from the PR review):

  • We assume that each CounterSet in a ResourcePool's SharedCounters corresponds to a "full device".
  • We assume that each DRA Device without ConsumesCounters corresponds to a "full device".
  • We assume that the each DRA ResourcePool represents a pool of homogeneous "full devices" - regardless of whether they're expressed via a SharedCounter, or a Device without ConsumesCounters.
  • The utilization logic:
    1. Adds up "full devices" exposed via a Device without ConsumesCounters.
    2. For each "full device" exposed via a CounterSet, its utilization is calculated as the highest utilization among its Counters. This utilization represents a portion of a "full device", which is added to the allocated total calculated in step 1..
    3. Divides the allocated total calculated in steps 1. and 2. by the total number of "full devices".

IIUC @mortent validated on the PR review that these seem like reasonable assumptions for CA to make in the common case. It'd be great if we could get more eyes on this to double-check. I also have some follow-up questions:

  • [this is the potentially blocking point] What's the expectation wrt the "availability fields" (NodeName/NodeSelector/AllNodes/PerDeviceNodeSelection) for ResourceSlices that set SharedCounters? I think the KEP doesn't mandate anything here, right? It'd simplify the logic in CA if we could depend on something like "if a ResourcePool has all Node-local Devices on the same Node, all its ResourceSlices (including the ones with SharedCounters) should have the NodeName field set". Otherwise we'll have to adapt the existing logic from the current simple "attach all slices with NodeName X to Node X" to something more complex like "attach all slices with NodeName X to Node X; go over all the ResourcePools in the slices, and for each pool also add ResourceSlices from the same pool without any availability field set".
  • Is it possible for there to be both "non-partitioned" (Device without ConsumesCounters) and "partitioned" (SharedCounter entry) devices in the same ResourcePool? I.e. could we have a single ResourcePool with gpu-0 exposed via Device without ConsumesCounters, and with gpu-1 exposed via SharedCounter? If possible, is it "likely", i.e. does it make sense for CA to handle this for the "common" case? It would simplify the utilization logic described above if we could have different paths for calculating utilization for "partitioned" ResourcePools and "non-partitioned" ResourcePools.
  • What's the motivation for a single Device consuming multiple SharedCounters? Could you share some example use-case?

@mortent
Copy link
Member Author

mortent commented Jan 23, 2026

The PR is still being discussed, but here's my view of how the utilization logic should work in CA before we have the dedicated effort for it (copied from the PR review):

  • We assume that each CounterSet in a ResourcePool's SharedCounters corresponds to a "full device".

  • We assume that each DRA Device without ConsumesCounters corresponds to a "full device".

  • We assume that the each DRA ResourcePool represents a pool of homogeneous "full devices" - regardless of whether they're expressed via a SharedCounter, or a Device without ConsumesCounters.

  • The utilization logic:

    1. Adds up "full devices" exposed via a Device without ConsumesCounters.
    2. For each "full device" exposed via a CounterSet, its utilization is calculated as the highest utilization among its Counters. This utilization represents a portion of a "full device", which is added to the allocated total calculated in step 1..
    3. Divides the allocated total calculated in steps 1. and 2. by the total number of "full devices".

I think the assumptions you have listed are reasonable. The main caveat I think of is that users might choose to do a static partition of their devices, meaning that the devices listed in the ResourceSlice no longer maps 1:1 to physical devices. And in this situation there isn't any need to use the Partitionable Devices feature since the partitioning is done such that the devices don't use overlapping hardware. So similar to the way MIGs can be supported with Device Plugin today. But like I said, this is separate from the Partitionable Devices feature.

  • [this is the potentially blocking point] What's the expectation wrt the "availability fields" (NodeName/NodeSelector/AllNodes/PerDeviceNodeSelection) for ResourceSlices that set SharedCounters? I think the KEP doesn't mandate anything here, right? It'd simplify the logic in CA if we could depend on something like "if a ResourcePool has all Node-local Devices on the same Node, all its ResourceSlices (including the ones with SharedCounters) should have the NodeName field set". Otherwise we'll have to adapt the existing logic from the current simple "attach all slices with NodeName X to Node X" to something more complex like "attach all slices with NodeName X to Node X; go over all the ResourcePools in the slices, and for each pool also add ResourceSlices from the same pool without any availability field set".

The requirements for the node selection fields (NodeName, NodeSelector, AllNodes, PerDeviceNodeSelection) are the same for all ResourceSlices, meaning that one and only one of the fields must be set. For drivers running on a node that publishes resource pool(s) of node-local devices, the expecation is that they will set the same node selector, which would be the NodeName for all ResourceSlices in the pool. This lets the driver fetch all its ResourceSlices using a field selector.
However, there is nothing in the DRA spec that requires that all ResourceSlices within a resource pool has the same node selector. So it is possible that a resource pool have ResourceSlices with different values for the NodeName field or where some have AllNodes and others NodeName. We found a bit related to just this scenario relatively recently: kubernetes/kubernetes#134466. That being said, I don't think we have any use-case where someone would do something like this, and I think we should consider it a best practice in DRA that all ResourceSlices within a resource pool has identical node selectors.
The resourceslicecontroller that publishes ResourceSlices for drivers that is built on the DRA Kubeletplugin framework doesn't allow setting different node selectors across ResourceSlices with the exception of the PerDeviceNodeSelection.

  • Is it possible for there to be both "non-partitioned" (Device without ConsumesCounters) and "partitioned" (SharedCounter entry) devices in the same ResourcePool? I.e. could we have a single ResourcePool with gpu-0 exposed via Device without ConsumesCounters, and with gpu-1 exposed via SharedCounter? If possible, is it "likely", i.e. does it make sense for CA to handle this for the "common" case? It would simplify the utilization logic described above if we could have different paths for calculating utilization for "partitioned" ResourcePools and "non-partitioned" ResourcePools.

There is nothing that prevents a driver from publishing a ResourcePool where gpu-1 is available with MIG support (so there will be a CounterSet representing the device and one or more devices consuming counters from the CounterSet) and gpu-2 is only available as a device that doesn't consume any counters. I would expect the most likely scenario is that all devices on a node would be made available in the same way, i.e. either all available as MIGs or none of them. But I can't confidently say that there aren't any use-cases where that would be desirable. Maybe @klueska can give some input on what the NVIDIA DRA driver is doing here?

  • What's the motivation for a single Device consuming multiple SharedCounters? Could you share some example use-case?

For node-local devices, I'm not aware of any use-cases where a single device would consume counters from multiple CounterSets. So I think it is safe to assume only one for the base common-case support we are planning for the first phase.

@towca
Copy link

towca commented Jan 27, 2026

I think the assumptions you have listed are reasonable. The main caveat I think of is that users might choose to do a static partition of their devices, meaning that the devices listed in the ResourceSlice no longer maps 1:1 to physical devices. And in this situation there isn't any need to use the Partitionable Devices feature since the partitioning is done such that the devices don't use overlapping hardware. So similar to the way MIGs can be supported with Device Plugin today. But like I said, this is separate from the Partitionable Devices feature.

You mean that we could have a ResourcePool with statically partitioned GPUs, so that SharedCounters are not used at all, but the Devices within the pool are not homogeneous? Yeah this seems fully orthogonal and would fall under the list of use-cases not supported by CA in the initial logic.

The requirements for the node selection fields (NodeName, NodeSelector, AllNodes, PerDeviceNodeSelection) are the same for all ResourceSlices, meaning that one and only one of the fields must be set. For drivers running on a node that publishes resource pool(s) of node-local devices, the expecation is that they will set the same node selector, which would be the NodeName for all ResourceSlices in the pool. This lets the driver fetch all its ResourceSlices using a field selector.

Just to double-check here - so in the examples from the KEP (e.g. the ResourceSlice with gpu-0-counter-set), we could normally expect NodeName to be set on the slice defining SharedCounters?

However, there is nothing in the DRA spec that requires that all ResourceSlices within a resource pool has the same node selector. So it is possible that a resource pool have ResourceSlices with different values for the NodeName field or where some have AllNodes and others NodeName. We found a bit related to just this scenario relatively recently: kubernetes/kubernetes#134466. That being said, I don't think we have any use-case where someone would do something like this, and I think we should consider it a best practice in DRA that all ResourceSlices within a resource pool has identical node selectors.

Yeah, I understand that part - these would also fall under the list of use-cases not supported by CA. We do probably need more explicit error handling for this in CA, so that a single ResourcePool breaking this assumption doesn't affect others (there's kubernetes/autoscaler#7784 planned to tackle this and other similar problems). But as you're saying - this is not introduced in this KEP.

There is nothing that prevents a driver from publishing a ResourcePool where gpu-1 is available with MIG support (so there will be a CounterSet representing the device and one or more devices consuming counters from the CounterSet) and gpu-2 is only available as a device that doesn't consume any counters. I would expect the most likely scenario is that all devices on a node would be made available in the same way, i.e. either all available as MIGs or none of them. But I can't confidently say that there aren't any use-cases where that would be desirable. Maybe @klueska can give some input on what the NVIDIA DRA driver is doing here?

Ack, thanks for the explanation! Seems best to keep the shared path with summing up both kinds in CA logic then.

For node-local devices, I'm not aware of any use-cases where a single device would consume counters from multiple CounterSets. So I think it is safe to assume only one for the base common-case support we are planning for the first phase.

That makes sense, thanks a lot for the confirmation! cc @MenD32

I don't think I have any more questions or concerns. @mtrqq @MenD32 @jackfrancis Do you?

Also tagging the Karpenter folks before formal approval - @jonathan-innis @njtran Have you had a chance to review this KEP update? Any concerns from the Karpenter end?

@dom4ha
Copy link
Member

dom4ha commented Jan 28, 2026

/approve for sig-scheduling

There is nothing new from the scheduler perspective and scheduling challenges has been identified, so LGTM. We're currently expanding gang scheduling to support topologies (see #5733), so scheduler should be able to consider various scheduling options (devices) and pick the one that allows binding of all pods in the gang.

@mortent
Copy link
Member Author

mortent commented Jan 28, 2026

Just to double-check here - so in the examples from the KEP (e.g. the ResourceSlice with gpu-0-counter-set), we could normally expect NodeName to be set on the slice defining SharedCounters?

I would expect NodeName to be set for all ResourceSlices that are published by drivers running on nodes, i.e. they contain node-local devices. There is nothing in DRA that prevents a driver from publishing ResourceSlices with another node selection, but I don't know of any use-case where they would do something else.

For resource pools containing devices that are not node-local, they will almost certainly use a different node selection than NodeName.

@harche
Copy link
Contributor

harche commented Jan 28, 2026

Looks good to me

@johnbelamaric
Copy link
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, johnbelamaric, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2026
@k8s-ci-robot k8s-ci-robot merged commit 0463ac3 into kubernetes:master Jan 28, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Jan 28, 2026
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: ✅ Done
Archived in project

Development

Successfully merging this pull request may close these issues.