KEP-5517: Introduce DRA for Native Resources KEP by pravk03 · Pull Request #5755 · kubernetes/enhancements

pravk03 · 2025-12-25T07:05:43Z

One-line PR description: Support native resource management through DRA.

Issue link: DRA: Native Resource Requests #5517

Other comments:

k8s-ci-robot · 2025-12-25T07:05:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ffromani · 2025-12-27T09:09:26Z

/cc

kad · 2025-12-29T17:24:41Z

/cc

keps/sig-node/5517-dra-native-resources/kep.yaml

pravk03 · 2026-01-05T18:50:54Z

/cc @johnbelamaric

mortent · 2026-01-06T16:44:16Z

/wg device-management

dom4ha

Have you considered defining some sort of framework level consumable counter that different plugins could consume capacity from? DRA devices could use this counter using Consumable Capacity feature and NodeResources plugin would be extended somehow to use it as well.

It would be a bit cleaner conceptually, but we'd introduce quite significant framework change, which may be harder to get through the process.

dom4ha · 2026-02-04T09:37:42Z

keps/sig-scheduling/5517-dra-native-resources/README.md

+#### Unreferenced Claims
+
+If a `ResourceClaim` is listed in `pod.Spec.ResourceClaims` but not referenced by any container in `pod.Spec.Containers[*].Resources.Claims`.
+The resources associated with this claim ARE still accounted for against the node's capacity. The DRA allocator reserves the devices for the pod,


What about in case of "AddPerReference" accounting policy?

Have you considered defining some sort of framework level consumable counter that different plugins could consume capacity from? DRA devices could use this counter using Consumable Capacity feature and NodeResources plugin would be extended somehow to use it as well.

So, one other option we thought about was simply to treat the NodeAllocatable as CounterSet that is implicitly always available with a well-known name. If we additionally layer in a concept of translating consumed capacity into counter set consumption (ie, a mapping between our current consumable capacity concept and our counter set concept), then we can solve many of the same problems.

What about in case of "AddPerReference" accounting policy?

Updated this section. In the case of unreferenced claims, we should account for it once when we update NodeInfo irrespective of the accounting policy.

So, one other option we thought about was simply to treat the NodeAllocatable as CounterSet that is implicitly always available with a well-known name. If we additionally layer in a concept of translating consumed capacity into counter set consumption (ie, a mapping between our current consumable capacity concept and our counter set concept), then we can solve many of the same problems.

@johnbelamaric I believe this is something that we can still consider post-alpha. Note that it may also address the problem of scoring (node fit would see a counter aggregating native resource request coming from different plugins.

Maybe we could mention this option for the future consideration.

@dom4ha I have covered scoring in "Future Enhancements" section. In the current proposal (and draft implementation), scoring in NodeResourcesFit does not consider claim requests. But it should be possible to introduce unified scoring.

pravk03 · 2026-02-09T17:16:28Z

Summary of modifications made KEP based on discussions. This should help reduce the scope and implementation complexity on the scheduler side.

Removed AccountingPolicy from Device and DeviceClass APIs. For the Alpha scope, we will support a single accounting policy where claim requests is allocated once and is additive to the pod's standard resource requests. The AccountingPolicy field along with supported policies has been moved to the "Future Enhancements" section for reconsideration in Alpha 2/Beta.
Sharing native resource claims between pods is excluded from alpha. The DynamicResources plugin will be updated to reject any pod that references a claim already in use by an existing pod. This reduces implementation complexity and allows for more discussion on node-level enforcement and cgroup management.
Pod Level Resources, if specified will take precedence and determine the overall resource footprint. Added a validation step in the DynamicResources plugin to ensure that container requests (including native resource claims) do not exceed pod-level requests.

cc @johnbelamaric @dchen1107 @SergeyKanzhelev @dom4ha

dom4ha · 2026-02-10T09:47:44Z

/approve

Thanks for reducing the scope, which should be helpful.

johnbelamaric

PRR is fine for alpha

keps/sig-scheduling/5517-dra-native-resources/README.md

pravk03 · 2026-02-10T18:54:31Z

/hold

(for sig-node lgtm)

SergeyKanzhelev · 2026-02-10T20:56:08Z

keps/sig-scheduling/5517-dra-native-resources/README.md

+-   The Kubelet's admission handler is updated to consider native resource claims in `Pod.Status`.
+-   All unit and integration tests outlined in the Test Plan are implemented and verified.
+
+#### Alpha2 / Beta


how do we know that we are moving in the right direction by making kubelet read native resources added by DRA from Pod Status? Is it part of "Gather feedback from alpha"?

Do you mean populating the claim information in pod status rather than ResourceClaim status ?. Or the general idea of having kubelet having to look outside of pod.spec ?

SergeyKanzhelev · 2026-02-10T20:56:46Z

The new scope seems more reasonable. I am still very un-easy on a few topics:

Can we GA the Native Resources DRA plugin so people can start adopting it (before we fully GA-ed this KEP)? If so - we may want to define a migration plan or just say that it will be a different DRA plugin once this KEP is GA.
Are there real world frameworks or users who lined up to try the fungibility between GPU and CPU? If we have nobody confirming that this is "likely" the way to go - it feels like we are implementing an expensive solution to the problem nobody is solving right now. Will we have enough feedback after alpha to decide on the path forward?

johnbelamaric · 2026-02-10T21:21:32Z

/approve for PRR

pravk03 · 2026-02-10T22:23:24Z

Can we GA the Native Resources DRA plugin so people can start adopting it (before we fully GA-ed this KEP)? If so - we may want to define a migration plan or just say that it will be a different DRA plugin once this KEP is GA.

I think the answer depends heavily on the specific plugin implementation. Specifically thinking of dra-driver-cpu, we can support adoption now using existing workarounds, such as mirroring claim requests in the Pod Spec. I view having Pod Level Resources define the overall shape (including claims) as one clean path for migrating to this KEP, but this definitely needs more thought. I am not sure if the same approach works for Memory or other drivers. @ffromani, since you are working on both CPU and Memory drivers, you may have any thoughts here?

On the contrary, without the understanding on native claims in scheduler and kubelet we would have to rely on workarounds (potentially different) in all the native resource driver implementations ?.

Are there real world frameworks or users who lined up to try the fungibility between GPU and CPU? If we have nobody confirming that this is "likely" the way to go - it feels like we are implementing an expensive solution to the problem nobody is solving right now. Will we have enough feedback after alpha to decide on the path forward?

Yes, we have seen interest in GPU/CPU fungibility.

ffromani · 2026-02-11T07:38:44Z

Can we GA the Native Resources DRA plugin so people can start adopting it (before we fully GA-ed this KEP)? If so - we may want to define a migration plan or just say that it will be a different DRA plugin once this KEP is GA.

I think the answer depends heavily on the specific plugin implementation. Specifically thinking of dra-driver-cpu, we can support adoption now using existing workarounds, such as mirroring claim requests in the Pod Spec. I view having Pod Level Resources define the overall shape (including claims) as one clean path for migrating to this KEP, but this definitely needs more thought. I am not sure if the same approach works for Memory or other drivers. @ffromani, since you are working on both CPU and Memory drivers, you may have any thoughts here?

Like you, I tend to believe that combining with Pod Level Resources should work, but we need more data here.

I see the point about experimenting with drivers before adding important changes into core kubernetes. I'm not against it, even if I see clearly the value in this KEP. The issues with this approach (try external drivers first) boils down to chicken/egg problems:

the amount of workarounds needed is likely to limit significantly the UX, then the adoption, which provides us too little data
what drivers can actually do is limited, which in turn constraints negatively the adoption, which circles back to too little data.

What I'm trying to say is that I see the value in experimenting out of tree and I personally would encourage that but without some core kubernetes changes, be them in the scheduler, in the kubelet or likely both, what drivers for core resources can do is very limited, and that hampers early adoption.

dchen1107 · 2026-02-11T20:15:43Z

/lgtm
/approve

Thank you all for the extensive discussions, both online and offline. We’ve successfully narrowed the scope to the most essential components. While some open questions remain, I believe it's ready to move forward into the implementation phase to continue exploring those details.

Agreed with Sergey above to move cpu dra driver to GA without being blocked by this KEP, so that we can have early and more feedbacks from the users.

k8s-ci-robot · 2026-02-11T20:15:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, dom4ha, johnbelamaric, pravk03

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-scheduling/OWNERS~~ [dom4ha,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnbelamaric · 2026-02-11T20:20:37Z

/hold cancel

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Dec 25, 2025

k8s-ci-robot requested review from mrunalp and palnabarun December 25, 2025 07:05

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 25, 2025

pravk03 force-pushed the native-dra branch 2 times, most recently from dfa064c to 02706c7 Compare December 26, 2025 19:53

k8s-ci-robot requested a review from ffromani December 27, 2025 09:09

k8s-ci-robot requested a review from kad December 29, 2025 17:24

pravk03 force-pushed the native-dra branch 3 times, most recently from df2979a to 9146641 Compare January 1, 2026 11:48

lmktfy reviewed Jan 4, 2026

View reviewed changes

keps/sig-node/5517-dra-native-resources/kep.yaml Outdated Show resolved Hide resolved

pravk03 force-pushed the native-dra branch from 9146641 to ea6f3d7 Compare January 5, 2026 18:24

pravk03 marked this pull request as ready for review January 5, 2026 18:25

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 5, 2026

k8s-ci-robot requested review from dchen1107 and kikisdeliveryservice January 5, 2026 18:25

k8s-ci-robot requested a review from johnbelamaric January 5, 2026 18:50

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 6, 2026

github-project-automation bot added this to Dynamic Resource Allocation Jan 6, 2026

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jan 6, 2026

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 7, 2026

pravk03 force-pushed the native-dra branch from ec7d7d9 to 6cff7e7 Compare February 4, 2026 00:29

dom4ha reviewed Feb 4, 2026

View reviewed changes

pravk03 force-pushed the native-dra branch 2 times, most recently from c5ae967 to b7623ed Compare February 5, 2026 07:35

liggitt moved this to Assigned in API Reviews Feb 5, 2026

pravk03 force-pushed the native-dra branch from b7623ed to aeae7a0 Compare February 9, 2026 05:50

johnbelamaric reviewed Feb 10, 2026

View reviewed changes

keps/sig-scheduling/5517-dra-native-resources/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5517-dra-native-resources/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5517-dra-native-resources/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2026

SergeyKanzhelev reviewed Feb 10, 2026

View reviewed changes

KEP-5517: Introduce DRA for Native Resources KEP

bedad02

pravk03 force-pushed the native-dra branch from aeae7a0 to bedad02 Compare February 10, 2026 21:15

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026

k8s-ci-robot assigned dchen1107 Feb 11, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2026

k8s-ci-robot merged commit 7e0cdf1 into kubernetes:master Feb 11, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026

github-project-automation bot moved this to Done in SIG Scheduling Feb 11, 2026

pravk03 mentioned this pull request Feb 12, 2026

Add Webadmission Controller kubernetes-sigs/dra-driver-cpu#61

Open

mortent moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Feb 17, 2026

Conversation

pravk03 commented Dec 25, 2025

Uh oh!

k8s-ci-robot commented Dec 25, 2025

Uh oh!

ffromani commented Dec 27, 2025

Uh oh!

kad commented Dec 29, 2025

Uh oh!

Uh oh!

pravk03 commented Jan 5, 2026

Uh oh!

mortent commented Jan 6, 2026

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

dom4ha Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

johnbelamaric Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

dom4ha Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 commented Feb 9, 2026

Uh oh!

dom4ha commented Feb 10, 2026

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pravk03 commented Feb 10, 2026

Uh oh!

SergeyKanzhelev Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev commented Feb 10, 2026

Uh oh!

johnbelamaric commented Feb 10, 2026

Uh oh!

pravk03 commented Feb 10, 2026

Uh oh!

ffromani commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dchen1107 commented Feb 11, 2026

Uh oh!

k8s-ci-robot commented Feb 11, 2026

Uh oh!

johnbelamaric commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

pravk03 Feb 11, 2026 •

edited

Loading

ffromani commented Feb 11, 2026 •

edited

Loading