Add KEP for DRA: Extended Resource #5136

yliaog · 2025-02-05T22:34:07Z

One-line PR description:
Add new KEP for supporting extended resource requests in DRA

Issue link:
DRA: Handle extended resource requests via DRA Driver #5004

Other comments:
[WIP] Prototype kubelet and scheduler for extended resource backed by DRA kubernetes#130653

yliaog · 2025-02-05T22:34:38Z

/assign @johnbelamaric

johnbelamaric

Awesome, thanks @yliaog

keps/sig-node/5004-dra-extended-resource/kep.yaml

keps/sig-node/5004-dra-extended-resource/README.md

johnbelamaric · 2025-02-05T22:55:23Z

keps/sig-node/5004-dra-extended-resource/README.md

+  pod on the node.
+  * It is *ignored* by the scheduler DRA plugin as it has no spec, hence
+  scheduler does not need to do any allocation for it.
+  * It is *read* by scheduler DRA plugin for the devices allocated,  so that


Can you clarify the distinction between these two bullets (which seem to be saying the opposite)? Are you saying it the plugin will not try to allocate it, but instead will only use it for accounting for the extended resources based device allocations?

normally scheduler DRA plugin would try to allocate devices for resourceclaim, however, this one is special, it does not have spec, (it is for recording devices for extended resources), so scheduler DRA plugin does not need to allocate any devices for this resourceclaim.

however, scheduler DRA plugin does need to read the devices allocated to 'extended resources' from this resourceclaim, and scheduler DRA plugin would not consider those devices when trying to allocate devices for other DRA resource claims.

johnbelamaric · 2025-02-05T23:01:23Z

keps/sig-node/5004-dra-extended-resource/README.md

+    resource.kubernetes.io/extended-resource-name: foo.domain/bar
+```
+
+The special resource claim lifecycle is managed by the resource claim


Can you explain a bit more about how the scheduler will operate on this? Like, what is the sequence diagram of how the scheduler decided to service the extended resource request via the DRA driver?

Can the cluster contain both pure device plugin based extended resources, AND DRA-based extended resources? I think it's clear with this plan that any given node will have a single driver, and with the DRA ones it can advertise as either DRA or extended resource. But if there are some nodes advertising the extended resource via device plugin, and some doing it via this mechanism, does the scheduler need to know that up front (I think it would because the resource slice exists), and then does it only update the special resource claim if it picks a node that uses DRA too? I think a few more details will help.

Are there any race conditions between the scheduler and the resource claim controller we need to worry about?

Yes, a cluster can have a node installed device plugin, and another node installed DRA, both advertises the same name. clarified that in the KEP.

scheduler needs to know that and keep track of it in NodeInfo.

resourceclaim is revised to be created by scheduler. I agree It is better to keep it as close to the regular normal resource claim as possible. so it is revised to be created one per pod, to keep track of the allocation results for the dynamic extended resources requests in the pod.

keps/sig-node/5004-dra-extended-resource/kep.yaml

johnbelamaric · 2025-02-06T00:16:03Z

keps/sig-node/5004-dra-extended-resource/kep.yaml

+kep-number: 5004
+authors:
+  - "@yliaog"
+owning-sig: sig-scheduling


You'll need to move it to the sig-scheduling directory, too.

Also add the PRR file (I can be the reviewer)

keps/sig-scheduling/5004-dra-extended-resource/README.md

mortent · 2025-02-06T03:34:30Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+  * It is a singleton. There is at most one resource claim object for a given
+    extended resource in a given namespace.
+  * It is not owned by a pod, its owner reference is null.
+  * Its field `status.allocation.devices` is used, other fields are ununsed,


The DeviceRequestAllocationResult has a Request field that is a reference to the request in the ResourceClaim spec. But in this situation, there is no corresponding request in the spec (since the spec is empty). Drivers might use this value, among other things for looking up any device configuration. How do we plan to handle this?

DRA driver has two parts:
1/ 1st part that handles DRA resource claim actuation
2/ 2nd part that handles extended resources actuation

the 1st part does not need to actuate on this special resource claim object, it is ignored there.

the 2nd part needs to read the list of allocated devices from the special resource claim object, and only use the devices in that list for extended resources actuation.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

DRA driver has to be updated to advertise devices as 'extended resource'.

That said, I revised the design based on the feedback, thanks for the comments!

A cluster node install either a DRA driver, or a deivce plugin driver for a given named resource. Devices are picked at the scheudling time, and communicated to kubelet and DRA driver through the special resource claim.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

That's a bit different from what I understood. If it's still the scheduler which picks devices and tells the driver about it, then it's fine.

From #5136 (review):

When you and I talked on Friday, we discussed allowing kubelet to remain unchanged. Instead, it would call the Device Plugin grpc API for those with extended resources, but since that grpc API would be handled by the DRA driver, it would know when receiving calls on that API, it should look for the special resource claim.

That probably goes back to my comment above: "I also very much dislike that DRA drivers need to be updated at all."

Why should we put additional work on all DRA drivers, now and in the future, if we can instead do something once in the kubelet? It has implications for the graduation of this feature, but should that be a deciding factor?

I know that I've said that we want to keep the kubelet as dumb as possible, but in this case I think it's simpler overall to do this in the kubelet. I'm also a bit worried about the implications of skipping some of the usual admission checks that the kubelet does for claims, like "reserved for". Perhaps that doesn't matter, but then we need to explain in the KEP why.

My initial thoughts were to minimize kubelet changes, and shift the work to device driver. But after some more thoughts, I agree it's better to do the work once, and great in kubelet, so device driver can have less work to do, and less chances to implement it wrongly.

pohly · 2025-02-06T09:43:20Z

/cc

johnbelamaric · 2025-02-06T17:12:28Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+
+## Proposal
+
+The basic idea is the following:


I have been thinking about this a bit more. I think the essential idea here is to have the scheduler account for the resources in both the traditional extended resources AND in the DRA resources. So, the the scheduler needs to know how to map between extended resource strings and DRA devices. The questions that come in are:

Where do we store that mapping? It could be its own resource or configuration stanza, the DeviceClass, or the ResourceSlice. The current proposal is in the ResourceSlice.

Where do we store the allocation? The current proposal has a special singleton ResourceClaim in a namespace for tracking this for all Pods. As @mortent pointed out, that may not work because drivers need to map between the allocations and the specific requests values, so that they can place devices in the correct containers. Instead, we may need to generate a per-Pod ResourceClaim, just like we do if there is a ResourceClaimTemplate for the Pod.

If we generate a per-Pod ResourceClaim, when do we do that? With a webook as described in the alternatives? Or with the ResourceClaim controller as described in the current proposal and as is done for ResourceClaimTemplates?

If we generate a per-Pod ResourceClaim in the ResourceClaim controller, how do we know in the scheduler to defer scheduling until that is created? I am guessing today we know if there is a ResourceClaimTemplate reference from the PodSpec, we have to wait for it to exist in order to schedule. But in this case for these implicit templates that exist because requests were made via the extended resources, we would have to have the same logic in the scheduler plugin to "know" that it should wait for the RC generation.

How does all of this work if there are existing nodes with the device plugin running that provide that extended resource? If we generate a per-pod RC, then ONLY the DRA-based drivers will be able to handle the request. Is that OK?

Generating a per-pod ResourceClaim that looks "normal" seems like it would be ideal, but I'm not sure if that is possible. Do we have enough information from the Pod and the ResourceSlice to generate a basic ResourceClaim with a proper spec?

If we let the ResourceClaim controller generate the ResourceClaim, we might be able to have the DynamicResources scheduler plugin delay scheduling of the Pod until the ResourceClaim has been generated. But it would require the plugin to be able to tell which Pods should use the "extended resource to DRA" bridge, since we can't delay just by looking for the pod spec.

I think we do have enough information, if we add DeviceClass somewhere. What we need for each request is: the device class, the container, the count. Since the resource request is in the container already, we know that. The count is also there already. It's the mapping to device class that we need to add somewhere.

In the current proposal, the mapping is between devices in the ResourceSlice and the extended resource names. We would need to add DeviceClass. However, if possible, it would be nice to avoid having to have a separate device class for every extended resource; the original "put the mapping in deviceclass" idea has that problem. But in this proposal, we could maybe allow the ResourceSlice to specify one of the following for each mapped resource name:

deviceclassname

deviceclassname + cel expression

I think this would allow the accounting to be done per extended resource name as this KEP suggests, we can still allow multiple extended resources to map to the same device class + a CEL selector expression.

Note that there is no reason this need apply ONLY to extended resources...we could map any resource name this way, which provides a path toward managing standard resources.

Make DeviceClassName optional in a ResourceClaim?

Sorry, couldn't resist.

Here's a different thought. Why don't we make the scheduler responsible for creating a per-pod ResourceClaim for the extended resource requests in a pod if and only if needed?

The requests in that claim can be empty. This also means that we don't need to name or create any DeviceClass. Instead, the scheduler directly sets the allocation result. The allocation result may need a request name in some places. I think we can simply say "extended-resource" and it should be okay. Perhaps the container name might also be useful (see below).

This special claim only gets created if the scheduler decides to use some devices advertised in a ResourceSlice, but not sooner. This gives the scheduler the freedom to use either normal extended resources advertised by a Device Plugin on node A or DRA devices on node B.

Once the claim is created, the status records that the devices are allocated. No changes needed in the kubelet or DRA drivers: they see a normal claim and proceed accordingly.

I haven't thought in detail about how this could be implemented in the scheduler plugin, but it seems feasible to me.

There are also questions around how the kubelet then decides where to make which CDI devices available. This may depend on using the container name I mentioned above.

Also, how do kubelet and the scheduler find the generated claim? We need to record it in the pod status with a PodResourceClaimStatus, which again may depend on some special name.

Finally, there's the question of error recovery when binding to the node fails and the scheduler needs to try elsewhere. The existing "deallocate allocated claim" path may help.

But wouldn't that require changes to drivers to understand how to map these things to containers

That mapping is done by the kubelet. The kubelet obviously needs to know about this feature, not the least to prevent it from trying to invoke some Device Plugin which won't exist on the node. Associating each allocated device with a container instead of a request should work.

In my proposal, there will be "a full claim" as far as the DRA driver is concerned, so NodePrepareResources should work. We could even support configs, although where they come from would need to be determined.

I think the eventual solution will need to handle clusters that have some nodes running the old Device Plugin, and some nodes running the new DRA driver

Yes, I was assuming that. Seems like a reasonable simplification to me.

Thanks all for the comments. I agree it is better to have the special claim as close to normal as possible. I revised the KEP to clarify that it supports a cluster with one node installed DRA driver, and the other node installed device plugin, and both advertised same name. I have not thought about how to support configs, my current thinking is config should be done through the regular DRA, the extended resource is a shorthand for a pool of devices, so keep it that simple way.

I think the fundamental issue being discussed here would be resolved with my proposal above. I.e. enable this feature by introducing a API server object that provides a mapping between an extended resource name and a deviceClass. Using this, it's obvious how to construct a one-off resource claim for the device.

We still need to do that construction in the scheduler if we want to support clusters where some nodes have a device plugin and others have a DRA driver. At that point it's clear what needs to be in that claim, also without this mapping to DeviceClass.

But I like that proposal for other reasons, see #5136 (comment).

where that mapping is added (slice, device class, or a new api object) can be discussed, but that only solves one problem, i.e. how to map device to 'extended resource name'.

the other problems still exist, and need to to resolved, like
1/ where to store the allocation results (a special claim)
2/ who creates that special claim (scheduler)
3/ what is the special claim's scope, per pod per extended resource?
4/ how does scheduler noderesource plugin allow the node with the DRA backed device to fit the pod
5/ how does scheduler dynamic resource plugin do the proper accouting and allocation
6/ when (at what scheduling framework phase) is the claim created?
7/ how to link the pod to the special claim
8/ how does kubelet admit the pod to the node (its admission logic has to change)
9/ how does kubelet pass the devices to the containers inside the pod

keps/sig-scheduling/5004-dra-extended-resource/README.md

pohly · 2025-02-07T12:02:29Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+  * It is a singleton. There is at most one resource claim object for a given
+    extended resource in a given namespace.
+  * It is not owned by a pod, its owner reference is null.
+  * Its field `status.allocation.devices` is used, other fields are ununsed,


This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

pohly · 2025-02-07T12:39:34Z

keps/sig-scheduling/5004-dra-extended-resource/README.md

+
+## Proposal
+
+The basic idea is the following:


Here's a different thought. Why don't we make the scheduler responsible for creating a per-pod ResourceClaim for the extended resource requests in a pod if and only if needed?

The requests in that claim can be empty. This also means that we don't need to name or create any DeviceClass. Instead, the scheduler directly sets the allocation result. The allocation result may need a request name in some places. I think we can simply say "extended-resource" and it should be okay. Perhaps the container name might also be useful (see below).

This special claim only gets created if the scheduler decides to use some devices advertised in a ResourceSlice, but not sooner. This gives the scheduler the freedom to use either normal extended resources advertised by a Device Plugin on node A or DRA devices on node B.

Once the claim is created, the status records that the devices are allocated. No changes needed in the kubelet or DRA drivers: they see a normal claim and proceed accordingly.

I haven't thought in detail about how this could be implemented in the scheduler plugin, but it seems feasible to me.

There are also questions around how the kubelet then decides where to make which CDI devices available. This may depend on using the container name I mentioned above.

Also, how do kubelet and the scheduler find the generated claim? We need to record it in the pod status with a PodResourceClaimStatus, which again may depend on some special name.

Finally, there's the question of error recovery when binding to the node fails and the scheduler needs to try elsewhere. The existing "deallocate allocated claim" path may help.

keps/sig-scheduling/5004-dra-extended-resource/README.md

k8s-ci-robot · 2025-03-07T20:02:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yliaog
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

a resourceclaim by scheduler for each dynamic extended resource in pre-binding phase.

Co-authored-by: Patrick Ohly <[email protected]>

Co-authored-by: Kevin Klues <[email protected]>

Co-authored-by: Patrick Ohly <[email protected]>

…rnatives

yliaog · 2025-03-26T20:03:42Z

@pohly @klueska @johnbelamaric

updated the KEP, please take another look, thanks.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 5, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp February 5, 2025 22:34

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2025

k8s-ci-robot assigned johnbelamaric Feb 5, 2025

johnbelamaric reviewed Feb 5, 2025

View reviewed changes

yliaog force-pushed the master branch from 0168b7a to 93e435a Compare February 5, 2025 23:38

johnbelamaric reviewed Feb 6, 2025

View reviewed changes

yliaog force-pushed the master branch from 93e435a to fd04dcf Compare February 6, 2025 01:24

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 6, 2025

mortent reviewed Feb 6, 2025

View reviewed changes

k8s-ci-robot requested a review from pohly February 6, 2025 09:43

johnbelamaric mentioned this pull request Feb 6, 2025

DRA: Handle extended resource requests via DRA Driver #5004

Open

4 tasks

johnbelamaric reviewed Feb 6, 2025

View reviewed changes

yliaog force-pushed the master branch 11 times, most recently from 7ccd621 to a1d3c16 Compare February 6, 2025 23:30

pohly reviewed Feb 7, 2025

View reviewed changes

yliaog force-pushed the master branch from 38a2e05 to 3867a9b Compare February 12, 2025 01:40

wojtek-t reviewed Feb 12, 2025

View reviewed changes

keps/sig-scheduling/5004-dra-extended-resource/README.md Outdated Show resolved Hide resolved

yliaog force-pushed the master branch 3 times, most recently from 9b79ede to 5fad8d7 Compare February 12, 2025 19:16

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Mar 7, 2025

yliaog force-pushed the master branch from da9c219 to 2ca8341 Compare March 7, 2025 20:02

enj moved this from Needs Triage to !SIG Auth in SIG Auth Mar 10, 2025

yliaog force-pushed the master branch 2 times, most recently from d1f06c9 to e28f005 Compare March 19, 2025 21:30

yliaog and others added 13 commits March 26, 2025 18:45

Add KEP for DRA: Extended Resource

67b6b65

Revised with introducing dynamic extended resource concept, creating

d3aa66a

a resourceclaim by scheduler for each dynamic extended resource in pre-binding phase.

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

9bc013e

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

85546f1

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

865dfae

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

f2df5f9

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

9d0aef4

Co-authored-by: Patrick Ohly <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

9cbbf1f

Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

8bf9996

Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

4a85441

Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-scheduling/5004-dra-extended-resource/README.md

23be449

Co-authored-by: Patrick Ohly <[email protected]>

made changes to reflect feedbacks from comments.

115ddd0

updated KEP based on prototyping results

f29984f

yliaog force-pushed the master branch from e28f005 to ea149df Compare March 26, 2025 18:45

keep device class for advertising extended resource, removed the alte…

41ca602

…rnatives

yliaog force-pushed the master branch from ea149df to 41ca602 Compare March 26, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KEP for DRA: Extended Resource #5136

Add KEP for DRA: Extended Resource #5136

yliaog commented Feb 5, 2025 •

edited

Loading

yliaog commented Feb 5, 2025

johnbelamaric left a comment

johnbelamaric Feb 5, 2025

yliaog Feb 5, 2025

johnbelamaric Feb 5, 2025

yliaog Feb 10, 2025

johnbelamaric Feb 6, 2025

mortent Feb 6, 2025

yliaog Feb 6, 2025

pohly Feb 7, 2025

johnbelamaric Feb 7, 2025 •

edited

Loading

yliaog Feb 10, 2025

pohly Feb 10, 2025

yliaog Feb 10, 2025

pohly commented Feb 6, 2025

johnbelamaric Feb 6, 2025 •

edited

Loading

mortent Feb 6, 2025

johnbelamaric Feb 6, 2025

pohly Feb 7, 2025 •

edited

Loading

pohly Feb 7, 2025

pohly Feb 7, 2025

yliaog Feb 10, 2025

klueska Feb 11, 2025

pohly Feb 11, 2025

yliaog Feb 11, 2025

pohly Feb 7, 2025

pohly Feb 7, 2025

k8s-ci-robot commented Mar 7, 2025

yliaog commented Mar 26, 2025

Add KEP for DRA: Extended Resource #5136

Are you sure you want to change the base?

Add KEP for DRA: Extended Resource #5136

Conversation

yliaog commented Feb 5, 2025 • edited Loading

yliaog commented Feb 5, 2025

johnbelamaric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Feb 6, 2025

johnbelamaric Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 7, 2025

yliaog commented Mar 26, 2025

yliaog commented Feb 5, 2025 •

edited

Loading

johnbelamaric Feb 7, 2025 •

edited

Loading

johnbelamaric Feb 6, 2025 •

edited

Loading

pohly Feb 7, 2025 •

edited

Loading