Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for DRA: Extended Resource #5136

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

yliaog
Copy link

@yliaog yliaog commented Feb 5, 2025

  • One-line PR description:
    Add new KEP for supporting extended resource requests in DRA

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2025
@yliaog
Copy link
Author

yliaog commented Feb 5, 2025

/assign @johnbelamaric

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @yliaog

pod on the node.
* It is *ignored* by the scheduler DRA plugin as it has no spec, hence
scheduler does not need to do any allocation for it.
* It is *read* by scheduler DRA plugin for the devices allocated, so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the distinction between these two bullets (which seem to be saying the opposite)? Are you saying it the plugin will not try to allocate it, but instead will only use it for accounting for the extended resources based device allocations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally scheduler DRA plugin would try to allocate devices for resourceclaim, however, this one is special, it does not have spec, (it is for recording devices for extended resources), so scheduler DRA plugin does not need to allocate any devices for this resourceclaim.

however, scheduler DRA plugin does need to read the devices allocated to 'extended resources' from this resourceclaim, and scheduler DRA plugin would not consider those devices when trying to allocate devices for other DRA resource claims.

resource.kubernetes.io/extended-resource-name: foo.domain/bar
```

The special resource claim lifecycle is managed by the resource claim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit more about how the scheduler will operate on this? Like, what is the sequence diagram of how the scheduler decided to service the extended resource request via the DRA driver?

Can the cluster contain both pure device plugin based extended resources, AND DRA-based extended resources? I think it's clear with this plan that any given node will have a single driver, and with the DRA ones it can advertise as either DRA or extended resource. But if there are some nodes advertising the extended resource via device plugin, and some doing it via this mechanism, does the scheduler need to know that up front (I think it would because the resource slice exists), and then does it only update the special resource claim if it picks a node that uses DRA too? I think a few more details will help.

Are there any race conditions between the scheduler and the resource claim controller we need to worry about?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a cluster can have a node installed device plugin, and another node installed DRA, both advertises the same name. clarified that in the KEP.

scheduler needs to know that and keep track of it in NodeInfo.

resourceclaim is revised to be created by scheduler. I agree It is better to keep it as close to the regular normal resource claim as possible. so it is revised to be created one per pod, to keep track of the allocation results for the dynamic extended resources requests in the pod.

kep-number: 5004
authors:
- "@yliaog"
owning-sig: sig-scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to move it to the sig-scheduling directory, too.

Also add the PRR file (I can be the reviewer)

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 6, 2025
* It is a singleton. There is at most one resource claim object for a given
extended resource in a given namespace.
* It is not owned by a pod, its owner reference is null.
* Its field `status.allocation.devices` is used, other fields are ununsed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DeviceRequestAllocationResult has a Request field that is a reference to the request in the ResourceClaim spec. But in this situation, there is no corresponding request in the spec (since the spec is empty). Drivers might use this value, among other things for looking up any device configuration. How do we plan to handle this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA driver has two parts:
1/ 1st part that handles DRA resource claim actuation
2/ 2nd part that handles extended resources actuation

the 1st part does not need to actuate on this special resource claim object, it is ignored there.

the 2nd part needs to read the list of allocated devices from the special resource claim object, and only use the devices in that list for extended resources actuation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.

Copy link
Member

@johnbelamaric johnbelamaric Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA driver has to be updated to advertise devices as 'extended resource'.

That said, I revised the design based on the feedback, thanks for the comments!

A cluster node install either a DRA driver, or a deivce plugin driver for a given named resource. Devices are picked at the scheudling time, and communicated to kubelet and DRA driver through the special resource claim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think you misunderstood (or I did). As I read it, the allocations are made by the scheduler and stored in the special singleton resource claim, not decided by the driver.

That's a bit different from what I understood. If it's still the scheduler which picks devices and tells the driver about it, then it's fine.

From #5136 (review):

When you and I talked on Friday, we discussed allowing kubelet to remain unchanged. Instead, it would call the Device Plugin grpc API for those with extended resources, but since that grpc API would be handled by the DRA driver, it would know when receiving calls on that API, it should look for the special resource claim.

That probably goes back to my comment above: "I also very much dislike that DRA drivers need to be updated at all."

Why should we put additional work on all DRA drivers, now and in the future, if we can instead do something once in the kubelet? It has implications for the graduation of this feature, but should that be a deciding factor?

I know that I've said that we want to keep the kubelet as dumb as possible, but in this case I think it's simpler overall to do this in the kubelet. I'm also a bit worried about the implications of skipping some of the usual admission checks that the kubelet does for claims, like "reserved for". Perhaps that doesn't matter, but then we need to explain in the KEP why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thoughts were to minimize kubelet changes, and shift the work to device driver. But after some more thoughts, I agree it's better to do the work once, and great in kubelet, so device driver can have less work to do, and less chances to implement it wrongly.

@pohly
Copy link
Contributor

pohly commented Feb 6, 2025

/cc


## Proposal

The basic idea is the following:
Copy link
Member

@johnbelamaric johnbelamaric Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about this a bit more. I think the essential idea here is to have the scheduler account for the resources in both the traditional extended resources AND in the DRA resources. So, the the scheduler needs to know how to map between extended resource strings and DRA devices. The questions that come in are:

  1. Where do we store that mapping? It could be its own resource or configuration stanza, the DeviceClass, or the ResourceSlice. The current proposal is in the ResourceSlice.
  2. Where do we store the allocation? The current proposal has a special singleton ResourceClaim in a namespace for tracking this for all Pods. As @mortent pointed out, that may not work because drivers need to map between the allocations and the specific requests values, so that they can place devices in the correct containers. Instead, we may need to generate a per-Pod ResourceClaim, just like we do if there is a ResourceClaimTemplate for the Pod.
  3. If we generate a per-Pod ResourceClaim, when do we do that? With a webook as described in the alternatives? Or with the ResourceClaim controller as described in the current proposal and as is done for ResourceClaimTemplates?
  4. If we generate a per-Pod ResourceClaim in the ResourceClaim controller, how do we know in the scheduler to defer scheduling until that is created? I am guessing today we know if there is a ResourceClaimTemplate reference from the PodSpec, we have to wait for it to exist in order to schedule. But in this case for these implicit templates that exist because requests were made via the extended resources, we would have to have the same logic in the scheduler plugin to "know" that it should wait for the RC generation.
  5. How does all of this work if there are existing nodes with the device plugin running that provide that extended resource? If we generate a per-pod RC, then ONLY the DRA-based drivers will be able to handle the request. Is that OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating a per-pod ResourceClaim that looks "normal" seems like it would be ideal, but I'm not sure if that is possible. Do we have enough information from the Pod and the ResourceSlice to generate a basic ResourceClaim with a proper spec?

If we let the ResourceClaim controller generate the ResourceClaim, we might be able to have the DynamicResources scheduler plugin delay scheduling of the Pod until the ResourceClaim has been generated. But it would require the plugin to be able to tell which Pods should use the "extended resource to DRA" bridge, since we can't delay just by looking for the pod spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do have enough information, if we add DeviceClass somewhere. What we need for each request is: the device class, the container, the count. Since the resource request is in the container already, we know that. The count is also there already. It's the mapping to device class that we need to add somewhere.

In the current proposal, the mapping is between devices in the ResourceSlice and the extended resource names. We would need to add DeviceClass. However, if possible, it would be nice to avoid having to have a separate device class for every extended resource; the original "put the mapping in deviceclass" idea has that problem. But in this proposal, we could maybe allow the ResourceSlice to specify one of the following for each mapped resource name:

  • deviceclassname
  • deviceclassname + cel expression

I think this would allow the accounting to be done per extended resource name as this KEP suggests, we can still allow multiple extended resources to map to the same device class + a CEL selector expression.

Note that there is no reason this need apply ONLY to extended resources...we could map any resource name this way, which provides a path toward managing standard resources.

Copy link
Contributor

@pohly pohly Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make DeviceClassName optional in a ResourceClaim?

Sorry, couldn't resist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a different thought. Why don't we make the scheduler responsible for creating a per-pod ResourceClaim for the extended resource requests in a pod if and only if needed?

The requests in that claim can be empty. This also means that we don't need to name or create any DeviceClass. Instead, the scheduler directly sets the allocation result. The allocation result may need a request name in some places. I think we can simply say "extended-resource" and it should be okay. Perhaps the container name might also be useful (see below).

This special claim only gets created if the scheduler decides to use some devices advertised in a ResourceSlice, but not sooner. This gives the scheduler the freedom to use either normal extended resources advertised by a Device Plugin on node A or DRA devices on node B.

Once the claim is created, the status records that the devices are allocated. No changes needed in the kubelet or DRA drivers: they see a normal claim and proceed accordingly.

I haven't thought in detail about how this could be implemented in the scheduler plugin, but it seems feasible to me.

There are also questions around how the kubelet then decides where to make which CDI devices available. This may depend on using the container name I mentioned above.

Also, how do kubelet and the scheduler find the generated claim? We need to record it in the pod status with a PodResourceClaimStatus, which again may depend on some special name.

Finally, there's the question of error recovery when binding to the node fails and the scheduler needs to try elsewhere. The existing "deallocate allocated claim" path may help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't that require changes to drivers to understand how to map these things to containers

That mapping is done by the kubelet. The kubelet obviously needs to know about this feature, not the least to prevent it from trying to invoke some Device Plugin which won't exist on the node. Associating each allocated device with a container instead of a request should work.

In my proposal, there will be "a full claim" as far as the DRA driver is concerned, so NodePrepareResources should work. We could even support configs, although where they come from would need to be determined.

I think the eventual solution will need to handle clusters that have some nodes running the old Device Plugin, and some nodes running the new DRA driver

Yes, I was assuming that. Seems like a reasonable simplification to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for the comments. I agree it is better to have the special claim as close to normal as possible. I revised the KEP to clarify that it supports a cluster with one node installed DRA driver, and the other node installed device plugin, and both advertised same name. I have not thought about how to support configs, my current thinking is config should be done through the regular DRA, the extended resource is a shorthand for a pool of devices, so keep it that simple way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fundamental issue being discussed here would be resolved with my proposal above. I.e. enable this feature by introducing a API server object that provides a mapping between an extended resource name and a deviceClass. Using this, it's obvious how to construct a one-off resource claim for the device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to do that construction in the scheduler if we want to support clusters where some nodes have a device plugin and others have a DRA driver. At that point it's clear what needs to be in that claim, also without this mapping to DeviceClass.

But I like that proposal for other reasons, see #5136 (comment).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where that mapping is added (slice, device class, or a new api object) can be discussed, but that only solves one problem, i.e. how to map device to 'extended resource name'.

the other problems still exist, and need to to resolved, like
1/ where to store the allocation results (a special claim)
2/ who creates that special claim (scheduler)
3/ what is the special claim's scope, per pod per extended resource?
4/ how does scheduler noderesource plugin allow the node with the DRA backed device to fit the pod
5/ how does scheduler dynamic resource plugin do the proper accouting and allocation
6/ when (at what scheduling framework phase) is the claim created?
7/ how to link the pod to the special claim
8/ how does kubelet admit the pod to the node (its admission logic has to change)
9/ how does kubelet pass the devices to the containers inside the pod

@yliaog yliaog force-pushed the master branch 11 times, most recently from 7ccd621 to a1d3c16 Compare February 6, 2025 23:30
* It is a singleton. There is at most one resource claim object for a given
extended resource in a given namespace.
* It is not owned by a pod, its owner reference is null.
* Its field `status.allocation.devices` is used, other fields are ununsed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pushes the responsibility of choosing a device back into the DRA driver. I don't think this is the right direction from an architectural perspective.

cc @klueska

I also very much dislike that DRA drivers need to be updated at all.


## Proposal

The basic idea is the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a different thought. Why don't we make the scheduler responsible for creating a per-pod ResourceClaim for the extended resource requests in a pod if and only if needed?

The requests in that claim can be empty. This also means that we don't need to name or create any DeviceClass. Instead, the scheduler directly sets the allocation result. The allocation result may need a request name in some places. I think we can simply say "extended-resource" and it should be okay. Perhaps the container name might also be useful (see below).

This special claim only gets created if the scheduler decides to use some devices advertised in a ResourceSlice, but not sooner. This gives the scheduler the freedom to use either normal extended resources advertised by a Device Plugin on node A or DRA devices on node B.

Once the claim is created, the status records that the devices are allocated. No changes needed in the kubelet or DRA drivers: they see a normal claim and proceed accordingly.

I haven't thought in detail about how this could be implemented in the scheduler plugin, but it seems feasible to me.

There are also questions around how the kubelet then decides where to make which CDI devices available. This may depend on using the container name I mentioned above.

Also, how do kubelet and the scheduler find the generated claim? We need to record it in the pod status with a PodResourceClaimStatus, which again may depend on some special name.

Finally, there's the question of error recovery when binding to the node fails and the scheduler needs to try elsewhere. The existing "deallocate allocated claim" path may help.

@yliaog yliaog force-pushed the master branch 3 times, most recently from 2f3ec06 to 388ce87 Compare February 12, 2025 01:24
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2025
@yliaog yliaog force-pushed the master branch 2 times, most recently from 38a2e05 to 3867a9b Compare February 12, 2025 01:40
@yliaog yliaog force-pushed the master branch 3 times, most recently from 9b79ede to 5fad8d7 Compare February 12, 2025 19:16
@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Mar 7, 2025
yliaog and others added 13 commits March 7, 2025 20:00
a resourceclaim by scheduler for each dynamic extended resource in
pre-binding phase.
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yliaog
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@yliaog: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 2ca8341 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/ui Categorizes an issue or PR as relevant to SIG UI. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Status: Needs Triage
Status: Needs Triage
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

9 participants