Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Handle extended resource requests via DRA Driver #5004

Open
4 tasks
klueska opened this issue Dec 17, 2024 · 12 comments
Open
4 tasks

DRA: Handle extended resource requests via DRA Driver #5004

klueska opened this issue Dec 17, 2024 · 12 comments
Assignees
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Milestone

Comments

@klueska
Copy link
Contributor

klueska commented Dec 17, 2024

Enhancement Description

  • One-line enhancement description (can be used as a release note):
    Allow DRA drivers to honor requests made via the extended resource API (e.g. nvidia.com/gpu: 2) rather than requiring a standard device plugin be used.

  • Kubernetes Enhancement Proposal:

  • Discussion Link:

  • Primary contact (assignee):
    @klueska, @pohly, @johnbelamaric

  • Responsible SIGs:
    /sig node
    /wg device-management

  • Enhancement target (which target equals to which milestone):

    • Alpha release target: 1.33
    • Beta release target: 1.34
    • Stable release target: 1.35
  • Alpha

    • KEP (k/enhancements) update PR(s):
      • TBD
    • Code (k/k) update PR(s):
      • TBD
    • Docs (k/website) update PR(s):
      • TBD
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Dec 17, 2024
@johnbelamaric
Copy link
Member

+1 yes please!

@johnbelamaric
Copy link
Member

johnbelamaric commented Dec 17, 2024

We need to sort out the requirements. A few initial questions:

  1. For newly created pods, I think it's clear we want this to be transparent. Existing manifests that use the extended resource API should continue to work as before, without modification.
  2. Can we handle this invisibly in the driver layer, or do we need to have DRA invoked at the control plane level and select the specific devices? If we don't, we will likely have a race condition - unless the scheduler can do some magical accounting (which seems possible).
  3. How do we handle upgrades? If we have a node running device plugin, and we switch to the DRA driver (or we upgrade to a driver that supports both), do you have to delete the pods? Do they automatically adopt the devices? If so, how do we write those back to the allocation logic (since no DRA claim exists).
  4. What happens if there are pods in a deployment, and some land on nodes with device plugin and some with DRA drivers?
  5. We talked about letting specific device classes be advertised as specific extended resources. This could mean the existing resource names get mapped to specific device classes by the admin. It could also mean we have a convention like deviceclass.k8s.io/foo: 4 for extended resource names. How do these choices interplay with the questions above?

@lengrongfu
Copy link
Member

Can each dra-driver implement a webhook to create a ResourceClaimTemplate after creating a pod and modify the application method of resources in the pod?

@klueska
Copy link
Contributor Author

klueska commented Jan 7, 2025

@lengrongfu that is what this KEP would be designed to avoid. There would be integrated scheduler support for all drivers, rather than requiring each DRA driver to provide a webhook.

@alculquicondor
Copy link
Member

Open questions (from SIG Scheduling meeting):

  • How to handle resource quotas
  • Scheduling throughput (API requests and overall processing).

@ffromani
Copy link
Contributor

/cc

@yliaog
Copy link

yliaog commented Jan 28, 2025

/cc

@johnbelamaric
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 30, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jan 30, 2025
@johnbelamaric
Copy link
Member

/assign @yliaog

Yu, I am assigning to you, let me know if that's OK

@johnbelamaric johnbelamaric moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Feb 4, 2025
@haircommander
Copy link
Contributor

/label lead-opted-in
/milestone v1.33

note: PRR freeze is tomorrow! you need to have a KEP update for this opened before then. Thanks!

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 5, 2025
@johnbelamaric
Copy link
Member

/stage alpha

@k8s-ci-robot k8s-ci-robot added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Feb 5, 2025
@dipesh-rawat
Copy link
Member

Hello @klueska @pohly @johnbelamaric @yliaog 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage alpha for v1.33 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: v1.32.
  • KEP readme has up-to-date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would need to update the following:

  • Create the KEP readme using the latest template and merge it in the k/enhancements repo.
  • Ensure that the KEP has undergone a production readiness review and has been merged into k/enhancements.

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@dipesh-rawat dipesh-rawat moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: At risk for enhancements freeze
Status: Draft Stage
Status: 🏗 In progress
Status: Needs Triage
Development

No branches or pull requests

9 participants