KEP-5677: DRA Resource Availability Visibility by nmn3m · Pull Request #5749 · kubernetes/enhancements

nmn3m · 2025-12-23T17:46:43Z

One-line PR description: Adding KEP-5677 for DRA Resource Availability Visibility
Issue link: DRA: Resource Availability Visibility #5677

k8s-ci-robot · 2025-12-23T17:46:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-12-29T23:31:43Z

@nmn3m: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-scheduling, kubernetes/sig-node, kubernetes/sig-cli, kubernetes/wg-device-management.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @kubernetes/sig-scheduling
/cc @kubernetes/sig-node
/cc @kubernetes/sig-cli
/cc @kubernetes/wg-device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nmn3m · 2025-12-29T23:42:26Z

/cc @johnbelamaric
/cc @pohly

nmn3m · 2025-12-29T23:45:24Z

/cc @kubernetes/sig-cli-kubectl-maintainers

mortent · 2026-01-06T16:43:53Z

/wg device-management

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

johnbelamaric

First pass, this is looking really really good to me so far

keps/sig-node/5677-dra-resource-availability-visibility/README.md

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md

keps/sig-node/5677-dra-resource-availability-visibility/README.md

nmn3m · 2026-02-05T19:10:35Z

Is there an existing pattern for kubectl to reach KCM endpoints?

I don't think so.

Should we reconsider the out-of-tree aggregated API server approach to avoid this connectivity challenge?

That would be the more standard approach, if someone was to built this out-of-tree. We look for in-tree, out-of-the-box support. It might still be the best solution, though.

Any other approaches I should explore?

We could add something directly to the apiserver. I am not sure how I feel about that. It adds some overhead there even when not in use and it's a fairly central component where we don't want to risk causing crashes.

If we go with apiserver, what would be the preferred pattern?
- A new endpoint under /apis/resource.k8s.io/?
- Something else?
For out-of-tree approach, would it make sense to start there and potentially graduate
to in-tree later if there's demand?

johnbelamaric · 2026-02-05T20:51:01Z

@liggitt WDYT?

An always available, in-tree approach is preferable, but I am not sure there's a great option for that. Here's what I could think of:

A "request" is made by creating a ResourcePool object, or maybe it's even called something like "ResourcePoolStatusRequest" or something. A controller runs in KCM that sees that request, makes the calculation, and writes the result to the object's status, where it can be observed by the user. It is a one-time operation with a timestamp. To recalculate, the user has to delete and recreate the object. The object probably should be cluster scoped.
A specialized API endpoint built into API server. I suspect this is a no-go. Jordan, is there any precedent for that?
A specialized API endpoint in KCM that is then exposed via an aggregated API configuration. Jordan, any precedent?

The first one seems promising if we want to do this in-tree. I actually think it's fine, and there is precedent for similar "imperative operations through declarative APIs" with things like CSR and even the way device taints with "None" effect works. It gives us the ability to controller permissions on the object, too.

For out-of-tree (could be in k-sigs), we could implement JUST a kubectl plugin and rely on user permissions, to start. And add in some aggregated API server later, if we see the need.

The advantage of in-tree: always available and in-sync with K8s releases, all users can depend on it. Disadvantage: locked to K8s release cycle.

Advantage of out-of-tree: we can implement it independently of the release cycle.

My preference: the first in-tree option.

…tate enum, add Security Considerations and Consistency Handling sections

…dback

Signed-off-by: Nour <nurmn3m@gmail.com>

nmn3m · 2026-02-07T13:17:11Z

@johnbelamaric , Thanks for the detailed design options! I've updated the KEP to implement your first
suggestion - the CSR-like pattern with ResourcePoolStatusRequest.

nmn3m · 2026-02-07T13:24:57Z

/cc @kannon92

Signed-off-by: Nour <nurmn3m@gmail.com>

kannon92

Please check over the questionaire. There are missing questions.

kannon92 · 2026-02-08T20:11:24Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in


Please make sure to mark this off as you go.

kannon92 · 2026-02-08T20:16:48Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+
+| Risk | Mitigation |
+|------|------------|
+| Request accumulation in etcd | Document cleanup; consider TTL for Beta |


If we don't get TTL should we consider ways to limit the number of ResourcePoolStatusRequests?

I worry that this will be an informational API that could be useful for scheduling decisions or even information for quota management solutions like Kueue. TTL would be useful but we should also consider

kannon92 · 2026-02-08T20:17:34Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+
+#### RBAC
+
+Access is controlled via standard RBAC on the ResourcePoolStatusRequest API:


Would the default admin permissions allow access to this resource? Or would cluster admins need to also add these ClusterRoles?

keps/sig-node/5677-dra-resource-availability-visibility/README.md

kannon92 · 2026-02-08T20:22:42Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+5. Validation errors detected
+6. RBAC enforced correctly
+
+#### e2e tests


Should there also be e2e tests for kubectl?

keps/sig-node/5677-dra-resource-availability-visibility/README.md

kannon92 · 2026-02-08T20:34:29Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+- Older kubectl can still create/read objects (standard API)
+- No special version skew concerns
+
+## Production Readiness Review Questionnaire


Can you please check this questionaire and make sure to follow the template?

Missing Questions

"Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?"
This entire question is absent from the Scalability section (after line 809).

Incomplete Answers

"How can someone using this feature know that it is working for their instance?" (line 754-756)
The template expects structured responses using checkboxes:
- [ ] Events — with Event Reason
- [ ] API .status — with Condition name / Other field
- [ ] Other

The KEP just has a plain text answer. It should at minimum check off API .status with Condition name: Complete.
3. "What are the SLIs (Service Level Indicators)?" (line 764-767)
The template expects structured responses:
- [ ] Metrics — with Metric name, Aggregation method, Components exposing the metric
- [ ] Other

The KEP lists metric names but doesn't specify which component exposes them (should be kube-controller-manager), and doesn't use the
template checkbox format.
4. "Does this feature depend on any specific services running in the cluster?" (line 776-778)
The template expects a structured table/list per dependency with fields like: name, usage, impact of unavailability, impact of degraded
performance, whether it can operate with a degraded/unavailable dependency. The KEP just gives a brief bullet list.
5. "Will enabling / using this feature result in any new API calls?" (line 783-786)
The template expects specifics: API call type, estimated throughput, originating component. The KEP gives a general description without
these details.
6. "Will enabling / using this feature result in introducing new API types?" (line 788-789)
The template expects: API type, supported operations, estimated count. Only the type name is given.
7. "What are other known failure modes?" (line 819-822)
The template expects structured entries with: failure mode description, detection method, mitigations, diagnostics, testing (is it covered
by e2e tests?). The KEP uses a brief bullet list.

I followed the rules and fixed them. Thanks for detailed feedback.

thanks looks great.

…alidation Signed-off-by: Nour <nurmn3m@gmail.com>

kannon92 · 2026-02-09T15:31:40Z

keps/sig-node/5677-dra-resource-availability-visibility/kep.yaml

+  - "@liggitt"
+  - "@pohly"
+approvers:
+  - TBD


@haircommander do you know who on sig-node signed up to approve this?

kannon92 · 2026-02-09T15:37:15Z

keps/sig-node/5677-dra-resource-availability-visibility/README.md

+  - Other field: `status.observationTime` is set when calculation is performed
+- [ ] Other (Alarm, К8s resources status)
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?


https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#kubernetes-slisslos

I think you are supposed to answer this with respect to kube-controller-manager SLI/SLO.

This isn't required for beta so I won't block the approval on this.

kannon92

/approve

For PRR.
This is a very well thoughout proposal for alpha.

Thank you!

k8s-ci-robot · 2026-02-09T15:38:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92, nmn3m
Once this PR has been reviewed and has the lgtm label, please ask for approval from mrunalp. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [kannon92]
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 23, 2025

github-project-automation bot added this to SIG Scheduling Dec 23, 2025

k8s-ci-robot requested review from dom4ha and macsko December 23, 2025 17:46

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 23, 2025

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from ca95081 to d9ac678 Compare December 29, 2025 23:26

nmn3m marked this pull request as ready for review December 29, 2025 23:31

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 29, 2025

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from d9ac678 to 495b6cb Compare December 29, 2025 23:36

k8s-ci-robot requested review from johnbelamaric and pohly December 29, 2025 23:42

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 6, 2026

github-project-automation bot added this to Dynamic Resource Allocation Jan 6, 2026

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jan 6, 2026

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 7, 2026

mortent reviewed Jan 8, 2026

View reviewed changes

johnbelamaric reviewed Jan 8, 2026

View reviewed changes

k8s-ci-robot assigned mrunalp Jan 8, 2026

mortent reviewed Jan 9, 2026

View reviewed changes

keps/sig-node/5677-dra-resource-availability-visibility/README.md Outdated Show resolved Hide resolved

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 495b6cb to fdbf949 Compare January 14, 2026 22:37

nmn3m requested review from johnbelamaric and mortent January 14, 2026 22:37

nmn3m added 5 commits February 7, 2026 13:48

Add KEP-5677 DRA Resource Availability Visibility

06121b7

KEP-5677: Address PR review feedback - replace DeviceCondition with S…

b9628d6

…tate enum, add Security Considerations and Consistency Handling sections

KEP-5677: Move KEP from sig-scheduling to sig-node per reviewer request.

40f2c50

KEP-5677: Redesign to use ResourcePool object based on API review fee…

4e90bec

…dback

KEP-5677: Add PRR file

659af6b

Signed-off-by: Nour <nurmn3m@gmail.com>

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 5708e86 to 4949759 Compare February 7, 2026 13:06

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 7, 2026

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 4949759 to 09cfcc3 Compare February 7, 2026 13:13

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 09cfcc3 to 720ae52 Compare February 7, 2026 13:21

nmn3m requested review from mortent and pohly February 7, 2026 13:22

k8s-ci-robot requested a review from kannon92 February 7, 2026 13:25

Redesign to use CSR-like ResourcePoolStatusRequest pattern

b5ca902

Signed-off-by: Nour <nurmn3m@gmail.com>

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 720ae52 to b5ca902 Compare February 8, 2026 19:22

kannon92 requested changes Feb 8, 2026

View reviewed changes

github-project-automation bot moved this from Backlog to Needs Review in SIG Scheduling Feb 8, 2026

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 8, 2026

Fix questionnaire format, clarify RBAC/tests, add spec immutability v…

1b123f5

…alidation Signed-off-by: Nour <nurmn3m@gmail.com>

nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from e435e31 to 1b123f5 Compare February 8, 2026 21:46

nmn3m requested a review from kannon92 February 8, 2026 21:48

kannon92 reviewed Feb 9, 2026

View reviewed changes


		Items marked with (R) are required prior to targeting to a milestone / release.

		- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in


		#### RBAC

		Access is controlled via standard RBAC on the ResourcePoolStatusRequest API:

Conversation

nmn3m commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 29, 2025

Uh oh!

nmn3m commented Dec 29, 2025

Uh oh!

nmn3m commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mortent commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nmn3m commented Feb 5, 2026

Uh oh!

johnbelamaric commented Feb 5, 2026

Uh oh!

nmn3m commented Feb 7, 2026

Uh oh!

nmn3m commented Feb 7, 2026

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

nmn3m commented Dec 23, 2025 •

edited

Loading

nmn3m commented Dec 29, 2025 •

edited

Loading