KEP-5677: DRA Resource Availability Visibility#5749
KEP-5677: DRA Resource Availability Visibility#5749nmn3m wants to merge 7 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
ca95081 to
d9ac678
Compare
|
@nmn3m: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-scheduling, kubernetes/sig-node, kubernetes/sig-cli, kubernetes/wg-device-management. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
d9ac678 to
495b6cb
Compare
|
/cc @johnbelamaric |
|
/cc @kubernetes/sig-cli-kubectl-maintainers |
|
/wg device-management |
keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
johnbelamaric
left a comment
There was a problem hiding this comment.
First pass, this is looking really really good to me so far
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
495b6cb to
fdbf949
Compare
|
|
@liggitt WDYT? An always available, in-tree approach is preferable, but I am not sure there's a great option for that. Here's what I could think of:
The first one seems promising if we want to do this in-tree. I actually think it's fine, and there is precedent for similar "imperative operations through declarative APIs" with things like CSR and even the way device taints with "None" effect works. It gives us the ability to controller permissions on the object, too. For out-of-tree (could be in k-sigs), we could implement JUST a kubectl plugin and rely on user permissions, to start. And add in some aggregated API server later, if we see the need. The advantage of in-tree: always available and in-sync with K8s releases, all users can depend on it. Disadvantage: locked to K8s release cycle. Advantage of out-of-tree: we can implement it independently of the release cycle. My preference: the first in-tree option. |
…tate enum, add Security Considerations and Consistency Handling sections
Signed-off-by: Nour <nurmn3m@gmail.com>
5708e86 to
4949759
Compare
4949759 to
09cfcc3
Compare
|
@johnbelamaric , Thanks for the detailed design options! I've updated the KEP to implement your first |
09cfcc3 to
720ae52
Compare
|
/cc @kannon92 |
Signed-off-by: Nour <nurmn3m@gmail.com>
720ae52 to
b5ca902
Compare
kannon92
left a comment
There was a problem hiding this comment.
Please check over the questionaire. There are missing questions.
|
|
||
| Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|
|
||
| - [ ] (R) Enhancement issue in release milestone, which links to KEP dir in |
There was a problem hiding this comment.
Please make sure to mark this off as you go.
|
|
||
| | Risk | Mitigation | | ||
| |------|------------| | ||
| | Request accumulation in etcd | Document cleanup; consider TTL for Beta | |
There was a problem hiding this comment.
If we don't get TTL should we consider ways to limit the number of ResourcePoolStatusRequests?
I worry that this will be an informational API that could be useful for scheduling decisions or even information for quota management solutions like Kueue. TTL would be useful but we should also consider
|
|
||
| #### RBAC | ||
|
|
||
| Access is controlled via standard RBAC on the ResourcePoolStatusRequest API: |
There was a problem hiding this comment.
Would the default admin permissions allow access to this resource? Or would cluster admins need to also add these ClusterRoles?
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
| 5. Validation errors detected | ||
| 6. RBAC enforced correctly | ||
|
|
||
| #### e2e tests |
There was a problem hiding this comment.
Should there also be e2e tests for kubectl?
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5677-dra-resource-availability-visibility/README.md
Outdated
Show resolved
Hide resolved
| - Older kubectl can still create/read objects (standard API) | ||
| - No special version skew concerns | ||
|
|
||
| ## Production Readiness Review Questionnaire |
There was a problem hiding this comment.
Can you please check this questionaire and make sure to follow the template?
Missing Questions
- "Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?"
This entire question is absent from the Scalability section (after line 809).
Incomplete Answers
- "How can someone using this feature know that it is working for their instance?" (line 754-756)
The template expects structured responses using checkboxes:
- [ ] Events — with Event Reason
- [ ] API .status — with Condition name / Other field
- [ ] Other
The KEP just has a plain text answer. It should at minimum check off API .status with Condition name: Complete.
3. "What are the SLIs (Service Level Indicators)?" (line 764-767)
The template expects structured responses:
- [ ] Metrics — with Metric name, Aggregation method, Components exposing the metric
- [ ] Other
The KEP lists metric names but doesn't specify which component exposes them (should be kube-controller-manager), and doesn't use the
template checkbox format.
4. "Does this feature depend on any specific services running in the cluster?" (line 776-778)
The template expects a structured table/list per dependency with fields like: name, usage, impact of unavailability, impact of degraded
performance, whether it can operate with a degraded/unavailable dependency. The KEP just gives a brief bullet list.
5. "Will enabling / using this feature result in any new API calls?" (line 783-786)
The template expects specifics: API call type, estimated throughput, originating component. The KEP gives a general description without
these details.
6. "Will enabling / using this feature result in introducing new API types?" (line 788-789)
The template expects: API type, supported operations, estimated count. Only the type name is given.
7. "What are other known failure modes?" (line 819-822)
The template expects structured entries with: failure mode description, detection method, mitigations, diagnostics, testing (is it covered
by e2e tests?). The KEP uses a brief bullet list.
There was a problem hiding this comment.
I followed the rules and fixed them. Thanks for detailed feedback.
…alidation Signed-off-by: Nour <nurmn3m@gmail.com>
e435e31 to
1b123f5
Compare
| - "@liggitt" | ||
| - "@pohly" | ||
| approvers: | ||
| - TBD |
There was a problem hiding this comment.
@haircommander do you know who on sig-node signed up to approve this?
| - Other field: `status.observationTime` is set when calculation is performed | ||
| - [ ] Other (Alarm, К8s resources status) | ||
|
|
||
| ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? |
There was a problem hiding this comment.
I think you are supposed to answer this with respect to kube-controller-manager SLI/SLO.
This isn't required for beta so I won't block the approval on this.
kannon92
left a comment
There was a problem hiding this comment.
/approve
For PRR.
This is a very well thoughout proposal for alpha.
Thank you!
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92, nmn3m The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: Adding KEP-5677 for DRA Resource Availability Visibility
Issue link: DRA: Resource Availability Visibility #5677