diff --git a/keps/prod-readiness/sig-node/5677.yaml b/keps/prod-readiness/sig-node/5677.yaml new file mode 100644 index 000000000000..7081c1d5945b --- /dev/null +++ b/keps/prod-readiness/sig-node/5677.yaml @@ -0,0 +1,3 @@ +kep-number: 5677 +alpha: + approver: "@kannon92" diff --git a/keps/sig-node/5677-dra-resource-availability-visibility/README.md b/keps/sig-node/5677-dra-resource-availability-visibility/README.md new file mode 100644 index 000000000000..886ec7147236 --- /dev/null +++ b/keps/sig-node/5677-dra-resource-availability-visibility/README.md @@ -0,0 +1,998 @@ +# KEP-5677: DRA Resource Availability Visibility + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Architecture](#architecture) + - [User Stories](#user-stories) + - [Story 1: Cluster Administrator Checking Pool Status](#story-1-cluster-administrator-checking-pool-status) + - [Story 2: Developer Debugging Resource Allocation](#story-2-developer-debugging-resource-allocation) + - [Story 3: Automation and Monitoring](#story-3-automation-and-monitoring) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) + - [Scaling Risks](#scaling-risks) + - [Operational Risks](#operational-risks) + - [Security Considerations](#security-considerations) + - [RBAC](#rbac) + - [Information Exposure](#information-exposure) + - [Security Risks](#security-risks) + - [Controller Security](#controller-security) + - [Future Consideration: Namespace-scoped Requests](#future-consideration-namespace-scoped-requests) +- [Design Details](#design-details) + - [API Definition](#api-definition) + - [ResourcePoolStatusRequest Object](#resourcepoolstatusrequest-object) + - [Spec Fields](#spec-fields) + - [Status Fields](#status-fields) + - [Controller Implementation](#controller-implementation) + - [Controller in KCM](#controller-in-kcm) + - [One-time Processing](#one-time-processing) + - [Reusing Existing Informers](#reusing-existing-informers) + - [kubectl Integration](#kubectl-integration) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Alternative 1: Out-of-tree Aggregated API Server](#alternative-1-out-of-tree-aggregated-api-server) + - [Alternative 2: Synchronous Review Pattern](#alternative-2-synchronous-review-pattern) + - [Alternative 3: Status in ResourceSlice](#alternative-3-status-in-resourceslice) + - [Alternative 4: Client-side only](#alternative-4-client-side-only) +- [Infrastructure Needed](#infrastructure-needed) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in + [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and + SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests] + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints] must be hit by [Conformance Tests] +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for + publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to + mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website +[Conformance Tests]: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md +[all GA Endpoints]: https://github.com/kubernetes/community/pull/1806 + +## Summary + +This KEP addresses a visibility gap in Dynamic Resource Allocation (DRA) by +enabling users to view available device capacity across resource pools. While +ResourceSlices store capacity data and ResourceClaims track consumption, there +is currently no straightforward way for users to view the available capacity +remaining in a pool or on a node. + +This enhancement introduces a **ResourcePoolStatusRequest** API following the +CertificateSigningRequest (CSR) pattern: + +1. User creates a ResourcePoolStatusRequest object specifying a driver (required) and optional pool filter +2. A controller in kube-controller-manager watches for new requests +3. Controller computes pool availability and writes result to status +4. User reads the status to see pool availability +5. To recalculate, user deletes and recreates the request + +This in-tree approach was chosen based on API review feedback to: +- Provide an always-available, in-sync solution with Kubernetes releases +- Follow established patterns (CSR, device taints with "None" effect) +- Control permissions via standard RBAC on the request object +- Avoid continuous controller overhead (one-time computation per request) + +## Motivation + +Dynamic Resource Allocation (DRA) provides a flexible framework for managing +specialized hardware resources like GPUs, FPGAs, and other accelerators. +However, the current implementation lacks visibility into resource availability: + +**Current State:** +- ResourceSlices are cluster-scoped resources that publish total capacity of + devices in a pool +- ResourceClaims are namespaced and track individual allocations +- Users with limited RBAC permissions cannot see ResourceClaims outside their + namespace +- No API-level view of "available" vs "allocated" capacity +- Difficult to understand why scheduling is failing or plan capacity + +**Problems this creates:** +1. **Debugging difficulty**: When pods fail to schedule due to insufficient + resources, users cannot easily see what is available vs. what is consumed +2. **Capacity planning**: Cluster administrators cannot easily determine if + more resources are needed +3. **Cross-namespace visibility**: Even cluster admins need to query multiple + namespaces to understand total consumption + +### Goals + +- Provide pool-level availability summaries via a standard Kubernetes API +- Follow established request/status patterns (like CSR) +- Compute availability on-demand (only when requested) +- Always available in-tree, in-sync with Kubernetes releases +- Require driver specification, with optional pool name filter +- Provide cross-slice validation to surface pool consistency issues +- Control access via standard RBAC on the request object +- Keep ResourceClaim and ResourceSlice APIs unchanged, requiring no + modifications to existing DRA drivers or scheduler +- Allow less-privileged users to access resource usage information without + exposing data beyond their normal RBAC access (e.g., cross-namespace claims) + +### Non-Goals + +- Adding real-time metrics/monitoring (this is point-in-time status) +- Implementing quotas or limits based on availability (future work) +- Providing historical consumption data (use multiple requests for that) +- Watch support for continuous updates (create new requests instead) + +## Proposal + +This KEP proposes a **ResourcePoolStatusRequest** API following the +CertificateSigningRequest (CSR) pattern - an established Kubernetes pattern +for imperative operations through declarative APIs. + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ User Workflow │ +│ │ +│ Step 1: CREATE Step 2: WAIT Step 3: READ │ +│ $ kubectl create $ kubectl wait $ kubectl get │ +│ resourcepoolstatusrequest --for=condition=Complete rpsr/my-check │ +└───────────┬─────────────────────────┬─────────────────────────┬─────────────┘ + │ │ │ + ▼ ▼ ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ kube-apiserver │ +│ │ +│ ┌───────────────────────────────────────────────────────────────────────┐ │ +│ │ ResourcePoolStatusRequest (stored in etcd) │ │ +│ │ │ │ +│ │ metadata: │ │ +│ │ name: my-check │ │ +│ │ │ │ +│ │ spec: status: │ │ +│ │ driver: example.com/gpu ───► observationTime: │ │ +│ │ poolName: node-1 pools: │ │ +│ │ - driver: example.com/gpu │ │ +│ │ poolName: node-1 │ │ +│ │ totalDevices: 4 │ │ +│ │ allocatedDevices: 3 │ │ +│ │ availableDevices: 1 │ │ +│ │ conditions: │ │ +│ │ - type: Complete │ │ +│ │ status: "True" │ │ +│ └───────────────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ + ▲ + │ Watch + UpdateStatus + │ +┌───────────────────────────────────────┴─────────────────────────────────────┐ +│ kube-controller-manager │ +│ │ +│ ┌────────────────────────────────────────────────────────────────────────┐ │ +│ │ ResourcePoolStatusRequest Controller │ │ +│ │ │ │ +│ │ 1. Watch for new ResourcePoolStatusRequest objects │ │ +│ │ 2. Skip if status.observationTime already set (one-time processing) │ │ +│ │ 3. Read ResourceSlices matching spec filters (driver, poolName) │ │ +│ │ 4. Read ResourceClaims to determine allocations │ │ +│ │ 5. Compute availability summary per pool │ │ +│ │ 6. Write result to status with timestamp │ │ +│ │ 7. Set condition Complete=True │ │ +│ └────────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Reuses existing informers: │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ ResourceSlices │ │ ResourceClaims │ │ +│ └─────────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +**Key design points:** + +1. **CSR-like pattern**: User creates request, controller processes, user reads + status - established pattern in Kubernetes +2. **One-time processing**: Controller skips requests that already have status, + ensuring each request is processed exactly once +3. **Reuses existing informers**: Controller reuses ResourceSlice and + ResourceClaim informers already in KCM, adding minimal overhead +4. **Always available**: In-tree implementation, no additional deployment needed +5. **Standard RBAC**: Access controlled via RBAC on ResourcePoolStatusRequest + +### User Stories + +#### Story 1: Cluster Administrator Checking Pool Status + +As a cluster administrator, I want to see at a glance how many GPU resources +are available across my cluster so that I can understand current utilization +and plan for capacity expansion. + +**Workflow:** +```bash +# Create a status request for all GPU pools +$ kubectl create -f - <downgrade->upgrade path tested? + +Will be tested manually before Beta promotion and documented here. For Alpha, +the feature is behind a feature gate and has no persistent state that could +cause issues during upgrade/downgrade cycles. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +- Check if ResourcePoolStatusRequest objects exist: `kubectl get resourcepoolstatusrequests` +- Check controller metrics: `resourcepoolstatus_requests_processed_total > 0` + +###### How can someone using this feature know that it is working for their instance? + +- [ ] Events + - Event Reason: N/A (no events emitted) +- [x] API .status + - Condition name: `Complete` (status: "True" when processing finished) + - Other field: `status.observationTime` is set when calculation is performed +- [ ] Other (Alarm, К8s resources status) + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +- Request processing: 99% of requests complete within 30 seconds +- No impact on existing scheduling or pod startup SLOs + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: `resourcepoolstatus_request_processing_duration_seconds` + - Aggregation method: histogram + - Components exposing the metric: kube-controller-manager + - Metric name: `resourcepoolstatus_request_processing_errors_total` + - Aggregation method: counter + - Components exposing the metric: kube-controller-manager + - Metric name: `resourcepoolstatus_requests_processed_total` + - Aggregation method: counter + - Components exposing the metric: kube-controller-manager +- [ ] Other (describe) + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +No, the controller will expose the standard metrics listed above. + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +| Dependency | Usage | Impact of Unavailable | Impact of Degraded | Can Operate Without | +|------------|-------|----------------------|-------------------|---------------------| +| kube-controller-manager | Runs the ResourcePoolStatusRequest controller | Requests will not be processed (status stays empty) | Slower processing | No (required for status computation) | +| DRA drivers | Create ResourceSlices that are aggregated | No pools to report (empty results) | Incomplete pool data | Yes (returns empty/partial results) | + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +Yes: + +| API Call Type | Estimated Throughput | Originating Component | +|---------------|---------------------|----------------------| +| CREATE ResourcePoolStatusRequest | User-driven, typically < 1/min per user | kubectl / client applications | +| GET ResourcePoolStatusRequest | User-driven, typically < 10/min per user | kubectl / client applications | +| DELETE ResourcePoolStatusRequest | User-driven, typically < 1/min per user | kubectl / client applications | +| UPDATE ResourcePoolStatusRequest/status | 1 per request created | kube-controller-manager | +| LIST/WATCH ResourceSlices | Reuses existing informer (no new calls) | kube-controller-manager | +| LIST/WATCH ResourceClaims | Reuses existing informer (no new calls) | kube-controller-manager | + +###### Will enabling / using this feature result in introducing new API types? + +Yes: + +| API Type | Supported Operations | Estimated Max Objects | +|----------|---------------------|----------------------| +| ResourcePoolStatusRequest | CREATE, GET, LIST, DELETE, WATCH | Hundreds per cluster (user-managed, ephemeral) | + +Note: Objects are intended to be short-lived. Users should delete requests after reading +the status. TTL-based auto-cleanup will be added in Beta. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No impact on scheduling or pod startup. + +###### Will enabling / using this feature result in non-negligible increase of resource usage? + +Minimal: +- etcd: Small objects, users should clean up (TTL in Beta) +- KCM: Reuses existing informers, adds small controller +- API server: Standard API operations +- Response size: Bounded by required `driver` field (one driver's pools) and optional `limit` field (values determined at implementation based on size calculations) + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No. This feature runs entirely in kube-controller-manager and kube-apiserver: +- No node-level resources are consumed +- No new processes or sockets created on nodes +- No file system operations on nodes +- Controller uses existing informers (no additional watch connections) + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +Requests cannot be created or read. No workload impact. + +###### What are other known failure modes? + +| Failure Mode | Description | Detection | Mitigations | Diagnostics | Testing | +|--------------|-------------|-----------|-------------|-------------|---------| +| Controller not running | ResourcePoolStatusRequest controller in KCM is not running or crashed | Requests stay in pending state (no `status.observationTime`), `resourcepoolstatus_requests_processed_total` metric stays at 0 | Restart KCM, check KCM logs | Check KCM logs for controller startup errors, verify feature gate enabled | Covered by integration tests | +| Informers not synced | ResourceSlice or ResourceClaim informers have not completed initial sync | Controller logs warning, requests delayed | Wait for informer sync, check API server connectivity | Check KCM logs for informer sync status | Covered by integration tests | +| Request accumulation | Users create many requests without cleanup | etcd storage grows, `kubectl get rpsr` shows many objects | Delete old requests, implement cleanup automation | List requests with `kubectl get rpsr`, check etcd metrics | Documented, TTL planned for Beta | + +###### What steps should be taken if SLOs are not being met? + +1. Check KCM logs for controller errors +2. Check controller metrics +3. Verify informers are synced +4. Check for excessive request volume + +## Implementation History + +- 2025-12-20: KEP created in provisional state +- 2026-01-15: Design revision - ResourceSlice status to ResourcePool +- 2026-02-07: Design revision - in-tree CSR-like pattern per API review + +## Drawbacks + +1. **Asynchronous operation**: User must wait for controller, unlike sync APIs + - Mitigation: Processing is fast (seconds); kubectl wait helps + +2. **Objects persist in etcd**: Users must clean up old requests + - Mitigation: Document cleanup; consider TTL in future + +3. **Not real-time**: Shows point-in-time snapshot, not live data + - Mitigation: Timestamp shows age; recreate for fresh data + +## Alternatives + +### Alternative 1: Out-of-tree Aggregated API Server + +Deploy a separate aggregated API server (like metrics-server) that computes +pool status on-demand. + +**Pros:** +- On-demand computation (no persistence) +- Independent release cycle +- No etcd storage + +**Cons:** +- Additional deployment to manage +- Not always available by default +- Duplicate informers add API server load + +**Rejected because:** API review preferred in-tree solution that is always +available and in-sync with Kubernetes releases. + +### Alternative 2: Synchronous Review Pattern + +Use SubjectAccessReview-like pattern where status is computed synchronously +in the API server during the Create call. + +**Pros:** +- Immediate response +- No persistence needed +- Simpler user flow + +**Cons:** +- Cannot reuse KCM informers (would need informers in API server) +- Computation in API server request path +- No established pattern for this in resource.k8s.io + +**Rejected because:** Would require new informers in API server; CSR pattern +is more established for operations that need controller processing. + +### Alternative 3: Status in ResourceSlice + +Add a Status field to ResourceSlice to track per-device allocations. + +**Pros:** +- No new API type + +**Cons:** +- Increases ResourceSlice size significantly +- RBAC issues: claim info exposed to slice readers +- Cross-pool aggregation awkward + +**Rejected because:** Size, churn, and RBAC concerns from API review. + +### Alternative 4: Client-side only + +Only provide kubectl plugin that computes everything locally. + +**Pros:** +- No server-side changes +- Zero cluster overhead + +**Cons:** +- Each invocation fetches all slices and claims +- Poor performance for large clusters +- No API for automation tools + +**Rejected because:** Poor performance at scale; no API for automation. + +## Infrastructure Needed + +None - this is an in-tree feature. diff --git a/keps/sig-node/5677-dra-resource-availability-visibility/kep.yaml b/keps/sig-node/5677-dra-resource-availability-visibility/kep.yaml new file mode 100644 index 000000000000..288b0503ca4f --- /dev/null +++ b/keps/sig-node/5677-dra-resource-availability-visibility/kep.yaml @@ -0,0 +1,41 @@ +title: DRA Resource Availability Visibility +kep-number: 5677 +authors: + - "@nmn3m" +owning-sig: sig-node +participating-sigs: + - sig-auth + - sig-api-machinery + +status: implementable +creation-date: 2025-12-20 +reviewers: + - "@johnbelamaric" + - "@mortent" + - "@liggitt" + - "@pohly" +approvers: + - "@mrunalp" + +see-also: + - "/keps/sig-node/4381-dra-structured-parameters" + - "/keps/sig-node/3063-dynamic-resource-allocation" + +stage: alpha + +latest-milestone: "v1.36" + +milestone: + alpha: "v1.36" + +feature-gates: + - name: DRAResourcePoolStatus + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +metrics: + - resourcepoolstatus_request_processing_duration_seconds + - resourcepoolstatus_request_processing_errors_total + - resourcepoolstatus_requests_processed_total